Multiple sequence alignment Fasta file manipulation

5 minute read

Published: June 23, 2022

This bit of code can be a great help to subset a Fasta file alignment based on certain conditions taken from a sample information file (e.g. subset only the samples collected in Spring, or those in Winter and greater than y cm, or everything except those in Autumn, etc.).

This blogpost covers the following:

Uploads an aligned Fasta file, a sample information data sheet (containing, for example, GPS localities, morphospecies names, collector name, sex), and a binary traits data sheet (0s and 1s depending on whether the sample belongs to a particular class (e.g. denoting sampling location, male or female, host plant).
Subsets the Fasta file based on particular conditions from the sample information sheet, and writes that as a new Fasta file
Subset the binary data sheet to contain only the rows for the samples present in the subsetted Fasta file
Order the new binary dataset to match the order of the sequences in the subsetted Fasta file
Write a text file to serve as a TRAITS block for PopArt (which generates haplotype networks)

The example below uses three input files:

🗒️ (1) a CSV dataset with sample information, which follows this example format:

sequence_ID	morphospecies_assignment	coi	nuclear_28S	morphotype	latitude	longitude
MN133285	sp1	TRUE	TRUE	GS	-33.2817	32.20973
MN133288	sp1	TRUE	TRUE	GS	-33.2817	32.20973
MN133296	sp1	TRUE	TRUE	NGS	-33.2817	32.20973
KCP_2020_00037	sp1	TRUE	TRUE	NGS	-33.2817	32.20973
GFS_2020_00614	sp2	TRUE	TRUE	GS	-33.2817	32.20973
GFS_2020_00562	sp2	TRUE	FALSE	GS	-33.2817	32.20973

The sequence_ID column names must match the Fasta sequence names exactly.

🗒️ (2) A binary dataset for the presence or absence of desired traits (e.g. localities, hosts), which follows this example format:

	adonax	agayanus	ecapensis	ecurvula
MN133285	0	1	0	0
MN133288	1	0	0	0
MN133296	1	0	0	0
KCP_2020_00037	0	0	1	0
GFS_2020_00614	0	0	0	1
GFS_2020_00562	0	0	0	1

The order of the Fasta sequences in the .fas file and the sequence IDs in the two dataframes above don’t have to be in the same order. The file above shows host grass species for Tetramesa wasps.

🗒️ (3) An aligned FASTA file (.fas)

R CODE

# (1) Read in sample information
source_mods = readxl::read_xlsx("G:/My Drive/Tetramesa Project/Source Modifiers/source_modifiers_updated_10_06_2021.xlsx", 
                                sheet = "cleaned_tetramesa_clarke"
)

# (2) Read in binary values
host.plants = readxl::read_xlsx("G:/My Drive/Tetramesa Project/Source Modifiers/source_modifiers_updated_10_06_2021.xlsx", 
                                sheet = "host_plants"
)

# Change any NA cells to contain zero values
host.plants[is.na(host.plants)] = 0

# Change the values in the dataframe to become characters (in case they are read as strings)
host.plants[] = lapply(host.plants, as.character)

# If the file is read into R using the readxl:: readxlsx() function, then this line changes the first column to rownames
host.plants = tibble::column_to_rownames(host.plants, var = "sample_id")

# Clean up the column names
host.plants = janitor::clean_names(host.plants)

# Subset the sample information datasheet accordingly. In my data, the columns after spyramidalis contain other information, which isn't needed here. 
# I only want to look at the Arundo donax (adonax) to Sporobolus pyramidalis (spyramidalis) columns
host.plants.filter = dplyr::select(host.plants, adonax:spyramidalis)

# (3) Read in the MSA fasta file
seqs_28S = ape::read.FASTA("C:/Users/s1000334/Documents/PhD_Tetramesa/Raw Sequences/Tetramesa/ALIGNED_READY/Tetramesa/Single genes/More_eurytomidae_included/28S/28S_only_aligned.fasta")

##########################
# FILTERING
##########################

# I want to subset sequences for only those that belong to the Golden shouldered (GS) morphotype from my sample info sheet (source_mods):
GS = dplyr::filter(source_mods, morphotype == "GS")

# Create an empty fasta file to which the subsetted sequences will be written
file.create("28S_GS.fasta")

# Loop through each sequence_ID name in the subsetted GS dataframe, and extract the corresponding sequence name from the uploaded Fasta alignment file.
# Then write that sequence to a new file. In this way, you will end up with a subsetted Fasta alignment with only the golden shouldered specimens

for(i in 1:nrow(GS)){
  
  for(j in 1:length(seqs_28S)){
    if(GS$sequence_ID[i] == labels(seqs_28S[j]))
      ape::write.FASTA( seqs_28S[j], "28S_GS.fasta", append = T )
  }
  
}

###############################################################
# CREATE A TRAITS BLOCK FOR POPART (HAPLOTYPE NETWORK PROGRAM)
###############################################################

# Using the newly subsetted Fasta alignment, we now want to subset the binary values matrix (host.plants) to match it

# Read in the subsetted Fasta file created above
seqs_28S_GS = ape::read.FASTA("C:/Users/s1000334/Documents/PhD_Tetramesa/Raw Sequences/Tetramesa/ALIGNED_READY/Tetramesa/Single genes/More_eurytomidae_included/28S/PopArt/28S_GS.fasta")

# Extract the rows in the binary data frame (host.plants) for the only the sample names that are present in the subsetted Fasta file (seqs_28S_GS)
host.plants.popart.block = host.plants.filter %>% dplyr::filter(row.names(host.plants.filter) %in% labels(seqs_28S_GS))

# Get the binary data frame into the same order as the sequence names in the subsetted Fasta file
host.plants.popart.block.ordered = host.plants.popart.block[match(labels(seqs_28S_GS), rownames(host.plants.popart.block)),]

# Find which columns have only zeros, so that those can be removed (no point in keeping a trait column if there are no sequence representatives in it!)
zero_cols = c()
for(i in 1:ncol(host.plants.popart.block.ordered)){
  if( all(host.plants.popart.block.ordered[i] == 0) )
    zero_cols[i] = i
}

# Remove the NA values from the vector
zero_cols = zero_cols[!is.na(zero_cols)]

# Remove the columns that had only zeroes
host.plants.popart.block.ordered = dplyr::select(host.plants.popart.block.ordered, -zero_cols )

# Write the resulting binary matrix to a text file, which can be appended to the end of the sequence alignment Nexus file as a TRAITS block
# Just remove the comma after each sample name to get it into the correct format for PopArt
write.table(host.plants.popart.block.ordered, file = "popart_traits_block.txt", quote = FALSE, col.names = TRUE, sep = ",")

The TRAITS block at the end of the Nexus file for PopArt will look something like this:

BEGIN TRAITS;
Dimensions NTRAITS=8;
Format labels=yes separator=Comma;
TraitLabels cdactylon, ecurvula, egummiflua, elehm, eplana, etef, hhirta, spyramidalis;
matrix

GFS_2019_00030 0, 1, 0, 0, 0, 0, 0, 0
GFS_2019_00043 0, 1, 0, 0, 0, 0, 0, 0
GFS_2020_00079 0, 1, 0, 0, 0, 0, 0, 0
GFS_2020_00083 0, 1, 0, 0, 0, 0, 0, 0
GFS_2020_00085 0, 1, 0, 0, 0, 0, 0, 0
;

END;

Twitter Facebook LinkedIn

Clarke van Steenderen

Multiple sequence alignment Fasta file manipulation

R CODE

Plot a phylogeny in R using ggtree

Creating a climate map with GPS coordinate points

Creating an elevation map with GPS coordinate points

Renaming all the sequences in a FASTA file automatically