Renaming all the sequences in a FASTA file automatically

2 minute read

Published: November 17, 2021

I came across the problem of renaming sequences in a FASTA sequence alignment after downloading over 200 sequences from GenBank for four different genes. The sequence names were assigned by the GenBank accession number (e.g. MK1526) only. I wanted the sequence names to have the species name in it as well, for example MK1526_Canis_lupis. To avoid the tedious task of manually doing this, I wrote a few lines of R code that I hope will be of use to others with the same issue.

One first needs an Excel .csv file with a column of species/taxon names, and a column with the associated GenBank ID. For example:

taxon_name	country	genbank
Ectrosia schultzii Benth.	Australia	MK872740
Cladoraphis spinosa (L.f.) S.M.Phillips	South Africa	GU360600
Ectrosia lasioclada (Merr.) S.T.Blake	Australia	MK872659

The aim is to generate the third column (“name”) shown below:

Taxon name	Country	GenBank	name
Ectrosia schultzii Benth.	Australia	MK872740	MK872740_Ectrosia_schultzii
Cladoraphis spinosa (L.f.) S.M.Phillips	South Africa	GU360600	GU360600_Cladoraphis_spinosa
Ectrosia lasioclada (Merr.) S.T.Blake	Australia	MK872659	MK872659_Ectrosia_lasioclada

And then use that column to rename the sequences.

# set your working directory to your desired folder location
# read the csv file
taxa_rps16trnk = read.csv("rps16trnk.csv")

library(stringr)
# change the first column so that it only contains the first two words (genus and species, separated by a space)
taxa_rps16trnk$taxon_name = stringr::word(taxa_rps16trnk$taxon_name, 1,2, sep=" ") 
# remove any cases of new line characters
taxa_rps16trnk$taxon_name = str_replace_all(taxa_rps16trnk$taxon_name, "[\r\n]" , "") 
# remove any characters from a round bracket onwards, in case there isn't a space between the species name and the bracket
taxa_rps16trnk$taxon_name = gsub("[(].*","", taxa_rps16trnk$taxon_name)
# add an underscore between the genus and species name
taxa_rps16trnk$taxon_name = gsub(" ", "_", taxa_rps16trnk$taxon_name)

# paste the GenBank number and genus_species name together into a new column
taxa_rps16trnk$name = paste(taxa_rps16trnk$rps16.trnK, taxa_rps16trnk$taxon_name)
# replace the space between them with an underscore
taxa_rps16trnk$name <- gsub(" ", "_", taxa_rps16trnk$name)

# optionally write this as a csv file to your working directory
write.csv(taxa_rps16trnk, "rps16trnk.csv")

# create a vector of the new names
new = taxa_rps16trnk$name
current = ape::read.FASTA("rps16trnk.fas")

# create a dataframe with two columns: one with the current sequence names directly from the FASTA file, and one with the new names you want to change them to
ref = data.frame(matrix(nrow = length(new), ncol = 2))
ref[,1] = names(current)
ref[,2] = new

# use the phylotools library to replace the names
library(phylotools)
rename.fasta(infile = "rps16trnk.fas", ref, outfile = "renamed_rps16trnk.fasta")

Twitter Facebook LinkedIn

Clarke van Steenderen

Renaming all the sequences in a FASTA file automatically

Multiple sequence alignment Fasta file manipulation

Plot a phylogeny in R using ggtree

Creating a climate map with GPS coordinate points

Creating an elevation map with GPS coordinate points