Converting GTDB faa files to a DIAMOND database with taxonomy
Table of Contents
The Genome Taxonomy Database (GTDB), is a database of high quality and complete bacterial and archaeal genomes. It covers the phylogenetic tree in a consistent way, so that no one type of microbe is overrepresented in the database. The latest release of the GTDB (R226) contains 732,475 genomes in total. Of this, 143,614 genomes are selected as species representatives that capture most known microbial diversity.
I’ve wanted to search for proteins in the GTDB, which requires converting the GTDB fasta files of the species representative genomes into a database that’s searchable by a tool like BLASTp, DIAMOND, or MMseqs2.
MMseqs2 has a convenient built-in command that handles the downloading and building of a protein sequence database with mmseqs databases
, which is handy.
However, building an equivalent GTDB database with taxonomy information as well as genome identifiers for DIAMOND requires a lot more manual effort.
The following is a solution I’ve come up with to build such a DIAMOND database.
Downloading fasta files and taxdump information for the GTDB
First, download and extract all the fasta files for the species representative genomes. Tip: use the https://data.ace.up.edua.au/public/gtdb mirror, since the one other mirror is extremely slow.
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/genomic_files_reps/gtdb_proteins_aa_reps_r226.tar.gz
tar xvf gtdb_proteins_aa_reps_r226.tar.gz
The extracted archive contains a directory called protein_faa_reps
, which contains two child directories bacteria
and archaea
, which themselves contain all the fasta files for each individual genome/species.
Next, download the taxdump information for the GTDB.
The GTDB provides this information in a taxdump archive under the auxillary files directory.
Sadly, this archive doesn’t contain any sort of taxid map file (with a list of genome ID’s in one column with the corresponding taxonomy ID’s in a second column).
Thankfully, the GitHub repo shenwei356/gtdb-taxdump provides such a taxid map file for each GTDB release.
The files I want for DIAMOND are taxid.map
, names.dmp
, and nodes.dmp
which can be found here.
Download these files and extract them too.
wget https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.6.0/gtdb-taxdump-R226.tar.gz
tar xvf gtdb-taxdump-R226.tar.gz
If you’re interested in generating the taxid.map
file from scratch, have a look at the documentation from shenwei356/gtdb-taxdump for how to do this.
Another way I’ve found comes from the authors of the MMseqs2 wiki and uses an amazing Bash/Awk script.
I’ve copy and pasted their script below, with some minor edits.
# Script from the MMseqs2 wiki with minor edits
# build name.dmp, node.dmp from GTDB taxonomy
mkdir taxonomy/ && cd "$_"
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/genomic_files_all/ssu_all_r226.fna.gz
gzip -d ssu_all_r226.fna.gz
buildNCBITax=$(cat << 'EOF'
BEGIN{
ids["root"]=1;
rank["c"]="class";
rank["d"]="superkingdom";
rank["f"]="family";
rank["g"]="genus";
rank["o"]="order";
rank["p"]="phylum";
rank["s"]="species";
taxCnt=1;
print "1\t|\t1\t|\tno rank\t|\t-\t|" > "nodes.dmp"
print "1\t|\troot\t|\t-\t|\tscientific name\t|" > "names.dmp";
}
/^>/{
str=$2
for(i=3; i<=NF; i++){ str=str" "$i}
n=split(str, a, ";");
prevTaxon=1;
for(i = 1; i<=n; i++){
if(a[i] in ids){
prevTaxon=ids[a[i]];
}else{
taxCnt++;
split(a[i],b,"_");
printf("%s\t|\t%s\t|\t%s\t|\t-\t|\n", taxCnt, prevTaxon, rank[b[1]]) > "nodes.dmp";
printf("%s\t|\t%s\t|\t-\t|\tscientific name\t|\n", taxCnt, b[3]) >"names.dmp";
ids[a[i]]=taxCnt;
prevTaxon=ids[a[i]];
}
}
gsub(">", "", $1);
printf("%s\t%s\n", $1, ids[a[n]]) > "mapping";
}
EOF
)
awk -F'\\[loc' '{print $1}' ssu_all_r226.fna | awk "$buildNCBITax"
touch merged.dmp
touch delnodes.dmp
cd ..
mkdir ssu tmp
mmseqs createdb ssu_all_r226.fna ssu
# add taxonomy to GTDB
mmseqs createtaxdb ssu tmp --ncbi-tax-dump taxonomy/ --tax-mapping-file taxonomy/mapping
However, I’ll be sticking with the pre-made taxid.map file from shenwei356/gtdb-taxdump for the rest of this.
Adding in GenomeID’s to the fasta accessions
Each fasta file in the protein_faa_reps
dir is named with its genome ID, starting with the GB_GCA_
or RS_GCF_
prefix, and ending with the suffix _protein.faa.gz
.
Each fasta feader within the files contains a unique protein accession identifier and other information from Prodigal.
These protein accession ID’s are not very useful on their own.
It would be more helpful if each fasta header contained the genome ID too, since the genome ID can be used to search against the GenBank, RefSeq, or GTDB webservers.
We can add genome ID’s to each header with Bash and Awk:
# navigate to the extracted faa file directory
cd protein_faa_reps
# make output dir
mkdir -p reformatted_files
# REFORMAT FILES
# read all faa files and write new ones with better accession IDs.
for file in ./{bacteria,archaea}/*.gz; do
# set variable names
# remove file path from file name
filename=${file##*/}
# remove file suffix
stemname=${filename%_*}
# remove RS_ and GB_ prefix to get the genome ID
MAGID=$(echo $stemname | sed -E 's/^(RS_|GB_)//')
# write new fasta file with accession IDs formatted to read as >GENOMEID~PROTEINID
zcat ${file} | awk -vMAGID="${MAGID}" '/^>/{gsub(/^>/,"",$1); printf ">%s~%s\n", MAGID, $1; next}{print}' >./reformatted_files/${stemname}_withgenomeids.faa
echo "written reformatted faa file for genome: ${MAGID}"
done
Now each fasta file will be in the format of:
>GENOMEID~PROTEINID
SEQUENCE
>GENOMEID~PROTEINID
SEQUENCE
Something I want to try in the future is adding other useful info into the fasta headers so that the DIAMOND flag --outfmt 6 salltitles
will return it.
Creating the TaxID map file for each faa accession
Making the TaxID map file for each fasta protein accession ID is quite simple with the shenwei356/gtdb-taxdump taxid.map
file.
I initially overcomplicated this step with using a nested for loop on an array of taxid.map
entries, and popping out entries after each interaction with mapfile
.
This was extremely slow (though it was cool learning about uncommonly used Bash features).
The better solution is to just grep
over the taxid.map
file, and extract the TaxID that way:
# for each new formatted faa file, find it's corresponding taxID and write all accessionIDs and taxIDs to a new file
for file in ./reformatted_files/*.faa; do
# set variable names
# remove file path from file name
filename=${file##*/}
# remove file suffix
stemname=${filename%_*}
# get the genome ID
MAGID=$(echo $stemname | sed -E 's/^(RS_|GB_)//')
# get the TaxID from the taxid.map file
taxid=$(grep "${MAGID}" ../gtdb-taxdump-R226/taxid.map | cut -f2)
# write new taxid map file in the format of GENOMEID~PROTEINID\tTAXID
awk -vTAXID="${taxid}" -vOFS="\t" '/^>/{gsub(/^>/,"",$1); print $1, TAXID}' ${file} >./taxmapping_dir/${stemname}_taxmapping.txt
echo "written taxmap file for genome: ${MAGID}"
done
By the way, I’m a big fan of using the awk -v
flag for adding in external variables into an Awk program.
I also find it way easier to set the output field separator OFS
variable with a -v
flag rather than using a BEGIN{OFS="\t"}
block.
Combining the reformatted fasta files and taxid map files for DIAMOND makedb
All the reformatted fasta files and taxid map files need to be combined into one big file each.
# make the header row for the taxidmap file so DIAMOND recognises it
echo -e "accession.version\ttaxid" >taxidmap
# loop over all taxidmap files and combine them
for file in ./taxmapping_dir/*.txt; do
cat ${file} >>taxidmap
echo "combined ${file} into full taxidmap"
done
# move to a more convenient location
mv taxidmap ../gtdb-taxdump-R226
# loop over all reformatted fasta files and combine them
for file in ./reformatted_files/*.faa; do
cat ${file} >>gtdb226_combined_faa_withgenomeids.faa
gzip ${file}
echo "combined faa ${file} and gzipped"
done
# compress this big fasta file (DIAMOND makedb can use .gz fasta files as input)
gzip gtdb226_combined_faa_withgenomeids.faa
echo "gzipped full combined faa file"
mv gtdb226_combined_faa_withgenomeids.faa.gz ..
Making the DIAMOND database with taxonomy information
Now we can construct the DIAMOND database finally.
We will use the nodes.dmp
and names.dmp
files provided by shenwei356/gtdb-taxdump repo.
# NOTE: need diamond binary accessible, assumes you have installed diamond
# NOTE: needs quite a lot a RAM
# set path to GTDB reformatted and combined fasta file
DBDIR=path/to/your/database/dir
# run DIAMOND
./diamond makedb \
--in ${DBDIR}/gtdb226_combined_faa_withgenomeids.faa.gz \
--db ${DBDIR}/gtdb226_combined_faa_withgenomeids \
--taxonmap ${DBDIR}/gtdb-taxdump-R226/taxidmap \
--taxonnodes ${DBDIR}/gtdb-taxdump-R226/nodes.dmp \
--taxonnames ${DBDIR}/gtdb-taxdump-R226/names.dmp
If everything runs successfully, we should see a log message like this confirming that all accessions have been mapped to a TaxID:
Closing the input file... [0.189s]
Closing the database file... [0.055s]
Database sequences 437238178
Database letters 138730167844
Accessions in database 437238178
Entries in accession to taxid file 23560576
Database accessions mapped to taxid 437238178
Database sequences mapped to taxid 437238178
Database hash 4660cada12b8409dd679240a0193968d
Total time 2748s
Querying the DIAMOND database and reformatting the output tsv file
To run a simple BLASTP job with DIAMOND, provide a fasta query file and run the following:
# change these
DBDIR=path/to/your/database/dir
INPUT=path/to/your/input/fasta_file
OUTPUT=path/to/your/output/dir
./diamond blastp \
-q $INPUT/your_fasta.faa \
-d $DBDIR/gtdb226_combined_faa_withgenomeids.dmnd \
-o $OUTPUT/quick_test.tsv \
--fast \
--max-target-seqs 1 \
--header simple \
--outfmt 6 qseqid sseqid pident evalue bitscore qcovhsp scovhsp staxids sskingdoms skingdoms sphylums sscinames salltitles stitle full_qseq full_sseq
An example output will look like this:
qseqid sseqid pident evalue bitscore qcovhsp scovhsp staxids sskingdoms skingdoms sphylums sscinames salltitles stitle full_qseq full_sseq
WP_004030875.1 GCA_002509665.1~DADN01000311.1_4 100 2.88e-253 697 100 99.7 1052962652 Archaea 0 Methanobacteriota 002509665 GCA_002509665.1~DADN01000311.1_4 GCA_002509665.1~DADN01000311.1_4 MKLAILGAGCYRTHAASGITNFSRACEVAEQVGKPEIAMTHSTIAMGAELKELAGIDEIVVSDPVFDNDFTVIDDFEYEAVIEAHKKDPESIMPQIREKVNAVAKDLPKPPKGAIHFTHPEDLGFEVTTDDNEAVQDADWVMTWFPKGDMQMGIIKEFADNLKEGAILTHACTVPTTMFQKIFEDLSSDEMNIAPKVNVSSYHPGAVPEMKGQVYIAEGYASEDAICKLVDWGVAARGDAFKLPAELLGPVCDMCSALTAITYAGILSYRDSVMNILGAPAGFAQMMAKESLTQVTDLMNSVGIDHMEEKLDPGALLGTADSMNFGAAADVLPSVLEVLENRKGKGPTCNI MKLAILGAGCYRTHAASGITNFSRACEVAEQVGKPEIAMTHSTIAMGAELKELAGIDEIVVSDPVFDNDFTVIDDFEYEAVIEAHKKDPESIMPQIREKVNAVAKDLPKPPKGAIHFTHPEDLGFEVTTDDNEAVQDADWVMTWFPKGDMQMGIIKEFADNLKEGAILTHACTVPTTMFQKIFEDLSSDEMNIAPKVNVSSYHPGAVPEMKGQVYIAEGYASEDAICKLVDWGVAARGDAFKLPAELLGPVCDMCSALTAITYAGILSYRDSVMNILGAPAGFAQMMAKESLTQVTDLMNSVGIDHMEEKLDPGALLGTADSMNFGAAADVLPSVLEVLENRKGKGPTCNI*
But it would be helpful if the sseqid
column was split into two new columns to show the genomeID and the proteinID of each hit, and concurrently split each GENOMEID~PROTEINID
entry.
We can do this with a simple line of Awk.
The following will change the header line and split each GENOMEID~PROTEINID
entry at the ~
character.
awk 'FNR==1{$2="genomeid" OFS "protaccession"; print; next}{gsub(/~/,OFS,$2); print}' quick_test.tsv
And the resulting output which now contains a new column:
qseqid genomeid protaccession pident evalue bitscore qcovhsp scovhsp staxids sskingdoms skingdoms sphylums sscinames salltitles stitle full_qseq full_sseq
WP_004030875.1 GCA_002509665.1 DADN01000311.1_4 100 2.88e-253 697 100 99.7 1052962652 Archaea 0 Methanobacteriota 002509665 GCA_002509665.1~DADN01000311.1_4 GCA_002509665.1~DADN01000311.1_4 MKLAILGAGCYRTHAASGITNFSRACEVAEQVGKPEIAMTHSTIAMGAELKELAGIDEIVVSDPVFDNDFTVIDDFEYEAVIEAHKKDPESIMPQIREKVNAVAKDLPKPPKGAIHFTHPEDLGFEVTTDDNEAVQDADWVMTWFPKGDMQMGIIKEFADNLKEGAILTHACTVPTTMFQKIFEDLSSDEMNIAPKVNVSSYHPGAVPEMKGQVYIAEGYASEDAICKLVDWGVAARGDAFKLPAELLGPVCDMCSALTAITYAGILSYRDSVMNILGAPAGFAQMMAKESLTQVTDLMNSVGIDHMEEKLDPGALLGTADSMNFGAAADVLPSVLEVLENRKGKGPTCNI MKLAILGAGCYRTHAASGITNFSRACEVAEQVGKPEIAMTHSTIAMGAELKELAGIDEIVVSDPVFDNDFTVIDDFEYEAVIEAHKKDPESIMPQIREKVNAVAKDLPKPPKGAIHFTHPEDLGFEVTTDDNEAVQDADWVMTWFPKGDMQMGIIKEFADNLKEGAILTHACTVPTTMFQKIFEDLSSDEMNIAPKVNVSSYHPGAVPEMKGQVYIAEGYASEDAICKLVDWGVAARGDAFKLPAELLGPVCDMCSALTAITYAGILSYRDSVMNILGAPAGFAQMMAKESLTQVTDLMNSVGIDHMEEKLDPGALLGTADSMNFGAAADVLPSVLEVLENRKGKGPTCNI*
SUMMARY: Running all these steps with a SLURM script
Example of a full SLURM script below. Note: be sure to change parameters and target directories for your own case.
#!/bin/bash
#SBATCH -D ./
#SBATCH -J GTDB
#SBATCH --time=48:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=40000
#SBATCH --output=log-%j.out
#SBATCH --error=log-%j.err
# set database directory
DBDIR=/home/user/path/to/your/database/dir
cd $DBDIR
# DOWNLOAD & EXTRACT THE GTDB PROTEIN FAA FILES
wget -v -P . https://data.ace.uq.edu.au/public/gtdb/data/releases/release226/226.0/genomic_files_reps/gtdb_proteins_aa_reps_r226.tar.gz
tar xvf gtdb_proteins_aa_reps_r226.tar.gz
# DOWNLOAD & EXTRACT TAXID INFO
# acquire prepared taxdump files from: https://github.com/shenwei356/gtdb-taxdump/releases/tag/v0.6.0
wget -v -P . https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.6.0/gtdb-taxdump-R226.tar.gz
tar xvf gtdb-taxdump-R226.tar.gz
# navigate to the extracted faa file directory
cd protein_faa_reps
# prep output dirs
if [[ ! -d reformatted_files ]]; then
mkdir -p reformatted_files
fi
if [[ ! -d ./taxmapping_dir ]]; then
mkdir -p taxmapping_dir
fi
# REFORMAT FILES
# read all faa files and write new ones with better accession IDs.
for file in ./{bacteria,archaea}/*.gz; do
# set variable names
filename=${file##*/}
stemname=${filename%_*}
MAGID=$(echo $stemname | sed -E 's/^(RS_|GB_)//')
# write new fasta file with accession IDs formatted to read as >GENOMEID~PROTEINID
zcat ${file} | awk -vMAGID="${MAGID}" '/^>/{gsub(/^>/,"",$1); printf ">%s~%s\n", MAGID, $1; next}{print}' >./reformatted_files/${stemname}_withgenomeids.faa
echo "written reformatted faa file for genome: ${MAGID}"
done
# WRITE TAXID MAP FILES
# for each new formatted faa file, find it's corresponding taxID and write all accessionIDs and taxIDs to a new file
for file in ./reformatted_files/*.faa; do
# set variable names
filename=${file##*/}
stemname=${filename%_*}
MAGID=$(echo $stemname | sed -E 's/^(RS_|GB_)//')
taxid=$(grep "${MAGID}" ../gtdb-taxdump-R226/taxid.map | cut -f2)
# write new taxid map file in the format of GENOMEID~PROTEINID\tTAXID
awk -vTAXID="${taxid}" -vOFS="\t" '/^>/{gsub(/^>/,"",$1); print $1, TAXID}' ${file} >./taxmapping_dir/${stemname}_taxmapping.txt
echo "written taxmap file for genome: ${MAGID}"
done
# MAKE THE FULL TAXID MAP FILE FOR DIAMOND
echo -e "accession.version\ttaxid" >taxidmap
for file in ./taxmapping_dir/*.txt; do
cat ${file} >>taxidmap
echo "combined ${file} into full taxidmap"
done
mv taxidmap ../gtdb-taxdump-R226
# COMBINE AND COMPRESS THE NEW REFORMATTED FAA FILES FOR DIAMOND
for file in ./reformatted_files/*.faa; do
cat ${file} >>gtdb226_combined_faa_withgenomeids.faa
gzip ${file}
echo "combined faa ${file} and gzipped"
done
gzip gtdb226_combined_faa_withgenomeids.faa
echo "gzipped full combined faa file"
mv gtdb226_combined_faa_withgenomeids.faa.gz ..
echo "done!"
# RUN DIAMOND MAKE DATABSE COMMAND
# NOTE: need diamond binary accessible, assumes you have installed diamond
# NOTE: needs quite a lot a RAM
./diamond makedb \
--in ${DBDIR}/gtdb226_combined_faa_withgenomeids.faa.gz \
--db ${DBDIR}/gtdb226_combined_faa_withgenomeids \
--taxonmap ${DBDIR}/gtdb-taxdump-R226/taxidmap \
--taxonnodes ${DBDIR}/gtdb-taxdump-R226/nodes.dmp \
--taxonnames ${DBDIR}/gtdb-taxdump-R226/names.dmp
Other methods
Shoutout to nick-youngblut/gtdb_to_taxdump’s GitHub repo, which does the same thing but with nicer Python scripts rather than my ugly slapdash Bash scripts.