Tutorial
After installing mkCOInr you have a file system like this:
mkCOInr
├── data
│ ├── bold_taxon_list_2022-02-24.txt
│ ├── example
│ │ ├── custom_lineages_verified.tsv
│ │ ├── my_sequences.tsv
│ │ ├── taxon_list_eukaryota.tsv
│ │ ├── taxon_list_insecta.tsv
│ │ └── taxon_list.tsv
│ └── one_seq_per_order_658.fas
├── doc
... (abbreviated)
└── scripts
├── add_taxids.pl
├── dereplicate.pl
├── download_bold.pl
├── download_taxonomy.pl
├── format_bold.pl
├── format_custom.pl
├── format_db.pl
├── format_ncbi.pl
├── get_subtaxa.pl
├── mkdb.pm
├── pool_and_dereplicate.pl
├── select_region.pl
└── select_taxa.pl
Customize database
- In the first part of the tutorial, I will start from the COInr database in each major step to illustrate
These steps can be executed independently.
The last example shows how to create a pipeline by combining different commands.
The creation of the COInr database is explained in the Create COInr from BOLD and NCBI section. You can download this database from Zenodo and customize it to your needs.
Download and untar COInr
You will need to change the date in the filename, and get the up-to-date link from zenodo for later releases.
cd mkCOInr
wget https://zenodo.org/record/6555985/files/COInr_2022_05_06.tar.gz
tar -zxvf COInr_2022_05_06.tar.gz
rm COInr_2022_05_06.tar.gz
For shortening the paths in this tutorial, rename COInr_2022_05_06 directory to COInr.
mv COInr_2022_05_06 COInr
This gives the following file structure
mkCOInr
├── COInr
│ ├── COInr.tsv
│ └── taxonomy.tsv
├── data
│ ├── bold_taxon_list_2022-02-24.txt
│ ├── example
│ │ ├── custom_lineages_verified.tsv
│ │ ├── my_sequences.tsv
│ │ ├── taxon_list_eukaryota.tsv
│ │ ├── taxon_list_insecta.tsv
│ │ └── taxon_list.tsv
│ └── one_seq_per_order_658.fas
...(abbreviated)
└── scripts
├── add_taxids.pl
├── dereplicate.pl
...(abbreviated)
- The COInr database is composed of two files
COInr.tsv contains sequenceIDs, taxIDs and sequences
taxonomy.tsv contains all taxIDs and associated information
Add custom sequences to a database
Format custom files
The input tsv file (-custom) contains seqIDs, taxon names (can be at any taxonomic level) and sequences (see the example data/example/my_sequences.tsv). The format_custom.pl script will suggest one or more lineages for each taxon name based on the existing lineages in taxonomy.tsv. It will also consider synonyms.
perl scripts/format_custom.pl -custom data/example/my_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/custom/1_format
The output lineage file (custom_lineages.tsv) looks like this:
phylum class order family subfamily genus species homonymy seqIDs
Mollusca Bivalvia Cardiida Cardiidae Acanthocardia Acanthocardia paucicostata 0 Seq113;Seq88
NA NA NA NA NA NA Ilia nucleus 0 Seq117
Streptophyta Magnoliopsida Ericales Ericaceae Leucothoe 1 Seq96
Arthropoda Malacostraca Amphipoda Leucothoidae Leucothoe 1 Seq96
Annelida Polychaeta Phyllodocida Polynoidae 0 Seq65
This output should should be checked manually to see if the lineages are coherent. If homonymy, choose the correct lineage (e.g. for Leucothoe genus), then delete homonymy column.
If a taxon name is not present in the taxonomy file, the lineage should be completed manually (e.g. Ilia nucleus in the example file).
I created a revised version of the lineage file (data/example/custom_lineages_verified.tsv), which will be used in the next step:
phylum class order family subfamily genus species seqIDs
Mollusca Bivalvia Cardiida Cardiidae Acanthocardia Acanthocardia paucicostata Seq113;Seq88
Arthropoda Malacostraca Decapoda Leucosiidae Ilia Ilia nucleus Seq117
Arthropoda Malacostraca Amphipoda Leucothoidae Leucothoe Seq96
Annelida Polychaeta Phyllodocida Polynoidae Seq65
See details in description section: format_custom.pl script.
Add taxIDs to custom sequences
The add_taxids.pl script will
- For each lineage in the input file
Find an existing taxID at the lowest possible taxonomic level. taxIDs can be either from NCBI, or negative taxID already present in taxonomy.tsv.
Add new arbitrary (negative) taxIDs to taxa not yet in the taxonomy file
Link each new taxID to an existing one as a child and include info to the updated taxonomy file
Update the taxonomy.tsv file
perl scripts/add_taxids.pl -lineages data/example/custom_lineages_verified.tsv -sequences tutorial/custom/1_format/custom_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/custom/2_add_taxids
See details in description section: add_taxids.pl script.
Dereplicate custom sequences
The dereplicate.pl script will eliminate sequences that are substrings of another sequence of the same taxID. Use sequences_with_taxIDs.tsv file (output of the previous script) as the input.
perl scripts/dereplicate.pl -tsv tutorial/custom/2_add_taxids/sequences_with_taxIDs.tsv -outdir tutorial/custom/3_dereplicate -out custom_dereplicated_sequences.tsv
The output file is in the same format as the input tsv file.
See details in description section: dereplicate.pl script.
Pool and dereplicate datasets
- Use two dereplicated sequence tsv files:
COInr.tsv (pool of BOLD and NCBI, downloaded from Zenodo)
custom_dereplicated_sequences.tsv (output of the previous script)
pool_and_dereplicate.pl will pool the files and dereplicate sequences of the taxIDs that are present in both files.
perl scripts/pool_and_dereplicate.pl -tsv1 COInr/COInr.tsv -tsv2 tutorial/custom/3_dereplicate/custom_dereplicated_sequences.tsv -outdir tutorial/custom -out COInr_custom.tsv
The output is the same format as the input tsv file.
See details in description section: pool_and_dereplicate.pl script.
Custom database
- Your custom database is composed of two files:
the dereplicated sequence file (COInr_custom.tsv)
the last version of the taxonomy file (taxonomy_updated.tsv)
For simplicity, move the updated taxonomy file to the same folder as the sequence file.
mv tutorial/custom/2_add_taxids/taxonomy_updated.tsv tutorial/custom/taxonomy_updated.tsv
This database can be further customized, or you can simply be formated to your taxonomic assignment program by the format_db.pl script.
Select sequences from existing database
Select sequences for a list of taxa with a minimum taxonomic rank
Sequences can be selected for a list of taxa and/or for a minimum taxonomic level (species/genus/family/order/class/phylum/kingdom/domain/root)
The input file (-taxon_list) contains a list of taxa and eventually their taxIDs (see example data/example/taxon_list.tsv).
perl scripts/select_taxa.pl -taxon_list data/example/taxon_list.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -min_taxlevel species -outdir tutorial/select_taxa_0 -out COInr_selected.tsv
The main output is a sequence tsv file (COInr_selected.tsv). A lineage file (taxa_with_lineages.tsv) is also written for all taxa in the taxon_list to check if they are coherent with the target taxon names.
See details in description section: select_taxa.pl script.
Excluding sequences of a taxon list
With the same script it is also possible to eliminate sequences of taxa instead of selecting them. Set the negative_list option to 1 to do that.
perl scripts/select_taxa.pl -taxon_list data/example/taxon_list.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -min_taxlevel species -outdir tutorial/select_taxa_1 -out COInr_reduced.tsv -negative_list 1
See details in description section: select_taxa.pl script.
Select region
Sequences can be trimmed to a specific region of the COI gene by the select_region.pl script. To define the region, you can either give a fasta file with sequences trimmed to the region of interest, or you can detect it automatically by e-pcr.
Select region using the e_pcr option
The primers used in this example are amplifying a Leray fragment (ca. 313 bp of the second half of the barcode region).
perl scripts/select_region.pl -tsv COInr/COInr.tsv -outdir tutorial/select_region/ePCR -e_pcr 1 -fw GGNTGAACNGTNTAYCCNCC -rv TAWACTTCDGGRTGNCCRAARAAYCA -trim_error 0.3 -min_amplicon_length 280 -max_amplicon_length 345 -min_overlap 20 -tcov 0.8 -identity 0.7
Select region using the bait_fas option
Using the e_pcr option is an easy way to produce some sequences trimmed to the target region, and they can be used as a database to align all other sequences to them. However, if the parameters of the e_pcr are relaxed, it can produce some false positives. An alternative solution is to use a small, taxonomically divers fasta file, with sequences already trimmed to the target region (-bait_fas option). An example of such a file is given in the data directory (data/one_seq_per_order_658.fas). It contains one sequence for each taxonomic order among the taxa that have a compete mitochondrial genome available in GenBank. Sequences are trimmed to the approximately 658 bp (depending on the taxon) barcode fragment of the COI gene.
perl scripts/select_region.pl -tsv COInr/COInr.tsv -outdir tutorial/select_region/bait_fas -e_pcr 0 -bait_fas data/one_seq_per_order_658.fas -tcov 0.8 -identity 0.7
See details in description section: select_region.pl script.
Format database
- Format the database to one of the following formats
qiime
rdp
full
blast
vtam
sintax
qiime
perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt qiime -outdir COInr/qiime -out COInr_qiime
rdp
perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt rdp -outdir COInr/rdp -out COInr_rdp
You should use the rdp_calssifier or qiime’s feature-classifier to train the database using the output files of this script if you have used the rdp or qiime options.
full
The full option, gives a tsv file with seqIDs, ranked lineages, taxIDs for each sequence, and this is a very easy-to-parse, complete file.
perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt full -outdir COInr/full -out COInr_full
sintax
perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt sintax -outdir COInr/sintax -out COInr_sintax
blast
For making a BLAST database, the taxonomy file is not necessary and the indexed files in the output folder are ready to use.
perl scripts/format_db.pl -tsv COInr/COInr.tsv -outfmt blast -outdir COInr/blast -out COInr_blast
vtam
The vtam option produces a BLAST database and a taxonomy file adapted to VTAM .
perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt vtam -outdir COInr/vtam -out COInr_vtam
See details in description section: format_db.pl script.
Chaining steps to make a custom database
In the above examples, we have started from the COInr database. However, you can chain the different commands.
- Bellow, I will show you how to create a database with the following characteristics:
Eukaryota sequences
Excluding insects
Enriched with custom sequences
Sequences assigned at least to genus level
Trimmed to the Leray fragment (ca. 313 nt of the second half of the barcode region) of the COI gene (keep sequences if cover at least 90% of the target region)
rdp_classifier format
- Notes:
It is a good idea to start with steps that are relatively quick and reduce the size of the database.
Since, over 70% of the sequences are from Insecta in COInr, we will start by eliminating them.
The custom sequences are all Non-Insect Eukaryotes, so we can add custom sequences to the reduced dataset. Otherwise, we should have started by adding custom sequences. This solution is also fine, but gives large intermediate files.
The selection of the target region is the most computationally intensive, and the more diverse the dataset, the less precise it is. So it is preferable to do this at the end of the pipeline.
Exclude Insecta and sequences with resolution lower than genus
perl scripts/select_taxa.pl -taxon_list data/example/taxon_list_insecta.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -min_taxlevel genus -outdir tutorial/chained/1_noInsecta -out COInr_noIns.tsv -negative_list 1
Keep Eukaryota
perl scripts/select_taxa.pl -taxon_list data/example/taxon_list_eukaryota.tsv -tsv tutorial/chained/1_noInsecta/COInr_noIns.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/2_Eukaryota -out COInr_noIns_Euk.tsv
Add custom sequences
perl scripts/format_custom.pl -custom data/example/my_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/3_add_custom/1_format
Check and format the custom_lineages.tsv and make custom_lineages_verified.tsv as in Add custom sequences to a database section.
perl scripts/add_taxids.pl -lineages data/example/custom_lineages_verified.tsv -sequences tutorial/chained/3_add_custom/1_format/custom_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/3_add_custom/2_add_taxids
perl scripts/dereplicate.pl -tsv tutorial/chained/3_add_custom/2_add_taxids/sequences_with_taxIDs.tsv -outdir tutorial/chained/3_add_custom/3_dereplicate -out custom_dereplicated_sequences.tsv
Add the formatted, dereplicated custom sequences to the sequences in COInr_noIns_Euk.tsv
perl scripts/pool_and_dereplicate.pl -tsv1 tutorial/chained/2_Eukaryota/COInr_noIns_Euk.tsv -tsv2 tutorial/chained/3_add_custom/3_dereplicate/custom_dereplicated_sequences.tsv -outdir tutorial/chained/3_add_custom -out COInr_noIns_Euk_custom.tsv
mv tutorial/chained/3_add_custom/2_add_taxids/taxonomy_updated.tsv tutorial/chained/3_add_custom/taxonomy_updated.tsv
Keep only sequences with genus or higher resolution
We have eliminated sequences with lower than genus resolution from COInr in the first step (-min_taxlevel genus). However, among the custom sequences we had a sequence with an unknown genus. So let’s redo the selection for a minimum taxonomic level.
Yes, you are right! We could have just avoided adding that sequence to the database in the previous step. But if you have many custom sequences, you might just be lazy to check the custom sequences manually, and in that case you can use mkCOInr to this for you.
Attention: From now on, we have to use the updated taxonomy file (taxonomy_updated.tsv), since some of the taxa of the custom sequences might not be in the original taxonomy.tsv file.
perl scripts/select_taxa.pl -tsv tutorial/chained/3_add_custom/COInr_noIns_Euk_custom.tsv -taxonomy tutorial/chained/3_add_custom/taxonomy_updated.tsv -outdir tutorial/chained/4_genus -out COInr_noIns_Euk_custom_genus.tsv
Trim to Leray region
perl scripts/select_region.pl -tsv tutorial/chained/4_genus/COInr_noIns_Euk_custom_genus.tsv -outdir tutorial/chained/5_select_region -e_pcr 1 -fw GGNTGAACNGTNTAYCCNCC -rv TAWACTTCDGGRTGNCCRAARAAYCA -trim_error 0.3 -min_amplicon_length 280 -max_amplicon_length 345 -min_overlap 20 -tcov 0.9 -identity 0.7
Format for RDP_classifier
perl scripts/format_db.pl -tsv tutorial/chained/5_select_region/trimmed.tsv -taxonomy tutorial/chained/3_add_custom/taxonomy_updated.tsv -outfmt rdp -outdir tutorial/chained/6_rdp -out COInr_customized
Create COInr from BOLD and NCBI
The following steps describe how COInr database (available at Zenodo ) was produced.
Download NCBI taxonomy
Download NCBI taxonomy dmp file and create taxonomy.tsv.
cd mkCOInr
perl scripts/download_taxonomy.pl -outdir COInr_new/taxonomy
See details in description section: download_taxonomy.pl script.
NCBI sequences
Download NCBI sequences
The following command will download Coding DNA Sequence (CDS) fasta files of all sequences with COI, CO1, COXI or COX1 in the title lines and complete mitochondrial genomes. It takes several hours (days) to run this command.
nsdpy -r "COI OR COX1 OR CO1 OR COXI OR (complete[Title] AND genome[Title] AND Mitochondrion[Filter])" -T -v --cds
The results are found in the NSDPY_results/yyyy-mm-dd_hh-mm-ss folder.
The sequences.fasta file contains all CDS sequences. Sequences are correctly oriented but should still be filtered to keep only COI sequences. TaxIDs.txt contains the sequenceIDs and the TaxIDs.
Move the results of nsdpy to the COInr_new/ncbi/download directory and clean up the directory.
mkdir -p COInr_new/ncbi
mv NSDPY_results/yyyy-mm-dd_hh-mm-ss COInr_new/ncbi/download
mv report.tsv COInr_new/ncbi/download
rmdir NSDPY_results
Format NCBI sequences
- The format_ncbi.pl script will
Select COI sequences and clean them.
Eliminate identical sequences of the same taxID.
Clean tax names and taxids.
perl scripts/format_ncbi.pl -cds COInr_new/ncbi/download/sequences.fasta -taxids COInr_new/ncbi/download/TaxIDs.txt -taxonomy COInr_new/taxonomy/taxonomy.tsv -outdir COInr_new/ncbi/format
The major output is a sequence tsv file with taxIDs.
See details in description section: format_ncbi.pl script.
Dereplicate NCBI sequences
Eliminate sequences that are substring of another sequence of the same taxID.
perl scripts/dereplicate.pl -tsv COInr_new/ncbi/format/ncbi_sequences.tsv -outdir COInr_new/ncbi/dereplicate -out ncbi_dereplicated_sequences.tsv
The output is the same format as the input tsv file.
See details in description section: dereplicate.pl script.
BOLD sequences
Download BOLD sequences
The download_bold.pl script is deprecated. The BOLD API used in download_bold.pl do not allow anymore to download large data files.
It is possible, however, to download all public sequences as a data package from https://www.boldsystems.org/index.php/datapackages. You need to have a BOLD account for downloading the data package in (tar.gz compressed) format, that contains a TSV file with sequences, taxonomic lineages and other metadata. This uncompressed file will be the input of format_bold_package.pl.
Format BOLD sequences
- The format_bold_package.pl script will
Select COI sequences and clean them
Select sequences with out without BIN_URI according to the delete_noBIN argument
Eliminate identical sequences of the same lineage
Clean lineages and make a list with corresponding sequenceIDs
perl scripts/format_bold_package.pl -bold_data COInr_new/bold/download/BOLD_Public.26-Apr-2024.tsv -outdir COInr_new/bold/format -delete_noBIN 1
- The major output is the following:
bold_lineages.tsv (all identical lineages are pooled into a same line)
See details in description section: format_bold_package.pl script.
Add taxIDs to BOLD sequences
- For each lineage the add_taxids.pl script will
Find an existing taxID at the lowest level possible. TaxIDs can be either from NCBI, or negative taxID already present in taxonomy.tsv.
Add new arbitrary (negative) taxIDs to taxa, that are not yet in taxonomy.tsv
Link each new taxID to existing one as a child and include info to the updated taxonomy file
Update the input taxonomy file
perl scripts/add_taxids.pl -lineages COInr_new/bold/format/bold_lineages.tsv -sequences COInr_new/bold/format/bold_sequences.tsv -taxonomy COInr_new/taxonomy/taxonomy.tsv -outdir COInr_new/bold/add_taxids
- The main output files are the following:
See details in description section: add_taxids.pl script.
Dereplicate BOLD sequences
Eliminate sequences that are substring of another sequence of the same taxID.
perl scripts/dereplicate.pl -tsv COInr_new/bold/add_taxids/sequences_with_taxIDs.tsv -outdir COInr_new/bold/dereplicate -out bold_dereplicated_sequences.tsv
The output is the same format as the input tsv file.
See details in description section: dereplicate.pl script.
Pool and dereplicate datasets
Use the dereplicated sequence files from BOLD and NCBI. The pool_and_dereplicate.pl script will pool the files and dereplicate sequences of a taxID that are present in both files.
perl scripts/pool_and_dereplicate.pl -tsv1 COInr_new/bold/dereplicate/bold_dereplicated_sequences.tsv -tsv2 COInr_new/ncbi/dereplicate/ncbi_dereplicated_sequences.tsv -outdir COInr_new -out COInr.tsv
The output is the same format as the input tsv file.
See details in description section: pool_and_dereplicate.pl script.
Move the taxonomy file to the same directory
mv COInr_new/bold/add_taxids/taxonomy_updated.tsv COInr_new/taxonomy.tsv