Tutorial

After installing mkCOInr you have a file system like this:

mkCOInr
├── data
│   ├── bold_taxon_list_2022-02-24.txt
│   ├── example
│   │   ├── custom_lineages_verified.tsv
│   │   ├── my_sequences.tsv
│   │   ├── taxon_list_eukaryota.tsv
│   │   ├── taxon_list_insecta.tsv
│   │   └── taxon_list.tsv
│   └── one_seq_per_order_658.fas
├── doc
... (abbreviated)
└── scripts
        ├── add_taxids.pl
        ├── dereplicate.pl
        ├── download_bold.pl
        ├── download_taxonomy.pl
        ├── format_bold.pl
        ├── format_custom.pl
        ├── format_db.pl
        ├── format_ncbi.pl
        ├── get_subtaxa.pl
        ├── mkdb.pm
        ├── pool_and_dereplicate.pl
        ├── select_region.pl
        └── select_taxa.pl

Customize database

In the first part of the tutorial, I will start from the COInr database in each major step to illustrate

How to include custom sequences
How to select or eliminate sequences of a list of taxa or a minimum resolution
How to select a target region
How to format a dataset to different database formats

These steps can be executed independently.

The last example shows how to create a pipeline by combining different commands.

The creation of the COInr database is explained in the Create COInr from BOLD and NCBI section. You can download this database from Zenodo and customize it to your needs.

Download and untar COInr

You will need to change the date in the filename, and get the up-to-date link from zenodo for later releases.

cd mkCOInr
wget https://zenodo.org/record/6555985/files/COInr_2022_05_06.tar.gz
tar -zxvf COInr_2022_05_06.tar.gz
rm COInr_2022_05_06.tar.gz

For shortening the paths in this tutorial, rename COInr_2022_05_06 directory to COInr.

mv COInr_2022_05_06 COInr

This gives the following file structure

mkCOInr
├── COInr
│   ├── COInr.tsv
│   └── taxonomy.tsv
├── data
│   ├── bold_taxon_list_2022-02-24.txt
│   ├── example
│   │   ├── custom_lineages_verified.tsv
│   │   ├── my_sequences.tsv
│   │   ├── taxon_list_eukaryota.tsv
│   │   ├── taxon_list_insecta.tsv
│   │   └── taxon_list.tsv
│   └── one_seq_per_order_658.fas
...(abbreviated)
└── scripts
        ├── add_taxids.pl
        ├── dereplicate.pl
        ...(abbreviated)

The COInr database is composed of two files

COInr.tsv contains sequenceIDs, taxIDs and sequences
taxonomy.tsv contains all taxIDs and associated information

Add custom sequences to a database

Format custom files

The input tsv file (-custom) contains seqIDs, taxon names (can be at any taxonomic level) and sequences (see the example data/example/my_sequences.tsv). The format_custom.pl script will suggest one or more lineages for each taxon name based on the existing lineages in taxonomy.tsv. It will also consider synonyms.

perl scripts/format_custom.pl -custom data/example/my_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/custom/1_format

The output lineage file (custom_lineages.tsv) looks like this:

phylum  class   order   family  subfamily       genus   species homonymy        seqIDs
Mollusca        Bivalvia        Cardiida        Cardiidae               Acanthocardia   Acanthocardia paucicostata      0       Seq113;Seq88
NA      NA      NA      NA      NA      NA      Ilia nucleus    0       Seq117
Streptophyta    Magnoliopsida   Ericales        Ericaceae               Leucothoe               1       Seq96
Arthropoda      Malacostraca    Amphipoda       Leucothoidae            Leucothoe               1       Seq96
Annelida        Polychaeta      Phyllodocida    Polynoidae                              0       Seq65

This output should should be checked manually to see if the lineages are coherent. If homonymy, choose the correct lineage (e.g. for Leucothoe genus), then delete homonymy column.

If a taxon name is not present in the taxonomy file, the lineage should be completed manually (e.g. Ilia nucleus in the example file).

I created a revised version of the lineage file (data/example/custom_lineages_verified.tsv), which will be used in the next step:

phylum  class   order   family  subfamily       genus   species seqIDs
Mollusca        Bivalvia        Cardiida        Cardiidae               Acanthocardia   Acanthocardia paucicostata      Seq113;Seq88
Arthropoda      Malacostraca    Decapoda        Leucosiidae             Ilia    Ilia nucleus    Seq117
Arthropoda      Malacostraca    Amphipoda       Leucothoidae            Leucothoe               Seq96
Annelida        Polychaeta      Phyllodocida    Polynoidae                              Seq65

See details in description section: format_custom.pl script.

Add taxIDs to custom sequences

The add_taxids.pl script will

For each lineage in the input file
- Find an existing taxID at the lowest possible taxonomic level. taxIDs can be either from NCBI, or negative taxID already present in taxonomy.tsv.
- Add new arbitrary (negative) taxIDs to taxa not yet in the taxonomy file
- Link each new taxID to an existing one as a child and include info to the updated taxonomy file
Make a tsv file with sequences and taxIDs
Update the taxonomy.tsv file

perl scripts/add_taxids.pl -lineages data/example/custom_lineages_verified.tsv -sequences tutorial/custom/1_format/custom_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/custom/2_add_taxids

See details in description section: add_taxids.pl script.

Dereplicate custom sequences

The dereplicate.pl script will eliminate sequences that are substrings of another sequence of the same taxID. Use sequences_with_taxIDs.tsv file (output of the previous script) as the input.

perl scripts/dereplicate.pl -tsv tutorial/custom/2_add_taxids/sequences_with_taxIDs.tsv -outdir tutorial/custom/3_dereplicate -out custom_dereplicated_sequences.tsv

The output file is in the same format as the input tsv file.

See details in description section: dereplicate.pl script.

Pool and dereplicate datasets

Use two dereplicated sequence tsv files:

COInr.tsv (pool of BOLD and NCBI, downloaded from Zenodo)
custom_dereplicated_sequences.tsv (output of the previous script)

pool_and_dereplicate.pl will pool the files and dereplicate sequences of the taxIDs that are present in both files.

perl scripts/pool_and_dereplicate.pl -tsv1 COInr/COInr.tsv -tsv2 tutorial/custom/3_dereplicate/custom_dereplicated_sequences.tsv -outdir tutorial/custom -out COInr_custom.tsv

The output is the same format as the input tsv file.

See details in description section: pool_and_dereplicate.pl script.

Custom database

Your custom database is composed of two files:

the dereplicated sequence file (COInr_custom.tsv)
the last version of the taxonomy file (taxonomy_updated.tsv)

For simplicity, move the updated taxonomy file to the same folder as the sequence file.

mv tutorial/custom/2_add_taxids/taxonomy_updated.tsv tutorial/custom/taxonomy_updated.tsv

This database can be further customized, or you can simply be formated to your taxonomic assignment program by the format_db.pl script.

Select sequences from existing database

Select sequences for a list of taxa with a minimum taxonomic rank

Sequences can be selected for a list of taxa and/or for a minimum taxonomic level (species/genus/family/order/class/phylum/kingdom/domain/root)

The input file (-taxon_list) contains a list of taxa and eventually their taxIDs (see example data/example/taxon_list.tsv).

perl scripts/select_taxa.pl -taxon_list data/example/taxon_list.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv  -min_taxlevel species  -outdir tutorial/select_taxa_0 -out COInr_selected.tsv

The main output is a sequence tsv file (COInr_selected.tsv). A lineage file (taxa_with_lineages.tsv) is also written for all taxa in the taxon_list to check if they are coherent with the target taxon names.

See details in description section: select_taxa.pl script.

Excluding sequences of a taxon list

With the same script it is also possible to eliminate sequences of taxa instead of selecting them. Set the negative_list option to 1 to do that.

perl scripts/select_taxa.pl -taxon_list data/example/taxon_list.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv  -min_taxlevel species -outdir tutorial/select_taxa_1 -out COInr_reduced.tsv -negative_list 1

See details in description section: select_taxa.pl script.

Select region

Sequences can be trimmed to a specific region of the COI gene by the select_region.pl script. To define the region, you can either give a fasta file with sequences trimmed to the region of interest, or you can detect it automatically by e-pcr.

Select region using the e_pcr option

The primers used in this example are amplifying a Leray fragment (ca. 313 bp of the second half of the barcode region).

perl scripts/select_region.pl -tsv COInr/COInr.tsv -outdir tutorial/select_region/ePCR -e_pcr 1 -fw GGNTGAACNGTNTAYCCNCC -rv TAWACTTCDGGRTGNCCRAARAAYCA -trim_error 0.3 -min_amplicon_length 280 -max_amplicon_length 345 -min_overlap 20 -tcov 0.8 -identity 0.7

Select region using the bait_fas option

Using the e_pcr option is an easy way to produce some sequences trimmed to the target region, and they can be used as a database to align all other sequences to them. However, if the parameters of the e_pcr are relaxed, it can produce some false positives. An alternative solution is to use a small, taxonomically divers fasta file, with sequences already trimmed to the target region (-bait_fas option). An example of such a file is given in the data directory (data/one_seq_per_order_658.fas). It contains one sequence for each taxonomic order among the taxa that have a compete mitochondrial genome available in GenBank. Sequences are trimmed to the approximately 658 bp (depending on the taxon) barcode fragment of the COI gene.

perl scripts/select_region.pl -tsv COInr/COInr.tsv -outdir tutorial/select_region/bait_fas -e_pcr 0 -bait_fas data/one_seq_per_order_658.fas -tcov 0.8 -identity 0.7

See details in description section: select_region.pl script.

Format database

Format the database to one of the following formats

qiime
rdp
full
blast
vtam
sintax

qiime

perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt qiime -outdir COInr/qiime -out COInr_qiime

rdp

perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt rdp -outdir COInr/rdp -out COInr_rdp

You should use the rdp_calssifier or qiime’s feature-classifier to train the database using the output files of this script if you have used the rdp or qiime options.

full

The full option, gives a tsv file with seqIDs, ranked lineages, taxIDs for each sequence, and this is a very easy-to-parse, complete file.

perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt full -outdir COInr/full -out COInr_full

sintax

perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt sintax -outdir COInr/sintax -out COInr_sintax

blast

For making a BLAST database, the taxonomy file is not necessary and the indexed files in the output folder are ready to use.

perl scripts/format_db.pl -tsv COInr/COInr.tsv -outfmt blast -outdir COInr/blast -out COInr_blast

vtam

The vtam option produces a BLAST database and a taxonomy file adapted to VTAM .

perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt vtam -outdir COInr/vtam -out COInr_vtam

See details in description section: format_db.pl script.

Chaining steps to make a custom database

In the above examples, we have started from the COInr database. However, you can chain the different commands.

Bellow, I will show you how to create a database with the following characteristics:

Eukaryota sequences
Excluding insects
Enriched with custom sequences
Sequences assigned at least to genus level
Trimmed to the Leray fragment (ca. 313 nt of the second half of the barcode region) of the COI gene (keep sequences if cover at least 90% of the target region)
rdp_classifier format

Notes:

It is a good idea to start with steps that are relatively quick and reduce the size of the database.
Since, over 70% of the sequences are from Insecta in COInr, we will start by eliminating them.
The custom sequences are all Non-Insect Eukaryotes, so we can add custom sequences to the reduced dataset. Otherwise, we should have started by adding custom sequences. This solution is also fine, but gives large intermediate files.
The selection of the target region is the most computationally intensive, and the more diverse the dataset, the less precise it is. So it is preferable to do this at the end of the pipeline.

Exclude Insecta and sequences with resolution lower than genus

perl scripts/select_taxa.pl -taxon_list data/example/taxon_list_insecta.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv  -min_taxlevel genus -outdir tutorial/chained/1_noInsecta -out COInr_noIns.tsv -negative_list 1

Keep Eukaryota

perl scripts/select_taxa.pl -taxon_list data/example/taxon_list_eukaryota.tsv -tsv tutorial/chained/1_noInsecta/COInr_noIns.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/2_Eukaryota -out COInr_noIns_Euk.tsv

Add custom sequences

perl scripts/format_custom.pl -custom data/example/my_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/3_add_custom/1_format

Check and format the custom_lineages.tsv and make custom_lineages_verified.tsv as in Add custom sequences to a database section.

perl scripts/add_taxids.pl -lineages data/example/custom_lineages_verified.tsv -sequences tutorial/chained/3_add_custom/1_format/custom_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/3_add_custom/2_add_taxids

perl scripts/dereplicate.pl -tsv tutorial/chained/3_add_custom/2_add_taxids/sequences_with_taxIDs.tsv -outdir tutorial/chained/3_add_custom/3_dereplicate -out custom_dereplicated_sequences.tsv

Add the formatted, dereplicated custom sequences to the sequences in COInr_noIns_Euk.tsv

perl scripts/pool_and_dereplicate.pl -tsv1 tutorial/chained/2_Eukaryota/COInr_noIns_Euk.tsv -tsv2 tutorial/chained/3_add_custom/3_dereplicate/custom_dereplicated_sequences.tsv -outdir tutorial/chained/3_add_custom -out COInr_noIns_Euk_custom.tsv

mv tutorial/chained/3_add_custom/2_add_taxids/taxonomy_updated.tsv tutorial/chained/3_add_custom/taxonomy_updated.tsv

Keep only sequences with genus or higher resolution

We have eliminated sequences with lower than genus resolution from COInr in the first step (-min_taxlevel genus). However, among the custom sequences we had a sequence with an unknown genus. So let’s redo the selection for a minimum taxonomic level.

Yes, you are right! We could have just avoided adding that sequence to the database in the previous step. But if you have many custom sequences, you might just be lazy to check the custom sequences manually, and in that case you can use mkCOInr to this for you.

Attention: From now on, we have to use the updated taxonomy file (taxonomy_updated.tsv), since some of the taxa of the custom sequences might not be in the original taxonomy.tsv file.

perl scripts/select_taxa.pl -tsv tutorial/chained/3_add_custom/COInr_noIns_Euk_custom.tsv -taxonomy tutorial/chained/3_add_custom/taxonomy_updated.tsv -outdir tutorial/chained/4_genus -out COInr_noIns_Euk_custom_genus.tsv

Trim to Leray region

perl scripts/select_region.pl -tsv tutorial/chained/4_genus/COInr_noIns_Euk_custom_genus.tsv -outdir tutorial/chained/5_select_region -e_pcr 1 -fw GGNTGAACNGTNTAYCCNCC -rv TAWACTTCDGGRTGNCCRAARAAYCA -trim_error 0.3 -min_amplicon_length 280 -max_amplicon_length 345 -min_overlap 20 -tcov 0.9 -identity 0.7

Format for RDP_classifier

perl scripts/format_db.pl -tsv tutorial/chained/5_select_region/trimmed.tsv -taxonomy tutorial/chained/3_add_custom/taxonomy_updated.tsv -outfmt rdp -outdir tutorial/chained/6_rdp -out COInr_customized

Create COInr from BOLD and NCBI

The following steps describe how COInr database (available at Zenodo ) was produced.

Download NCBI taxonomy

Download NCBI taxonomy dmp file and create taxonomy.tsv.

cd mkCOInr
perl scripts/download_taxonomy.pl -outdir COInr_new/taxonomy

See details in description section: download_taxonomy.pl script.

NCBI sequences

Download NCBI sequences

The following command will download Coding DNA Sequence (CDS) fasta files of all sequences with COI, CO1, COXI or COX1 in the title lines and complete mitochondrial genomes. It takes several hours (days) to run this command.

nsdpy -r "COI OR COX1 OR CO1 OR COXI OR (complete[Title] AND genome[Title] AND Mitochondrion[Filter])" -T -v --cds

The results are found in the NSDPY_results/yyyy-mm-dd_hh-mm-ss folder.

The sequences.fasta file contains all CDS sequences. Sequences are correctly oriented but should still be filtered to keep only COI sequences. TaxIDs.txt contains the sequenceIDs and the TaxIDs.

Move the results of nsdpy to the COInr_new/ncbi/download directory and clean up the directory.

mkdir -p COInr_new/ncbi
mv NSDPY_results/yyyy-mm-dd_hh-mm-ss COInr_new/ncbi/download
mv report.tsv COInr_new/ncbi/download
rmdir NSDPY_results

Format NCBI sequences

The format_ncbi.pl script will

Select COI sequences and clean them.
Eliminate identical sequences of the same taxID.
Clean tax names and taxids.

perl scripts/format_ncbi.pl -cds COInr_new/ncbi/download/sequences.fasta -taxids COInr_new/ncbi/download/TaxIDs.txt -taxonomy COInr_new/taxonomy/taxonomy.tsv -outdir COInr_new/ncbi/format

The major output is a sequence tsv file with taxIDs.

See details in description section: format_ncbi.pl script.

Dereplicate NCBI sequences

Eliminate sequences that are substring of another sequence of the same taxID.

perl scripts/dereplicate.pl -tsv COInr_new/ncbi/format/ncbi_sequences.tsv -outdir COInr_new/ncbi/dereplicate -out ncbi_dereplicated_sequences.tsv

The output is the same format as the input tsv file.

See details in description section: dereplicate.pl script.

BOLD sequences

Download BOLD sequences

The download_bold.pl script is deprecated. The BOLD API used in download_bold.pl do not allow anymore to download large data files.

It is possible, however, to download all public sequences as a data package from https://www.boldsystems.org/index.php/datapackages. You need to have a BOLD account for downloading the data package in (tar.gz compressed) format, that contains a TSV file with sequences, taxonomic lineages and other metadata. This uncompressed file will be the input of format_bold_package.pl.

Format BOLD sequences

The format_bold_package.pl script will

Select COI sequences and clean them
Select sequences with out without BIN_URI according to the delete_noBIN argument
Eliminate identical sequences of the same lineage
Clean lineages and make a list with corresponding sequenceIDs

perl scripts/format_bold_package.pl -bold_data COInr_new/bold/download/BOLD_Public.26-Apr-2024.tsv -outdir COInr_new/bold/format -delete_noBIN 1

The major output is the following:

bold_sequences.tsv
bold_lineages.tsv (all identical lineages are pooled into a same line)

See details in description section: format_bold_package.pl script.

Add taxIDs to BOLD sequences

For each lineage the add_taxids.pl script will

Find an existing taxID at the lowest level possible. TaxIDs can be either from NCBI, or negative taxID already present in taxonomy.tsv.
Add new arbitrary (negative) taxIDs to taxa, that are not yet in taxonomy.tsv
Link each new taxID to existing one as a child and include info to the updated taxonomy file
Update the input taxonomy file

perl scripts/add_taxids.pl -lineages COInr_new/bold/format/bold_lineages.tsv -sequences COInr_new/bold/format/bold_sequences.tsv -taxonomy COInr_new/taxonomy/taxonomy.tsv -outdir COInr_new/bold/add_taxids

The main output files are the following:

See details in description section: add_taxids.pl script.

Dereplicate BOLD sequences

Eliminate sequences that are substring of another sequence of the same taxID.

perl scripts/dereplicate.pl -tsv COInr_new/bold/add_taxids/sequences_with_taxIDs.tsv -outdir COInr_new/bold/dereplicate -out bold_dereplicated_sequences.tsv

The output is the same format as the input tsv file.

See details in description section: dereplicate.pl script.

Pool and dereplicate datasets

Use the dereplicated sequence files from BOLD and NCBI. The pool_and_dereplicate.pl script will pool the files and dereplicate sequences of a taxID that are present in both files.

perl scripts/pool_and_dereplicate.pl -tsv1 COInr_new/bold/dereplicate/bold_dereplicated_sequences.tsv -tsv2 COInr_new/ncbi/dereplicate/ncbi_dereplicated_sequences.tsv -outdir COInr_new -out COInr.tsv

The output is the same format as the input tsv file.

See details in description section: pool_and_dereplicate.pl script.

Move the taxonomy file to the same directory

mv COInr_new/bold/add_taxids/taxonomy_updated.tsv COInr_new/taxonomy.tsv