.. _tutorial_tutorial: Tutorial ============ After installing mkCOInr you have a file system like this: .. code-block:: bash mkCOInr ├── data │   ├── bold_taxon_list_2022-02-24.txt │   ├── example │   │   ├── custom_lineages_verified.tsv │   │   ├── my_sequences.tsv │   │   ├── taxon_list_eukaryota.tsv │   │   ├── taxon_list_insecta.tsv │   │   └── taxon_list.tsv │   └── one_seq_per_order_658.fas ├── doc ... (abbreviated) └── scripts ├── add_taxids.pl ├── dereplicate.pl ├── download_bold.pl ├── download_taxonomy.pl ├── format_bold.pl ├── format_custom.pl ├── format_db.pl ├── format_ncbi.pl ├── get_subtaxa.pl ├── mkdb.pm ├── pool_and_dereplicate.pl ├── select_region.pl └── select_taxa.pl .. _customize_tutorial: Customize database ------------------------------------------------- In the first part of the tutorial, I will start from the COInr database in each major step to illustrate - :ref:`How to include custom sequences ` - :ref:`How to select or eliminate sequences of a list of taxa or a minimum resolution ` - :ref:`How to select a target region ` - :ref:`How to format a dataset to different database formats ` These steps can be executed independently. The last example shows how to :ref:`create a pipeline ` by combining different commands. The creation of the COInr database is explained in the :ref:`Create COInr from BOLD and NCBI section `. You can download this database from `Zenodo `_ and customize it to your needs. .. _download_coinr_tutorial: Download and untar COInr ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You will need to change the date in the filename, and get the up-to-date link from zenodo for later releases. .. code-block:: bash cd mkCOInr wget https://zenodo.org/record/6555985/files/COInr_2022_05_06.tar.gz tar -zxvf COInr_2022_05_06.tar.gz rm COInr_2022_05_06.tar.gz For shortening the paths in this tutorial, rename COInr_2022_05_06 directory to COInr. .. code-block:: bash mv COInr_2022_05_06 COInr This gives the following file structure .. code-block:: bash mkCOInr ├── COInr │   ├── COInr.tsv │   └── taxonomy.tsv ├── data │   ├── bold_taxon_list_2022-02-24.txt │   ├── example │   │   ├── custom_lineages_verified.tsv │   │   ├── my_sequences.tsv │   │   ├── taxon_list_eukaryota.tsv │   │   ├── taxon_list_insecta.tsv │   │   └── taxon_list.tsv │   └── one_seq_per_order_658.fas ...(abbreviated) └── scripts ├── add_taxids.pl ├── dereplicate.pl ...(abbreviated) The COInr database is composed of two files - :ref:`COInr.tsv ` contains :ref:`sequenceIDs `, :ref:`taxIDs ` and sequences - :ref:`taxonomy.tsv ` contains all taxIDs and associated information .. _add_custom_sequences_tutorial: Add custom sequences to a database ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. _format_custom_tutorial: Format custom files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :ref:`input tsv file ` (-custom) contains :ref:`seqIDs `, taxon names (can be at any taxonomic level) and sequences (see the example data/example/my_sequences.tsv). The :ref:`format_custom.pl ` script will suggest one or more lineages for each taxon name based on the existing lineages in :ref:`taxonomy.tsv `. It will also consider synonyms. .. code-block:: bash perl scripts/format_custom.pl -custom data/example/my_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/custom/1_format The output lineage file (custom_lineages.tsv) looks like this: .. code-block:: bash phylum class order family subfamily genus species homonymy seqIDs Mollusca Bivalvia Cardiida Cardiidae Acanthocardia Acanthocardia paucicostata 0 Seq113;Seq88 NA NA NA NA NA NA Ilia nucleus 0 Seq117 Streptophyta Magnoliopsida Ericales Ericaceae Leucothoe 1 Seq96 Arthropoda Malacostraca Amphipoda Leucothoidae Leucothoe 1 Seq96 Annelida Polychaeta Phyllodocida Polynoidae 0 Seq65 This output should should be checked manually to see if the lineages are coherent. If homonymy, choose the correct lineage (e.g. for *Leucothoe* genus), then delete homonymy column. If a taxon name is not present in the taxonomy file, the lineage should be completed manually (e.g. *Ilia nucleus* in the example file). I created a revised version of the lineage file (data/example/custom_lineages_verified.tsv), which will be used in the next step: .. code-block:: bash phylum class order family subfamily genus species seqIDs Mollusca Bivalvia Cardiida Cardiidae Acanthocardia Acanthocardia paucicostata Seq113;Seq88 Arthropoda Malacostraca Decapoda Leucosiidae Ilia Ilia nucleus Seq117 Arthropoda Malacostraca Amphipoda Leucothoidae Leucothoe Seq96 Annelida Polychaeta Phyllodocida Polynoidae Seq65 See details in description section: :ref:`format_custom.pl ` script. .. _add_taxids_custom_tutorial: Add taxIDs to custom sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :ref:`add_taxids.pl ` script will - For each lineage in the input file - Find an existing taxID at the lowest possible taxonomic level. :ref:`taxIDs ` can be either from NCBI, or negative taxID already present in :ref:`taxonomy.tsv `. - Add new arbitrary (negative) taxIDs to taxa not yet in the taxonomy file - Link each new taxID to an existing one as a child and include info to the updated taxonomy file - Make a :ref:`tsv file with sequences and taxIDs ` - Update the :ref:`taxonomy.tsv ` file .. code-block:: bash perl scripts/add_taxids.pl -lineages data/example/custom_lineages_verified.tsv -sequences tutorial/custom/1_format/custom_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/custom/2_add_taxids See details in description section: :ref:`add_taxids.pl ` script. .. _dereplicate_custom_tutorial: Dereplicate custom sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :ref:`dereplicate.pl ` script will eliminate sequences that are substrings of another sequence of the same :ref:`taxID `. Use :ref:`sequences_with_taxIDs.tsv ` file (output of the previous script) as the input. .. code-block:: bash perl scripts/dereplicate.pl -tsv tutorial/custom/2_add_taxids/sequences_with_taxIDs.tsv -outdir tutorial/custom/3_dereplicate -out custom_dereplicated_sequences.tsv The output file is in the same format as the input tsv file. See details in description section: :ref:`dereplicate.pl ` script. .. _pool_and_dereplicate_custom_tutorial: Pool and dereplicate datasets ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Use two dereplicated :ref:`sequence tsv files `: - COInr.tsv (pool of BOLD and NCBI, downloaded from Zenodo) - custom_dereplicated_sequences.tsv (output of the previous script) :ref:`pool_and_dereplicate.pl ` will pool the files and dereplicate sequences of the taxIDs that are present in both files. .. code-block:: bash perl scripts/pool_and_dereplicate.pl -tsv1 COInr/COInr.tsv -tsv2 tutorial/custom/3_dereplicate/custom_dereplicated_sequences.tsv -outdir tutorial/custom -out COInr_custom.tsv The output is the same format as the input tsv file. See details in description section: :ref:`pool_and_dereplicate.pl ` script. Custom database ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Your custom database is composed of two files: - the dereplicated sequence file (COInr_custom.tsv) - the last version of the taxonomy file (taxonomy_updated.tsv) For simplicity, move the updated taxonomy file to the same folder as the sequence file. .. code-block:: bash mv tutorial/custom/2_add_taxids/taxonomy_updated.tsv tutorial/custom/taxonomy_updated.tsv This database can be further customized, or you can simply be formated to your taxonomic assignment program by the :ref:`format_db.pl ` script. .. _select_sequences_custom_tutorial: Select sequences from existing database ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Select sequences for a list of taxa with a minimum taxonomic rank ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sequences can be selected for a list of taxa and/or for a minimum taxonomic level (species/genus/family/order/class/phylum/kingdom/domain/root) The input file (:ref:`-taxon_list `) contains a list of taxa and eventually their taxIDs (see example data/example/taxon_list.tsv). .. code-block:: bash perl scripts/select_taxa.pl -taxon_list data/example/taxon_list.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -min_taxlevel species -outdir tutorial/select_taxa_0 -out COInr_selected.tsv The main output is a :ref:`sequence tsv file ` (COInr_selected.tsv). A :ref:`lineage file ` (taxa_with_lineages.tsv) is also written for all taxa in the taxon_list to check if they are coherent with the target taxon names. See details in description section: :ref:`select_taxa.pl ` script. Excluding sequences of a taxon list ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ With the same script it is also possible to eliminate sequences of taxa instead of selecting them. Set the *negative_list* option to 1 to do that. .. code-block:: bash perl scripts/select_taxa.pl -taxon_list data/example/taxon_list.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -min_taxlevel species -outdir tutorial/select_taxa_1 -out COInr_reduced.tsv -negative_list 1 See details in description section: :ref:`select_taxa.pl ` script. .. _select_region_custom_tutorial: Select region ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Sequences can be trimmed to a specific region of the COI gene by the :ref:`select_region.pl ` script. To define the region, you can either give a fasta file with sequences trimmed to the region of interest, or you can detect it automatically by e-pcr. .. _select_region_e_pcr_custom_tutorial: Select region using the e_pcr option ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The primers used in this example are amplifying a Leray fragment (ca. 313 bp of the second half of the barcode region). .. code-block:: bash perl scripts/select_region.pl -tsv COInr/COInr.tsv -outdir tutorial/select_region/ePCR -e_pcr 1 -fw GGNTGAACNGTNTAYCCNCC -rv TAWACTTCDGGRTGNCCRAARAAYCA -trim_error 0.3 -min_amplicon_length 280 -max_amplicon_length 345 -min_overlap 20 -tcov 0.8 -identity 0.7 .. _select_region_bait_fas_custom_tutorial: Select region using the bait_fas option ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Using the *e_pcr* option is an easy way to produce some sequences trimmed to the target region, and they can be used as a database to align all other sequences to them. However, if the parameters of the e_pcr are relaxed, it can produce some false positives. An alternative solution is to use a small, taxonomically divers fasta file, with sequences already trimmed to the target region (-*bait_fas* option). An example of such a file is given in the data directory (data/one_seq_per_order_658.fas). It contains one sequence for each taxonomic order among the taxa that have a compete mitochondrial genome available in GenBank. Sequences are trimmed to the approximately 658 bp (depending on the taxon) barcode fragment of the COI gene. .. code-block:: bash perl scripts/select_region.pl -tsv COInr/COInr.tsv -outdir tutorial/select_region/bait_fas -e_pcr 0 -bait_fas data/one_seq_per_order_658.fas -tcov 0.8 -identity 0.7 See details in description section: :ref:`select_region.pl ` script. .. _format_db_custom_tutorial: Format database ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Format the database to one of the following formats - qiime - rdp - full - blast - vtam - sintax **qiime** .. code-block:: bash perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt qiime -outdir COInr/qiime -out COInr_qiime **rdp** .. code-block:: bash perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt rdp -outdir COInr/rdp -out COInr_rdp You should use the rdp_calssifier or qiime's feature-classifier to train the database using the output files of this script if you have used the rdp or qiime options. **full** The full option, gives a :ref:`tsv file ` with seqIDs, ranked lineages, taxIDs for each sequence, and this is a very easy-to-parse, complete file. .. code-block:: bash perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt full -outdir COInr/full -out COInr_full **sintax** .. code-block:: bash perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt sintax -outdir COInr/sintax -out COInr_sintax **blast** For making a BLAST database, the taxonomy file is not necessary and the indexed files in the output folder are ready to use. .. code-block:: bash perl scripts/format_db.pl -tsv COInr/COInr.tsv -outfmt blast -outdir COInr/blast -out COInr_blast **vtam** The vtam option produces a BLAST database and a taxonomy file adapted to `VTAM `_ . .. code-block:: bash perl scripts/format_db.pl -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -outfmt vtam -outdir COInr/vtam -out COInr_vtam See details in description section: :ref:`format_db.pl ` script. .. _chained_custom_tutorial: Chaining steps to make a custom database ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the above examples, we have started from the COInr database. However, you can chain the different commands. Bellow, I will show you how to create a database with the following characteristics: - Eukaryota sequences - Excluding insects - Enriched with custom sequences - Sequences assigned at least to genus level - Trimmed to the Leray fragment (ca. 313 nt of the second half of the barcode region) of the COI gene (keep sequences if cover at least 90% of the target region) - rdp_classifier format **Notes**: - It is a good idea to start with steps that are relatively quick and reduce the size of the database. - Since, over 70% of the sequences are from Insecta in COInr, we will start by eliminating them. - The custom sequences are all Non-Insect Eukaryotes, so we can add custom sequences to the reduced dataset. Otherwise, we should have started by adding custom sequences. This solution is also fine, but gives large intermediate files. - The selection of the target region is the most computationally intensive, and the more diverse the dataset, the less precise it is. So it is preferable to do this at the end of the pipeline. .. _exclude_insecta_tutorial: Exclude Insecta and sequences with resolution lower than genus ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash perl scripts/select_taxa.pl -taxon_list data/example/taxon_list_insecta.tsv -tsv COInr/COInr.tsv -taxonomy COInr/taxonomy.tsv -min_taxlevel genus -outdir tutorial/chained/1_noInsecta -out COInr_noIns.tsv -negative_list 1 .. _keep_eukaryota_tutorial: Keep Eukaryota ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash perl scripts/select_taxa.pl -taxon_list data/example/taxon_list_eukaryota.tsv -tsv tutorial/chained/1_noInsecta/COInr_noIns.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/2_Eukaryota -out COInr_noIns_Euk.tsv .. _add_custom_chained_tutorial: Add custom sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash perl scripts/format_custom.pl -custom data/example/my_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/3_add_custom/1_format Check and format the custom_lineages.tsv and make custom_lineages_verified.tsv as in :ref:`Add custom sequences to a database ` section. .. code-block:: bash perl scripts/add_taxids.pl -lineages data/example/custom_lineages_verified.tsv -sequences tutorial/chained/3_add_custom/1_format/custom_sequences.tsv -taxonomy COInr/taxonomy.tsv -outdir tutorial/chained/3_add_custom/2_add_taxids perl scripts/dereplicate.pl -tsv tutorial/chained/3_add_custom/2_add_taxids/sequences_with_taxIDs.tsv -outdir tutorial/chained/3_add_custom/3_dereplicate -out custom_dereplicated_sequences.tsv Add the formatted, dereplicated custom sequences to the sequences in COInr_noIns_Euk.tsv .. code-block:: bash perl scripts/pool_and_dereplicate.pl -tsv1 tutorial/chained/2_Eukaryota/COInr_noIns_Euk.tsv -tsv2 tutorial/chained/3_add_custom/3_dereplicate/custom_dereplicated_sequences.tsv -outdir tutorial/chained/3_add_custom -out COInr_noIns_Euk_custom.tsv mv tutorial/chained/3_add_custom/2_add_taxids/taxonomy_updated.tsv tutorial/chained/3_add_custom/taxonomy_updated.tsv .. _keep_genus_tutorial: Keep only sequences with genus or higher resolution ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We have eliminated sequences with lower than genus resolution from COInr in the first step (-min_taxlevel genus). However, among the custom sequences we had a sequence with an unknown genus. So let's redo the selection for a minimum taxonomic level. Yes, you are right! We could have just avoided adding that sequence to the database in the previous step. But if you have many custom sequences, you might just be lazy to check the custom sequences manually, and in that case you can use mkCOInr to this for you. **Attention**: From now on, we have to use the updated taxonomy file (taxonomy_updated.tsv), since some of the taxa of the custom sequences might not be in the original taxonomy.tsv file. .. code-block:: bash perl scripts/select_taxa.pl -tsv tutorial/chained/3_add_custom/COInr_noIns_Euk_custom.tsv -taxonomy tutorial/chained/3_add_custom/taxonomy_updated.tsv -outdir tutorial/chained/4_genus -out COInr_noIns_Euk_custom_genus.tsv .. _trim_to_leray_tutorial: Trim to Leray region ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash perl scripts/select_region.pl -tsv tutorial/chained/4_genus/COInr_noIns_Euk_custom_genus.tsv -outdir tutorial/chained/5_select_region -e_pcr 1 -fw GGNTGAACNGTNTAYCCNCC -rv TAWACTTCDGGRTGNCCRAARAAYCA -trim_error 0.3 -min_amplicon_length 280 -max_amplicon_length 345 -min_overlap 20 -tcov 0.9 -identity 0.7 .. _format_rdp_chained_tutorial: Format for RDP_classifier ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: bash perl scripts/format_db.pl -tsv tutorial/chained/5_select_region/trimmed.tsv -taxonomy tutorial/chained/3_add_custom/taxonomy_updated.tsv -outfmt rdp -outdir tutorial/chained/6_rdp -out COInr_customized .. _create_coinr_tutorial: Create COInr from BOLD and NCBI ------------------------------------------------- The following steps describe how COInr database (available at `Zenodo `_ ) was produced. .. _download_ncbi_taxonomy_tutorial: Download NCBI taxonomy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Download NCBI taxonomy dmp file and create :ref:`taxonomy.tsv `. .. code-block:: bash cd mkCOInr perl scripts/download_taxonomy.pl -outdir COInr_new/taxonomy See details in description section: :ref:`download_taxonomy.pl ` script. .. _ncbi_sequences_tutorial: NCBI sequences ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Download NCBI sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following command will download Coding DNA Sequence (CDS) fasta files of all sequences with COI, CO1, COXI or COX1 in the title lines and complete mitochondrial genomes. It takes several hours (days) to run this command. .. code-block:: bash nsdpy -r "COI OR COX1 OR CO1 OR COXI OR (complete[Title] AND genome[Title] AND Mitochondrion[Filter])" -T -v --cds The results are found in the NSDPY_results/yyyy-mm-dd_hh-mm-ss folder. The sequences.fasta file contains all CDS sequences. Sequences are correctly oriented but should still be filtered to keep only COI sequences. TaxIDs.txt contains the sequenceIDs and the TaxIDs. Move the results of nsdpy to the COInr_new/ncbi/download directory and clean up the directory. .. code-block:: bash mkdir -p COInr_new/ncbi mv NSDPY_results/yyyy-mm-dd_hh-mm-ss COInr_new/ncbi/download mv report.tsv COInr_new/ncbi/download rmdir NSDPY_results Format NCBI sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :ref:`format_ncbi.pl ` script will - Select COI sequences and clean them. - Eliminate identical sequences of the same taxID. - Clean tax names and taxids. .. code-block:: bash perl scripts/format_ncbi.pl -cds COInr_new/ncbi/download/sequences.fasta -taxids COInr_new/ncbi/download/TaxIDs.txt -taxonomy COInr_new/taxonomy/taxonomy.tsv -outdir COInr_new/ncbi/format The major output is a :ref:`sequence tsv file with taxIDs `. See details in description section: :ref:`format_ncbi.pl ` script. Dereplicate NCBI sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Eliminate sequences that are substring of another sequence of the same :ref:`taxID `. .. code-block:: bash perl scripts/dereplicate.pl -tsv COInr_new/ncbi/format/ncbi_sequences.tsv -outdir COInr_new/ncbi/dereplicate -out ncbi_dereplicated_sequences.tsv The output is the same format as the input tsv file. See details in description section: :ref:`dereplicate.pl ` script. .. _bold_sequences_tutorial: BOLD sequences ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Download BOLD sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :ref:`download_bold.pl ` script is deprecated. The BOLD API used in download_bold.pl do not allow anymore to download large data files. It is possible, however, to download all public sequences as a data package from `https://www.boldsystems.org/index.php/datapackages `_. You need to have a BOLD account for downloading the data package in (tar.gz compressed) format, that contains a TSV file with sequences, taxonomic lineages and other metadata. This uncompressed file will be the input of format_bold_package.pl. Format BOLD sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :ref:`format_bold_package.pl ` script will - Select COI sequences and clean them - Select sequences with out without BIN_URI according to the delete_noBIN argument - Eliminate identical sequences of the same lineage - Clean lineages and make a list with corresponding sequenceIDs .. code-block:: bash perl scripts/format_bold_package.pl -bold_data COInr_new/bold/download/BOLD_Public.26-Apr-2024.tsv -outdir COInr_new/bold/format -delete_noBIN 1 The major output is the following: - :ref:`bold_sequences.tsv ` - :ref:`bold_lineages.tsv ` (all identical lineages are pooled into a same line) See details in description section: :ref:`format_bold_package.pl ` script. Add taxIDs to BOLD sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For each lineage the :ref:`add_taxids.pl ` script will - Find an existing :ref:`taxID ` at the lowest level possible. TaxIDs can be either from NCBI, or negative taxID already present in :ref:`taxonomy.tsv `. - Add new arbitrary (negative) taxIDs to taxa, that are not yet in taxonomy.tsv - Link each new taxID to existing one as a child and include info to the updated taxonomy file - Update the input taxonomy file .. code-block:: bash perl scripts/add_taxids.pl -lineages COInr_new/bold/format/bold_lineages.tsv -sequences COInr_new/bold/format/bold_sequences.tsv -taxonomy COInr_new/taxonomy/taxonomy.tsv -outdir COInr_new/bold/add_taxids The main output files are the following: - :ref:`sequences_with_taxIDs.tsv ` - :ref:`taxonomy_updated.tsv ` See details in description section: :ref:`add_taxids.pl ` script. Dereplicate BOLD sequences ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Eliminate sequences that are substring of another sequence of the same taxID. .. code-block:: bash perl scripts/dereplicate.pl -tsv COInr_new/bold/add_taxids/sequences_with_taxIDs.tsv -outdir COInr_new/bold/dereplicate -out bold_dereplicated_sequences.tsv The output is the same format as the input tsv file. See details in description section: :ref:`dereplicate.pl ` script. .. _pool_and_dereplicate_tutorial: Pool and dereplicate datasets ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Use the dereplicated sequence files from BOLD and NCBI. The :ref:`pool_and_dereplicate.pl ` script will pool the files and dereplicate sequences of a taxID that are present in both files. .. code-block:: bash perl scripts/pool_and_dereplicate.pl -tsv1 COInr_new/bold/dereplicate/bold_dereplicated_sequences.tsv -tsv2 COInr_new/ncbi/dereplicate/ncbi_dereplicated_sequences.tsv -outdir COInr_new -out COInr.tsv The output is the same format as the input tsv file. See details in description section: :ref:`pool_and_dereplicate.pl ` script. **Move the taxonomy file to the same directory** .. code-block:: bash mv COInr_new/bold/add_taxids/taxonomy_updated.tsv COInr_new/taxonomy.tsv