Description

add_taxids.pl

Aim

For each lineage

  • Find the smallest taxon that matches an already existing taxID in taxonomy.tsv

  • Assign arbitrary, negative taxIDs to taxa under this taxon

Input files

Parameters/options

NONE

Algorithm

For each lineage finds the lowest taxon that matches an already existing taxID :

  • Finds the smallest taxon that matches an already existing taxID (taking into account all taxIDs in the input taxonomy file)

  • Accepts taxID if at least 0.6 of the taxa in the upward input lineages matches the lineage of the taxID (for species level do not count the genus, since it matches necessarily the species name) OR Both the input taxon and taxID have a phylum rank

  • If the match between the input lineage and the taxID lineage is <=0.25, go to the next taxlevel

  • If the match between the input lineage and the taxID lineage is between 0.25 and 0.6, print lineage to ambiguous file. It should be checked manually.

The search for an existing taxID, takes into account

  • Existing NCBI or previous arbitrary (negative) taxIDs

  • All synonym taxon names in the taxonomy file

  • If there are more than one taxIDs for the taxon name, take the one with the highest proportion of taxa matching the upwards lineage and then the one that has the same taxonomic rank

If there is no taxID for the lowest taxonomic levels in the lineage, assign arbitrary negative taxIDs to them. New taxIDs are linked as a child to an existing taxID, and the taxonomy file is updated with them.

Output

dereplicate.pl

Aim

Taxonomically aware dereplication.

Input files

Parameters/options

  • vsearch_path (path to vsearch executables if not in the PATH)

Algorithm

Compare sequences of the same taxID. Delete sequences that are substring of another sequence (100% identity on the overlapping region, and one sequence covers the other completely). If more than 10.000 sequences for the same taxID, first, cluster the sequences with 100% of identity using the cluster_fast algorithm of vsearch, then use the substring search for each cluster.

Output

download_bold.pl

Aim

Download BOLD data in tsv format for a list of taxa.

Input files

Parameters/options

  • try_download (integer; Try to download files try_download times if some of the downloaded files are incomplete; Default: 3)

  • max_record_n (integer; If more than max_record_n records for a taxon, cut up the taxon to subtaxa; Do not cut up input taxa if max_record_n is 0; Default: 0)

  • taxonomy.tsv Only necessary id max_record_n > 0

Algorithm

The script downloads all sequences and lineages for all taxa on the taxon_list from BOLD.

The taxon_list file was constructed manually from taxa on https://www.boldsystems.org/index.php/TaxBrowser_Home. Each taxon on the list has less than 750.000 specimen records in BOLD. The taxon_list constructed on 2022-02-24 is available on github (data/bold_taxon_list_2022-02-24.txt). It contains all taxa available in BOLD. This file might need to be updated later.

For constructing a taxon specific taxon_list (e.g. include only taxa of Arthropoda), the bold_taxon_list_2022-02-24_details.tsv file is available in the data directory, where the lineage of each taxon and the number of specimen records on are also included. You can use this file to easily select the taxa of your interest and make a custom taxon_list.

Alternatively, a list of taxa (including large taxa such as Arthropoda) can be given and the input taxa can be cut up automatically to subtaxa of less than max_record_n records each. This method has the advantage of avoiding the manual construction of the taxon list (as for bold_taxon_list_2022-02-24.txt). However, the subtaxa produced by the script are based on the NCBI taxonomy, and in case of divergent nomenclature between BOLD and NCBI, some of the subtaxa can be missed. For example, if the Chordata phylum is cut up to classes, it will contain the Lepidosauria class. Lepidosauria gives 0 results, since in BOLD the class field contains Reptilia instead of Lepidosauria, thus missing out BOLD orders like Crocodylia, Rhynchocephalia, Squamata, Testudines.

Download is done using BOLD’s API. First a small stat file is downloaded to access the number of records available for the taxa, then the tsv file is downloaded with sequences and metainfo. The script checks if the number of downloaded records corresponds to the expected one (based on the stat file). If there is an error, it removes the file and retries the download try_download times.

If the file exists already, the download is skipped. In this way, if the program stops (for example hitting wall time on a server), it can be simply restarted and the taxa with successful downloads will not be rerun.

NOTE: The download for a long list of taxa can take several days since it is not parallelized. You can cut up the input list and run each of them on a separate computer and move the output files to the same folder afterwards.

Output

  • json file for each taxon with the total number of records

  • tsv file for each taxon with the following columns:

processid sampleid recordID catalognum fieldnum institution_storing collection_code bin_uri phylum_taxID phylum_name class_taxID class_name order_taxID order_name family_taxID family_name subfamily_taxID subfamily_name genus_taxID genus_name species_taxID species_name subspecies_taxID subspecies_name identification_provided_by identification_method identification_reference tax_note voucher_status tissue_type collection_event_id collectors collectiondate_start collectiondate_end collectiontime collection_notesite_code sampling_protocol lifestage sex reproduction habitat associated_specimens associated_taxa extrainfo notes lat lon coord_source coord_accuracy elev depth elev_accuracy depth_accuracy country province_state region sector exactsite image_ids image_urls media_descriptors captions copyright_holders copyright_years copyright_licenses copyright_institutions photographers sequenceID markercode genbank_accession nucleotides trace_ids trace_names trace_links run_dates sequencing_centers directions seq_primers marker_codes

download_taxonomy.pl

Aim

Download the NCBI taxonomy dmp files (https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/) and prepare taxonomy.tsv file.

Input

Parameters/options

  • skip_download (0/1; if 1, skips download, only prepares taxonomy file; Default: 0)

Algorithm

Downloads ncbi taxonomy dump files from https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/ to the ncbi_tax subdirectory and decompress them. Prepares a taxonomy.tsv file with all taxIDs in it.

Output

format_bold.pl

Aim

Prepare a single sequence file and a lineage file from all tsv files downloaded from BOLD. Clean and orient sequences.

Input files

  • download_dir (name of the folder containing the downloaded BOLD tsv files)

  • outdir

Parameters/options

  • marker_list (List of markers to be selected; Default: ‘COI-5P COI-3P’)

  • check_name (0/1; If 1 keeps only taxa with valid Latin name format: Default: 1)

  • max_n (positive integer; eliminates sequences with max_n or more consecutive Ns; Default:5)

  • min_length (positive integer; minimum length of the cleaned sequence; Default:100)

  • max_length (positive integer; maximum length of the cleaned sequence; Default:2000)

  • check_orientation (0/1; if 1, checks the orientation of the sequences; Default: 0)

  • blast_path (Optional; Path to the BLAST executables if they is not in your PATH)

Algorithm

Clean downloaded files and pool information to lineage and sequence files

  • Eliminate partial lines (mostly errors in the database)

  • Select sequences for a given marker list

  • Clean sequences
    • Correct sequence IDs

    • Gaps deleted

    • Non-TCGA changed to N

    • External Ns deleted

    • Sequences with more than max_n consecutive Ns are deleted

    • Keep sequences with length in a min_length and max_length range

  • Clean lineages
    • If check_name, keep only names matching a correct Latin name format (only letters, spaces and -, correct capitalization)

    • Pool identical lineages into one line with the list of valid sequence IDs in the last field

    • Eliminate lines with environmental and metagenomic samples

  • Orient sequence (optional)
    • Count the TAA, TAG STOP codons in each reading frame

    • Choose the orientation where there is no STOP codon

    • If there is a STOP codon in all frames OR there are frames without STOP codon both in stand + and -, class it as ambiguous

    • Make a small “reference” db form randomly sampled oriented sequences

    • Blast ambiguous sequences to check orientation

    • Write sequences without hit to the bold_ambiguous_orientation.fas

Output

  • bold_sequences.tsv

  • bold_lineages.tsv

  • bold_partial_lines.tsv (lines in the input tsv files that did not have sequences)

  • bold_ambiguous_orientation.fas (sequences that could not be oriented in check_orientation option)

format_bold_package.pl

Aim

Prepare a single sequence file and a lineage file from a BOLD data package file downloaded from BOLD <https://www.boldsystems.org/index.php/datapackages>_ Clean and orient sequences.

Input files

  • bold_data (name of the data package downloaded from BOLD; tsv format)

  • outdir

Parameters/options

  • marker_list (List of markers to be selected; Default: ‘COI-5P COI-3P’)

  • delete_noBIN (0/1/2; Default 1;
    • If 0, do not delete sequneces without BIN_URI

    • If 1, delete sequences without BIN_URI, if there are other sequences with BIN_URI for the taxon

    • If 2, delete all sequences without BIN_URI, even if there are no other sequences with BIN_URI for the taxon)

  • check_name (0/1; If 1 keeps only taxa with valid Latin name format: Default: 1)

  • max_n (positive integer; eliminates sequences with max_n or more consecutive Ns; Default:5)

  • min_length (positive integer; minimum length of the cleaned sequence; Default:100)

  • max_length (positive integer; maximum length of the cleaned sequence; Default:2000)

  • check_orientation (0/1; if 1, checks the orientation of the sequences; Default: 0)

  • blast_path (Optional; Path to the BLAST executables if they is not in your PATH)

Algorithm

Select and clean sequences and pool information to lineage and sequence files

  • Eliminate partial lines (mostly errors in the database)

  • Select sequences for a given marker list

  • Delete sequences without BIN according to the delete_noBIN parameter

  • Clean sequences
    • Correct sequence IDs

    • Gaps deleted

    • Non-TCGA changed to N

    • External Ns deleted

    • Sequences with more than max_n consecutive Ns are deleted

    • Keep sequences with length in a min_length and max_length range

  • Clean lineages
    • If check_name, keep only names matching a correct Latin name format (only letters, spaces and -, correct capitalization)

    • Pool identical lineages into one line with the list of valid sequence IDs in the last field

    • Eliminate lines with environmental and metagenomic samples

  • Orient sequence (optional)
    • Count the TAA, TAG STOP codons in each reading frame

    • Choose the orientation where there is no STOP codon

    • If there is a STOP codon in all frames OR there are frames without STOP codon both in stand + and -, class it as ambiguous

    • Make a small “reference” db form randomly sampled oriented sequences

    • Blast ambiguous sequences to check orientation

    • Write sequences without hit to the bold_ambiguous_orientation.fas

Output

  • bold_sequences.tsv

  • bold_lineages.tsv

  • bold_partial_lines.tsv (lines in the input tsv files that did not have sequences)

  • bold_ambiguous_orientation.fas (sequences that could not be oriented in check_orientation option)

format_custom.pl

Aim

Make a lineage file for each input taxon name. Prepare input for add_taxid.pl

The output lineage file should be checked manually

  • To see if the suggested lineages are plausible

  • To select the correct lineage if there is more than one (1 in homonymy column) for the same taxon name

Before the next step (add_taxids.pl)

Input

Parameters/options

  • max_n (positive integer; eliminate sequences with max_n or more consecutive Ns; Default: 5)

  • min_length (positive integer; minimum length of the cleaned sequence; Default: 100)

  • max_length (positive integer; maximum length of the cleaned sequence; Default: 2000 )

  • check_seqid_format (0/1; if 1 check if seqID resemble to bold and ncbi formats, print out warnings, if yes; Default: 1)

Algorithm

Match names to all taxon names in taxonomy.tsv (including synonyms)

  • Write a lineage to all taxon names where name matches to an existing name in taxonomy.tsv

  • If homonymy, create a lineage for each homonym, and write 1 to the homonymy column

  • If taxon name corresponds to a Latin name format (Genus species) and species name is not known, get the lineage for the genus.

If the check_seqid_format option is activated

  • If some of the sequence IDs are not unique, list the duplicates IDs and exit

  • If sequence ID format is similar to accessions used in BOLD and NCBI/EMBL/DDBJ, list IDs but continue

  • A fairly safe format is xxx_xxx####, where x is a letter, # is a digit

Clean sequences

  • gaps deleted

  • non-TCGA changed to N

  • external Ns deleted

  • sequences with more than max_n consecutive Ns are deleted

Output

format_db.pl

Aim

Make a database in blast, rdp, qiime or full tsv format from the sequence tsv and taxonomy.tsv files

Input

  • sequence tsv

  • taxonomy.tsv

  • outfmt (rdp/blast/qiime/full/vtam/sintax; choose the format of the database)

  • outdir

  • out (string for naming the output files)

Parameters/options

  • blast_path (Optional; path to the blast executables if it is not in your PATH)

Algorithm

BLAST db, VTAM

  • Prepare a fasta file with sequences and the taxIDs (seqID, taxID)

  • Run the makeblastdb command of blast to make indexed files ready to be used as a blast database

  • for VTAM format, prepare taxonomy file as well as the BLAST database. They can be used directly in VTAM.

RDP, QIIME, sintax and FULL

  • Prepare a ranked lineage for each taxID

  • Taxon names are concatenated with taxID to avoid homonymy

  • Missing taxonomic levels are completed by using the name of a higher-level taxon concatenated with the taxonomic ranks

  • Prepare a trainseq fasta and a taxon file for rdp and qiime

  • Prepare a single tsv file for full

  • Prepare a single fasta file with lineages incuded in the definition lines for sintax

The trainseq fasta and the taxon files can be used by the train command of rdp_classifier or feature-classifier of QIIME to train the dataset before classification.

The sintax fasta file is ready to be used as a database for SINTAX.

The full tsv format is an easy to parse tsv file with ranked lineage and taxID for each sequence.

Output

blast option

vtam option

rdp option

quiime option

sintax option

full option

format_rdp.pl

Aim

Format the RDP training dataset to sequence tsv file with taxIDs and taxonomy.tsv

Input files

Parameters/options

NA

Algorithm

Output

format_ncbi.pl

Aim

Format the CDS and taxID files to a sequence tsv file with taxIDs. Select and clean sequences.

Input files

  • cds (CDS fasta file; output of nsdpy)

  • taxids (tsv file with the seqID and taxID columns; output of nsdpy)

  • taxonomy.tsv

  • outdir

Parameters/options

  • check_name (0/1; If one, keep only taxa with valid Latin name fomat: Default: 1)

  • max_n (positive integer; eliminate sequences with max_n or more consecutive Ns; Default: 5)

  • min_length (positive integer; minimum length of the cleaned sequence; Default: 100)

  • max_length (positive integer; maximum length of the cleaned sequence; Default: 2000)

Algorithm

  • Select sequences
    • Check if gene names and protein names correspond to COI

    • Eliminate genes if they have introns

    • Can have more than one COI gene in the same mitochondrion

    • Accept only if valid taxID, replace old non-valid taxIDs by up-to-date taxIDs

    • Eliminate sequences from environmental and metagenomic samples

    • If check_name is activated, take the taxID of the smallest taxon with a valid Latin name, otherwise keep the original taxID.

  • Clean sequences
    • Upper case

    • Replace non-ATCG by N

    • Delete gaps and external Ns

    • Delete sequence if more than max_n consecutive Ns

    • Keep sequences if length is between min_length and max_length

    • Sequences are already in a correct orientation in the input file, since that are coming from CDS files

Output

get_subtaxa.pl

Aim

List subtaxa of the input taxon at the next major taxonomic rank (e.g. list all orders of the input class)

Input files

Parameters/options

NONE

Algorithm

  • If taxon is a taxon name, get all taxIDs that correspond to this name (e.g. 1266065 and 50622 for Plecoptera)

  • Determine the next lowest major taxonomic rank (phylum, class, order, family, genus, species) for each taxID (e.g. if taxId is an order or suborder or superfamily, the next major tax rank is family)

  • List subtaxa of each taxID of this taxonomic rank.

Output

  • tsv with the following columns
    • taxon

    • taxID

    • subtaxon

    • taxID (of the subtaxa)

pool_and_dereplicate.pl

Aim

Pool 2 dereplicated sequence tsv files and do a taxonomically-aware dereplication for taxIDs present in both input files

Input files

  • tsv1

  • tsv2

  • outdir

  • out (name of the output dereplicated sequence tsv file)

Parameters/options

  • vsearch_path (path to vsearch executables if not in the PATH)

Algorithm

Pool sequences from two input tsv files. The dereplication is done only for taxID shared by the two input files, since they have been dereplicated individually. The algorithm of dereplication is identical to the one used in dereplicate.pl

Output

reduce_metadata.pl .. _reduce_metadata_reference:

reduce_metadata.pl

Aim

Select metadata of BOLD sequences that appear in the final file of COInr.tsv; The output is used to be able to trace back BOLD sequences to their autors and get their metadata.

Input files

Algorithm

Select lines from the BOLD data package, that reamin in the COInr database.

Output

select_region.pl

Aim

Select a target region from input sequences. As an input, either a primer pair should be given to identify the target region in some sequences by e-pcr, or a fasta file containing taxonomically diverse sequences limited to the target region. The sequences are then aligned to the target sequences and trimmed according to the alignment

Input files

  • sequence tsv with taxIDs

  • outdir

  • bait_fas (A small phylogenetically diverse fasta file with sequences already trimmed to the target region; Optional; Can be produced by e-pcr included in the script.)

Parameters/options

E-pcr related parameters

  • e_pcr (0/1; if 1, identify the target region of the sequences by e-pcr in the first step)

  • fw (optional; if e_pcr is done, the sequence of the forward primer that amplifies the target region)

  • rv (optional; if e_pcr is done, the sequence of the reverse primer that amplifies the target region)

  • trim_error (real [0-1], the proportion of mismatches allowed between the primer and the sequence during the e_pcr; Default: 0.3)

  • min_overlap (the minimum overlap between primer and the sequence during e-pcr; Default: 20)

  • min_amplicon_length (The minimum length of the amplicon after primer trimming; Default: 100)

  • max_amplicon_length (The minimum length of the amplicon after primer trimming; Default: 2000)

  • cutadapt_path (Optional; Path to cutadapt if it is not in your PATH)

usearch_global related parameters for trimming sequences according to the alignments

  • tcov (minimum coverage of the target sequence in usearch_global hits; Default: 0.5)

  • identity (minimum identity between the sequence and the target in usearch_global hits; Default: 0.7)

  • vsearch_path (Optional; path to the vsearch if it is not in your PATH)

Algorithm

A fasta file is prepared from the input tsv sequence file.

Sequences are aligned to a small pool of target sequences already limited to the target region (bait_fas). The alignment is done by the usearch_global command of vsearch which makes global alignments (unlike BLAST).

The best hit is used for each sequence to orient and trim them to the target region. Only hits with a minimum target coverage (tcov) and percentage of identity (identity) are used.

The bait_fas file can be either previously prepared by the users and given as an input, or be produced by e-pcr using the e-pcr related parameters.

To reduce runtime, sequences in the bait_fas are clustered by -cluster_fast algorithm of vsearch and the centroids are used as a target file for the usearch_global

Output

  • cutadapt_trimmed.fas (if e_pcr; fasta file with sequences recognized and trimmed by E-pcr; equivalent of bait_fas )

  • target_centroids.fas (fasta file of centroids of the clustering of bait_fas or cutadapt_trimmed.fas)

  • trimmed.tsv (sequence tsv files with taxIDs trimmed to the target region)

  • untrimmed.tsv (sequence tsv files with taxIDs where the target region is not found)

select_taxa.pl

Aim

Select or eliminate sequences that belong to taxa on a taxon list. Select sequences with a minimum taxonomic resolution (e.g. assigned at least to genus).

Input files

Parameters/options

  • negative_list (1/0; if 1, keeps all taxa except the ones on the taxon list; Default: 0)

  • min_taxlevel (species/genus/family/order/class/phylum/kingdom/root; Default: root)

Algorithm

Select sequences that belong to the taxa on a taxon list if negative_list==0 (default). Select sequences that DO NOT belong to the taxa on in a taxon list, if negative_list==1.

Keep only sequences that are assigned to at least to min_taxlevel rank.

If taxIDs are not given in the taxon_list file the script uses all taxIDs that matches the taxon name.

A lineage file is written for each taxon in taxon_list and the corresponding taxIDs.

  • It should be checked manually if lineages are coherent with the target taxa

  • Homonymy column indicates if there are more than one taxID for a taxon

  • If there are incoherent lineages, make a new taxon_list file based on the lineage file including taxon names and taxIDs and rerun the script with the new taxon_list.

Output