Description
add_taxids.pl
Aim
For each lineage
Find the smallest taxon that matches an already existing taxID in taxonomy.tsv
Assign arbitrary, negative taxIDs to taxa under this taxon
Input files
lineages (lineage tsv without taxID; output of format_custom.pl or format_bold_package.pl)
sequences (sequence tsv without taxID)
Parameters/options
NONE
Algorithm
For each lineage finds the lowest taxon that matches an already existing taxID :
Finds the smallest taxon that matches an already existing taxID (taking into account all taxIDs in the input taxonomy file)
Accepts taxID if at least 0.6 of the taxa in the upward input lineages matches the lineage of the taxID (for species level do not count the genus, since it matches necessarily the species name) OR Both the input taxon and taxID have a phylum rank
If the match between the input lineage and the taxID lineage is <=0.25, go to the next taxlevel
If the match between the input lineage and the taxID lineage is between 0.25 and 0.6, print lineage to ambiguous file. It should be checked manually.
The search for an existing taxID, takes into account
Existing NCBI or previous arbitrary (negative) taxIDs
All synonym taxon names in the taxonomy file
If there are more than one taxIDs for the taxon name, take the one with the highest proportion of taxa matching the upwards lineage and then the one that has the same taxonomic rank
If there is no taxID for the lowest taxonomic levels in the lineage, assign arbitrary negative taxIDs to them. New taxIDs are linked as a child to an existing taxID, and the taxonomy file is updated with them.
Output
dereplicate.pl
Aim
Taxonomically aware dereplication.
Input files
out (name of the output dereplicated sequence tsv file)
Parameters/options
vsearch_path (path to vsearch executables if not in the PATH)
Algorithm
Compare sequences of the same taxID. Delete sequences that are substring of another sequence (100% identity on the overlapping region, and one sequence covers the other completely). If more than 10.000 sequences for the same taxID, first, cluster the sequences with 100% of identity using the cluster_fast algorithm of vsearch, then use the substring search for each cluster.
Output
download_bold.pl
Aim
Download BOLD data in tsv format for a list of taxa.
Input files
Parameters/options
try_download (integer; Try to download files try_download times if some of the downloaded files are incomplete; Default: 3)
max_record_n (integer; If more than max_record_n records for a taxon, cut up the taxon to subtaxa; Do not cut up input taxa if max_record_n is 0; Default: 0)
taxonomy.tsv Only necessary id max_record_n > 0
Algorithm
The script downloads all sequences and lineages for all taxa on the taxon_list from BOLD.
The taxon_list file was constructed manually from taxa on https://www.boldsystems.org/index.php/TaxBrowser_Home. Each taxon on the list has less than 750.000 specimen records in BOLD. The taxon_list constructed on 2022-02-24 is available on github (data/bold_taxon_list_2022-02-24.txt). It contains all taxa available in BOLD. This file might need to be updated later.
For constructing a taxon specific taxon_list (e.g. include only taxa of Arthropoda), the bold_taxon_list_2022-02-24_details.tsv file is available in the data directory, where the lineage of each taxon and the number of specimen records on are also included. You can use this file to easily select the taxa of your interest and make a custom taxon_list.
Alternatively, a list of taxa (including large taxa such as Arthropoda) can be given and the input taxa can be cut up automatically to subtaxa of less than max_record_n records each. This method has the advantage of avoiding the manual construction of the taxon list (as for bold_taxon_list_2022-02-24.txt). However, the subtaxa produced by the script are based on the NCBI taxonomy, and in case of divergent nomenclature between BOLD and NCBI, some of the subtaxa can be missed. For example, if the Chordata phylum is cut up to classes, it will contain the Lepidosauria class. Lepidosauria gives 0 results, since in BOLD the class field contains Reptilia instead of Lepidosauria, thus missing out BOLD orders like Crocodylia, Rhynchocephalia, Squamata, Testudines.
Download is done using BOLD’s API. First a small stat file is downloaded to access the number of records available for the taxa, then the tsv file is downloaded with sequences and metainfo. The script checks if the number of downloaded records corresponds to the expected one (based on the stat file). If there is an error, it removes the file and retries the download try_download times.
If the file exists already, the download is skipped. In this way, if the program stops (for example hitting wall time on a server), it can be simply restarted and the taxa with successful downloads will not be rerun.
NOTE: The download for a long list of taxa can take several days since it is not parallelized. You can cut up the input list and run each of them on a separate computer and move the output files to the same folder afterwards.
Output
json file for each taxon with the total number of records
tsv file for each taxon with the following columns:
processid sampleid recordID catalognum fieldnum institution_storing collection_code bin_uri phylum_taxID phylum_name class_taxID class_name order_taxID order_name family_taxID family_name subfamily_taxID subfamily_name genus_taxID genus_name species_taxID species_name subspecies_taxID subspecies_name identification_provided_by identification_method identification_reference tax_note voucher_status tissue_type collection_event_id collectors collectiondate_start collectiondate_end collectiontime collection_notesite_code sampling_protocol lifestage sex reproduction habitat associated_specimens associated_taxa extrainfo notes lat lon coord_source coord_accuracy elev depth elev_accuracy depth_accuracy country province_state region sector exactsite image_ids image_urls media_descriptors captions copyright_holders copyright_years copyright_licenses copyright_institutions photographers sequenceID markercode genbank_accession nucleotides trace_ids trace_names trace_links run_dates sequencing_centers directions seq_primers marker_codes
download_taxonomy.pl
Aim
Download the NCBI taxonomy dmp files (https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/) and prepare taxonomy.tsv file.
Input
Parameters/options
skip_download (0/1; if 1, skips download, only prepares taxonomy file; Default: 0)
Algorithm
Downloads ncbi taxonomy dump files from https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/ to the ncbi_tax subdirectory and decompress them. Prepares a taxonomy.tsv file with all taxIDs in it.
Output
ncbi_tax subdirectory with dmp files
format_bold.pl
Aim
Prepare a single sequence file and a lineage file from all tsv files downloaded from BOLD. Clean and orient sequences.
Input files
download_dir (name of the folder containing the downloaded BOLD tsv files)
Parameters/options
marker_list (List of markers to be selected; Default: ‘COI-5P COI-3P’)
check_name (0/1; If 1 keeps only taxa with valid Latin name format: Default: 1)
max_n (positive integer; eliminates sequences with max_n or more consecutive Ns; Default:5)
min_length (positive integer; minimum length of the cleaned sequence; Default:100)
max_length (positive integer; maximum length of the cleaned sequence; Default:2000)
check_orientation (0/1; if 1, checks the orientation of the sequences; Default: 0)
blast_path (Optional; Path to the BLAST executables if they is not in your PATH)
Algorithm
Clean downloaded files and pool information to lineage and sequence files
Eliminate partial lines (mostly errors in the database)
Select sequences for a given marker list
- Clean sequences
Correct sequence IDs
Gaps deleted
Non-TCGA changed to N
External Ns deleted
Sequences with more than max_n consecutive Ns are deleted
Keep sequences with length in a min_length and max_length range
- Clean lineages
If check_name, keep only names matching a correct Latin name format (only letters, spaces and -, correct capitalization)
Pool identical lineages into one line with the list of valid sequence IDs in the last field
Eliminate lines with environmental and metagenomic samples
- Orient sequence (optional)
Count the TAA, TAG STOP codons in each reading frame
Choose the orientation where there is no STOP codon
If there is a STOP codon in all frames OR there are frames without STOP codon both in stand + and -, class it as ambiguous
Make a small “reference” db form randomly sampled oriented sequences
Blast ambiguous sequences to check orientation
Write sequences without hit to the bold_ambiguous_orientation.fas
Output
bold_partial_lines.tsv (lines in the input tsv files that did not have sequences)
bold_ambiguous_orientation.fas (sequences that could not be oriented in check_orientation option)
format_bold_package.pl
Aim
Prepare a single sequence file and a lineage file from a BOLD data package file downloaded from BOLD <https://www.boldsystems.org/index.php/datapackages>_ Clean and orient sequences.
Input files
bold_data (name of the data package downloaded from BOLD; tsv format)
Parameters/options
marker_list (List of markers to be selected; Default: ‘COI-5P COI-3P’)
- delete_noBIN (0/1/2; Default 1;
If 0, do not delete sequneces without BIN_URI
If 1, delete sequences without BIN_URI, if there are other sequences with BIN_URI for the taxon
If 2, delete all sequences without BIN_URI, even if there are no other sequences with BIN_URI for the taxon)
check_name (0/1; If 1 keeps only taxa with valid Latin name format: Default: 1)
max_n (positive integer; eliminates sequences with max_n or more consecutive Ns; Default:5)
min_length (positive integer; minimum length of the cleaned sequence; Default:100)
max_length (positive integer; maximum length of the cleaned sequence; Default:2000)
check_orientation (0/1; if 1, checks the orientation of the sequences; Default: 0)
blast_path (Optional; Path to the BLAST executables if they is not in your PATH)
Algorithm
Select and clean sequences and pool information to lineage and sequence files
Eliminate partial lines (mostly errors in the database)
Select sequences for a given marker list
Delete sequences without BIN according to the delete_noBIN parameter
- Clean sequences
Correct sequence IDs
Gaps deleted
Non-TCGA changed to N
External Ns deleted
Sequences with more than max_n consecutive Ns are deleted
Keep sequences with length in a min_length and max_length range
- Clean lineages
If check_name, keep only names matching a correct Latin name format (only letters, spaces and -, correct capitalization)
Pool identical lineages into one line with the list of valid sequence IDs in the last field
Eliminate lines with environmental and metagenomic samples
- Orient sequence (optional)
Count the TAA, TAG STOP codons in each reading frame
Choose the orientation where there is no STOP codon
If there is a STOP codon in all frames OR there are frames without STOP codon both in stand + and -, class it as ambiguous
Make a small “reference” db form randomly sampled oriented sequences
Blast ambiguous sequences to check orientation
Write sequences without hit to the bold_ambiguous_orientation.fas
Output
bold_partial_lines.tsv (lines in the input tsv files that did not have sequences)
bold_ambiguous_orientation.fas (sequences that could not be oriented in check_orientation option)
format_custom.pl
Aim
Make a lineage file for each input taxon name. Prepare input for add_taxid.pl
The output lineage file should be checked manually
To see if the suggested lineages are plausible
To select the correct lineage if there is more than one (1 in homonymy column) for the same taxon name
Before the next step (add_taxids.pl)
Correct/delete/complete lines if lineage in not correct. Try to use taxon names compatible with NCBI taxonomy (https://www.ncbi.nlm.nih.gov/taxonomy/ )
Delete the homonymy column
Input
Parameters/options
max_n (positive integer; eliminate sequences with max_n or more consecutive Ns; Default: 5)
min_length (positive integer; minimum length of the cleaned sequence; Default: 100)
max_length (positive integer; maximum length of the cleaned sequence; Default: 2000 )
check_seqid_format (0/1; if 1 check if seqID resemble to bold and ncbi formats, print out warnings, if yes; Default: 1)
Algorithm
Match names to all taxon names in taxonomy.tsv (including synonyms)
Write a lineage to all taxon names where name matches to an existing name in taxonomy.tsv
If homonymy, create a lineage for each homonym, and write 1 to the homonymy column
If taxon name corresponds to a Latin name format (Genus species) and species name is not known, get the lineage for the genus.
If the check_seqid_format option is activated
If some of the sequence IDs are not unique, list the duplicates IDs and exit
If sequence ID format is similar to accessions used in BOLD and NCBI/EMBL/DDBJ, list IDs but continue
A fairly safe format is xxx_xxx####, where x is a letter, # is a digit
Clean sequences
gaps deleted
non-TCGA changed to N
external Ns deleted
sequences with more than max_n consecutive Ns are deleted
Output
format_db.pl
Aim
Make a database in blast, rdp, qiime or full tsv format from the sequence tsv and taxonomy.tsv files
Input
outfmt (rdp/blast/qiime/full/vtam/sintax; choose the format of the database)
out (string for naming the output files)
Parameters/options
blast_path (Optional; path to the blast executables if it is not in your PATH)
Algorithm
BLAST db, VTAM
Prepare a fasta file with sequences and the taxIDs (seqID, taxID)
Run the makeblastdb command of blast to make indexed files ready to be used as a blast database
for VTAM format, prepare taxonomy file as well as the BLAST database. They can be used directly in VTAM.
RDP, QIIME, sintax and FULL
Prepare a ranked lineage for each taxID
Taxon names are concatenated with taxID to avoid homonymy
Missing taxonomic levels are completed by using the name of a higher-level taxon concatenated with the taxonomic ranks
Prepare a single tsv file for full
Prepare a single fasta file with lineages incuded in the definition lines for sintax
The trainseq fasta and the taxon files can be used by the train command of rdp_classifier or feature-classifier of QIIME to train the dataset before classification.
The sintax fasta file is ready to be used as a database for SINTAX.
The full tsv format is an easy to parse tsv file with ranked lineage and taxID for each sequence.
Output
blast option
Indexed files ready to be used as a BLAST database
vtam option
Indexed files ready to be used as a BLAST database
taxonomy.tsv adapted to VTAM
rdp option
quiime option
sintax option
full option
format_rdp.pl
Aim
Format the RDP training dataset to sequence tsv file with taxIDs and taxonomy.tsv
Input files
Parameters/options
NA
Algorithm
Reformat the rdp taxonomy file to taxonomy.tsv format
Make sequence tsv file with taxIDs from the fasta file
Output
format_ncbi.pl
Aim
Format the CDS and taxID files to a sequence tsv file with taxIDs. Select and clean sequences.
Input files
cds (CDS fasta file; output of nsdpy)
taxids (tsv file with the seqID and taxID columns; output of nsdpy)
Parameters/options
check_name (0/1; If one, keep only taxa with valid Latin name fomat: Default: 1)
max_n (positive integer; eliminate sequences with max_n or more consecutive Ns; Default: 5)
min_length (positive integer; minimum length of the cleaned sequence; Default: 100)
max_length (positive integer; maximum length of the cleaned sequence; Default: 2000)
Algorithm
- Select sequences
Check if gene names and protein names correspond to COI
Eliminate genes if they have introns
Can have more than one COI gene in the same mitochondrion
Accept only if valid taxID, replace old non-valid taxIDs by up-to-date taxIDs
Eliminate sequences from environmental and metagenomic samples
If check_name is activated, take the taxID of the smallest taxon with a valid Latin name, otherwise keep the original taxID.
- Clean sequences
Upper case
Replace non-ATCG by N
Delete gaps and external Ns
Delete sequence if more than max_n consecutive Ns
Keep sequences if length is between min_length and max_length
Sequences are already in a correct orientation in the input file, since that are coming from CDS files
Output
get_subtaxa.pl
Aim
List subtaxa of the input taxon at the next major taxonomic rank (e.g. list all orders of the input class)
Input files
Parameters/options
NONE
Algorithm
If taxon is a taxon name, get all taxIDs that correspond to this name (e.g. 1266065 and 50622 for Plecoptera)
Determine the next lowest major taxonomic rank (phylum, class, order, family, genus, species) for each taxID (e.g. if taxId is an order or suborder or superfamily, the next major tax rank is family)
List subtaxa of each taxID of this taxonomic rank.
Output
- tsv with the following columns
taxon
taxID
subtaxon
taxID (of the subtaxa)
pool_and_dereplicate.pl
Aim
Pool 2 dereplicated sequence tsv files and do a taxonomically-aware dereplication for taxIDs present in both input files
Input files
Parameters/options
vsearch_path (path to vsearch executables if not in the PATH)
Algorithm
Pool sequences from two input tsv files. The dereplication is done only for taxID shared by the two input files, since they have been dereplicated individually. The algorithm of dereplication is identical to the one used in dereplicate.pl
Output
reduce_metadata.pl .. _reduce_metadata_reference:
reduce_metadata.pl
Aim
Select metadata of BOLD sequences that appear in the final file of COInr.tsv; The output is used to be able to trace back BOLD sequences to their autors and get their metadata.
Input files
out (name of the output tsv file)
Algorithm
Select lines from the BOLD data package, that reamin in the COInr database.
Output
select_region.pl
Aim
Select a target region from input sequences. As an input, either a primer pair should be given to identify the target region in some sequences by e-pcr, or a fasta file containing taxonomically diverse sequences limited to the target region. The sequences are then aligned to the target sequences and trimmed according to the alignment
Input files
bait_fas (A small phylogenetically diverse fasta file with sequences already trimmed to the target region; Optional; Can be produced by e-pcr included in the script.)
Parameters/options
E-pcr related parameters
e_pcr (0/1; if 1, identify the target region of the sequences by e-pcr in the first step)
fw (optional; if e_pcr is done, the sequence of the forward primer that amplifies the target region)
rv (optional; if e_pcr is done, the sequence of the reverse primer that amplifies the target region)
trim_error (real [0-1], the proportion of mismatches allowed between the primer and the sequence during the e_pcr; Default: 0.3)
min_overlap (the minimum overlap between primer and the sequence during e-pcr; Default: 20)
min_amplicon_length (The minimum length of the amplicon after primer trimming; Default: 100)
max_amplicon_length (The minimum length of the amplicon after primer trimming; Default: 2000)
cutadapt_path (Optional; Path to cutadapt if it is not in your PATH)
usearch_global related parameters for trimming sequences according to the alignments
tcov (minimum coverage of the target sequence in usearch_global hits; Default: 0.5)
identity (minimum identity between the sequence and the target in usearch_global hits; Default: 0.7)
vsearch_path (Optional; path to the vsearch if it is not in your PATH)
Algorithm
A fasta file is prepared from the input tsv sequence file.
Sequences are aligned to a small pool of target sequences already limited to the target region (bait_fas). The alignment is done by the usearch_global command of vsearch which makes global alignments (unlike BLAST).
The best hit is used for each sequence to orient and trim them to the target region. Only hits with a minimum target coverage (tcov) and percentage of identity (identity) are used.
The bait_fas file can be either previously prepared by the users and given as an input, or be produced by e-pcr using the e-pcr related parameters.
To reduce runtime, sequences in the bait_fas are clustered by -cluster_fast algorithm of vsearch and the centroids are used as a target file for the usearch_global
Output
cutadapt_trimmed.fas (if e_pcr; fasta file with sequences recognized and trimmed by E-pcr; equivalent of bait_fas )
target_centroids.fas (fasta file of centroids of the clustering of bait_fas or cutadapt_trimmed.fas)
trimmed.tsv (sequence tsv files with taxIDs trimmed to the target region)
untrimmed.tsv (sequence tsv files with taxIDs where the target region is not found)
select_taxa.pl
Aim
Select or eliminate sequences that belong to taxa on a taxon list. Select sequences with a minimum taxonomic resolution (e.g. assigned at least to genus).
Input files
out (name of the output dereplicated sequence tsv file)
Parameters/options
negative_list (1/0; if 1, keeps all taxa except the ones on the taxon list; Default: 0)
min_taxlevel (species/genus/family/order/class/phylum/kingdom/root; Default: root)
Algorithm
Select sequences that belong to the taxa on a taxon list if negative_list==0 (default). Select sequences that DO NOT belong to the taxa on in a taxon list, if negative_list==1.
Keep only sequences that are assigned to at least to min_taxlevel rank.
If taxIDs are not given in the taxon_list file the script uses all taxIDs that matches the taxon name.
A lineage file is written for each taxon in taxon_list and the corresponding taxIDs.
It should be checked manually if lineages are coherent with the target taxa
Homonymy column indicates if there are more than one taxID for a taxon
If there are incoherent lineages, make a new taxon_list file based on the lineage file including taxon names and taxIDs and rerun the script with the new taxon_list.