Input/Output
This list of arguments and options can be obtained by typing
perl name_of_the_script.pl -help
Lineage files
BOLD data package TSV
tsv file with the following columns:
-processid -sampleid -specimenid -museumid -fieldid -inst -bin_uri -identification -funding_src -kingdom -phylum -class -order -family -subfamily -genus -species -subspecies -identified_by -voucher_type -collectors -collection_date -collection_date_accuracy -life_stage -sex -reproduction -extrainfo -notes -coord -coord_source -coord_accuracy -elev -depth -elev_accuracy -depth_accuracy -country -province -country_iso -region -sector -site -collection_time -habitat -collection_note -associated_taxas -associated_specimen -species_reference -identification_method -recordset_code_arr -gb_acs -marker_code -nucraw -sequence_run_site -processid_minted_date -sequence_upload_date -identification_rank
The reduced metadata file do not contain the sequence, and statrs with the sequence ID used in COInd (BOLD_marker_processid format)
processid sampleid specimenid museumid fieldid inst bin_uri identification funding_src kingdom phylum class order family subfamily genus species subspecies identified_by voucher_type collectors collection_date collection_date_accuracy life_stage sex reproduction extrainfo notes coord coord_source coord_accuracy elev depth elev_accuracy depth_accuracy country province country_iso region sector site collection_time habitat collection_noteassociated_taxa associated_specimen species_reference identification_method recordset_code_arr gb_acs marker_code nucraw sequence_run_site processid_minted_date sequence_upload_date identification_rank
AAASF001-17 CBGSFMX-0101 7804897 None CBGSFMX-0101 Universidad Autonoma de Nuevo Leon BOLD:ADP3520 Lutzomyia cruciata None AnimaliaArthropoda Insecta Diptera Psychodidae Phlebotominae Lutzomyia Lutzomyia cruciata None Jorge J. Rodriguez Rojas None Wilbert P 2016-10-28 None Adult M S None Slide mounted with Euparal (19.3786,-88.1892) None None None None None None Mexico Quintana Roo None Candelaria None None None None None None None None Morphological ['AAASF', 'DS-17IBMWP', 'DS-UNIQUE17'] MK851247 COI-5P AACATTATATTTTATTTTTGGAGCCTGAGCAGGAATAGTGGGAACATCTTTAAGAATTTTAATTCGAGCAGAATTAGGTCACCCCGGTGCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACAGCTCATGCATTTGTAATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAACTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCATTCCCTCGAATAAATAATATAAGATTTTGACTTTTACCCCCCTCTCTTACTCTCCTTCTTACAAGAAGTATAGTTGAAACTGGGGCAGGAACAGGATGAACTGTTTATCCACCTCTTTCAAGAAATATTGCCCATAGAGGAGCTTCTGTTGATTTAGCAATTTTTTCCCTACATTTAGCCGGGATTTCATCTATTCTTGGAGCAGTAAATTTTATTACTACAGTTATTAATATACGATCTGCTGGAATTACATTAGATCGAATACCTTTATTTGTTTGATCTGTAATAATTACTGCGGTACTTCTATTATTATCATTACCTGTTTTAGCAGGTGCAATTACAATACTTCTAACTGATCGTAATCTAAATACTTCTTTTTTTGACCCTGCGGGAGGTGGGGATCCAATTTTATATCAACATTTATTT Instituto Politecnico Nacional, Centro de Biotecnologia Genomica 30-May-2017 12-Jun-2017 species
lineage tsv without taxID
tsv file with the following columns:
phylum
class
order
family
subfamily
genus
species
seqIDs
All identical lineages are pooled into a single line, seqIDs are in the last column separated by semicolons
phylum class order family subfamily genus species seqIDs
Acanthocephala 12418139
Acanthocephala Archiacanthocephala Gigantorhynchida Gigantorhynchidae Mediorhynchus 5445424;3143887
Acanthocephala Archiacanthocephala Gigantorhynchida Gigantorhynchidae Mediorhynchus Mediorhynchus gallinarum 15188348;15188349;5445423
lineage tsv with taxID (output of select_taxa.pl)
tsv file with the following columns:
taxon
taxID
homonymy
number of sequences
domain
kingdom
phylum
class
order
family
subfamily
genus
species
taxon taxID homonymy number of sequences domain kingdom phylum class order family genus species
Abylidae 316207 0 33 Eukaryota Metazoa Cnidaria Hydrozoa Siphonophorae Abylidae
lineage tsv with taxID (output of add_taxids.pl)
tsv file with the following columns:
lowest_taxname
lowest_rank
lowest_TaxID
phylum
class
order
family
subfamily
genus
species
seqIDs
lowest_taxname lowest_rank lowest_TaxID phylum class order family subfamily genus species seqIDs
Acanthocephala phylum 10232 Acanthocephala 12418139
Mediorhynchus genus 60535 Acanthocephala Archiacanthocephala Gigantorhynchida Gigantorhynchidae Mediorhynchus 3143887;5445424
custom lineages tsv
tsv file with the following columns:
phylum
class
order
family
subfamily
genus
species
homonymy
seqIDs
phylum class order family subfamily genus species homonymy seqIDs
Cnidaria Hydrozoa Leptothecata Aglaopheniidae Aglaophenia 0 OEB_MLR10
Bryozoa Gymnolaemata Cheilostomatida Margarettidae Margaretta Margaretta cereoides 1 OEB_EH13;OEB_EH17;OEB_EH19
Streptophyta Magnoliopsida Gentianales Apocynaceae Margaretta Margaretta cereoides 1 OEB_EH13;OEB_EH17;OEB_EH19
ambiguous lineages
tsv file with the following columns:
match_lineage_proportion
ncbi_taxname
ncbi_taxlevel
ncbi_TaxID
phylum
class order
family
subfamily
genus
species
ncbi_domain
ncbi_kingdom
ncbi_phylum
ncbi_class
ncbi_order
ncbi_family
ncbi_genus
ncbi_species
ncbi_taxname
seqIDs
match_lineage_proportion ncbi_taxname ncbi_taxlevel ncbi_TaxID phylum class order family subfamily genus species ncbi_domain ncbi_kingdom ncbi_phylum ncbi_class ncbi_order ncbi_family ncbi_genus ncbi_species ncbi_taxname seqIDs
0.4 Bolbophorus genus 186184 Platyhelminthes Trematoda Diplostomida Diplostomidae Bolbophorinae Bolbophorus Eukaryota Metazoa Platyhelminthes Trematoda Strigeidida Bolbophoridae Bolbophorus Bolbophorus 12416284;9942141;15268484;12416286;12416287;12416283;12417832;3490428;12417833;5993483;12416282;12416285;12416280;12416281
0.33 Sylon hippolytes species 399056 Arthropoda Hexanauplia Clistosaccidae Sylon Sylon hippolytes Eukaryota Metazoa Arthropoda Thecostraca Sylonidae Sylon Sylon hippolytes Sylon hippolytes 2631808;2631807;2631809;2631789;2631806;2631805
Sequence files
sequence tsv without taxID
tsv file with the following columns:
seqID
sequence
seqID sequence
12418139 AGATATTGGTATATTATATATTTTGTTTGCGTTATGAAGAGGC...
3143887 GTGATATATATAATGTCATCGGTATGAAGTGGTATTATAGGGGTGAT...
sequence tsv with taxID
tsv file with the following columns:
seqID
taxID
sequence
seqID taxID sequence
11611742 10236 GGGATAATATATATTTTGCTTGCATTGTGGAGGG...
10907577 -9466 TAAGATTTTGAATATTACCTCCATCAATTACATT...
GU179406_1 2921812 GGACTCCTTGGTACTTCTATAAGATTGCTTCTGT...
custom sequences tsv
tsv file with the following columns:
seqID
taxon name (any taxonomic level)
sequence
seqID taxon_name sequence
xxx_10236 Porifera GGGATAATATATATTTTGCTTGCATTGTGGAGGG...
xxx_10907577 Margaretta TAAGATTTTGAATATTACCTCCATCAATTACATT...
RDP classifier trainset fasta
Fasta file in RDP Classifier trainseq format. Can be downloaded from https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/
>AJ000684 Mycobacterium heidelbergense str. 2554/91 Type domain__Bacteria; phylum__Actinobacteria; class__Actinobacteria; order__Mycobacteriales; family__Mycobacteriaceae; genus__Mycobacterium
gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctctt
>EF599163 Vibrio atlanticus str. LMG 24300 Type domain__Bacteria; phylum__Proteobacteria; class__Gammaproteobacteria; order__Vibrionales; family__Vibrionaceae; genus__Vibrio
gtttgatcctggctcagattgaacgctggcggcaggcctaacacatgcaa
Database formats
BLAST database files
Binary files ready to use by BLAST.
blastdb_name.nhr
blastdb_name.nin
blastdb_name.nog
blastdb_name.nsd
blastdb_name.nsi
blastdb_name.nsq
full tsv
The sequence tsv and the taxonomy files can be formatted by format_db.pl to a full tsv file containing the following columns:
seqID
taxon
taxID
taxlevel
domain
domain_taxID
kingdom
kingdom_taxID
phylum
phylum_taxID
class
class_taxID order
order_taxID
family
family_taxID
genus
genus_taxID
species
species_taxID
sequence
seqID taxon taxID taxlevel domain domain_taxID kingdom kingdom_taxID phylum phylum_taxID class class_taxID order order_taxID family family_taxID genus genus_taxID species species_taxID sequence
5423724 Aspidoscopulia australia 1001026 8 Eukaryota 2759 Metazoa 33208 Porifera 6040 Hexactinellida 60882 Hexactinosida 98040 Farreidae 98041 Aspidoscopulia 999811 Aspidoscopulia_australia1001026 GGATCTCTATTAGAAGACGACCACACCTATAACGTTGTAGTTACAGCTCACGC...
QIIME
QIIME trainseq fasta
Fasta file with only seqIDs in the definition line
>OEB_CA11
AGTGGTCTCAGTGCTTTAATTCGCATTGAGTTAAGTCAGCCAGGTGGTTTAATGGGCAATG...
>OEB_EH10
AGTGGGTAGAGGGTTAAGAGCTTTGATCCGGGTCGAACTAAGTCAACCTGGAGGTTTACTA...
QIIME taxon file
file with the following columns:
seqID
lineage
The taxonomic levels of the lineage are separated by ;
Negative taxIDs are allowed. Empty taxlevels are filled out using the name of higher-level taxa.
OEB_CA11 k__Metazoa_33208; p__Bryozoa_10205; c__Gymnolaemata_10206; o__Cheilostomatida_10207; f__Adeonidae_558780; g__Reptadeonella_2576536; s__Reptadeonella_violacea_-35055
OEB_EH46 k__Metazoa_33208; p__Bryozoa_10205; c__; o__; f__; g__; s__
RDP
RDP trainseq fasta
Fasta file with the definition as follows
>seqID cellularOrganisms;domain_taxID;kingdom_taxID;phylum_taxID;class_taxID;order_taxID;family_taxID;genus_taxID;species_taxID
Negative taxIDs are allowed Empty taxlevels are filled out using the name of higher-level taxa (e.g. Polychaeta_6341_order).
>MG655623_1 cellularOrganisms;Eukaryota_2759;Metazoa_33208;Ctenophora_10197;Nuda_1919246;Beroida_37538;Beroidae_37539;Beroe_10199;Beroe_forskalii_140453
ATTTTAGATAAATGATTAGGTTCTGTTTATCATTACAATATTGCTTCTTTATATTTTTTTTTTTCTATTTCTTTAGGGTTTTGTGCCTTTTTTTATTCTTTTATTATAAGATTGTCTTTAGTTTGGCCTTTTGCATTTCTATCTTCAGGTTCTATCTATTTGCATTACGTTACTT
>7437763 cellularOrganisms;Eukaryota_2759;Metazoa_33208;Annelida_6340;Polychaeta_6341;Polychaeta_6341_order;Orbiniidae_46603;Orbinia_195262;Orbinia_johnsoni_-91
CGAACAGAACTAGGCCAACCCGGCTCTCTTCTTGGAAGAGACCAACTATACAATACAATTGTTACCGCTCACGCAGTATTAATAATTTTCTTTCTTGTAATGCCCGTCCTAATTGGAGGATTTGGCAACTGACTTGTCCCTTTAAT
RDP taxon file
file with the following columns separated by stars:
taxID
taxon_name_taxID
parent taxID
taxonomic rank index (‘root’,1, ‘domain’,2, ‘kingdom’,3, ‘phylum’,4, ‘class’,5, ‘order’,6, ‘family’,7, ‘genus’,8, ‘species’)
taxonomic rank
Negative taxIDs are allowed. Empty taxlevels are filled out using the name of higher-level taxa.
-1*Acanthogyrus_cheni_-1*2493664*8*species
-10*Amynthas_sexpectatus_-10*195544*8*species
-100*Meiodrilus_adhaerens_-100*2723626*8*species
-1000*Runcinia_erythrina_-1000*486328*8*species
-10000*Psaltoda_claripennis_-10000*1225615*8*species
-10001*Psaltoda_flavescens_-10001*1225615*8*species
...
-35075*Polychaeta_6341_order*6341*5*order
SINTAX
SINTAX fasta
Fasta file with the definition line as follows
>seqID;tax=k:kingdom_taxID,p:phylum_taxID,c:class_taxID,o:order_taxID,f:family_taxID,g:genus_taxID,s:species_taxID
Negative taxIDs are allowed Empty taxlevels are filled out using the name of higher-level taxa (e.g. Monostilifera_6227_family).
>KF935544_1;tax=k:Metazoa_33208,p:Nemertea_6217,c:Enopla_6225,o:Monostilifera_6227,f:Monostilifera_6227_family,g:Vieitezia_1068817,s:Vieitezia_luzmurubeae_1068818
ATTTTAGATAAATGATTAGGTTCTGTTTATCATTACAATATTGCTTCTTTATATTTTTTTTTTTCTATTTCTTTAGGGTTTTGTGCCTTTTTTTATTCTTTTATTATAAGATTGTCTTTAGTTTG
>PP587771_1;tax=k:Metazoa_33208,p:Chordata_7711,c:Mammalia_40674,o:Rodentia_9989,f:Sciuridae_55153,g:Sciurus_10001
CCTCCTCTAGCAGGAAATCTAGCCCATGCAGGAGCCTCAATAGATCTAACTATTTTCTCACTCCACCTGGCAGGTGTTTCCTCCATCTTAGGGGCAATTAATTTTATTACTACTATTATCAATAT
>BOLD_COI-5P_YBIVV3784-23;tax=k:Metazoa_33208,p:Arthropoda_6656,c:Insecta_50557,o:Diptera_7147,f:Rhagionidae_92609,g:Chrysopilus_124301,s:Chrysopilus_alaskaensis_-9996
TTTATATTTTATCTTTGGAGCTTGAGCGGGTATAGTAGGAACATCTCTTAGTATATTAATTCGAGCAGAATTAGGCCATCCTGGAGCATTAATTGGTGACGATCAAATTTATAATGTGATTGTAA
VTAM database files
BLAST database binary files ready to use by BLAST.
blastdb_name.nhr
blastdb_name.nin
blastdb_name.nog
blastdb_name.nsd
blastdb_name.nsi
blastdb_name.nsq
Taxonomy file with the following columns:
tax_id
parent_tax_id
rank
name_txt
old_tax_id (old_tax_id merged to tax_id)
taxlevel
tax_id parent_tax_id rank name_txt old_tax_id taxlevel
1 1 no rank root 0
2 131567 domain Bacteria 1
6 335928 genus Azorhizobium 7
7 6 species Azorhizobium caulinodans 395 8
9 32199 species Buchnera aphidicola 28241 8
10 1706371 genus Cellvibrio 7
11 1707 species Cellulomonas gilvus 8
13 203488 genus Dictyoglomus 7
14 13 species Dictyoglomus thermophilum 8
Other
outdir
Name of the directory to write output files
out
String for naming output files