Input/Output

This list of arguments and options can be obtained by typing

perl name_of_the_script.pl -help

Lineage files

BOLD data package TSV

tsv file with the following columns:

-processid -sampleid -specimenid -museumid -fieldid -inst -bin_uri -identification -funding_src -kingdom -phylum -class -order -family -subfamily -genus -species -subspecies -identified_by -voucher_type -collectors -collection_date -collection_date_accuracy -life_stage -sex -reproduction -extrainfo -notes -coord -coord_source -coord_accuracy -elev -depth -elev_accuracy -depth_accuracy -country -province -country_iso -region -sector -site -collection_time -habitat -collection_note -associated_taxas -associated_specimen -species_reference -identification_method -recordset_code_arr -gb_acs -marker_code -nucraw -sequence_run_site -processid_minted_date -sequence_upload_date -identification_rank

The reduced metadata file do not contain the sequence, and statrs with the sequence ID used in COInd (BOLD_marker_processid format)

processid       sampleid        specimenid      museumid        fieldid inst    bin_uri identification  funding_src     kingdom phylum  class   order   family  subfamily       genus   species subspecies      identified_by   voucher_type    collectors      collection_date collection_date_accuracy        life_stage      sex     reproduction    extrainfo       notes   coord   coord_source    coord_accuracy  elev    depth   elev_accuracy   depth_accuracy  country province        country_iso     region  sector  site    collection_time habitat collection_noteassociated_taxa  associated_specimen     species_reference       identification_method   recordset_code_arr      gb_acs  marker_code     nucraw  sequence_run_site       processid_minted_date   sequence_upload_date    identification_rank
AAASF001-17     CBGSFMX-0101    7804897 None    CBGSFMX-0101    Universidad Autonoma de Nuevo Leon      BOLD:ADP3520    Lutzomyia cruciata      None    AnimaliaArthropoda      Insecta Diptera Psychodidae     Phlebotominae   Lutzomyia       Lutzomyia cruciata      None    Jorge J. Rodriguez Rojas        None    Wilbert P       2016-10-28      None    Adult   M       S       None    Slide mounted with Euparal      (19.3786,-88.1892)      None    None    None    None    None    None    Mexico  Quintana Roo    None    Candelaria      None    None    None    None    None    None    None    None    Morphological   ['AAASF', 'DS-17IBMWP', 'DS-UNIQUE17']  MK851247        COI-5P  AACATTATATTTTATTTTTGGAGCCTGAGCAGGAATAGTGGGAACATCTTTAAGAATTTTAATTCGAGCAGAATTAGGTCACCCCGGTGCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACAGCTCATGCATTTGTAATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAACTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCATTCCCTCGAATAAATAATATAAGATTTTGACTTTTACCCCCCTCTCTTACTCTCCTTCTTACAAGAAGTATAGTTGAAACTGGGGCAGGAACAGGATGAACTGTTTATCCACCTCTTTCAAGAAATATTGCCCATAGAGGAGCTTCTGTTGATTTAGCAATTTTTTCCCTACATTTAGCCGGGATTTCATCTATTCTTGGAGCAGTAAATTTTATTACTACAGTTATTAATATACGATCTGCTGGAATTACATTAGATCGAATACCTTTATTTGTTTGATCTGTAATAATTACTGCGGTACTTCTATTATTATCATTACCTGTTTTAGCAGGTGCAATTACAATACTTCTAACTGATCGTAATCTAAATACTTCTTTTTTTGACCCTGCGGGAGGTGGGGATCCAATTTTATATCAACATTTATTT      Instituto Politecnico Nacional, Centro de Biotecnologia Genomica        30-May-2017     12-Jun-2017     species

lineage tsv without taxID

tsv file with the following columns:

  • phylum

  • class

  • order

  • family

  • subfamily

  • genus

  • species

  • seqIDs

All identical lineages are pooled into a single line, seqIDs are in the last column separated by semicolons

phylum  class   order   family  subfamily       genus   species seqIDs
Acanthocephala                                                  12418139
Acanthocephala  Archiacanthocephala     Gigantorhynchida        Gigantorhynchidae               Mediorhynchus           5445424;3143887
Acanthocephala  Archiacanthocephala     Gigantorhynchida        Gigantorhynchidae               Mediorhynchus   Mediorhynchus gallinarum        15188348;15188349;5445423

lineage tsv with taxID (output of select_taxa.pl)

tsv file with the following columns:

  • taxon

  • taxID

  • homonymy

  • number of sequences

  • domain

  • kingdom

  • phylum

  • class

  • order

  • family

  • subfamily

  • genus

  • species

taxon   taxID   homonymy        number of sequences     domain  kingdom phylum  class   order   family  genus   species
Abylidae        316207  0       33      Eukaryota       Metazoa Cnidaria        Hydrozoa        Siphonophorae   Abylidae

lineage tsv with taxID (output of add_taxids.pl)

tsv file with the following columns:

  • lowest_taxname

  • lowest_rank

  • lowest_TaxID

  • phylum

  • class

  • order

  • family

  • subfamily

  • genus

  • species

  • seqIDs

lowest_taxname  lowest_rank     lowest_TaxID    phylum  class   order   family  subfamily       genus   species seqIDs
Acanthocephala  phylum  10232   Acanthocephala                                                  12418139
Mediorhynchus   genus   60535   Acanthocephala  Archiacanthocephala     Gigantorhynchida        Gigantorhynchidae               Mediorhynchus           3143887;5445424

custom lineages tsv

tsv file with the following columns:

  • phylum

  • class

  • order

  • family

  • subfamily

  • genus

  • species

  • homonymy

  • seqIDs

phylum  class   order   family  subfamily       genus   species homonymy        seqIDs
Cnidaria        Hydrozoa        Leptothecata    Aglaopheniidae          Aglaophenia             0       OEB_MLR10
Bryozoa Gymnolaemata    Cheilostomatida Margarettidae           Margaretta      Margaretta cereoides    1       OEB_EH13;OEB_EH17;OEB_EH19
Streptophyta    Magnoliopsida   Gentianales     Apocynaceae             Margaretta      Margaretta cereoides    1       OEB_EH13;OEB_EH17;OEB_EH19

ambiguous lineages

tsv file with the following columns:

  • match_lineage_proportion

  • ncbi_taxname

  • ncbi_taxlevel

  • ncbi_TaxID

  • phylum

  • class order

  • family

  • subfamily

  • genus

  • species

  • ncbi_domain

  • ncbi_kingdom

  • ncbi_phylum

  • ncbi_class

  • ncbi_order

  • ncbi_family

  • ncbi_genus

  • ncbi_species

  • ncbi_taxname

  • seqIDs

match_lineage_proportion        ncbi_taxname    ncbi_taxlevel   ncbi_TaxID      phylum  class   order   family  subfamily       genus   species ncbi_domain     ncbi_kingdom    ncbi_phylum     ncbi_class      ncbi_order      ncbi_family     ncbi_genus      ncbi_species    ncbi_taxname    seqIDs
0.4     Bolbophorus     genus   186184  Platyhelminthes Trematoda       Diplostomida    Diplostomidae   Bolbophorinae   Bolbophorus             Eukaryota       Metazoa Platyhelminthes Trematoda       Strigeidida     Bolbophoridae   Bolbophorus             Bolbophorus     12416284;9942141;15268484;12416286;12416287;12416283;12417832;3490428;12417833;5993483;12416282;12416285;12416280;12416281
0.33    Sylon hippolytes        species 399056  Arthropoda      Hexanauplia             Clistosaccidae          Sylon   Sylon hippolytes        Eukaryota       Metazoa Arthropoda      Thecostraca             Sylonidae       Sylon   Sylon hippolytes        Sylon hippolytes        2631808;2631807;2631809;2631789;2631806;2631805

Sequence files

sequence tsv without taxID

tsv file with the following columns:

  • seqID

  • sequence

seqID   sequence
12418139        AGATATTGGTATATTATATATTTTGTTTGCGTTATGAAGAGGC...
3143887 GTGATATATATAATGTCATCGGTATGAAGTGGTATTATAGGGGTGAT...

sequence tsv with taxID

tsv file with the following columns:

  • seqID

  • taxID

  • sequence

seqID   taxID   sequence
11611742        10236   GGGATAATATATATTTTGCTTGCATTGTGGAGGG...
10907577        -9466   TAAGATTTTGAATATTACCTCCATCAATTACATT...
GU179406_1      2921812 GGACTCCTTGGTACTTCTATAAGATTGCTTCTGT...

custom sequences tsv

tsv file with the following columns:

  • seqID

  • taxon name (any taxonomic level)

  • sequence

seqID   taxon_name      sequence
xxx_10236 Porifera      GGGATAATATATATTTTGCTTGCATTGTGGAGGG...
xxx_10907577    Margaretta      TAAGATTTTGAATATTACCTCCATCAATTACATT...

RDP classifier trainset fasta

Fasta file in RDP Classifier trainseq format. Can be downloaded from https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/

>AJ000684       Mycobacterium heidelbergense str. 2554/91 Type  domain__Bacteria; phylum__Actinobacteria; class__Actinobacteria; order__Mycobacteriales; family__Mycobacteriaceae; genus__Mycobacterium
gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctctt
>EF599163       Vibrio atlanticus str. LMG 24300 Type   domain__Bacteria; phylum__Proteobacteria; class__Gammaproteobacteria; order__Vibrionales; family__Vibrionaceae; genus__Vibrio
gtttgatcctggctcagattgaacgctggcggcaggcctaacacatgcaa

Database formats

BLAST database files

Binary files ready to use by BLAST.

  • blastdb_name.nhr

  • blastdb_name.nin

  • blastdb_name.nog

  • blastdb_name.nsd

  • blastdb_name.nsi

  • blastdb_name.nsq

full tsv

The sequence tsv and the taxonomy files can be formatted by format_db.pl to a full tsv file containing the following columns:

  • seqID

  • taxon

  • taxID

  • taxlevel

  • domain

  • domain_taxID

  • kingdom

  • kingdom_taxID

  • phylum

  • phylum_taxID

  • class

  • class_taxID order

  • order_taxID

  • family

  • family_taxID

  • genus

  • genus_taxID

  • species

  • species_taxID

  • sequence

seqID   taxon   taxID   taxlevel        domain  domain_taxID    kingdom kingdom_taxID   phylum  phylum_taxID    class   class_taxID     order   order_taxID     family  family_taxID    genus   genus_taxID     species species_taxID   sequence
5423724 Aspidoscopulia australia        1001026 8       Eukaryota       2759    Metazoa 33208   Porifera        6040    Hexactinellida  60882   Hexactinosida   98040   Farreidae       98041   Aspidoscopulia  999811  Aspidoscopulia_australia1001026 GGATCTCTATTAGAAGACGACCACACCTATAACGTTGTAGTTACAGCTCACGC...

QIIME

QIIME trainseq fasta

Fasta file with only seqIDs in the definition line

>OEB_CA11
AGTGGTCTCAGTGCTTTAATTCGCATTGAGTTAAGTCAGCCAGGTGGTTTAATGGGCAATG...
>OEB_EH10
AGTGGGTAGAGGGTTAAGAGCTTTGATCCGGGTCGAACTAAGTCAACCTGGAGGTTTACTA...

QIIME taxon file

file with the following columns:

  • seqID

  • lineage

The taxonomic levels of the lineage are separated by ;

Negative taxIDs are allowed. Empty taxlevels are filled out using the name of higher-level taxa.

OEB_CA11        k__Metazoa_33208; p__Bryozoa_10205; c__Gymnolaemata_10206; o__Cheilostomatida_10207; f__Adeonidae_558780; g__Reptadeonella_2576536; s__Reptadeonella_violacea_-35055
OEB_EH46        k__Metazoa_33208; p__Bryozoa_10205; c__; o__; f__; g__; s__

RDP

RDP trainseq fasta

Fasta file with the definition as follows

>seqID cellularOrganisms;domain_taxID;kingdom_taxID;phylum_taxID;class_taxID;order_taxID;family_taxID;genus_taxID;species_taxID

Negative taxIDs are allowed Empty taxlevels are filled out using the name of higher-level taxa (e.g. Polychaeta_6341_order).

>MG655623_1 cellularOrganisms;Eukaryota_2759;Metazoa_33208;Ctenophora_10197;Nuda_1919246;Beroida_37538;Beroidae_37539;Beroe_10199;Beroe_forskalii_140453
ATTTTAGATAAATGATTAGGTTCTGTTTATCATTACAATATTGCTTCTTTATATTTTTTTTTTTCTATTTCTTTAGGGTTTTGTGCCTTTTTTTATTCTTTTATTATAAGATTGTCTTTAGTTTGGCCTTTTGCATTTCTATCTTCAGGTTCTATCTATTTGCATTACGTTACTT
>7437763 cellularOrganisms;Eukaryota_2759;Metazoa_33208;Annelida_6340;Polychaeta_6341;Polychaeta_6341_order;Orbiniidae_46603;Orbinia_195262;Orbinia_johnsoni_-91
CGAACAGAACTAGGCCAACCCGGCTCTCTTCTTGGAAGAGACCAACTATACAATACAATTGTTACCGCTCACGCAGTATTAATAATTTTCTTTCTTGTAATGCCCGTCCTAATTGGAGGATTTGGCAACTGACTTGTCCCTTTAAT

RDP taxon file

file with the following columns separated by stars:

  • taxID

  • taxon_name_taxID

  • parent taxID

  • taxonomic rank index (‘root’,1, ‘domain’,2, ‘kingdom’,3, ‘phylum’,4, ‘class’,5, ‘order’,6, ‘family’,7, ‘genus’,8, ‘species’)

  • taxonomic rank

Negative taxIDs are allowed. Empty taxlevels are filled out using the name of higher-level taxa.

-1*Acanthogyrus_cheni_-1*2493664*8*species
-10*Amynthas_sexpectatus_-10*195544*8*species
-100*Meiodrilus_adhaerens_-100*2723626*8*species
-1000*Runcinia_erythrina_-1000*486328*8*species
-10000*Psaltoda_claripennis_-10000*1225615*8*species
-10001*Psaltoda_flavescens_-10001*1225615*8*species
...
-35075*Polychaeta_6341_order*6341*5*order

SINTAX

SINTAX fasta

Fasta file with the definition line as follows

>seqID;tax=k:kingdom_taxID,p:phylum_taxID,c:class_taxID,o:order_taxID,f:family_taxID,g:genus_taxID,s:species_taxID

Negative taxIDs are allowed Empty taxlevels are filled out using the name of higher-level taxa (e.g. Monostilifera_6227_family).

>KF935544_1;tax=k:Metazoa_33208,p:Nemertea_6217,c:Enopla_6225,o:Monostilifera_6227,f:Monostilifera_6227_family,g:Vieitezia_1068817,s:Vieitezia_luzmurubeae_1068818
ATTTTAGATAAATGATTAGGTTCTGTTTATCATTACAATATTGCTTCTTTATATTTTTTTTTTTCTATTTCTTTAGGGTTTTGTGCCTTTTTTTATTCTTTTATTATAAGATTGTCTTTAGTTTG
>PP587771_1;tax=k:Metazoa_33208,p:Chordata_7711,c:Mammalia_40674,o:Rodentia_9989,f:Sciuridae_55153,g:Sciurus_10001
CCTCCTCTAGCAGGAAATCTAGCCCATGCAGGAGCCTCAATAGATCTAACTATTTTCTCACTCCACCTGGCAGGTGTTTCCTCCATCTTAGGGGCAATTAATTTTATTACTACTATTATCAATAT
>BOLD_COI-5P_YBIVV3784-23;tax=k:Metazoa_33208,p:Arthropoda_6656,c:Insecta_50557,o:Diptera_7147,f:Rhagionidae_92609,g:Chrysopilus_124301,s:Chrysopilus_alaskaensis_-9996
TTTATATTTTATCTTTGGAGCTTGAGCGGGTATAGTAGGAACATCTCTTAGTATATTAATTCGAGCAGAATTAGGCCATCCTGGAGCATTAATTGGTGACGATCAAATTTATAATGTGATTGTAA

VTAM database files

BLAST database binary files ready to use by BLAST.

  • blastdb_name.nhr

  • blastdb_name.nin

  • blastdb_name.nog

  • blastdb_name.nsd

  • blastdb_name.nsi

  • blastdb_name.nsq

Taxonomy file with the following columns:

  • tax_id

  • parent_tax_id

  • rank

  • name_txt

  • old_tax_id (old_tax_id merged to tax_id)

  • taxlevel

tax_id  parent_tax_id   rank    name_txt        old_tax_id      taxlevel
1       1       no rank root            0
2       131567  domain  Bacteria                1
6       335928  genus   Azorhizobium            7
7       6       species Azorhizobium caulinodans        395     8
9       32199   species Buchnera aphidicola     28241   8
10      1706371 genus   Cellvibrio              7
11      1707    species Cellulomonas gilvus             8
13      203488  genus   Dictyoglomus            7
14      13      species Dictyoglomus thermophilum               8

Other

outdir

Name of the directory to write output files

out

String for naming output files