Overview of mkCOInr ================================================= mkCOInr is a series of Perl scripts that aims to create **COInr, a large, comprehensive, COI database from NCBI-nucleotide and BOLD**. The COInr database is composed of two files - :ref:`COInr.tsv `, that contains :ref:`sequenceIDs `, :ref:`taxIDs ` and sequences - :ref:`taxonomy.tsv ` that contains all taxIDs and associated information COInr is freely available and can be easily downloaded from `Zenodo `_ It is planned to produce a new version annually. Further scripts allow users to customize the database. Major features of the creation of mkCOInr: - Mass download of sequences and their taxonomic lineages from NCBI-nucleotide and BOLD databases - TaxIDs are used to avoid problems with homonyms and synonyms - Creation of a coherent taxID system. The hierarchical structure of the NCBI taxIDs is completed if necessary with new, negative taxIDs. - When adding sequences with unknown taxIDs, taxon names are matched to already existing taxonomic lineages in the database to identify a correct existing taxID, or to assign a new one. - Taxonomically aware demultiplexing - Creation of a ready-to-use database in BLAST, RDP_classifier QIIME, VTAM or a FULL tsv format **COInr** - Is not specific to a particular region of the COI gene. Sequences can be partial and can cover any part of the COI gene. - All cellular organisms are included, even Bacteria. - Sequences with incomplete lineages (e.g. assigned to a family without further precision) are present in the database - Taxa are taken into account only with correct latin name formats (e.g. instead of 'Proterorhinus sp. BOLD:EUFWF4948-19', the sequence is assigned to *Proterorhinus* genus without a species name) The database can be used directly for similarity-based taxonomic assignations of metabarcoding data with any COI marker (primer pairs) of any geographical regions or target group. Alternatively, the **database can be used as a starting point to create smaller, more specific custom databases**. Sequences can be selected for : - A particular gene region (amplicon of a given primer pair) using :ref:`select_region ` - List of taxa (sequences of a taxon list can be either selected or eliminated) using :ref:`select_taxa ` - User-defined minimal taxonomic resolution using :ref:`select_taxa ` Additionally, it is also possible to **add custom sequences**. This can save a considerable amount of time and effort, since one of the most important challenges of creating a custom database is the mass downloading of the sequences and their pooling into a coherent taxonomic system. COInr or the custom databases derived from it **can be formated to different database formats** (qiime, rdp, blast, vtam, full) by :ref:`format_db ` .. _fig1_Flowchart: .. figure:: img/COInr_flowchart_readme.png :scale: 50 % :alt: Figure 1 **Figure 1.** The full pipeline to create COInr and options to make a custom database. **Further precisions** - The taxonomic origin of the sequences is not checked, but taken as a face value from the source database. - At the scale of the complete database I did not find a satisfactory method to blacklist sequences that are probably incorrectly assigned. However, if a small custom database is produced, the use of a phylogenetic method like `SATIVA `_ becomes feasible and recommended to eliminate sequences of dubious origin.