Overview of mkCOInr

mkCOInr is a series of Perl scripts that aims to create COInr, a large, comprehensive, COI database from NCBI-nucleotide and BOLD.

The COInr database is composed of two files

COInr.tsv, that contains sequenceIDs, taxIDs and sequences
taxonomy.tsv that contains all taxIDs and associated information

COInr is freely available and can be easily downloaded from Zenodo

It is planned to produce a new version annually.

Further scripts allow users to customize the database.

Major features of the creation of mkCOInr:

Mass download of sequences and their taxonomic lineages from NCBI-nucleotide and BOLD databases
TaxIDs are used to avoid problems with homonyms and synonyms
Creation of a coherent taxID system. The hierarchical structure of the NCBI taxIDs is completed if necessary with new, negative taxIDs.
When adding sequences with unknown taxIDs, taxon names are matched to already existing taxonomic lineages in the database to identify a correct existing taxID, or to assign a new one.
Taxonomically aware demultiplexing
Creation of a ready-to-use database in BLAST, RDP_classifier QIIME, VTAM or a FULL tsv format

COInr

Is not specific to a particular region of the COI gene. Sequences can be partial and can cover any part of the COI gene.
All cellular organisms are included, even Bacteria.
Sequences with incomplete lineages (e.g. assigned to a family without further precision) are present in the database
Taxa are taken into account only with correct latin name formats (e.g. instead of ‘Proterorhinus sp. BOLD:EUFWF4948-19’, the sequence is assigned to Proterorhinus genus without a species name)

The database can be used directly for similarity-based taxonomic assignations of metabarcoding data with any COI marker (primer pairs) of any geographical regions or target group.

Alternatively, the database can be used as a starting point to create smaller, more specific custom databases.

Sequences can be selected for :

A particular gene region (amplicon of a given primer pair) using select_region

List of taxa (sequences of a taxon list can be either selected or eliminated) using select_taxa

User-defined minimal taxonomic resolution using select_taxa

Additionally, it is also possible to add custom sequences.

This can save a considerable amount of time and effort, since one of the most important challenges of creating a custom database is the mass downloading of the sequences and their pooling into a coherent taxonomic system.

COInr or the custom databases derived from it can be formated to different database formats (qiime, rdp, blast, vtam, full) by format_db

**Figure 1.** The full pipeline to create COInr and options to make a custom database.

Further precisions

The taxonomic origin of the sequences is not checked, but taken as a face value from the source database.

At the scale of the complete database I did not find a satisfactory method to blacklist sequences that are probably incorrectly assigned. However, if a small custom database is produced, the use of a phylogenetic method like SATIVA becomes feasible and recommended to eliminate sequences of dubious origin.