Overview of mkCOInr

mkCOInr is a series of Perl scripts that aims to create COInr, a large, comprehensive, COI database from NCBI-nucleotide and BOLD.

The COInr database is composed of two files

COInr is freely available and can be easily downloaded from Zenodo

It is planned to produce a new version annually.

Further scripts allow users to customize the database.

Major features of the creation of mkCOInr:
  • Mass download of sequences and their taxonomic lineages from NCBI-nucleotide and BOLD databases

  • TaxIDs are used to avoid problems with homonyms and synonyms

  • Creation of a coherent taxID system. The hierarchical structure of the NCBI taxIDs is completed if necessary with new, negative taxIDs.

  • When adding sequences with unknown taxIDs, taxon names are matched to already existing taxonomic lineages in the database to identify a correct existing taxID, or to assign a new one.

  • Taxonomically aware demultiplexing

  • Creation of a ready-to-use database in BLAST, RDP_classifier QIIME, VTAM or a FULL tsv format

COInr
  • Is not specific to a particular region of the COI gene. Sequences can be partial and can cover any part of the COI gene.

  • All cellular organisms are included, even Bacteria.

  • Sequences with incomplete lineages (e.g. assigned to a family without further precision) are present in the database

  • Taxa are taken into account only with correct latin name formats (e.g. instead of ‘Proterorhinus sp. BOLD:EUFWF4948-19’, the sequence is assigned to Proterorhinus genus without a species name)

The database can be used directly for similarity-based taxonomic assignations of metabarcoding data with any COI marker (primer pairs) of any geographical regions or target group.

Alternatively, the database can be used as a starting point to create smaller, more specific custom databases.

Sequences can be selected for :

  • A particular gene region (amplicon of a given primer pair) using select_region

  • List of taxa (sequences of a taxon list can be either selected or eliminated) using select_taxa

  • User-defined minimal taxonomic resolution using select_taxa

Additionally, it is also possible to add custom sequences.

This can save a considerable amount of time and effort, since one of the most important challenges of creating a custom database is the mass downloading of the sequences and their pooling into a coherent taxonomic system.

COInr or the custom databases derived from it can be formated to different database formats (qiime, rdp, blast, vtam, full) by format_db

Figure 1

Figure 1. The full pipeline to create COInr and options to make a custom database.

Further precisions

  • The taxonomic origin of the sequences is not checked, but taken as a face value from the source database.

  • At the scale of the complete database I did not find a satisfactory method to blacklist sequences that are probably incorrectly assigned. However, if a small custom database is produced, the use of a phylogenetic method like SATIVA becomes feasible and recommended to eliminate sequences of dubious origin.