MitoGeneExtractor
1. Mitochondrial DNA (mtDNA) sequences are often found as byproducts in next-generation sequencing (NGS) datasets that were originally created to capture genomic or transcriptomic information of an organism. These mtDNA sequences are often discarded, wasting this valuable sequencing information.
2. We developed MitoGeneExtractor, an innovative tool which allows to extract mitochondrial protein coding genes (PCGs) of interest from NGS libraries through multiple sequence alignments of sequencing reads to amino acid references. General references, for example on order level are sufficient for mining mitochondrial PCGs. In a case study, we applied MitoGeneExtractor to recently published genomic datasets of 1993 birds and were able to extract complete or nearly complete sequences for all 13 mitochondrial PCGs for a large proportion of libraries. Compared to an existing assembly guided sequence reconstruction algorithm, MitoGeneExtractor was faster and substantially more sensitive.
3. We compared COI sequences mined with MitoGeneExtractor to COI databases. Mined sequences show a high sequence similarity and correct taxonomic assignment between the recovered sequence and the assigned morphospecies in most samples. In some cases of incongruent taxonomic assignments, we found evidence for contamination in NGS libraries.
4. MitoGeneExtractor allows a fast extraction of mitochondrial PCGs from a wide range of NGS datasets. We recommend to routinely harvest and curate mitochondrial sequence information from genomic resources. MitoGeneExtractor output can be used to identify contaminated NGS libraries and to validate the
species identity of the sequenced animal based on the extracted COI sequences.