De novo species delimitation in metabarcoding datasets using ecology and phylogeny
Background: Metabarcoding studies allow a wide variety of taxa to be analysed simultaneously in a fraction of the time taken by morphological identification, but currently metabarcoding studies must rely on sequence similarity-based methodologies to delimit operational taxonomic units (OTUs). Similarity-based OTU clustering methodologies can lead to inaccurate estimates of diversity, species’ distributions or responses to change, meaning that there is a critical need for methods to delimit species in metabarcoding datasets.
Methods: We introduce SNAPhy (Species delimitation using Niche And PHYlogeny), a novel approach which utilises ecological and phylogenetic information to delimit de novo OTUs in metabarcoding datasets and avoids the problems associated with current OTU clustering methods. Sequencing reads are first divided into ecological groups based on co-occurrence, thereby reducing data complexity and facilitating the use of evolutionary and phylogenetic models (e.g. BEAST and GMYC) to delimit species-level groupings within discrete ecologically informed phylogenies. The utility of SNAPhy is demonstrated using an 18S rDNA nuclear small subunit (nSSU) dataset representing replicated samples taken along the entire length of an estuarine salinity gradient, and SNAPhy is then compared to existing OTU clustering methods.
Results: All of the OTU clustering methods compared yielded different numbers of OTUs and a different taxonomic distribution of OTUs, which we suggest is due to the taxon differences that are known to exist in the degree of intraspecific divergence. SNAPhy and UCLUST (with a 98% similarity threshold) gave the most plausible numbers of OTUs, especially within the Nematoda. Additionally, the degree of variation within nematode OTUs delimited by SNAPhy lies within the range of variation in deeply metabarcoded individuals.
Discussion: SNAPhy avoids the static clustering threshold problems associated with current OTU clustering methods and instead focuses on genuine biological diversity delimited according to a general lineage species concept. We suggest that the SNAPhy approach should play a crucial role in future sequencing-based biodiversity assessment by providing more accurate estimates of species diversity and distributions than current methods, thereby enabling more accurate impact assessments and better informing managerial decisions.