The Leibniz Institute for the Analysis of Biodiversity Change
is a research museum of the Leibniz Association
Link to Leibniz Association
In phylogenomics character matrices with extensive missing data are frequently used. These missing data have potentially detrimental effects on the accuracy and robustness of tree inference.
Therefore, many investigators select taxa and genes with high data coverage. Drawbacks of these selections are their exclusive reliance on data coverage without consideration of actual signal in the data. The simple selection of taxa and genes with high data coverage might thus not deliver data matrices with optimal signal. As an alternative, we have developed a heuristics which
(1) assesses information content of genes in super\-matrices using a measure of tree--likeness combined with data coverage and
(2) reduces super\-matrices with a simple hill climbing procedure to matrices with high total information content.
The selection of a data subset with the proposed approach increased the chance to recover correct partial trees > 10-fold.
Our simulations and analyses of empirical data demonstrate that the selection of data subsets can be improved with formal approaches compared with simply selecting taxa and genes of high data coverage. We are further developing this approach into a hypotheses-driven selection of an optimal concatenated supermatrix.