The Leibniz Institute for the Analysis of Biodiversity Change
is a research museum of the Leibniz Association
Link to Leibniz Association
Random similarity of sequences or sequence sections can impede phylogenetic analyses or the identification of gene homologies. Additionally, randomly similar sequences or ambiguously aligned sequence sections can negatively interfere with the estimation of substitution model parameters. Phylogenomic studies have shown that biases in model estimation and tree reconstructions do not disappear even with large datasets. In fact, these biases can become pronounced with more data. It is therefore important to identify possible random similarity within sequence alignments in advance of model estimation and tree reconstructions. Different approaches have been already suggested to identify and treat problematic alignment sections, like GBLOCKS or noisy. We propose an alternative method which can identify random similarity within multiple sequence alignments based on Monte Carlo resampling within a sliding window. The method infers similarity profiles from pairwise sequence comparisons and subsequently calculates a consensus profile. In consequence, consensus profiles identify dominating patterns of non-random similarity or randomness within sections of multiple sequence alignments. It thus appears to be a powerful tool to identify possible biases of tree reconstructions or gene identification. The approach has been extended to aminoacid and nucleotide data and is currently further developed to visualize total randomness among sequences of a multiple sequence alignment together with Dr. Patrick Kück and Sandra Meid, both ZFMK.