Abstract: | Phylogenomics heavily relies on well-curated sequence data sets that comprise, for each gene, exclusively 1:1 orthologos. Paralogs are treated as a dangerous nuisance that has to be detected and removed. We show here that this severe restriction of the data sets is not necessary. Building upon recent advances in mathematical phylogenetics, we demonstrate that gene duplications convey meaningful phylogenetic information and allow the inference of plausible phylogenetic trees, provided orthologs and paralogs can be distinguished with a degree of certainty. Starting from tree-free estimates of orthology, cograph editing can sufficiently reduce the noise to find correct event-annotated gene trees. The information of gene trees can then directly be translated into constraints on the species trees. Although the resolution is very poor for individual gene families, we show that genome-wide data sets are sufficient to generate fully resolved phylogenetic trees, even in the presence of horizontal gene transfer.Molecular phylogenetics is primarily concerned with the reconstruction of evolutionary relationships between species based on sequence information. To this end, alignments of protein or DNA sequences are used, whose evolutionary history is believed to be congruent to that of the respective species. This property can be ensured most easily in the absence of gene duplications and horizontal gene transfer (HGT). Phylogenetic studies judiciously select families of genes that rarely exhibit duplications (such as rRNAs, most ribosomal proteins, and many of the housekeeping enzymes). In phylogenomics, elaborate automatic pipelines such as HaMStR (1), are used to filter genome-wide data sets to at least deplete sequences with detectable paralogs (homologs in the same species).In the presence of gene duplications, however, it becomes necessary to distinguish between the evolutionary history of genes (gene trees) and the evolutionary history of the species (species trees) in which these genes reside. Leaves of a gene tree represent genes. Their inner nodes represent two kinds of evolutionary events, namely the duplication of genes within a genome—giving rise to paralogs—and speciations, in which the ancestral gene complement is transmitted to two daughter lineages. Two genes are (co)orthologous if their last common ancestor in the gene tree represents a speciation event, whereas they are paralogous if their last common ancestor is a duplication event; see refs. 2 and 3 for a more recent discussion on orthology and paralogy relationships. Speciation events, in turn, define the inner vertices of a species tree. However, they depend on both the gene and the species phylogeny, as well as the reconciliation between the two. The latter identifies speciation vertices in the gene tree with a particular speciation event in the species tree and places the gene duplication events on the edges of the species tree. Intriguingly, it is nevertheless possible in practice to distinguish orthologs with acceptable accuracy without constructing either gene or species trees (4). Many tools of this type have become available over the last decade; see refs. 5 and 6 for a recent review. The output of such methods is an estimate Θ of the true orthology relation Θ?, which can be interpreted as a graph GΘ whose vertices are genes and whose edges connect estimated (co)orthologs.Recent advances in mathematical phylogenetics suggest that the estimated orthology relation Θ contains information on the structure of the species tree. To make this connection, we combine here three abstract mathematical results that are made precise in Materials and Methods below.- i)Building upon the theory of symbolic ultrametrics (7), we showed that in the absence of horizontal gene transfer, the orthology relation of each gene family is a cograph (8). Cographs can be generated from the single-vertex graph K1 by complementation and disjoint union (9). This special structure of cographs imposes very strong constraints that can be used to reduce the noise and inaccuracies of empirical estimates of orthology from pairwise sequence comparison. To this end, the initial estimate of GΘ is modified to the closest correct orthology relation GΘ? in such a way that a minimal number of edges (i.e., orthology assignments) are introduced or removed. This amounts to solving the cograph-editing problem (10, 11).
- ii)It is well known that each cograph is equivalently represented by its cotree (9). The cotree is easily computed for a given cograph. In our context, the cotree of GΘ? is an incompletely resolved event-labeled gene tree. That is, in addition to the tree topology, we know for each internal branch point whether it corresponds to a speciation or a duplication event. Even though adjacent speciations or adjacent duplications cannot be resolved, the tree faithfully encodes the relative order of any pair of duplication and speciation (8). In the presence of horizontal gene transfer, GΘ may deviate from the structural requirements of a cograph. Still, the situation can be described in terms of edge-colored graphs whose subgraphs are cographs (7, 8), so that the cograph structure remains an acceptable approximation.
- iii)Every triple (rooted binary tree on three leaves) in the cotree that has leaves from three species and is rooted in a speciation event also appears in the underlying species tree (12). Thus, the estimated orthology relation, after editing to a cograph and conversion to the equivalent event-labeled gene tree, provides much information on the species tree. This result allows us to collect, from the cotrees for each gene family, partial information on the underlying species tree. Interestingly, only gene families that harbor duplications, and thus have a nontrivial cotree, are informative. If no paralogs exist, then the orthology relation GΘ is a clique (i.e., every family member is orthologous to every other family member) and the corresponding cotree is completely unresolved, and hence contains no triple. On the other hand, full resolution of the species tree is guaranteed if at least one duplication event between any two adjacent speciations is observable. The achievable resolution therefore depends on the frequency of gene duplications and the number of gene families.
Despite the variance reduction due to cograph editing, noise in the data, as well as the occasional introduction of contradictory triples as a consequence of horizontal gene transfer, is unavoidable. The species triples collected from the individual gene families thus will not always be congruent. A conceptually elegant way to deal with such potentially conflicting information is provided by the theory of supertrees in the form of the largest set of consistent triples (13, 14). The data will not always contain a sufficient set of duplication events to achieve full resolution. To this end, we consider trees with the property that the contraction of any edge leads to the loss of an input triple. There may be exponentially many alternative trees of this type. They can be listed efficiently using Semple’s algorithms (15). To reduce the solution space further, we search for a least resolved tree in the sense of ref. 16, i.e., a tree that has the minimum number of inner vertices. It constitutes one of the best estimates of the phylogeny without pretending a higher resolution than actually supported by the data. In SI Appendix, we discuss alternative choices.The mathematical reasoning summarized above, outlined in Materials and Methods, and presented in full detail in SI Appendix, directly translates into a computational workflow, . It entails three NP-hard combinatorial optimization problems: cograph editing (11), maximal consistent triple set (17–19), and least resolved supertree (16). We show here that they are nevertheless tractable in practice by formulating them as Integer Linear Programs (ILP) that can be solved for both artificial benchmark data sets and real-life data sets, comprising genome-scale protein sets for dozens of species, even in the presence of horizontal gene transfer.Open in a separate windowOutline of the computational framework. Starting from an estimated orthology relation Θ, its graph representation GΘ is edited to obtain the closest cograph GΘ*, which, in turn, is equivalent to a (not necessarily fully resolved) gene tree T and an event labeling t. From (T, t), we extract the set ?? of all relevant species triples. As the triple set ?? need not be consistent, we compute the maximal consistent subset ??? of ??. Finally, we construct a least resolved species tree from ???. |