Motivation: Horizontal gene transfer (HGT) is believed to be ubiquitous among bacteria, and plays a major role in their genome diversification as well as their ability to develop resistance to antibiotics. In light of its evolutionary significance and implications for human health, developing accurate and efficient methods for detecting and reconstructing HGT is imperative.
Results: In this article we provide a new HGT-oriented likelihood framework for many problems that involve phylogeny-based HGT detection and reconstruction. Beside the formulation of various likelihood criteria, we show that most of these problems are NP-hard, and offer heuristics for efficient and accurate reconstruction of HGT under these criteria. We implemented our heuristics and used them to analyze biological as well as synthetic data. In both cases, our criteria and heuristics exhibited very good performance with respect to identifying the correct number of HGT events as well as inferring their correct location on the species tree.
Supplementary information: Supplementary data are available at Bioinformatics online.
Unlike eukaryotes, which evolve largely through vertical lineal descent, bacteria acquire genetic material through the transfer of DNA segments across species boundaries—a process known as horizontal gene transfer (HGT). This process plays a major role in bacterial genome diversification (Doolittle et al., 2003), and is a significant mechanism by which bacteria develop resistance to antibiotics (Paulsen, 2003). In the presence of HGT, the evolutionary history of a set of organisms is modeled by a phylogenetic network1, which is a directed acyclic graph obtained by positing a set of edges between pairs of the branches of an organismal tree to model the horizontal transfer of genetic material (Moret et al., 2004).
Therefore, to reconstruct an accurate Tree (or Network) of Life and to unravel bacterial genomic complexities, developing accurate criteria and efficient methods for reconstructing and assessing the quality of phylogenetic networks is imperative. A large body of work has been introduced in recent years to address phylogenetic network reconstruction and evaluation. In general, three categories of non-tree-like models have been addressed, all of which have been introduced under the umbrella concept of phylogenetic networks. However, major differences exist among the three categories. Splits networks (e.g. Huson and Bryant, 2006) are graphical models that capture incompatibilities in the data owing to various factors, not necessarily HGT or hybrid speciation. The second category is that of recombination networks (e.g. Gusfield and Bansal, 2005), which are used to model the evolution of haplotypes and genes at the population level. HGT networks are the extension of phylogenetic trees to enable the modeling of reticulation events, such as HGT and hybrid speciation [these are also called reticulate networks in Huson and Bryant (2006)]; Henceforth, we refer by phylogenetic networks to the latter type. See Linder et al. (2004) for a detailed survey of the various phylogenetic network models and methodologies.
One of the most accurate and commonly used criteria for reconstructing phylogenetic trees is maximum likelihood (ML) (Felsenstein, 1981). Roughly speaking, this criterion considers a phylogenetic tree from a probabilistic perspective as a generative model, and seeks the model (i.e. tree) that maximizes the likelihood of observing a given set of sequences at the leaves of the tree.
Likelihood in the general network setting has been investigated in the past by various works. However, no HGT-specific-likelihood framework has ever been suggested. von Haeseler and Churchill (1993) provided a framework for evaluating likelihoods on networks and subsequently (Strimmer and Moulton, 2000) provided an approach to assess this likelihood. These works consider the network as an arbitrary set of splits and therefore fall into the first category. They are characterized by the combined analysis approach, which entails combining all gene datasets first (by sequence concatenation), and then analyze the combined dataset. A serious drawback of this approach is that when individual genes are governed by different evolutionary mechanisms and models (a scenario that is very common in reticulate evolution), combining multiple data sets is problematic (Nakhleh et al., 2003). Likelihood on networks has also been considered in the setting of recombination networks (e.g. Husmeier and McGuire, 2002). These methods, similar to ours, are tailored to identify breakpoints along the given sequences; however, their underlying model is different from ours as they model a different process.
In this work, we extend the ML criterion to handle specifically HGT-oriented phylogenetic networks, and propose a set of criteria and efficient heuristics for computing them. Our extension is based on the fundamental observation that, barring recombination, the evolutionary history of a gene is modeled by a tree, such that a phylogenetic network can be modeled by its constituent trees (Nakhleh et al., 2005). We propose a set of ML criteria for phylogenetic networks; these criteria differ in how the tree information is used, which variant of the ML criterion is used and finally what input is provided. Further, we investigate the computational complexity of some of these criteria and devise a set of efficient heuristics for reconstructing and evaluating phylogenetic networks based on them. In particular, we prove that scoring the likelihood of a phylogenetic network is NP-hard in general, and provide an empirically efficient exact algorithm for the problem, relying on the notion of bi-connected components. Further, we devise an efficient branch-and-bound heuristic and EM algorithm for the problem of adding a number of HGT edges to a tree to obtain an optimal phylogenetic network.
We have implemented our criteria and heuristics and studied their performance on biological as well as synthetic data. For the biological data, we analyzed two datasets. The first dataset includes the Rubisco gene in eubacteria and plastids, which was previously analyzed by Delwiche and Palmer (1996), who postulated a set of HGT events for it. The second dataset includes ribosomal protein rpl12e of a group of 14 Archaeal organisms, which was suspected to include HGT events (Tailliez et al., 2002).
For the synthetic data, we simulated multiple datasets with various HGT events and applied our techniques to the data. In both cases, our criteria and heuristics performed very well with respect to the identification of the correct number of HGT events as well as their placements on the organismal trees.
2 MAXIMUM LIKELIHOOD OF PHYLOGENETIC NETWORKS
2.1 Preliminaries and definitions
Let T = (V, E) be a tree, where V and E are the tree nodes and tree edges, respectively, and let L(T) denote its leaf set and I(T) its internal nodes. Further, let χ be a set of taxa (species). Then, T is a phylogenetic tree over χ if there is a bijection between χ and L(T). Henceforth, we will identify the taxa set with the leaves they are mapped on to, and let denote the set of leaf-labels. A tree T is said to be rooted if the set of edges E is directed and there is a single distinguished internal node r with in-degree 0.
A phylogenetic network N = N(T) = (V′, E′) over the taxa set χ is derived from a rooted tree T = (V, E) by adding a set H of edges to T, without creating cycles, where each edge h ∈ H is added as follows: (i) split an edge e ∈ E by adding new node, ve; (ii) split an edge e′ ∈ E by adding new node, ; and (iii) finally, add a directed reticulation edge from ve to . The resulting network is a rooted directed acyclic graph.
Figure 1a shows a phylogenetic network obtained by adding the edge (X, Y) to the underlying organismal tree.
The tree in Figure 1b models the evolution of all genetic material that is vertically inherited from the ancestral organism (the species tree), whereas the tree in Figure 1c models the evolution of horizontally transferred genetic material. We denote by T(N) the set of all trees contained inside network N. Each such tree is obtained by the following two steps: (i) for each node of in-degree 2, remove one of the incoming edges, and then (ii) for every node x of in-degree and out-degree 1, whose parent is u and child is v, remove node x and its two adjacent edges, and add a new edge from u to v. For example, the set T(N) of the network N in Figure 1a contains only the two trees that are shown in Figure 1b and 1c.
2.2 Likelihood of phylogenetic networks
Under the ML criterion, a phylogenetic tree is viewed as a probabilistic model from which input sequences S are assumed to be sampled. For given input sequences S the i-th site, Si, is the set of values at the i-th position for every sequence in S.2 In this article, we assume that the sites are independently and identically distributed (iid). Since the parameters of the phylogenetic tree, M, are unknown, they are usually estimated from the observed sequences by maximizing the likelihood function P(S∣M) (Felsenstein, 1981). In general, the overall likelihood of the aligned sequences S given the model M is obtained by the product of the likelihood of every site i given M as follows:
When calculating the likelihood of a tree for a given site, two variants are considered: average likelihood (Steel and Penny, 2000) and ancestral likelihood (Pupko et al., 2000) (the former is the more popular of the two). The ML criterion assumes a model of evolution. We consider here the Jukes–Cantor model of sequence evolution (Jukes and Cantor, 1969). However, all the results here can be easily generalized to any other group-based model of sequence evolution. Our concept and major parts of our results can also be generalized to cases when independence across edges is not maintained. Given a set of aligned sequences , a tree T with I(T) internal nodes (), and the edge transition probabilities p, , the average likelihood of obtaining site Si under T is defined as follows:
A natural way of extending this setting to networks is as follows. The topology of a phylogenetic network is defined as above; however, in this case since tree edges have transition probabilities, when adding a reticulation edge between the edges e, e′ ∈ E we should mention where along the edges e and e′ we add the two new vertices. The transition probability of the reticulation edge is always 0, meaning that there are no substitutions along it (the reason being that HGT is instantaneous at the scale of evolution). However, each reticulation edge r = (e, e′) has reticulation probability br associated with it. This probability denotes the probability of a DNA segment being transferred along that edge. Figure 2 describes a simple phylogenetic network.
The likelihood of a network is obtained as a function of the likelihoods of the trees contained in it. Here again we consider two variants. In the first, the likelihood is the sum of the likelihood of all the trees of the network, where for each tree T we also need to multiply the resultant likelihood by P(T). Again, we can choose between the two tree likelihood functions described above [(Equations (2) and (3)]. Thus, we get the following equation:
Likelihood criterion for the trees: Is ancestral likelihood or average likelihood used to assess each tree likelihood in the network?
Tree criterion: At each site, is the best tree likelihood or the sum of the likelihoods over all trees taken?
Input data: Which of the network's parameters (topology and/or probabilities) are given? We consider three possibilities:
The tiny problem: The network topology, transition probabilities and reticulation probabilities are given.
The small problem: The initial tree and reticulation edges (i.e. network topology) are given but not the transition or reticulation probabilities.
The big problem: An initial network (usually a tree) is given and a set of additional reticulation edges is sought.
In the Supplementary Material for this paper (see the Abstract) we give proofs for the NP-hardness of part of the tiny and small versions of the problems. We show that the following problems are NP-hard: tiny best tree ancestral sequences, tiny best tree average sequences, tiny all trees ancestral sequences, tiny all trees average sequences and small best tree ancestral sequences. In this section, we describe algorithms and heuristics for dealing with these NP-hard problems. The simplest algorithm for the tiny problem is to decompose the network into all its constituent trees and analyze each tree separately by either the algorithm of Felsenstein (1981) for the average case or of Pupko et al. (2000) for the ancestral case. The complexity of such a naive approach is for each site, where r is the number of reticulation edges in the network and n is the number of leaves in the tree.
The component-wise naive algorithm for the tiny problem. We now describe a more efficient algorithm that takes into consideration independent components in the network. For a network N and a node v ∈ V (N), Nv denotes the set of nodes that are reachable from v. Let u,v be two nodes in N. We say that u and v are unrelated if u ∉ Nv and v ∉ Nu. In order to compute likelihood in a bottom-up fashion, we need the following property to hold: for every two internal nodes u, v, , , or . Note that this property always holds for trees, but not necessarily for networks. Indeed, this is the underlying principle in Felsenstein's pruning algorithm to compute likelihood on a tree (Felsenstein, 1981).
A biconnected component (bi-component) is a subgraph induced by a maximal set of vertices W, such that in the underlying graph, there are two vertex disjoint paths between any two vertices in W (we assume all the edges are undirected).
A node in a bi-component B is a leaf in B if it has no children in (the directed graph of) B. Otherwise it is internal in B. Also, a node is a root in B if it has no ancestor in B. It is easy to see that every bi-component has at least one leaf and exactly one root. Also, every two bi-components are internal-vertex-disjoint.
Let B be a bi-component with a root r. Let V (B) denote the set of vertices of B. Then for every internal nodethere is an unrelated internal nodes.t. (and by definition). In particular, every leaf in a bi-component has at least two parents.
The above observation implies that a more efficient algorithm than the naive algorithm can be devised to handle bi-components independently.
Let N be a network. Then B(N), the bi-component graph of N is a graph (V(B), E(B)) where V is the set of bi-components in N and two bi-components B1, B2 are connected by a (directed) edge in B(N) if 1. There is an edge in N between v1 ∈ B1 and v2 ∈ B2. 2. There is a vertex, and v is a leaf in B1 (and necessarily internal in B2).
We have the following observation.
Let N be a phylogenetic network. Then B(N) is a tree.
Based on the above observation, a better solution is to decompose the network into bi-components, run the naive exhaustive algorithm inside every bi-component and the tree algorithm in the bi-component. Let r(B) be the number of reticulation edges in a bi-component B, B∗ the largest bi-component in N and r∗ the maximum of r(B) over all bi-components of N. Then, the above improvement reduces the complexity of the algorithm to . This improvement can be quite important as the bi-components can be sparse in a network with few reticulation edges.
Heuristics for the small versions. In this case, we have the network topology but not the probabilities, which we infer from the data. For the average likelihood version, we use hill climbing to compute the optimal parameters for the given topology. For the ancestral likelihood version we propose a new Expectation—Maximization (EM) algorithm. We start with random initial edge lengths. Next we perform the following two steps until convergence: (1) Find optimal internal assignments and a tree topology at each site, given the edge lengths; (2) Find the best edge lengths, given these assignments and tree topologies. Step (1) can be performed using our exact algorithm for the tiny problem. Step (2) can be performed in polynomial time as we now describe. After Step (1) we have labeling for each node in the tree, and a tree for each site. For an edge e, let Se denote the set of sites (out of k sites) where the edge e appears in the best tree. Let h(Se, e) denote the Hamming distance between the two endpoints of the edge for the sites in Se. In Step (2) we set , for tree edges e, and , for reticulation edges e. It can be shown that repeating this procedure leads to a local critical point on the likelihood surface.
Heuristics for the big versions. For accelerating the running time for the big problems, we use a branch and bound (B&B) heuristic. In general, the B&B technique is based on the decreasing monotonicity of the objective function with respect to partial inputs. In other words, the value of the objective function for a certain input is not greater than any valid part of that input. The B&B principle asserts that if the value of the objective function for some partial input is smaller than some known value on the total input, then exploring all inputs that extend this partial input can be avoided. B&B is normally applied to NP-hard problem and can lead to very efficient running times. In our case, the input is a phylogenetic network and a set of characters and the objective function is the likelihood of the network w.r.t. the character set.
Let N = (V,E) be a phylogenetic network. We say that is an edge separated subnetwork of N if , , and there exists a single edge that connects a node in V and a node in V′. The following observations establish an upper bound for the likelihood of a phylogenetic network.
If a network N′ is a valid phylogenetic network and is an edge separated subnetwork of phylogenetic network N, then P(S∣N′) > P(S∣N).
For a set S of sequences of length k, we denote by (n ≥ 0, m ≤ k) the set of these sequences restricted only to the positions n ≤ i ≤ m.
For any phylogenetic network N and set S of sequences, we have.
Based on these observations, we propose a two-step heuristic: (i) optimize the likelihood of sub-networks; if a candidate network contains an edge separated sub-network with low-likelihood score, it is not the network with the ML score; (ii) optimize the likelihood of a network on part of the sites; if a candidate network has low-likelihood score for these sites, it is not the network with the ML score.
4 EMPIRICAL PERFORMANCE
We implemented our ML criteria and algorithms and tested them first on simulated data. Next, using insights from this study we applied our software on two real biological datasets. The problem we sought to solve is the big problem in which only the organismal tree and the input sequences are given and the task is to reconstruct the network topology and sets of edge probabilities. Specifically, we investigated the performance of ML with respect to its ability to infer the correct number and location of HGT events.
Simulation. We used the
The best results we obtained for this type of experiment were under the ancestral likelihood model. In all our experiments (26) the best reconstructed edges (the ones that contributed the most to the network likelihood) were indeed the HGT edges (in the model network). The sixth edge exhibited a lesser contribution and the seventh lesser than that. Moreover, all the sites where the HGT edges were inferred (but not all of them) were from the 1501-th site to the 2000-th site.
Results on biological data. We analyzed two biological datasets. In the first we considered the rubisco gene rbcL of 15 plastids, cyanobacteria and proteobacteria organisms. This is a subset of the dataset considered by Delwiche and Palmer (1996) (owing to ML computational intensity, we could not analyze the whole 48-taxon set) for which multiple HGT events were conjectured by the authors. The dataset consists of two sequences from each of the α-, β- and γ-proteobacteria groups, two from cyanobacteria, one from green plastids, one from red plastids, one cyanophora and four from form II rubisco sequences. For this dataset, we obtained the species (organismal) tree which was reported in Delwiche and Palmer (1996) and Boc and Makarenkov (2003). The species tree is based on 16S rRNA and other evidence. The 532 sites long alignment is available from
Delwiche and Palmer (1996) hypothesized the occurrence of several HGT events of the rubisco genes as opposed to ancient gene duplication and loss. Our findings, based on the ML criterion are as follows: (1) The two most significant HGT edges are H1 and H2 as shown in Figure 3, and they group the form II rubisco (Thiobacillus denitrificans II and Hydrogenovibrio II) of the β- and γ-proteobacteria, respectively, together with the form II rubisco (Rhodobacter capsulatus) of the α-proteobacteria. This result agrees with the grouping of these three organisms, based on the form II rubisco, into one clade, as shown in Figure 2 of Delwiche and Palmer (1996). (2) The third most significant HGT edge is H3 in Figure 3, and it indicates an HGT of the red type form I rubisco from the α-proteobacteria (Rhodobacter sphaeroides I) to the β-proteobacteria (Alcaligenes H16 plasmid). This result agrees with the grouping of all red type form I rubisco genes in one clade in Figure 2 of Delwiche and Palmer (1996). (3) The fourth most significant HGT edge is H4 in Figure 3, which completes the grouping of the form II rubisco genes, along with H1 and H2, by indicating an HGT of the form II rubisco from R.capsulatus to Gonyaulax (an α-proteobacteria and plastid, respectively). This result is also supported by the single clade of all form II rubisco in Figure 2 of Delwiche and Palmer (1996). (4) The fifth most significant HGT edge is H5 in Figure 3, which indicates an HGT of the form I rubisco from the red and brown plastids (Cyanidium) to the α-proteobacteria (R.sphaeroides I). This HGT event is in agreement with the grouping of the red type form I rubisco in Figure 2 of Delwiche and Palmer (1996). (5) The last HGT edge is H6 in Figure 3 which groups the red and brown plastids with the α-, β- and γ-proteobacteria. Such grouping is supported by the red type form I rubisco group in Figure 2 of Delwiche and Palmer, (1996). To summarize, the HGT edges computed by our heuristic agree with the grouping of the organisms based on the forms I and II rubisco, as hypothesized by Delwiche and Palmer (1996).
We also compared our results to the results reported by Boc and Makarenkov (2003), who used a distance method for discovering HGT and analyzed a dataset similar to ours. Three of our edges (H1, H3, H4) appeared in Boc and Makarenkov (2003). Two other edges (H2, H5) appeared also in Boc and Makarenkov (2003), but in the opposite direction. One edge, H6, appeared in our results but did not appear in the results of Boc and Makarenkov (2003). In general, our results are similar to the result reported by Boc and Makarenkov (2003) but simpler. Our results included six HGTs, whereas the solution of Boc and Makarenkov (2003) included eight HGTs.
The second biological dataset includes the ribosomal protein rpl12e over a group of 14 archaeal organisms, which was mentioned in the work of Tailliez et al. (2002). We used the organismal tree described in Tailliez et al. (2002) which was based on both the concatenation of 57 ribosomal proteins (7175 positions) and the concatenation of SSU and LSU rRNA (3933 positions) (the two methods gave identical trees). The sequences were downloaded from NCBI and were aligned by ClustalW. Tailliez et al. (2002) claimed that the phelogenetic tree based on these proteins is different from the organismal tree of these archaea. We used our method for explaining this difference. The results are described in Figures 4 and 5 and suggest that three HGT events can explain the difference between the organismal tree and the phylogenetic tree for the rpl12e proteins.
From the definition of the ML criterion for networks it follows that adding an edge never decreases the likelihood score. Therefore, a significant question is when to stop adding edges (stopping rule). To answer this question, we plotted the improvement in the likelihood score as a function of the number of HGT edges added; the results are shown in Figure 6 for the first dataset and Figure 5 for the second dataset. The first figure shows that while adding the first five edges achieves drastic improvements in the likelihood score, adding the sixth edge results in a much lower improvement, which is indicated by the slow decrease in the likelihood score. Similarly, the second figure shows that the first three edges achieve the major improvements in the likelihood score.
5 CONCLUSIONS AND FUTURE RESEARCH
Phylogenetic networks model evolutionary histories of sets of organisms in the presence of non-tree-like evolutionary events such as HGT and hybrid speciation. In this paper, we introduced a new ML framework for reconstructing and evaluating phylogenetic networks. This framework gave rise to an array of computational problems. We addressed the most basic and fundamental of these problems such as the complexity of reconstruction, hardness of the ‘tiny’ variants, devising efficient heuristics and algorithms, and showing the viability of this criterion. We implemented our methods and tested them on a large set of simulated data. We also analyzed two biological datasets. On the dataset of eubacteria and plastids (Delwiche and Palmer, 1996) we confirmed previous conjecturs made by Delwiche and Palmer (1996), and on the archaeal dataset of Tailliez et al. (2002) we explained the discrepancy detected by Tailliez et al. (2002) by three HGT edges. The empirical analysis carried out here was aimed to demonstrate the relevance of the ML criterion. However, since even the simplest models we assumed here are NP-hard, it is easy to be convinced that even the relatively simple methods we employed here are fairly involved.
Future research directions include (1) developing more computationally efficient algorithmic techniques to enable analysis of large datasets; (2) modeling of dependence among sites (we intend to investigate HMMs for this purpose); and (3) exploring various distributions of reticulation edge probabilities. It is clear that there is a tradeoff between the first and the other two points. Although using HMMs and more complicated distributions of reticulation probabilities will make the model more accurate, it will also dramatically increase the running time of the method. Thus, we believe that simpler probabilistic models, as we described here, are practically important.
The authors would like to thank Derek Ruths and Satish Rao for helpful discussions and data, and the anonymous reviewers for their helpful comments. Experiments were run on the Rice Terascale Cluster, funded by NSF under grant EIA-0216467, Intel, and HP. L.N. was supported in part by DOE grant DE-FG02-06ER25734 and NSF grant CCF-0622037. S.S. was supported in part by NSF grant CCR-0105533.
Conflict of Interest: none declared.