-
PDF
- Split View
-
Views
-
Cite
Cite
Saharon Rosset, Efficient inference on known phylogenetic trees using Poisson regression, Bioinformatics, Volume 23, Issue 2, 15 January 2007, Pages e142–e147, https://doi.org/10.1093/bioinformatics/btl306
Close -
Share
Abstract
Motivation: We suggest the use of Poisson regression for time inference and hypothesis testing on a bifurcating Phylogenetic tree with known topology. This method is computationally simple and naturally accommodates variable substitution rates across different sites, without requiring the estimation of these rates. We identify the assumptions under which this is a maximum-likelihood inference approach and show that in some realistic situations—in particular, when the probability of repeated mutation within each branch of the tree is small—these assumptions hold with high probability.
Results: Our motivating domain is human mitochondrial DNA trees, and we illustrate our method on a problem of estimating the time to most recent common ancestor of all non-African mtDNA, using publicly available data. We test for molecular clock violations using multiple comparisons, and conclude that the global molecular clock hypothesis cannot be rejected based on these data.
Contact:srosset@us.ibm.com
1 INTRODUCTION
In this article we concentrate on the inference phase of phylogenetic analysis. That is, given a phylogenetic tree, how can we estimate parameters (such as time) and test hypotheses about the mutation process (such as molecular clock/lineage independence). The tree construction phase, which is arguably the most difficult part of the process, and often the most scientifically important one, is not addressed here, and we assume that we start from a ‘correct’ phylogenetic tree, which we will assume here is bifurcating (although our methods can also account for multi-furcation).
We propose the use of Poisson regression (PR) to estimate parameters, do inference on these parameters and test hypotheses on this tree. We show that this allows us to accommodate site-dependent mutation rates and different mutational clock assumptions. We show that our Poisson regression method is a maximum-likelihood phylogenetic inference approach if we assume:
We observe all mutations, i.e. there are no multiple mutations in the same site in any branch of the Phylogenetic tree.
The overall mutation rate (i.e. the sum of mutation probabilities in all sites) is fixed, regardless of the current state, which holds under some commonly used nucleotide substitution models, the most general of which is the Kimura (1981) 3-parameter model (K3ST).
We compare PR with two prevalent approaches: maximum-likelihood inference and the commonly used simple ρ method (Forster et al., 1996). PR is more efficient than the former (and equivalent to it statistically under the assumptions mentioned above) and statistically more sound than the latter.
Finally, we analyze the human mtDNA tree using PR. We argue that the probability of repeated mutation within branches of the tree is negligible; therefore, the use of PR is well justified. We use PR to estimate the time to most recent common ancestor of all non-African mtDNA, and also employ a multiple comparison approach to test for molecular clock violations. Our TMRCA confidence interval is 67 000 ± 13 400—older and significantly tighter than the commonly quoted estimates of Ingman et al. (2000). Our testing for molecular clock violations concludes that the global clock hypothesis cannot be rejected based on these data.
2 THE POISSON REGRESSION METHOD
Consider a known rooted phylogenetic tree for non-recombining DNA, such as the simple tree in Figure 1. Assume the terminal nodes (leaves) are extant samples and the internal nodes of the tree represent their common ancestors. Further assume that we have mutation counts on each branch of the tree, e.g. from using the Fitch algorithm (Fitch, 1971). We associate parameters with non-leaf nodes (denoted β1, β2, β3, …). These represent distances from each such node to the terminal nodes (present day) under a molecular clock assumption, measured in terms of expected mutation counts. Thus, we expect to encounter ‘about’ β1 mutations when traversing any path from the ancestor at the root to an extant samples at the leaves. We assume the observed mutation counts on each branch of the tree (denoted by xij) have a Poisson(λij) distribution with a parameter, i.e. the difference between the parameter at the beginning and the end of the branch, or the difference between the expected mutation distance from the starting node and ending node of this branch. For branches leading directly to leaves, the Poisson parameter is simply the β parameter of the starting node for this branch (since the leaves represent the present). In Figure 1 we illustrate this setting on two branches—the one between nodes 1 and 3 and the one from internal node 4 to leaf 8.
Simple phylogenetic tree, with the Poisson assumptions made explicit on two branches.
Simple phylogenetic tree, with the Poisson assumptions made explicit on two branches.
Since the Poisson parameters of the observed counts on the branches are linear functions of the β parameters, finding the maximum-likelihood estimates of the β parameters is simply a Poisson regression with an identity link function. This link function (identity) is different from the one usually used for Poisson regression, which is the log link (the natural link for the Poisson family). However our problem here is still a well-defined Generalized Linear Model (GLM), which can be solved using any good GLM package, such as the glm function in R. See McCullagh and Nelder (1989) for discussion of the theory of GLMs and Venables and Ripley (1994) for discussion of the glm function in S+ and R.
The advantages of using a GLM approach for inference are both computational and statistical:
GLM methodology offers highly efficient solutions to even large phylogenetic problems (see McCullagh and Nelder, 1989). Problems with up to several hundreds of variables (internal nodes in the tree) can easily be solved in <1 min of CPU time on a standard desktop computer. This efficiency allows us to solve big problems, and also to experiment with many different solutions to smaller ones, like we do in Section 5, where we enumerate many possible ways of relaxing the molecular clock assumption and compare their outcomes.
The theory of GLMs offers easily attainable uncertainty estimates on the estimated parameters, in particular confidence intervals. These are based on asymptotic Gaussian approximations of the likelihood around the ML estimates and are not guaranteed to be exactly correct for our finite samples. However they are typically reasonable for Poisson quantities, especially for relatively large values of Poisson parameters. An alternative approach to inference would be to use the parametric bootstrap (Efron and Tibshirani, 1994; Felsenstein, 2003), which would also be facilitated by the computational efficiency of the Poisson regression approach.
If we draw a random sample of mutation counts on all branches and fit the Poisson regression, we get the results in Table 1.
Sample output of PR
| Node . | True value . | Estimate . | Std Error . |
|---|---|---|---|
| β1 | 20 | 23.2003 | 2.8596 |
| β2 | 9 | 7.2598 | 1.4690 |
| β3 | 13 | 13.7575 | 2.1200 |
| β4 | 1 | 1.6489 | 0.8685 |
| β5 | 7 | 7.2598 | 1.4690 |
| β6 | 5 | 8.0170 | 1.7664 |
| β7 | 3 | 3.6563 | 1.2859 |
| Node . | True value . | Estimate . | Std Error . |
|---|---|---|---|
| β1 | 20 | 23.2003 | 2.8596 |
| β2 | 9 | 7.2598 | 1.4690 |
| β3 | 13 | 13.7575 | 2.1200 |
| β4 | 1 | 1.6489 | 0.8685 |
| β5 | 7 | 7.2598 | 1.4690 |
| β6 | 5 | 8.0170 | 1.7664 |
| β7 | 3 | 3.6563 | 1.2859 |
Sample output of PR
| Node . | True value . | Estimate . | Std Error . |
|---|---|---|---|
| β1 | 20 | 23.2003 | 2.8596 |
| β2 | 9 | 7.2598 | 1.4690 |
| β3 | 13 | 13.7575 | 2.1200 |
| β4 | 1 | 1.6489 | 0.8685 |
| β5 | 7 | 7.2598 | 1.4690 |
| β6 | 5 | 8.0170 | 1.7664 |
| β7 | 3 | 3.6563 | 1.2859 |
| Node . | True value . | Estimate . | Std Error . |
|---|---|---|---|
| β1 | 20 | 23.2003 | 2.8596 |
| β2 | 9 | 7.2598 | 1.4690 |
| β3 | 13 | 13.7575 | 2.1200 |
| β4 | 1 | 1.6489 | 0.8685 |
| β5 | 7 | 7.2598 | 1.4690 |
| β6 | 5 | 8.0170 | 1.7664 |
| β7 | 3 | 3.6563 | 1.2859 |
Some of these estimates (e.g. β3, β5) are very close to their true values and some are a little off. Most importantly, the standard deviation estimates in the third column seem well calibrated. In particular, all parameter estimates are within the 1.7 estimated standard deviations of their true value.
3 PR AND MAXIMUM LIKELIHOOD
Maximum-likelihood (ML) methods (Felsenstein, 2003; Yang, 1993) assume a mutation (or substitution) model and, subject to that model, attempt to find the tree and tree parameters (specifically, branch lengths) which best explain the observed data, where the quality of explanation is measured by the likelihood of the data, given the parameters. In the context of learning tree structure the maximum-likelihood approach has been applied extensively. It has also been applied to inference, in particular hypothesis testing, when the phylogenetic tree is assumed known. See Felsenstein (2003, Chapters 16 and 19) for a review of tree generation and inference using maximum-likelihood approaches. For our purpose we point out two important facts about this area:
The essence of the ML approach is in calculating the maximum-likelihood parameters for a single tree, with a tree-search wrapped around that. Thus, efficient calculations for ‘known’ trees are critical for tree generation as well.
Allowing for variable mutation rates by site significantly complicates the ML analysis. In fact Yang [(1993, 1994, 1997), Author Webpage], who originally showed how to adapt ML to variable rates with a Gamma prior, advocates practical use of a variable rate approach for inference only and not for tree construction, owing to the increased computational burden.
Our Poisson regression is a maximum-likelihood approach, under the Poisson model we assume. The interesting question is, can we consider it to be an exact (or at least approximate) maximum-likelihood approach for phylogenetic inference and under what assumptions?
The answer, as we show below, is that the Poisson regression estimate is indeed a maximum-likelihood estimate of tree branch lengths if we assume the following:
(1) An appropriately simple substitution model, in particular one where the mutation probability is independent of the current nucleotide (and consequently all 4 nt are equally likely to appear). This is true of three of the most commonly used substitution models:
The Jukes–Cantor model (JC69; Jukes and Cantor, 1969), which assumes that all substitutions are equally likely.
The Kimura 2-parameter model (K80; Kimura,1980), which allows for different probability of transitions and transversions.
(2) That we observe all mutations. This is typically not a realistic assumption, since on a phylogenetic tree we can only ‘observe’ (or infer) the haplotypes on the nodes of the tree, representing inferred ancestors just before branching events. Assuming the tree topology and inferred ancestors are correct [strong assumption, but one usually required to facilitate inference, e.g. Langley and Fitch (1974)], the situation in which we observe all mutations, is if no repeated or back mutations have occurred along any single branch of the tree. In that case, we know that all sites that differ between a starting and ending node of a branch (ancestor and descendent) have mutated once along this branch. It is important to emphasize that we do not need to assume there are no repeated mutations anywhere in the tree. The same nucleotide position can mutate many times, as long as no two mutations happen on the same branch of the tree. This is a non-testable assumption, and indeed in some situations it is also not reasonable. We show in Section 5.2 that it is reasonable for our motivating domain, of human mtDNA trees, when we consider only coding region mutations, without the D-loop.
The assumption of observing all mutations has a significant simplifying effect on the likelihood function. It means that a maximum-likelihood analysis does not have to account for all possible combinations of unobserved mutations that could lead to the observed and inferred haplotypes on the tree. Hence it may not be surprising that these two assumptions lead us to the relatively simple PR formulation as a maximum-likelihood approach. We now prove this equivalence rigorously.
First, it is easy to show (e.g. Huelsenbeck and Crandall, 1997) that a site with mutation rates as in (3) has a total number of mutations in time t distributed as Pois(t · λ). Given this, assume now a genetic region of length n has an appropriate nucleotide substitution model (i.e. K3ST or simpler) and each nucleotide position has a different overall mutation rate. Denote these rates by λ = λ1, … , λn. This implies that transitions in site j occur at rate a·λj and the two types of transversions occur at rates b1·λj and b2·λj, with a + b1 + b2 + 1. Assume WLOG (this just fixes a scale for the time length estimates). For a fixed phylogenetic tree with m branches, denote the unknown branch lengths by t = t1, … , tm and the mutation counts for each (branch, site) combination by . Let be the mutation counts per branch. Then
is a sufficient statistic for the maximum-likelihood estimation of branch lengths t under the model K3ST. In other words, the maximum-likelihood estimate is independent of λ1, … , λn, a, b1, b2 and the specific identities of the mutations on the edges when we condition on
The same is not true for phylogeny estimation. In particular, the specific mutation rates λ are meaningful for maximum-likelihood generation of trees, since it also implicitly involves estimating the C matrix (mutation counts, dependent on phylogeny), which cannot be decomposed from λ.
Recall, that the tls in this representation are linear functions of the βjs in our original representation (1). Keeping the scaling for which , we can write if l is the branch from node i to node j, and the part of (4) is exactly our original maximum-likelihood problem in (1). Thus we have proved our central result, which makes explicit the connection between PR and ML:
If the substitution model is K3ST or simpler and all mutations on the branches are observed, then the PR solution is the maximum-likelihood phylogenetic inference solution, even if the mutation rates vary by locus.
Theorems 1 and 2 have some interesting implications on maximum-likelihood methods in general. For example, we can conclude from Theorem 1 that if the probability of repeated or back mutation on any branch is very low, and hence the counts cij are ‘almost’ fully observed, then the result of maximum-likelihood inference with and without the variable rate assumption should be almost identical. We claim in Section 5.2 that human mtDNA trees are one such domain, where branch lengths are short enough relative to individual mutation rates (when considering coding region only). Inspecting the literature, we can indeed see the expected effect: cases where the analysis obtains the same results whether a constant or varying mutation rate is assumed. For example, Howell et al. (2004) use the software package PAML (Yang, 1997) to analyze African mtDNA data. Their conclusions regarding molecular clock validity are of independent interest, but for our purpose here we are interested in their comparison of the results of PAML with and without the variable rate analysis (under a Gamma prior), which are practically identical (e.g. rows 5 and 6, and rows 11 and 12 in their Table 3).
4 TMRCA ESTIMATION AND COMPARISON TO ρ
An important use of the PR method is for estimating the age of various ancestral nodes in the phylogenetic tree under a mutational clock assumption. The Poisson parameters estimated at the nodes are estimates of the TMRCA of the lineage starting at this node, in units of mutations. Thus, we are simultaneously estimating the age of all ancestral nodes, taking into account the whole tree.
It is interesting to compare our method with the simple, widely used ρ method (Forster et al., 1996). This method only estimates the age of the root node (i.e. TMRCA) of a whole tree (or sub-tree) by averaging the mutational length of all paths from that node to the leaves of the tree.
ρ is clearly an unbiased estimate under a mutational clock assumption and is certainly a reasonable one. The PR estimate is not guaranteed to be unbiased; however, it is unbiased to first order (McCullagh and Nelder, 1989), and therefore almost unbiased in practice, as it is indeed in the example below. Under the Poisson assumption, there are several pieces of evidence that the variance of the PR estimate is likely smaller than that of ρ:
As a maximum-likelihood estimate, the asymptotic variance of the PR estimate of TMRCA is equal to the Fisher bound, i.e. no estimator of TMRCA can have lower variance.
By the Rao–Blackwell theorem, PR is the minimum variance estimate of its own expectation (which we know to be unbiased to first order).
For a concise review of the statistical principles underlying these statements, see Casella and Berger (2001, chapter 7).
Where we expect PR to have a bigger advantage over ρ in estimating the age of internal nodes in our tree, i.e. TMRCA of sub-trees, since ρ does not take into account the whole tree, just the relevant sub-tree.
We now compare the performance of PR and ρ for estimating the parameters of the top three nodes in the example of Section 2. We know ρ is unbiased, and since it is a simple linear combination of the counts on the branches, we can calculate its variance analytically. To estimate bias and variance of the PR estimates, we simulate 100 000 random draws of data from the model (2) and empirically estimate the mean and variance based on those. Our results are as follows:
Top node (λ = 20). From ρ: Bias = 0, variance = 6.75, which is also the expected meansquared error (MSE). From PR: Bias ∈[−0.014,0.012] (90% confidence interval), variance 6.69, expected MSE is below 6.7. MSE reduction from using PR: 1%.
Left node (λ = 9) From ρ: Bias = 0, variance = 3.5, which is also the expected MSE. From PR: Bias ∈[−0.005,0.013], variance 2.96, expected MSE is below 3. MSE reduction from using PR: 14%.
Right node (λ = 13) From ρ: Bias = 0, variance = 5.5, which is also the expected MSE. From PR: Bias ∈[0,0.02], variance 4.45, expected MSE is below 4.5. MSE reduction from using PR: 18%.
5 EXAMPLE: TIMING THE EXODUS OF MTDNA FROM AFRICA
We now use our approach to investigate the question of the date of the ‘out of Africa’ event for the mitochondrial DNA. For that, we should identify a date for the MRCA of clades M,N,L3 in the mtDNA tree (for a discussion on this tree, see Jobling et al., 2004, chapter 8). Since all non-African DNA are derived from this ancestor, and no more recent ancestor for non-African mtDNA can be defined, the date of the M,N MRCA is a likely earliest bound for the out-of-Africa event for mtDNA. A widely quoted confidence interval for this ancestor is the one given by Ingman et al. (2000) of 52 000 ± 27 500. As we will see, our data and method lead to an older expected age and a much tighter confidence interval.
For this purpose, we have constructed a tree of mtDNA samples using publicly available full sequences from the NCBI data repository (Author Webpage) and the known structure of the mtDNA haplogroup tree (we completely avoided parts of the tree that are disputed in the literature, such as the sub-clades of haplogroups H, T and L0–L1). The resulting tree is shown in Figure 2. We used two randomly drawn samples each from Haplgroups H, V, J, T, I, W, C, Z, D (total 18 samples), for which we fully counted on the known mtDNA phylogeny, as well as seven samples from haplogroup L2, for which we reconstructed their relationship using neighbor joining, leading to a verification of the accepted L2 nomenclature for sub-clades L2a, L2b and L2c (see haplogroup annotation in Fig. 2). We reconstructed the coding region ancestral types using the Fitch algorithm (Fitch, 1971), and the resulting mutation counts for the coding region on each branch are shown in blue in Figure 2. The terminal nodes represent two samples each, with the total number of mutations on both branches on them (e.g. the rightmost node for Z implies the two Z samples differ by 12 SNPs; hence, this is the inferred total number of mutations on both branches). Note that for the PR likelihood (1) we only need this total number for both branches, which share the same Poisson parameter.
mtDNA tree with mutation counts and age estimates (in mutation units).
mtDNA tree with mutation counts and age estimates (in mutation units).
We ran our Poisson regression on this tree, to obtain time estimates and confidence intervals on all the internal nodes. Figure 2 shows the resulting parameter estimates and, for some nodes, a confidence interval of ±2 SD. The resulting estimate for the MN super-clade ancestor is 17.5 ± 3.5 mutations. Using the same conversion factor as in Ingman et al. (2000) of 1.7 × 10−8 substitutions per site per year, this translates to a confidence interval of 67 000 ± 13 400 years for the MN TMRCA. Comparing that to the result of Ingman et al. (2000) quoted above, we can see that we obtain an older age and a tighter confidence interval. The older age is likely not owing to the method used but to the difference in data—the difference in estimated age is clearly within the uncertainty bounds defined by the confidence intervals. We should also comment that our assumption of no repeated mutation should result in PR under-estimating the TMRCA rather than over-estimating, if it is violated.
The tighter confidence interval is probably owing to our use of a better inference approach. Ingman et al. (2000) do not explicitly mention what method they used. We do know, however, that they used a somewhat larger and no less diverse set of samples than our analysis (53 samples, including 5 from L1 and 7 from L2, and representatives from all 9 MN haplogroups above).
5.1 Testing the molecular clock
We would like to test the validity of the molecular clock assumption, i.e. the underlying assumption that the mutation distance from every ancestral node to its leaves is fixed. The typical approach to this is to completely relax the molecular clock assumption and, instead of putting a parameter at each node, put one on each branch in the tree, increasing the number of parameters roughly by 2-fold. A likelihood ratio test (LRT) is then performed between the two resulting nested models [see Felsenstein (2003) for description of this methodology]. For example, Howell et al. (2004) successfully use this approach to reject the molecular clock hypothesis within the L2 haplogroup, based on a detailed tree they construct. In the case of PR, with no clock assumption it simply estimates the parameter for each branch as the observed mutation count. On our tree of Figure 2, the molecular clock hypothesis cannot be rejected based on the resulting LRT, giving a P-value of ∼0.7.
A compromise between the ‘global clock’ and ‘no clock’ assumptions is that there are several different clocks which affect different parts of the tree. This is the approach that Yang (1997) calls ‘local clock’. In particular, we may hypothesize two different clocks, where a sub-tree is subject to a different clock than the rest of the tree. In the tree of Figure 2 there are a total of 24 internal branches defining ‘subtrees’ below them. Each time we assume a rate shift for one such sub-tree, we add just one degree of freedom to our model fit, and hence the resulting LRT uses one degree of freedom. Out of the 24 P-values we get here, the lowest are 0.017 for the branch marked with double asterisk in Figure 2, and 0.05 for the branch marked with single asterisk. Given that we have carried out 24 tests, these P-values cannot be considered significant1. Thus we can safely conclude that our tests reveal no evidence against a global clock hypothesis.
5.2 mtDNA and repeated mutation on branches
As we discussed in Section 3, the assumption of no repeated or back mutation on any branch of the phylogenetic tree is crucial for the interpretation of our PR method as a maximum-likelihood phylogenetic inference approach. We may be interested, therefore, whether this is a realistic assumption for the domain we consider here, of human mtDNA trees, considering coding regions only (and omitting the fast mutating control regions).
To calculate the probability of repeated mutation we need to know the distribution of mutation rates in the relevant region. Tamura and Nei (1993) and others have argued that a Gamma prior with shape parameter α ≈0.1 is appropriate for the distribution of mutation rates in the mtDNA. Furthermore, it is easy to show [as Tamura and Nei (1993) mention] that this Gamma prior with a Poisson likelihood leads to a negative binomial distribution of the number of mutations per site, with parameters: NB(0.1,1−μ/0.1), where μ is the expected number of mutations per site.
Now, consider a representative branch whose true length is 20 000 years. Considering the well-established rough time scale for our tree in Figure 2, it is very likely that the vast majority of its actual branches would be shorter than this length. Using the number proposed by Ingman et al. (2000) of 1.7 × 10−8 substitution per site per year in the mtDNA coding region, which implies that the number of substitutions in a randomly selected coding site in 20 000 years has a NB(0.1,0.9966) distribution, which implies it has probability of ∼6 × 10−7 of being bigger than 1. Considering now that the coding region has about 15 500 total sites in it, we get that the overall probability of any site having more than one substitution on our hypothesize branch of 20 000 years is about 15 500.6 × 10−7 = 0.009
So, a branch of length 20 000 years has probability of slightly <1% of having any sites with repeated mutation. Since our tree has ∼25 branches, we can reasonably assume that it is unlikely to have repeated or back mutations on any branch. Even if it does, we would have to be exceedingly unlucky for it to have more than one or two such events.
An important question is, what would be the effect of this slight under-counting on the Poisson parameter estimates? It is easy to show via simulation that the effect of under-counting by one mutation is always less than one ‘mutation unit’ on the estimates on the tree, and usually much less.
So when we consider this additional uncertainty relative to the confidence intervals in Figure 2, which are typically bigger than one, we can see that taking this additional uncertainty into account is not expected to have a significant effect on our results.
6 CONCLUSIONS
We have proposed the use of Poisson regression with identity link-function as an approach for efficient and reliable inference on known phylogenetic trees. We have shown that the PR method can be considered an approximate maximum-likelihood approach under some non-guaranteed conditions, that are nonetheless reasonable in our motivating domain of the human mtDNA tree. The main advantages of the PR method over the exact maximum-likelihood calculation are as follows:
It implicitly allows variable mutation rates across sites, in the sense that the PR estimates are not affected by the individual rates, but only by the sum of rates across all sites.
Since PR is a Generalized Linear Modeling approach, the extensive inference machinery developed for GLM is at our disposal.
In principle, PR should be significantly more efficient and well behaved computationally than general-purpose maximum-likelihood fitting.
We have demonstrated the PR method on the problem of estimating the out-of-Africa TMRCA of the human mtDNA, and obtained confidence intervals that are significantly tighter than the commonly quoted result of Ingman et al. (2000), obtained with a comparable (in fact, slightly bigger) number of samples.
The mtDNA example we present here is relatively small, as it served to demonstrate the ability of PR to accomplish a tighter confidence interval than Ingman et al. (2000) with comparable data. We plan to apply PR to a much larger dataset of mtDNA, in an attempt to obtain tighter confidence intervals.
This work was inspired by the Genographic project (Author Webpage) and in particular by discussions with Doron M. Behar of Family Tree DNA. The author is also grateful to Gabriela Alexe, Claudia Perlich and Ajay Royyuru of IBM and Trevor Hastie of Stanford University for their help in data preparation and comments.
REFERENCES
1Since these tests are not independent the simple Bonferroni or FDR-type corrections are not directly applicable, but a minimal p-value of 0.017 is by all measures not impressive with 24 tests.



