- Split View
-
Views
-
Cite
Cite
A Oliva, S Pulicani, V Lefort, L Bréhélin, O Gascuel, S Guindon, Accounting for ambiguity in ancestral sequence reconstruction, Bioinformatics, Volume 35, Issue 21, November 2019, Pages 4290–4297, https://doi.org/10.1093/bioinformatics/btz249
- Share Icon Share
Abstract
The reconstruction of ancestral genetic sequences from the analysis of contemporaneous data is a powerful tool to improve our understanding of molecular evolution. Various statistical criteria defined in a phylogenetic framework can be used to infer nucleotide, amino-acid or codon states at internal nodes of the tree, for every position along the sequence. These criteria generally select the state that maximizes (or minimizes) a given criterion. Although it is perfectly sensible from a statistical perspective, that strategy fails to convey useful information about the level of uncertainty associated to the inference.
The present study introduces a new criterion for ancestral sequence reconstruction, the minimum posterior expected error (MPEE), that selects a single state whenever the signal conveyed by the data is strong, and a combination of multiple states otherwise. We also assess the performance of a criterion based on the Brier scoring scheme which, like MPEE, does not rely on any tuning parameters. The precision and accuracy of several other criteria that involve arbitrarily set tuning parameters are also evaluated. Large scale simulations demonstrate the benefits of using the MPEE and Brier-based criteria with a substantial increase in the accuracy of the inference of past sequences compared to the standard approach and realistic compromises on the precision of the solutions returned.
The software package PhyML (https://github.com/stephaneguindon/phyml) provides an implementation of the Maximum A Posteriori (MAP) and MPEE criteria for reconstructing ancestral nucleotide and amino-acid sequences.
1 Introduction
Molecular sequences collected in present day species provide a wealth of information about past evolutionary events. Using relevant probabilistic models of molecular evolution, it is possible to reconstruct the sequences of species ancestral to a sample of taxa [see Merkl and Sterner (2016) for a recent review]. The application of these techniques led to spectacular findings. In particular, the resurrection of ancestral proteins using biochemical processes (Bridgham et al., 2006; Chang and Donoghue, 2000; Thornton, 2004) improved our understanding of the ways evolution takes place at the molecular level.
Phylogenetics provides an adequate framework for the reconstruction of ancestral sequences. Given a phylogenetic tree that depicts the evolutionary history of a sample of taxa along with a set of corresponding homologous sequences, it is possible to estimate the sequences at each internal node of the tree. The parsimony approach (Fitch, 1971) consists in selecting the ancestral sequences that minimize the number of substitutions required to explain the sequences observed at the tips of the tree. It uses information related to the grouping of taxa in the tree while amounts of evolution (i.e. the length of edges in the tree) are ignored. The accuracy of the ancestral sequences estimated with the parsimony approach has been well studied from a theoretical perspective (Fischer and Thatte, 2009; Gascuel and Steel, 2010; Yang et al., 2011) and using simulations (Zhang and Nei, 1997).
Yang et al. (1995) and Koshi and Goldstein (1996) were the first to infer ancestral sequences using the maximum likelihood approach under a phylogenetic model. The inference relies here on a two-step approach. A phylogenetic model (i.e. a tree topology, with edge lengths and substitution model parameters) is first fitted to the data. Finding the nucleotide (or amino-acid or codon) character with the highest (marginal) posterior probability at any internal node in a fixed phylogeny is then relatively straightforward. Pupko et al. (2000) later described an elegant dynamic programming algorithm for the inference of the combination of character states at all internal nodes that maximizes the joint posterior probability.
These methods focus on selecting the best state in the alphabet of nucleotides, amino-acids or codons, i.e. the alphabet the data generating process relies upon. Although it is perfectly sensible from a statistical viewpoint, it does not always reflect potential uncertainty in the inference. Indeed, multiple characters can have high probabilities and selecting only the best one obliterates available information about the variability of ancestral sequence estimates (Blanchet et al., 2017). Ignoring uncertainty in the reconstruction of ancestral sequences is indeed a serious limitation of current estimation techniques that has been known for a long time (Cunningham et al., 1998). Previous attempts to deal with uncertainty in ancestral sequence inference consisted in generating a subset of ancestral sequences randomly sampled from their posterior probabilities (Gaucher et al., 2008) or considering all possible character states (or only a subset of sub-optimal residues) at positions deemed ambiguous (McKeown et al., 2014; Thomson et al., 2005). Eick et al. (2017) compared these two approaches on three families of protein domains and showed that sampling from the posterior distribution of residues produced non-functional proteins in some cases. In practice, the software FastML for estimating ancestral sequences using the maximum likelihood principle (Ashkenazy et al., 2012) can be used to select the k most probable ancestral sequences at each node, where k is fixed by the user.
In the present study, we introduce a new statistical criterion, the minimum posterior expected error (MPEE), and test another one, based on decision theory and the Brier score. Both reveal uncertainty in the inference without compromising on the intelligibility of ancestral character reconstruction. We also introduce and test several other criteria which, unlike MPEE and Brier-based criteria, require setting tuning parameters prior to the data analysis. We focus on the inference of past DNA and protein sequences using simulated data. Our results indicate that accommodating for ambiguity in ancestral sequences using MPEE or Brier amounts to a better use of the available data compared to the traditional approach.
2 Notations
The multiple sequence alignment is noted as d. Its length, i.e. the number of columns in the alignment is l. Each sequence in d is a vector of characters, each character being taken within the alphabet of nucleotides, amino-acids, codons or any other well-defined discrete states in finite number. Let n be the cardinality of . In what follows, corresponds to the sth column of d. Let τ denote the unrooted topology of the phylogeny under scrutiny and e the vector of edge lengths on that tree. We assume that the tree is binary so that there are u – 2 internal vertices, where u is the number of tips. Let x denote the vector of internal nodes and x is one of these nodes. v1(x), v2(x) and v3(x) are the three nodes directly connected to x, i.e. its three direct neighbors. is the random variable giving the ancestral character observed at node x and site s.
3 Inferring ancestral states
In the following two sections, we first introduce the standard criterion, i.e. the maximum a posteriori criterion, defined in the context of an unrooted or a rooted tree. The new, minimum posterior expected error and Brier-based criteria are presented next followed by other, less standard approaches.
3.1 The maximum a posteriori (MAP) criterion
In Equation 3, the marginal posterior probability derives from the conditional likelihood as defined in Equation 2. Since the product in Equation 2 is over the three vertices directly connected to the internal node under scrutiny, the whole set of characters found at the tips of the tree is involved in the calculation of this marginal probability. A distinct approach, which applies to the case where the tree is rooted, is to sum over the two vertices subtending the node under scrutiny (i.e. the two nodes ‘away from the root’, taking the node of interest as reference). This solution amounts to considering that only the data ‘below’ the node of interest convey information about the ancestral state. This reasoning is somehow at odds with the time-reversible Markovian assumption about the substitution process as the position of the root is not identifiable under this class of models. More importantly, this approach unnecessarily ignores part of the data. It is nonetheless implemented in the software RAxML (Stamatakis, 2006). In the following, we will refer to this criterion as MAPr.
3.2 The minimum posterior expected error (MPEE) criterion
The maximum value that can take for any is obtained when and is maximum, i.e. , in which case . Therefore, and the fully ambiguous character set is never strictly optimal (it is only optimal in the trivial case where for all x, in which case and all character sets are also optimal).
We then evaluate , i.e. the ambiguity level that minimizes the posterior expected error at the (i + 1)th point on our grid of αk values. Given and , we then identify , the best scoring character set for each value of i. The inferred character set is finally obtained by selecting the element with the highest multiplicity in the multiset . The computational complexity of this procedure is , thereby making it considerably faster than the ‘full grid’ approach described before.
3.3 Brier-based score
The MPEE criterion is not the only one that can be used to achieve the same goal of inferring ambiguous or non-ambiguous ancestral states. Very recently, Ishikawa and colleagues (2019) described an ancestral sequence reconstruction method based on the Brier scoring rule. They proposed to compare the posterior probabilities of non-ambiguous states to a null (or expected) uniform distribution on states. A squared Euclidean distance characterizes the difference between the observed and each of the n expected distributions. The inferred state, which can be ambiguous or not, is the one that minimizes these distances.
As with the MPEE criterion, one then finds , i.e. the optimal level of ambiguity, and the inferred ancestral state is that corresponding to the subset of non-ambiguous states with the largest posterior probabilities. The time complexity for finding the ancestral state that minimizes Brier-b criterion is , making its calculation very quick in practice and applicable to both nucleotide and amino-acid character states. The corresponding method is called MPPA (Marginal Posterior Probability Approximation) in Ishikawa et al. (2019).
3.4 Other criteria
The last two criteria presented above (MPEE and Brier-b) do not rely on any tuning parameter for inferring ancestral states, ambiguous or not. There are other criteria that achieve the same goal but involve tuning parameter(s) that are generally set in an arbitrary manner before starting the data analysis. We list below some of these criteria which performance were evaluated in the present study:
Threshold criterion (Thresh): a threshold for the posterior probability of any non-ambiguous character set is defined a priori. Any character with a corresponding posterior probability greater or equal to that threshold is considered as valid, i.e. it belongs to the set of possible ancestral states. In our study, these thresholds were set to 1/4 and 1/20 when inferring ancestral nucleotides and amino-acids respectively.
The cumulative probability criterion (CumProb): the (non-ambiguous) character states are first ranked in decreasing order of their posterior probabilities. The list obtained is then traversed from top to bottom. Character states keep on being added to the set of possible ancestral states as long as the corresponding cumulative probability does not exceed a certain threshold fixed a priori. In the present study, the threshold was fixed to 0.9.
The differential criterion (Diff): a ranked list of non-ambiguous character states identical to that used by the CumProb criterion is first built. Characters keep on being added to the set of possible ancestral states as long as the difference between two successive posterior probabilities is less than a given threshold. In our study, the thresholds were set to 1/4 and 1/20 when inferring ancestral nucleotides and amino-acids respectively.
4 Simulations
The performance of the criteria introduced above was assessed using multiple simulated datasets. Each such dataset consists in a phylogenetic tree along which sequences are evolved according to a standard Markov model of evolution. The specifics of these simulations are given below.
4.1 Simulating phylogenies
Random trees were generated with the R package TreeSim (Stadler, 2010). TreeSim creates random trees under a constant rate birth-death process. The function sim.bd.taxa.age was used to generate trees with a fixed number of taxa and fixed time since the most recent common ancestor of the sampled species. The number of taxa in the generated tree was set to 50. The tree height parameter (parameter ‘age’ in sim.bd.taxa.age), H, was randomly drawn for each tree in U[0.1; 1]. 50 trees were generated in total. The birth rate was fixed to 0.1 while the extinction rate was fixed at 0.5. The trees hence generated are ultrametric, i.e. clock-like. Edge lengths are interpreted here as amounts of codon substitutions per (codon) site with an average length equal to 0.08.
4.2 Simulating sequences
We used the software INDELible (Fletcher and Yang, 2009) to simulate codon sequences along the trees. INDELible evolves sequences under a probabilistic model of molecular evolution that accommodates for point substitutions. It also accounts for insertion and deletion events although we did not consider this option in the present study. We used a codon model (the M0 model) that assumes that every codon and every lineage in the phylogeny evolves under the same substitution process whereby the dN/dS and transition/transversion rate ratios are fixed to 0.5 and 2.5 respectively, as suggested in the example file provided with INDELible. Codon frequencies at equilibrium were all equal. Sequences of length 300 codons were generated.
4.3 Computer programs and parameter settings
For each simulated dataset, we inferred ancestral sequences using the GTR nucleotide substitution model (Tavaré, 1986), thereby using a model distinct from the one that generated the sequences. The parameters of the GTR model were estimated using maximum likelihood under the ‘true’, i.e. the simulated phylogeny. A discrete gamma distribution was used here to accommodate for the variability of substitution rates across single sites in the alignment, thereby taking into account (although imperfectly) the rate heterogeneity due to the structure of the genetic code.
We also reconstructed ancestral states from the protein sequences resulting from the simulated codon sequences translated into amino-acids using the standard genetic code. We (wrongly) assumed that amino-acid sequences evolved under the LG model of substitution model (Le and Gascuel, 2008). For both nucleotide and amino-acid ancestral reconstruction, sequences were inferred along the true tree topology so that it is straightforward to match ancestral sequences to the corresponding estimated ones.
PhyML (Guindon et al., 2010) and RAxML (Stamatakis, 2006) were used to reconstruct ancestral sequences under the GTR and LG models. While the tree topologies used for the estimation were set to the true ones and fixed throughout the analysis, edge lengths and substitution model parameters (for the GTR model) were optimized with each software prior to the ancestral sequence reconstruction.
5 Results
Figure 1 presents the performance of the six criteria for inferring ancestral states considered in this study. Although sequences were simulated under a stochastic process describing changes between codons, a model of substitutions between nucleotides was assumed for the ancestral reconstruction (see sections 4.2 and 4.3), thereby ignoring the non-independence between individual columns in each alignment. For each nucleotide site in the alignment and each internal node in the phylogeny, an ancestral state was inferred using one of the six criteria and compared to the ‘true’ (i.e. simulated) nucleotide. Only cases where the maximum posterior probability of any nucleotide were smaller than 0.95 were considered here in order to focus on examples where the inference is not obvious.
For each pair of barplots and each criterion, corresponding to a given level of ambiguity in the inferred states, the ratio between the number of incorrectly reconstructed states (the true state is not in the set of inferred states, barplots on the right of each pair) and the number of correct inferences (the true state is in the set of inferred states, barplots on the left) gives an indication about the accuracy of the criterion. This ratio is also given for each criterion as a percentage in the column separating each pair of barplots (see ‘% err’). Also, comparing the plots obtained for the different ambiguity levels provides information about the precision of the different approaches: precise criteria will mostly infer non-ambiguous states (‘# states: 1’ in the figure) while imprecise ones will return ambiguous characters (‘# states: 2, 3 and 4’). The ‘% tot.’ figure given below each error percentage corresponds to the percentage of inferred states in the corresponding ambiguity level. For instance, 100% of the states inferred using MAP are non-ambiguous while only 68% of those inferred with the Thresh criterion are non-ambiguous.
The percentage of errors obtained with the various criteria when considering only non-ambiguous inferred ancestral nucleotides varies from 24% for MAP to 8% for CumProb. The other criteria (Thresh, Diff, Brier-b and MPEE) all behave roughly the same with 14 to 18% of errors when considering non-ambiguous inferred states and less than 5% for ambiguous states. All criteria vary however in terms of the precision of the inference. Apart from MAP which, by construction, always returns non-ambiguous ancestral states, CumProb suffers from a lack of precision compared to other criteria (only 33% of inferences made with this criterion are non-ambiguous). The Diff criterion behaves well in terms of precision (83% of non-ambiguous), along with accuracy (18% of errors), this last figure being slightly greater than that of Brier-b and MPEE. Thresh, Brier-b and MPEE behave similarly in terms of precision and accuracy although the last two appear to be slightly more precise.
The results obtained for the reconstruction of ancestral amino-acid sequences are largely similar to that derived with nucleotides (Fig. 2). Yet, it is worth noting that inferences performed with the Thresh criterion are seldom non-ambiguous (6% of non-ambiguous), which contrasts with the results obtained with nucleotide sequences (68% of non-ambiguous). Diff is, here again, not behaving as well as Brier-b or MPEE or any other criterion except MAP in terms of accuracy. Also, CumProb lacks precision with only 27% of all inferences being non-ambiguous. It is also interesting to note that Brier-b and MPEE perform very similarly, with a slight advantage for MPEE in terms of the percentage of errors when inferring ambiguous ancestral amino-acids (7 and 5% of errors with Brier-b for doubly- and triply-ambiguous inferences against 4 and 2% for MPEE). Brier-b score also appears to be slightly less precise compared to MPEE (39 versus 31% of ambiguous characters inferred on amino-acid data with Brier-b versus MPEE and 34 versus 31% for nucleotide data).
We tested a range of values for the tuning parameters involved in the Diff, CumProb and Thresh criteria. With nucleotide data, setting the tuning parameters to more stringent values (0.05 for Diff, 0.5 for CumProb and 0.45 for Thresh), i.e. pushing these criteria towards selecting non-ambiguous states, forcing the accuracy and the precision of these three approaches to become virtually identical to that of MAP (23%, 22% and 22% of errors for these three methods respectively, with 96, 96 and 95% of non-ambiguous inferences). We also adjusted the tuning parameters of Diff, CumProb and Thresh so that the precision with these three criteria is close to that achieved by Brier-b and MPEE when considering non-ambiguous nucleotide states. The percentage of errors obtained with these criteria are then all very close to that obtained with Brier-b and MPEE (14% for Diff and CumProb, 15% for Thresh). This observation suggests that, considering non-ambiguous inferences only, it may not be possible to achieve a better accuracy than that obtained with Brier-b and MPEE for the precision obtained with these two techniques.
Finally, the percentage of errors obtained with MAPr, i.e. the version of MAP that applies to rooted phylogenies and only considers characters below the internal node where the ancestral reconstruction is performed, are 46 and 49% with nucleotide and amino-acid data respectively. Compared to MAP (24 and 28% of errors), it is fairly obvious that ignoring part of the data is clearly detrimental and should be avoided.
In the present study, we used Ishikawa et al. (2019) Brier-based score as a tool to infer ancestral characters. Yet, the Brier score was originally proposed as a metric to measure the accuracy of probabilistic predictions. Following Ishikawa et al. (2019), we use this metric here to compare the performance of the various criteria considered in our study (see Equation 13). Results presented in Table 1 show that, for nucleotide sequences, the best methods are MPEE, Brier-b and Thresh, with scores fairly close to best (i.e. minimum) achievable score. The performance obtained with Diff and CumProb are slightly inferior while that of MAP is worse than Diff and CumProb. MAPr is lagging far behind, with performance relatively close to that obtained with random predictions. Results obtained with amino-acid sequences are similar, although in this case, Thresh does not perform as well as what is observed with nucleotide data.
Criterion . | Nucleotides . | Amino-acids . |
---|---|---|
MPEE | 0.37 | 0.45 |
Brier-b | 0.37 | 0.44 |
Diff | 0.40 | 0.54 |
CumProb | 0.41 | 0.47 |
Thresh | 0.37 | 0.54 |
MAP | 0.48 | 0.56 |
MAPr | 0.93 | 0.97 |
Posterior | 0.33 | 0.39 |
Random | 1.50 | 1.90 |
Criterion . | Nucleotides . | Amino-acids . |
---|---|---|
MPEE | 0.37 | 0.45 |
Brier-b | 0.37 | 0.44 |
Diff | 0.40 | 0.54 |
CumProb | 0.41 | 0.47 |
Thresh | 0.37 | 0.54 |
MAP | 0.48 | 0.56 |
MAPr | 0.93 | 0.97 |
Posterior | 0.33 | 0.39 |
Random | 1.50 | 1.90 |
Note: Averages were computed over the simulated datasets for which the highest posterior probability was smaller than 0.95. The second-to-last row gives the mean Brier score obtained when using the posterior probabilities of each nucleotide or amino-acid as predictive distribution, which is the minimum value the expected score can take (provided the posterior is evaluated under the model that generated the data). The last row gives the mean Brier score when predicted ancestral states are chosen uniformly at random, thereby defining an upper bound for the average score.
Criterion . | Nucleotides . | Amino-acids . |
---|---|---|
MPEE | 0.37 | 0.45 |
Brier-b | 0.37 | 0.44 |
Diff | 0.40 | 0.54 |
CumProb | 0.41 | 0.47 |
Thresh | 0.37 | 0.54 |
MAP | 0.48 | 0.56 |
MAPr | 0.93 | 0.97 |
Posterior | 0.33 | 0.39 |
Random | 1.50 | 1.90 |
Criterion . | Nucleotides . | Amino-acids . |
---|---|---|
MPEE | 0.37 | 0.45 |
Brier-b | 0.37 | 0.44 |
Diff | 0.40 | 0.54 |
CumProb | 0.41 | 0.47 |
Thresh | 0.37 | 0.54 |
MAP | 0.48 | 0.56 |
MAPr | 0.93 | 0.97 |
Posterior | 0.33 | 0.39 |
Random | 1.50 | 1.90 |
Note: Averages were computed over the simulated datasets for which the highest posterior probability was smaller than 0.95. The second-to-last row gives the mean Brier score obtained when using the posterior probabilities of each nucleotide or amino-acid as predictive distribution, which is the minimum value the expected score can take (provided the posterior is evaluated under the model that generated the data). The last row gives the mean Brier score when predicted ancestral states are chosen uniformly at random, thereby defining an upper bound for the average score.
6 Discussion
This study focuses on new statistical criteria for the inference of ancestral nucleotide and amino-acid sequences. The main motivation behind our work was to improve the way uncertainty in the ancestral reconstruction is dealt with. In particular, in cases where multiple nucleotides or amino-acids have similar marginal posterior probabilities, selecting only one of them can be problematic. Doing so is prone to an increased amount of errors in the inference (see the accuracy of MAP in our simulations). Also, systematically selecting a single character state in the inference fails to convey important information about uncertainty in the decision process.
We test two classes of criteria here. One class has criteria that rely on tuning parameters that are set prior to the analysis (Thresh, CumProb and Diff). The others do not rely on any arbitrarily set parameter (MAP, Brier-b and MPEE). We assessed the performance of these techniques on sequences simulated under a codon model and analyzed under nucleotide and amino-acid substitution models. Hence, in both cases, the models used for reconstructing ancestral states departed from the actual process that generated the sequences. While all criteria have strengths and weaknesses, depending on the type of data (nucleotides or amino-acids), the main goal of the analysis and the resources available (can one afford to consider multiple plausible ancestral states or only a single one will be taken into account?), Brier-b and MPEE behave well compared to other techniques. In fact, our results suggest that these two criteria are optimal in the sense that knowing the tuning parameter values for Thresh, CumProb or Diff that yield the same precision as that of Brier-b and MPEE, would not lead to higher accuracy. The MPEE criterion slightly outperforms Brier-b on our simulations, although the difference between the two in terms of both precision and accuracy is moderate.
The Brier-b and MPEE criteria for ancestral state reconstruction have a sound statistical basis. These approaches help unveil relevant information about the uncertainty in the inference in a concise format, which is relevant to biologists. Moreover, the computational overload associated to the application of these new criteria is negligible compared to the standard approach (i.e. MAP), thereby offering a practical alternative. We thus recommend that these criteria are considered alongside the traditional approach for inferring the biochemical properties of ancient genetic sequences. In cases where reconstruction is ambiguous, as pointed out by Brier-b and/or MPEE, considering multiple alternative ancestral states could lead to a greater complexity in downstream analyses. Yet, this increase of complexity may also help unveil plausible properties of ancient molecules that would have been ignored otherwise.
7 Software availability
The MPEE and MAP criteria for inferring ancestral nucleotide and amino-acid states are implemented and documented in the PhyML software package (https://github.com/stephaneguindon/phyml). The scripts that were used to perform the simulations and analyze the results are available from https://github.com/stephaneguindon/ancestral.
Acknowledgements
We would like to thank Fabio Pardi for discussions along with Prof. Tal Pupko and two anonymous reviewers for their insightful suggestions that helped improve the original manuscript.
Funding
This research was supported by the Institut Français de Bioinformatique (RENABI-IFB, Investissements d’Avenir, ANR-11-INBS-0013) and the Agence Nationale pour la Recherche through the project GENOSPACE.
Conflict of Interest: none declared.
References