The effect of non-reversibility on inferring rooted phylogenies

Most phylogenetic models assume that the evolutionary process is stationary and reversible. As a result, the root of the tree cannot be inferred as part of the analysis because the likelihood of the data does not depend on the position of the root. Yet defining the root of a phylogenetic tree is a key component of phylogenetic inference because it provides a point of reference for polarising ancestor/descendant relationships and therefore interpreting the tree. In this paper we investigate the effect of relaxing the reversibility assumption and allowing the position of the root to be another unknown quantity in the model. We propose two hierarchical models which are centred on a reversible model but perturbed to allow non-reversibility. The models differ in the degree of structure imposed on the perturbations. The analysis is performed in the Bayesian framework using Markov chain Monte Carlo methods. We illustrate the performance of the two non-reversible models in analyses of simulated datasets using two types of topological priors. We then apply the models to a real biological dataset, the radiation of polyploid yeasts, for which there is a robust biological opinion about the root position. Finally we apply the models to a second biological dataset for which the rooted tree is controversial: the ribosomal tree of life. We compare the two non-reversible models and conclude that both are useful in inferring the position of the root from real biological datasets.

The root of a phylogenetic tree is fundamental to its biological interpretation, providing a critical reference point for polarising ancestor-descendant relationships and for determining the order in which key traits evolved along the tree (Embley and Martin 2006). The most common way to root an evolutionary tree is to use an outgroup to the clade of interest, or ingroup; the root is then placed on the branch connecting the outgroup to the ingroup (Penny 1976;Huelsenbeck et al. 2002). However, this approach can be problematic if the outgroup is only distantly related to the ingroup because the long branch leading to the outgroup can induce phylogenetic artifacts such as long branch attraction (LBA), potentially interfering with the inference of ingroup relationships and the root position (Felsenstein 1978;Holland et al. 2003;Bergsten 2005). It has been proposed that the three-domains tree of life, in which the eukaryotes represent the sister group to a monophyletic Archaea, could have resulted from LBA (Tourasse and Gouy 1999;Williams et al. 2013). Another drawback of outgroup rooting has been observed when the ingroup and outgroup taxa differ substantially in nucleotide or amino acid composition: the position of the root of the ingroup becomes unstable, depending on the model (Tarrío et al. 2000;Foster 2004). Outgroup rooting is also difficult to apply to the question of rooting the universal tree, for which no obvious outgroup is available. One solution to this problem has been to use pairs of paralogous genes which diverged from each other before the last common ancestor of all cellular life, so that one paralogue can be used to root a tree of the other (Iwabe et al. 1989;Brown and Doolittle 1995;Hashimoto and Hasegawa 1996;Baldauf 1996). However, for any given gene it is difficult to unambiguously establish that duplication took place before the divergence of the domains of life. The number of genes to which this technique can be applied is also limited.
An alternative, but perhaps under-explored, approach to rooting trees is to use a non-reversible and/or non-stationary substitution model in which changing the root position changes the likelihood of the tree. Unlike reversible substitution models, in which the probability of evolutionary change in one direction is the same as in the other (Felsenstein 1981), non-reversible models allow us to infer the direction of evolution and therefore the ancestor/descendant relationships among species on the tree. Non-reversible models have been found to be useful for correctly inferring the root position for simulated data containing a high degree of non-reversibility (Huelsenbeck et al. 2002). The first model to relax stationarity and reversibility assumptions was a general model of DNA evolution (Barry and Hartigan 1987). Unfortunately, this model required a large number of parameters, since it assigned different rate matrices to each edge of the phylogenetic tree (Jayaswal et al. 2005). One of the earliest attempts to reduce the number of parameters was made by assigning the same exchangeability parameters, but different composition vectors, to each branch of the tree (Yang and Roberts 1995). Heaps et al. (2014) fitted a similar model in a Bayesian framework, but adopted a prior over composition vectors which allowed information to be shared between branches. A further reduction of the number of model parameters was achieved by assigning composition vectors to groups of edges rather than to a single edge (Yang and Roberts 1995;Foster 2004). Blanquart and Lartillot (2006) used a variation of this idea by assuming the compositional shifts occurred according to a Poisson process, independently of speciation events. However, these models still involved substantially more parameters than their standard reversible counterparts. In the context of nucleotide evolution, Galtier and Gouy (1998) reduced the number of parameters in the model of Yang and Roberts (1995) by using a single G+C component, rather than three free parameters for the composition vector. But this inevitably came at the cost of a loss of information from the alignment. Jayaswal et al. (2011) attempted to simplify the general model of DNA evolution using a heuristic to reduce the number of rate matrices using the distances between them as a similarity criteria, and forcing the most similar rate matrices to be identical. Starting from a model with a distinct rate matrix for each edge, this method reduced the number of rate matrices until the optimal number was achieved, using the Akaike information criterion (AIC) (Akaike 1974) or Bayesian information criterion (BIC) (Schwarz 1978) as the optimality criteria. However, given the heuristic nature of the model search, the algorithm offered no assurance of identifying a global optimum.
In this paper, we develop two non-reversible models which each requires only one rate matrix, thus ensuring a parsimonious model. Non-reversibility is achieved by extending a standard reversible model to allow a non-reversible perturbation of the instantaneous rates of substitution between the nucleotides. The models differ in the structure of the perturbation. We take a Bayesian approach to inference and demonstrate the sensitivity of the analysis to different topological priors. We then test our models on simulated data and on a real biological dataset for which there is a robust biological opinion about the position of the root. Finally, we apply our models to an open question in biology: the root of the tree of life.

Model description
We consider a number of aligned homologous sequences and aim to infer the evolutionary relationships among these sequences. These relationships can be described in the form of a bifurcating tree, where each edge represents the period of time over which point mutations accumulate, and each bifurcation represents a speciation event. The nucleotides at each site of a sequence alignment on n taxa can be thought of as independent realisations of a random variable X = (x 1 , ..., x n ) T on a discrete space where x i ∈ Ω and Ω = {A, G, C, T}, for i = 1, . . . , n. The evolutionary process operating along each edge of the tree is described by a continuous time Markov process, where the future value of a nucleotide at any given site depends on its current value only and does not depend on its past values given this current value, that is where t > t n > t n−1 > ... > t 2 > t 1 . The process can therefore be specified by a transition matrix P ( ) = {p ij ( )} whose elements p ij ( ) represent the probabilities of changing from one nucleotide to another over a branch of length . Equivalently we can represent the process through an instantaneous rate matrix Q, where P ( ) = exp(Q ). The off-diagonal elements of Q represent an instantaneous rate of change from one nucleotide to another during an infinitesimal period of time. The diagonal elements are specified so that every row sums to zero. If branch lengths need to be expressed in terms of expected number of substitutions per site then the Q matrix has to be rescaled so that − Q ii π Q,i = 1, where π Q = (π Q,A , π Q,G , π Q,C , π Q,T ) is the theoretical stationary distribution of the process, which can be calculated from Q.
Most phylogenetic models are time-reversible. Reversibility implies that π Q,i p ij = π Q,j p ji and allows the rate matrix to be represented in the form Q = SΠ, where S is a matrix of the exchangeability parameters S = {ρ ij }, i = j, and Π = diag(π) is a diagonal matrix containing the elements of π Q . While the reversibility assumption makes statistical models simpler, it has no biological justification, and is applied for computational convenience only. Indeed, Squartini and Arndt (2008) showed that there is often evidence of non-reversibility in biological datasets. The most general reversible rate matrix, with six exchangeability parameters, is the GTR model. The HKY85 model is a widely used special case with only two distinct ρ ij , one of which is fixed to prevent arbitrary rescaling of the Q matrix. The rate matrix Q of this model is then specified by the compositional frequency vector π = (π A , π G , π C , π T ) and by the transition-transversion rate ratio κ as Here the symbol is used to indicate that the diagonal elements are specified such that every row sums to zero. We consider two hierarchical non-reversible models based on a log-normal perturbation of the off-diagonal elements of the rate matrix of the HKY85 model. The first model, henceforth called the NR model, utilises one perturbation component, while the more complex model, henceforth called the NR2 model, utilises two perturbation components. We use log-normal perturbations because they provide a computationally convenient way to relax the unrealistic biological assumption of reversibility.

NR model.-
We denote the off-diagonal elements of the rate matrix of the NR model by q ij , and the off-diagonal elements of the rate matrix of the HKY85 model by q H ij , where i = j. The non-reversibility of the NR model is achieved by a log-normal perturbation of the off-diagonal elements of the rate matrix Q H using a perturbation component σ as represented in the following directed acyclic graph (DAG): Working element-wise on a log-scale, the off-diagonal elements of the rate matrix of the NR model can be expressed as, for i = j where the ij are independent N (0, σ 2 ) quantities. Here the perturbation standard deviation σ represents the extent to which Q departs from a HKY85-structure: the larger its value, the greater the degree of departure. This parameter is treated as an unknown quantity whose value we learn about during the analysis. In the NR model we assume that the variation between the overall rate of substitution events at sites is modelled by a Gamma distribution with mean equal to 1 (Yang 1993). For computational convenience we approximate the continuous Gamma Ga(α, α) distribution with a discrete Ga(α, α) distribution with four categories (Yang 1994).
NR2 model.-Under the NR model, departures from HKY85-structure could lead to a non-reversible model or simply a general reversible rate matrix. As such the two types of deviation are confounded and so for any given dataset, learning that σ is large does not necessarily provide evidence of non-reversibility. The NR2 model addresses this issue, thereby aiding model interpretation, by using a two-stage process to perturb the underlying HKY85 rate matrix Q H . The first perturbation is within the space of GTR matrices, perpendicular to the subspace of HKY85 matrices, leading to a reversible rate matrix denoted Q R . The second perturbation acts on Q R and is within the space of general rate matrices but perpendicular to the subspace of GTR matrices, leading to a general non-reversible rate matrix denoted Q. These two random perturbations have different variance parameters σ 2 R and σ 2 N respectively. The general structure of this model can be represented by the following DAG: The two-stage perturbation procedure is explained in the Appendix.
Prior distribution NR model.-The aim of the analysis is to make inference about the parameters of the model: the composition vector π, the transition-transversion rate ratio κ, the perturbation standard deviation σ, the off-diagonal elements of the rate matrix Q, the shape parameter α, the branch lengths and the rooted topology τ . Since we adopt the Bayesian approach to inference, we express our initial uncertainty about these unknown parameters though a prior distribution which takes the form π(π, κ, σ, Q, α, , τ ) = π(π)π(κ)π(σ)π(Q|π, κ, σ)π(α)π( )π(τ ).
The composition vector π is defined on the four-dimensional simplex, that is, it has four positive elements, constrained to sum to one. We choose to assign it a Dirichlet prior, π ∼ D(a π π 0 ), where π 0 = (0.25, 0.25, 0.25, 0.25) is the mean and a π is a concentration parameter (we take a π = 4). We adopt a log-normal prior for the transition-transversion rate ratio κ ∼ LN (log κ 0 , ν 2 ), where κ 0 = 1 and ν = 0.8. The parameters of the prior for κ represent our belief that the probability of κ exceeding 2 is 0.2, i.e Pr(κ < 2) = 0.8. This prior is exchangeable with respect to the nucleotide labels. The perturbation parameter σ is assigned an Exponential prior σ ∼ Exp(γ), where the rate γ = 2.3 reflects our prior belief that the probability of σ exceeding 1 is 0.1, i.e Pr(σ < 1) = 0.9. This choice discourages a stationary distribution π Q in which some characters are heavily favoured over the others. The branch lengths are assigned independent Exponential priors i ∼ Exp(µ), where i = 1, ..., k and k is the number of edges. The rate µ equals 10, so that E(µ) = 0.1. The shape parameter α is assigned a Gamma prior, α ∼ Ga(10, 10), which ensures the expected substitution rate in the Ga(α, α) model for site-specific substitution rates is modestly concentrated around 1.
We define a root type as the number of species on each side of the root. For example, the root type 1 : (n − 1) represents a root split on a pendant edge, 2 : (n − 2) represents a root split between two taxa and all others, etc. A uniform prior over rooted topologies assigns a prior probability of more than 0.5 to root splits of the type 1 : (n − 1), in other words, to roots on pendant edges. We felt that deeper roots are generally more biologically plausible and should be assigned higher prior mass, whilst still retaining a diffuse initial distribution. We therefore chose to assign the rooted tree topology a prior according to the Yule model of speciation, which assumes that at any given time each of the species is equally likely to undergo a speciation event. This generates a biologically defensible prior in which all root types receive the same prior probability if n is odd, and a near uniform distribution if n is even, but with n/2 : n/2 root types receiving half the prior probability of the other root types. The probability of generating a n-species tree T under the Yule distribution is calculated by dividing the number of labelled histories for the tree T by the total number of all possible labelled histories on n species (Steel and McKenzie 2001). This probability depends on the complete rooted topology and therefore has to be re-calculated at every iteration of the MCMC algorithm used for inference. To save computational time, we therefore additionally introduce an approximation to the Yule prior, which we term the structured uniform prior, which assigns equal prior probability to all root-types. In order to sample a rooted topology from this distribution we first sample a root type uniformly. We then sample uniformly from the set of all rooted topologies with that root type. Computationally, this prior is more convenient than the Yule prior because its mass function is independent of the particular unrooted topology and only considers the root split. It also has the advantage of being uniform on root types for all n. Posterior sensitivity to the choice of topological prior will be discussed in the Results section.

NR2 model.-
The parameters of the NR2 model are: the composition vector π, the transition-transversion rate ratio κ, the perturbation standard deviation on the reversible plane σ R , the perturbation standard deviation on the non-reversible plane σ N , the shape parameter α, the branch lengths and the rooted topology τ . We also have latent variables comprising ν 1 , . . . , ν 5 for the reversible perturbation, and η 1 , η 2 , η 3 for the non-reversible perturbation (see Appendix). The prior distribution of these unknowns takes the form The rate heterogeneity parameter α, branch lengths , rooted topology τ and the parameters π and κ of the reversible Q H matrix are assigned the same priors as those used for the NR model. Both perturbation standard deviations are assigned the same prior as their analogue, σ, in the NR model, i.e. σ R ∼ Exp(2.3) and σ N ∼ Exp(2.3). As discussed in the Appendix, ν i ∼ N (0, σ 2 R ) for i = 1, . . . , 5 independently, and η i ∼ N (0, σ 2 N ) for i = 1, 2, 3 independently.

Likelihood
The likelihood function summarises the information available from the data about the unknowns in the model including the phylogenetic tree. Since we assume that alignment sites evolved independently of each other, the likelihood can be expressed as a product of the likelihoods of the n individual sites of the alignment. If we denote θ to be the parameters of the substitution process, the likelihood takes the form where D i is the column of nucleotides at site i. The probability of the data at a site i is given by where v and w are the vertices at the two ends of edge and X(i) is the nucleotide at vertex i which is only observed at the leaves. The sum is taken over all functions X(i) from the vertices to Ω such that X(i) matches data D i for leaf vertices. We assume a stationary model and so take the probability at the root π X(root) to be π Q,X(root) , which comes from π Q , the theoretical stationary distribution associated with Q (note that this is not the same as π, the stationary distribution of the underlying HKY85 model).
This distribution is analytically intractable and so we use Markov chain Monte Carlo (MCMC) methods, specifically a Metropolis-within-Gibbs sampling scheme. At each iteration of the MCMC algorithm the following moves are performed: (a) update the parameters of the substitution model (π, κ, σ, Q, α); (b) update the branch lengths and the rooted topology τ .
In move (a) we update the parameters using a Dirichlet random walk proposal for π and log-normal random walk proposals for the other parameters. Move (b) consists of a series of Metropolis-Hastings steps to update each branch length one at a time using a log-normal random walk proposal and then updating the rooted topology and branch lengths (in a joint move) through three types of proposal: nearest-neighbour interchange (NNI), sub-tree prune and regraft (SPR), and a proposal which moves the root; see (Heaps et al. 2014) for complete details of all three moves.

MCMC implementation.-
In the following section, all results are based on (almost) un-autocorrelated posterior samples of size 2K. These samples were obtained by running the algorithm for at least 200K iterations, discarding at least 50K iterations as burn-in and then thinning by taking every 100th iterate to remove autocorrelation. Convergence was diagnosed using the procedure described in Heaps et al. (2014). This involved initialising two MCMC chains at different starting points and graphically comparing the chains through properties based on model parameters and the relative frequencies of sampled clades. In all cases, the graphical diagnostics gave no evidence of any lack of convergence.

Analysis of simulated data
We used simulations to investigate the behaviour of these models. In order to simulate the alignments, we first fixed the underlying reversible HKY85 rate matrix (Q H matrix) using the values π = (0.25, 0.25, 0.25, 0.25) and κ = 2. Then we applied different types of perturbation to the Q H matrix.

NR model.-
To investigate the effect of different levels of non-reversibility, four different values of the perturbation standard deviation were used to simulate the data: σ = 0.05, 0.1, 0.2, 0.3. These values were chosen so that the prior for the stationary distribution induced by the log-normal perturbation would be in the range of values estimated for real data; as σ increases, significant support is given to highly biased compositions, and for σ > 0.3 these are biologically unrealistic ( Supplementary Fig. 1). For each value of the perturbation standard deviation nine different datasets of length 2000 bp were simulated, the first five having different rate matrices (datasets 1 -5), and the last five having the same rate matrix (datasets 5, 5A, 5B, 5C, 5D). Thus the former five datasets have different stationary distributions, while the latter five datasets have the same stationary distribution. This type of alignment simulation allows us to investigate different sources of variability in the data. The datasets were simulated under the NR model on a rooted version of an unrooted 30-taxa tree derived from a recent analysis (Supplementary The posterior distribution of the root splits is displayed in Figure 1, with the analysis of one representative simulated alignment shown for each value of the perturbation standard deviation. The true root split was recovered with the highest posterior probability, with all other possible root splits receiving much lower posterior support. The results of the other eight simulated alignments for each value of σ looked very similar, with the correct root split receiving the highest posterior support (not shown). This suggests that the NR model can recover the true root, at least on simulated data. In order to evaluate the sensitivity of these results to the topological prior, the same analysis was performed using the structured uniform prior ( Supplementary Fig. 3). This analysis gave very similar results, as we might expect given the similarity between the two priors.
It is worth noting that while the true root split received the highest posterior support in these analyses, the posterior modal unrooted topology and the majority rule consensus tree did not exactly match the true unrooted topology, though the differences were very minor ( Supplementary Fig. 4). We reasoned that this result may be explained by the presence of a number of short branches in the tree used in the simulations, which might complicate inference of the unrooted topology. We evaluated this hypothesis in two ways. First, we simulated another dataset using the NR model, but under a slightly modified tree in which the branch lengths were re-sampled from a shifted Exponential distribution to ensure that the resulting tree had no short branches ( i ∼ Exp(10) + 0.05). Inference on this dataset recovered the true unrooted topology, consistent with the explanation of short branches causing the problem in the original analysis. Second, we simulated a dataset using a reversible HKY85 model on the original tree -including short branches -and reanalysed it using the HKY85 model. As with the NR model, the inferred unrooted topology did not exactly match the true one. These results suggest that while the NR model performs similarly to the reversible HKY85 model when inferring unrooted topologies, it is also able to extract sufficient information for inferring the position of the root.

NR2 model.-
The datasets were simulated under the NR2 model in a similar manner as for the NR model. A small value was chosen for the reversible perturbation (σ R = 0.001) since we assume that the information about the root position comes from the non-reversible perturbation. The values for the non-reversible perturbation σ N = 0.1, 0.25, 0.5, 1.0 were chosen so that in the prior for the stationary distribution, some nucleotides are not heavily favoured over the others ( Supplementary Fig. 5). The analysis was performed twice (with the Yule prior and the structured uniform prior). As with the NR model, and most likely for the same reasons, the inferred unrooted topology did not exactly match the true one. The posterior distribution of the root splits for different values of σ N is shown in Figure 2 (for the Yule prior) and Supplementary Figure 6 (for the structured uniform prior). As in the case of the NR model, inference under both priors recovered the true root split as the posterior mode.

Rooting the radiation of palaeopolyploid yeasts
We next investigated the performance of the NR and NR2 models on a real biological dataset for which there is broad biological consensus on the root position (Byrne and Wolfe 2005;Hedtke et al. 2006). The lineage leading to Saccharomyces cerevisiae (brewer's yeast) and its relatives underwent a conserved whole-genome duplication (WGD) about 100 million years ago (Wolfe and Shields 1997; Kellis et al. 2004). Evidence for this WGD, in the form of duplicated genes and genomic regions, is shared by all post-WGD yeasts and defines the group as a clade from which the root of the Saccharomycetales is excluded. Current biological opinion on the rooted phylogeny of these species is summarised in Figure 3  . The root inferred in the outgroup analysis performed within the maximum likelihood framework separates a clade comprising Eremothecium gossypii, Eremothecium cymbalariae, Kluyveromyces lactis, Lachancea kluyveri, Lachancea thermotolerans and Lachancea waltii from the other species (Hedtke et al. 2006). We analysed an alignment of concatenated large and small subunit ribosomal DNA sequences for 20 yeast species, with a combined length of 4460 bp. The sequences were aligned with MUSCLE, and poorly-aligned regions were detected and removed using TrimAl. We analysed this dataset with the NR and NR2 models, using both the Yule prior and the structured uniform prior. Figure 4a shows the posterior distribution of the root splits for both the NR and NR2 models implemented with the structured uniform prior. The root split supported by outgroup rooting (Hedtke et al. 2006) has the highest posterior probability (root 1 on Fig. 3) for both models. However, there is a substantial amount of uncertainty represented by the non-negligible posterior probabilities of the other root splits. While the third most plausible root is placed within the outgroup (root 2 on Fig. 3), the second most plausible root is located within the post-WGD clade (root 3 on Fig. 3). This degree of uncertainty is also reflected in the sensitivity of the analysis to the topological prior: while the structured uniform prior recovered the root supported by the outgroup analysis with the highest posterior support, the Yule prior instead recovered this root with the second-highest support (Fig 4b). The most plausible root inferred with the Yule prior is placed within the post-WGD clade (root 3 on Fig. 3), which contradicts the whole genome duplication analysis. The posterior for the non-reversibility parameter σ is suggestive of a large degree of non-reversibility in the data. This is illustrated in Supplementary Figure 7 which shows the standardised marginal likelihood for σ, indicating the weight of evidence for different values of σ given the prior and the model. The posterior means is around 0.37. For simulated data, we were able to infer the true root with higher posterior support and less uncertainty for such σ. This suggests the presence of other features of the data not accounted for by the model that are masking the root signal.
The rooted majority rule consensus trees from the analyses with the two topological priors are depicted in Fig 5. The unrooted topologies of both consensus trees differ from that supported by the whole genome duplication analysis by the placement of Vanderwaltozyma polyspora. While the WGD analysis places it within the post-WGD clade, in our analysis this taxa is located within the pre-WGD clade. Interestingly, this result is consistent with the analysis performed with the CAT-GTR model (Lartillot and Philippe 2004): on the tree inferred with the CAT-GTR model, the Vanderwaltozyma polyspora is excluded from the post-WGD clade (not shown). This suggests that the non-reversible models can extract meaningful information about the root position, and also capture information for inferring the unrooted topology. However, the minor mismatch of the topologies inferred in our analysis with that supported by WGD and outgroup analyses (Hedtke et al. 2006) confirms the presence of some features of the data which our models do not account for. One of them might be heterogeneity in composition across branches, that is, over time. It has been previously shown that failure to account for compositional heterogeneity can lead to inferring incorrect topologies with strong support (Foster 2004;Cox et al. 2008;Foster et al. 2009;Williams et al. 2012). Thus further refinement of the models, for instance, relaxing the stationarity assumption, might be necessary to improve the ability of the models to provide better insight into the evolution of palaeopolyploid yeasts.
The reason we analyse the posterior modal root split and not the root split on the majority rule consensus tree is that the two root splits need not match each other. This happens because the consensus tree is a conditional summary, computed recursively from the leaves to the root, which depends upon the plausibility of sub-clades. On the other hand, the posterior over root splits is a marginal summary which averages over the relationships expressed elsewhere in the tree; see online Appendix 1 for an illustrative example.

Analysis of the ribosomal tree of life
We have also applied the models to a dataset for which there is still debate about the unrooted topology and root position: the ribosomal tree of life. The debates are centred on two hypotheses. According to the three domains hypothesis, the Archaea are monophyletic, sharing a common ancestor with Eukaryota (Woese 1990). The other hypothesis, called the eocyte hypothesis, suggests that the Archaea are paraphyletic and the Eukaryota originated from within the Archaea (Lake 1988;Rivera and Lake 1992;Cox et al. 2008). Recent analyses of the tree of life ribosomal RNA data have demonstrated that the tree recovered can change depending on the substitution model used. When homogeneous models are used for the analysis they often recover the three domains tree, while heterogeneous models generally recover the eocyte tree (Cox et al. 2008;Williams et al. 2012). Here we analysed aligned concatenated large and small subunit ribosomal RNA sequences from archaebacterial, bacterial and eukaryotic species (36 taxa, 1734 sequence positions), including the recently discovered archaeal groups: Thaumarchaeota, Aigarchaeota and Korarchaeota. These new groups are closely related to Crenarchaeota and together they form the so-called TACK superphylum (Guy and Ettema 2011;Kelly et al. 2011;Williams et al. 2012;Lasek-Nesselquist and Gogarten 2013). Previous analyses of this dataset recovered an eocyte topology, however these analyses were not able to infer the root because they used reversible substitution models (Williams et al. 2012). We analysed these data with both the NR and NR2 models using both the Yule prior and the structured uniform prior. All the analyses recovered the eocyte topology with similar posterior support. Figure 6 shows the majority rule consensus tree from the analysis with the Yule prior (the majority rule consensus tree from the analysis with the structured uniform prior is very similar, shown in Supplementary Fig. 8). The analysis with the Yule prior assigned high posterior support to two roots splits -one on the branch leading to Bacteria (root 1 on Fig. 6), the other within the Bacteria, on the branch leading to Rhodopirellula baltica (root 2 on Fig. 6). This inference is in accord with current biological opinion about the root of the tree of life, which places the root either on the branch leading to the Bacteria, or within the Bacteria (Baldauf 1996;Cavalier-Smith 2006;Skophammer et al. 2007). However, in the analysis performed with the structured uniform prior, the support for the root within the Bacteria decreased and that for the the root on the bacterial branch increased (Fig. 7). This analysis illustrates the sensitivity of the inference to the choice of topological prior, and confirms the importance of prior choice in Bayesian phylogenetics.

Discussion
We have presented two hierarchical non-reversible models for inferring rooted phylogenetic trees. The non-reversibility of both models is achieved by applying a stochastic perturbation to the rate matrix of a reversible model. This perturbation makes the likelihood dependent on the position of the root, enabling us to infer the root directly from the sequence alignment. In the first model we use only one variation component and perform a log-normal perturbation on the space of all possible rate matrices. In contrast, the second model utilises two variation components and the perturbation is performed on the space of reversible and non-reversible rate matrices separately. This separation allows us to judge the extent of the different types of perturbation. The results on the simulated data show that both models are able to correctly infer the position of the root for a range of degrees of non-reversibility, suggesting that modelling non-reversibility is an effective way to learn about the root of the tree.
We have applied our models to two biological datasets. These analyses agree with our simulations in suggesting that our non-reversible models can recover useful rooting information from real biological sequence alignments. The analyses of both the yeast and tree of life datasets recover the widely agreed root. However, both datasets show some prior sensitivity, even though the two topological priors (the Yule prior and the structured uniform prior) share similar features. The inferred non-reversibility parameter provides evidence of a non-negligible degree of non-reversibility from both datasets ( Supplementary  Fig. 7). Despite evidence of the presence of non-reversibility, the analyses display high levels of posterior uncertainty. This suggests that the information about the root may be obscured by other signals that are not accounted for by our current models. In the current implementation, the models assume the evolutionary process is stationary. If this was true then the empirical composition of the four nucleotides would be roughly the same for all taxa in the alignment. However, this is often not the case in experimental data (Foster 2004;Cox et al. 2008), so the models may benefit from further development, such as modelling the non-stationarity of the process. Nonetheless, with the current implementation of the models our findings illustrate that non-reversibility can be used to learn about the root position and suggest that non-reversible models can be useful to infer the root position from real biological datasets.

Appendix 1
The two-stage perturbation relies upon the underlying geometry of the space of Markov rate matrices, and is achieved in the following way. We work on a log-scale element-wise with all matrices, ignoring diagonal elements. The set of all possible 4 × 4 rate matrices M is therefore identified with R 12 which we equip with the standard inner product. The set of HKY85 matrices and GTR matrices form nested sub-manifolds of M . Recall that working element-wise on a log-scale, the off-diagonal elements of the rate matrix of the NR model can be expressed as, for i = j where the ij are independent N (0, σ 2 ) quantities. The element-wise log of the HKY85 matrix Q H in equation (1) is where (π 1 ,π 2 ,π 3 ,π 4 ) = (log π A , log π G , log π C , log π T ),κ = log κ, e i is the i-th standard basis vector of R 4 and s = (1, 1, 1, 1) T . By differentiating with respect to the parameterŝ π 1 ,π 2 ,π 3 andκ we obtain 4 linearly independent vectors in M which are locally tangent to the sub-manifold of HKY85 matrices at Q H , and we denote these V 1 , V 2 , V 3 , V 4 . (Differentiating with respect toπ 4 gives a tangent vector contained in the span of V 1 , V 2 , V 3 .) The tangent vectors in M correspond to the 4 × 4 matrices for i = 1, 2, 3, and V 4 = e 1 e T 2 + e 2 e T 1 + e 3 e T 4 + e 4 e T 3 .
The element-wise log of the general GTR matrix is whereρ ij is the log of the exchangeability parameter ρ ij . By differentiating with respect to theρ ij parameters, it is straightforward to obtain tangent vectors V 5 , . . . , V 9 to the sub-manifold of GTR matrices at Q H , such that the set V 1 , . . . , V 9 is linearly independent. Finally, standard linear algebra can be used to extend this collection to a basis V 1 , . . . , V 12 of R 12 . Next, the QR factorisation algorithm is applied to the 12 × 12 matrix with columns V 1 , . . . , V 12 to obtain an orthonormal basis of tangent vectors W 1 , . . . , W 12 which is used to perturb Q H . First, Q H is perturbed using ν 1 , . . . , ν 5 to obtain a GTR matrix Q R where, for i = j log q R ij = log q H ij + 9 k=5 ν k−4 W kij , and the ν k are independent N (0, σ 2 R ) and W kij is the (i, j)-th element of the 4 × 4 matrix W k . The choice of basis W 1 , . . . , W 12 ensures that this perturbation is locally orthogonal to the sub-manifold of HKY85 matrices. The second stage perturbs Q R into the space of non-reversible rate matrices using η 1 , η 2 , η 3 : for i = j log q ij = log q R ij + 12 k=10 η k−9 W kij , and the η k are independent N (0, σ 2 N ) quantities. This perturbation is locally perpendicular to the sub-manifold of GTR matrices in M . The equation determines the off-diagonal elements of the non-reversible rate matrix Q, while the diagonal elements are fixed in order to make the row sums zero. The size of the perturbation variance σ 2 R can be thought of as representing the extent to which the rate matrix Q departs from the class of HKY85 models remaining within the class of reversible models, while σ 2 N represents the extent to which Q departs from being reversible. As for the NR model, the NR2 model uses a discrete Gamma distribution with 4 categories to model rate heterogeneity between sites.

Online Appendix 1
The root on the majority rule consensus tree and the mode of the posterior distribution for root splits are different point summaries of the posterior distribution for root positions. Both can be approximated from posterior samples of rooted topologies but they need not coincide. For example, suppose the posterior output comprises the following five trees: The clade (A, B) appears on all the trees, and so is included in the consensus tree with probability one. Similarly, the clade (A, B, C) appears on three trees (Tree 2, Tree 3 and Tree 4), and so appears in the consensus tree with support 0.6. Continuing in this fashion, the consensus tree is completed by incorporating the clades (E, F) and (D, E, F) which appear with support 0.8 and 0.6 respectively. Hence, the root position on the consensus tree (shown in the Figure below) separates the taxa A, B, C from D, E, F. On the other hand, the posterior for root splits is given by:   Figure 3. Rooted phylogeny of the palaeopolyploid yeasts supported by the whole-gene duplication analysis (not drawn to scale), reproduced from the YGOB website (Byrne and Wolfe 2005;http://ygob.ucd.ie 2015). The tree is rooted according to the outgroup method performed within the maximum likelihood framework (Hedtke et al. 2006).  Figure 3. In (a), the analysis performed with the structured uniform prior, the root split supported by outgroup rooting (Hedtke et al. 2006) has the highest posterior probability (root 1, highlighted). Root 2 is placed within the outgroup and root 3 is placed within the post-WGD clade. In (b), the analysis performed with the Yule prior, the root split supported by outgroup rooting (Hedtke et al. 2006) has the second highest posterior probability (root 1, highlighted). Root 3 is placed within the post-WGD clade.  Figure 6. Rooted majority rule consensus tree for the tree of life dataset, inferred under the NR model using the Yule prior. The tree supports the eocyte hypothesis by placing the eukaryotes within the Archaea, as a sister group to the TACK superphylum. Roots 1, 2 and 3 are the root splits having the highest posterior support. Posterior support for these root splits is shown in Figure 7.   Figure 8. Rooted majority rule consensus tree for the tree of life dataset, inferred under the NR model using the structured uniform prior. The tree supports the eocyte hypothesis by placing the eukaryotes within the Archaea, as a sister group to the TACK superphylum. Roots 1, 2 and 3 are the root splits having the highest posterior support.