## Abstract

Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within *N*-tuples of adjacent sites, their ability to accommodate overlapping *N*-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. Overlapping tuples are efficiently handled by assuming Markov dependence of the observed bases at each site on those at the *N* − 1 preceding sites, and the required conditional probabilities are computed with an extension of Felsenstein's algorithm. Estimated substitution rates based on a data set of about 160,000 noncoding sites in mammalian genomes indicate a pronounced CpG effect, but they also suggest a complex overall pattern of context-dependent substitution, comprising a variety of subtle effects. Estimates based on about 3 million sites in coding regions demonstrate that amino acid substitution rates can be learned at the nucleotide level, and suggest that context effects across codon boundaries are significant.

## Introduction

Many of the simplifying assumptions originally introduced for phylogenetic inference have been relaxed, resulting in models of improved realism and power. For example, the assumptions that each nucleotide or amino acid substitutes for the other at the same average rate, and that this average rate is constant across sites (Neyman 1971; Felsenstein 1981), have both given way to more realistic alternatives (Whelan, Liò, and Goldman 2001). One assumption that has largely been maintained, however—despite general agreement that it is biologically unrealistic—is that the sites at which evolution occurs can be considered independent.

Various methods have been proposed for relaxing site independence in limited ways, without sacrificing the essential strengths of maximum-likelihood methods for phylogenetic inference. For example, models have been introduced that allow for dependence between sites of the same codon (Goldman and Yang 1994; Muse and Gaut 1994; Pedersen, Wiuf, and Christiansen 1998; Yang et al. 2000; Schadt and Lange 2002), correlation of the overall rates of substitution at adjacent sites (Felsenstein and Churchill 1996; Yang 1995), and dependence between paired bases in ribosomal RNA genes (Schöniger and von Haeseler 1994; Muse 1995; Rzhetsky 1995; Tillier and Collins 1995). In addition, models have been developed for the special case of two sequences and a reversible substitution process that allow for general context-dependent substitution, with substitution rates for each base depending on the identity of flanking bases (Jensen and Pedersen 2000; Pedersen and Jensen 2001). With these models, the likelihood computation can no longer be expressed as a product over the sites of an alignment, and exact parameter estimation becomes intractable (a Markov random field arises from the bidirectional dependencies between adjacent sites); Markov chain Monte Carlo (MCMC) or similar methods must be used instead. (Another, simpler model of context-dependent substitution for pairs of sequences has been developed by Arndt, Burge and Hwa (2002), along with an approximation procedure, adapted from nonlinear dynamics, that allows for efficient parameter estimation.) These models accurately reflect an assumed *process* of context-dependent substitution, and therefore, will be described here as “process-based,” in contrast to more empirical models (see below). To our knowledge, process-based models for context-dependent substitution have not yet been extended for use with a general phylogenetic tree.

It is now widely agreed that site independence is an unacceptable assumption in coding regions, where selection acts primarily at the level of amino acids, which are coded for by triplets of adjacent nucleotides. Indeed, the inadequacy of this assumption is the premise underlying codon models, which do appear to improve significantly on the goodness of fit of independent-site models in coding regions (Goldman and Yang 1994). Substitution rates in noncoding DNA, however, are also highly dependent on neighboring bases, and this phenomenon has mostly been ignored in phylogenetic analysis (except that sites subject to the strongest dependencies are often discarded).

Studies of aligned gene/pseudogene pairs in primates have revealed strong evidence of context-dependent substitution in (presumably) neutrally evolving regions, with the rate at which one nucleotide substitutes for another varying as much as 10-fold for different sets of flanking bases, and context-dependent rates overall spanning a more than 50-fold range (Blake, Hess, and Nicholson-Tuell 1992; Hess, Blake, and Blake 1994). Similar effects have been observed recently in connection with single-nucleotide polymorphisms (SNPs) in the human genome (Zhao and Boerwinkle 2002). The strongest context effects in these studies have been seen with C→T transitions in CpG dinucleotides (or G→A transitions on the reverse strand), and are presumably due to methylation and spontaneous deamination of cytosines (Bulmer 1986; Ehrlich, Zhang, and Inamdar 1990). Other effects have been noted as well, however, such as a tendency for higher rates of substitution where purines and pyrimidines alternate, an increased rate of T→C transitions in TpA dinucleotides, and a higher than expected rate of C→A and C→G (G→T and G→A) transversions in CpG dinucleotides (Blake, Hess, and Nicholson-Tuell 1992; Hess, Blake, and Blake 1994). A somewhat weak dependence of the transition/transversion ratio on the A + T content of flanking bases has also been observed in primates (Zhao and Boerwinkle 2002; Yang, Chen, and Li 2002), echoing the stronger dependence observed in the choloroplast and mitochondrial genomes of plants (Morton and Clegg 1995; Morton, Oberholzer, and Clegg 1997; Yang, Chen, and Li 2002). Context effects are strongest for the two immediate neighbors of a given site, but under certain conditions, they can be detected as well for the second and third pairs of flanking bases (Morton 1997; Morton, Oberholzer, and Clegg 1997). Subtle effects may extend as far as 200 bp (Zhao and Boerwinkle 2002), and longer-range dependencies of substitution rates on base composition and other regional properties have also been observed (Bernardi 2000; Hardison, Roskin, and Yang 2003). Context dependence of substitution rates is thought to be the result of neighbor effects on various phenomena, including polymerase fidelity and proofreading, mismatch repair, and mismatch stability (briefly reviewed by Blake, Hess, and Nicholson-Tuell 1992; Morton, Oberholzer, and Clegg 1997).

Most studies of context-dependent substitution rates in noncoding DNA have used simple counting methods and pairwise alignments of highly similar sequences. These methods have some clear deficiencies: for example, they do not allow for multiple substitutions per site or differences in evolutionary distance between aligned pairs of sequences (e.g., in different gene/pseudogene pairs), and they do not benefit from consideration of multiple homologous sequences and a phylogeny. If context-dependent substitution could be incorporated into probabilistic phylogenetic models, better estimates of context-dependent substitution rates might be obtained, as well as improved estimates of branch lengths and (possibly) tree topologies.

In this article, methods are introduced for incorporating context-dependent substitution into phylogenetic models that are more closely related to methods used in computational gene-finding (Durbin et al. 1998; Siepel and Haussler 2004) than to those used with process-based models. The basic strategy here is to extend and generalize codon models for improved handling of context dependence, without sacrificing too much in the way of computational efficiency. We stop short of a process-based description of context-dependent substitution, and as a result are able, for the most part, to retain the standard framework for likelihood computation and parameter estimation (the need for sampling or approximation algorithms is avoided). We begin with the introduction of arbitrary “*N*-th order” models, defined in terms of *N*-tuples of characters (for small *N*), and then a discussion of how these models can be fitted to large data sets using an expectation maximization (EM) algorithm. Next, a method is introduced, based on an assumption of Markov dependence of each site on its *N* − 1 predecessors, that allows *N*-tuples to overlap, so that context effects can be captured between all adjacent pairs of sites. Results are then presented for two large data sets of mammalian sequence—one of about 160,000 sites in noncoding regions and one of about 3 million sites in coding regions—which demonstrate that our context-dependent models produce significant improvements in goodness of fit compared with both codon models and independent-site models. Allowing for overlapping tuples of sites is shown to be important, as is using a sufficiently rich parameterization of the substitution process. Parameter estimates are presented for context-dependent substitution in both non-coding and coding regions, and discussed in some detail.

## Materials and Methods

### Background and Notation

Let a *phylogenetic model***ψ** = (**Q**, **π**, **τ**, **β**) be a four-tuple consisting of a substitution rate matrix **Q**, a vector of equilibrium frequencies **π**, a (rooted) binary tree τ = (

**β**= {β

_{u}|

*u*∈

*r*}} (where β

_{u}corresponds to the branch between node

*u*and its parent, denoted σ(

*u*), and

*r*is the root of the tree). The model is defined with respect to an alphabet

**Σ**of size

*d*, e.g.,

**Σ**= (A, C, G, T). In standard phylogenetic models,

**Q**has dimension

*d*×

*d*, and

**π**has dimension

*d*. The tree has

*n leaves*, or external nodes, corresponding to

*n*present-day taxa; hence, in total there are 2

*n*− 1 nodes and 2

*n*− 2 branches. It is sometimes useful to distinguish between the set of leaves,

The parameters of a phylogenetic model are estimated with respect to a multiple alignment of *n* sequences, whose length (number of columns) we denote *L*. It is assumed that there has existed a sequence of length *L*, with characters drawn from **Σ**, corresponding to every node *u* ∈

*X*

_{u,i}} be a set of random variables, such that

*X*

_{u,i}represents the

*i*th character (1 ≤

*i*≤

*L*) in the sequence corresponding to node

*u*∈

*X*

_{•}denote the set of observed variables,

*X*

_{•}= {

*X*

_{u,i}|

*u*∈

*X*

_{∘}denote the set of latent variables,

*X*

_{∘}= {

*X*

_{u,i}|

*u*∈

*x*

_{•}and

*x*

_{∘}denote instances of

*X*

_{•}and

*X*

_{∘}, respectively, so that

*x*

_{•}is completely defined by the given alignment, and

*x*

_{∘}represents a set of possible ancestral sequences. Let

*X*

_{•,i}and

*X*

_{∘,i}be the sets of observed and latent variables, respectively, corresponding to column

*i*in the alignment,

*X*

_{•,i}= {

*X*

_{u,i}|

*u*∈

*X*

_{∘,i}= {

*X*

_{u,i}|

*u*∈

*x*

_{•,i}and

*x*

_{∘,i}be sets of instances of these variables.

The likelihood of a phylogenetic model **ψ** with respect to an alignment *x*_{•} is obtained by summing over all possible values of the latent variables. Assuming site independence,

*P*(

*x*

_{•,i}|ψ) = ∑

*P*(

*x*

_{•,i},

*x*

_{∘,i}|ψ), can be computed with Felsenstein's “pruning” algorithm (Felsenstein 1981; Durbin et al. 1998) (see Appendix A), assuming that the probability is available of the substitution of any

*b*∈

**Σ**for any

*a*∈

**Σ**over a branch of length

*t*(

*t*≥ 0). These probabilities, denoted

*P*(

*b*|

*a*,

*t*), are based on a continuous-time Markov model of substitution, defined by the rate matrix

**Q**. (The substitution process is assumed to be stationary and homogeneous.) Let the elements of

**Q**be denoted {

*q*

_{a,b}} (for

*a*,

*b*∈

**Σ**), with

*q*

_{a,b}(

*a*≠

*b*) representing the instantaneous rate at which

*a*is replaced by

*b*, and elements on the main diagonal defined such that each row sums to zero; i.e.,

*q*

_{a,a}= −∑

_{b∈Σ−{a}}

*q*

_{a,b}. The probabilities

*P*(

*b*|

*a*,

*t*) are given by the elements of the matrix

**P**(

*t*) = exp(

**Q**

*t*), where exp(

**Q**

*t*) =

*P*(

*b*|

*a*,

*t*) as [exp(

**Q**

*t*)]

_{a,b}, to be clear that it is a function of

**Q**.)

**Q**can be parameterized in various ways (Yang 1994

*a*; Whelan, Liò, and Goldman 2001); in this article, we consider three of the standard parameterizations, corresponding to the HKY, REV (general reversible), and UNR (unrestricted) models (see table 1). By convention,

**Q**is scaled such that

*t*can be interpreted as the expected number of substitutions per site (the scaling constraint is ∑

_{a,b∈Σ:a≠b}π

_{a}

*q*

_{a,b}= 1).

A phylogenetic model can be extended easily to describe *N*-tuples of sites, for small *N*: the alphabet **Σ** is simply replaced with **Σ**^{N}, and the dimensions of **Q** and **π** are adjusted accordingly. As long as the *N*-tuples can be assumed independent, the procedures for computing the likelihood of a model and estimating its parameters remain essentially unchanged, although they become more computationally intensive. (The matrix **P**(*t*) = exp(**Q***t*) now describes the joint probabilities of substitution of *N*-tuples of bases.) We refer to such a model as having *order N*, and when *N* > 1, we call the model *context-dependent.* In this article, four kinds of *N*th order models are considered, corresponding to four parameterizations of the rate matrix **Q**: an unrestricted model (U*N*), a reversible model (R*N*), a strand-symmetric unrestricted model (U*N*S), and a strand-symmetric reversible model (R*N*S) (see table 1). In all cases, the number of parameters is kept manageable by assuming an instantaneous rate of zero for substitutions of multiple nucleotides, as in most codon models. A scaling constraint of

_{a}

*q*

_{a,b}=

*N*allows for branch-length units of substitutions per site. See Appendix B for additional details.

For context-dependent models, the letter *Z* will be used in place of *X* to denote both *N*-tuples of sites and *N*-tuples of random variables. *N*-tuples will be assumed to be composed of adjacent sites, so that *Z*_{u, j} = (*X*_{u, N( j−1)+1}, … , *X*_{u, Nj}), *Z*_{•, j} = (*X*_{•, N( j −1)+1}, … , *X*_{•, Nj}), and *Z*_{∘, j} = (*X*_{∘, N( j−1)+1}, … , *X*_{∘, Nj}) (where 1 ≤ *j* ≤ *L*′, *L*′ = *L*/*N*). As with *x*_{•i} and *x*_{∘i}, *z*_{•, j} and *z*_{∘, j} are instances of *Z*_{•, j} and *Z*_{∘, j}, respectively. To complete the parallel, the variables *Z*_{•}, *Z*_{∘}, *z*_{•}, and *z*_{∘} are defined, but are equivalent to *X*_{•}, *X*_{∘}, *x*_{•}, and *x*_{∘}, respectively. When *N*-tuples are independent (and nonoverlapping, as above), the likelihood of the model is simply *P*(*z*_{•}|ψ) =

*P*(

*z*

_{•, j}|ψ). The probability of each tuple of columns,

*P*(

*z*

_{•, j}|ψ) = ∑

*P*(

*z*

_{•, j},

*z*

_{∘, j}|ψ), can be computed with Felsenstein's algorithm, as before, but with substitution probabilities based on

*N*-tuples of nucleotides. Overlapping

*N*-tuples are discussed below.

Variation in the rate of evolution between different sites can be accommodated, with context-dependent as well as independent-site models, using the *discrete gamma* method of Yang (1993; 1994*b*). This method is based on the introduction of a random scaling parameter for the branch lengths **β**, which is assumed to have a gamma distribution, whose shape parameter α is estimated from the data. The distribution is partitioned it into *k* “rate categories” of equal probability, each with a rate constant *r*_{l} (1 ≤ *l* ≤ *k*), and the probability *P*(*z*_{•, j}|**ψ**) is approximated as

*k*invocations of Felsenstein's algorithm. Even quite small values of

*k*appear to provide adequate approximations of the full continuous distribution;

*k*= 4 was recommended by Yang (1994

*b*).

### Estimating the Parameters of a Context-Dependent Model

A maximum-likelihood estimate (m.l.e.) of the parameters of a phylogenetic model,

_{ψ}

*P*(

*z*

_{•}|ψ), can be obtained by searching over tree topologies, and numerically optimizing (

**Q**,

**π**,

**β**) conditional on each possible

**τ**. In this article,

**τ**is assumed to be known, and the problem is simplified considerably. In addition,

**π**is estimated directly from

*z*

_{•}, as is the standard practice (each frequency π

_{a}is estimated as the observed frequency of tuple

*a*in

*z*

_{•}), so that the critical problem is to obtain m.l.e.'s of

**Q**and

**β**conditional on

**π**and

**τ**.

Many software packages, including PAML (Yang 1997), MOLPHY (Adachi and Hasegawa 1996), fastDNAml (Olsen et al. 1994), PHYLIP (Felsenstein 1993), and PAUP (Swofford 2002), use Newton-Raphson or quasi-Newton algorithms for parameter estimation (Press et al. 1992), computing partial derivatives of the likelihood function numerically or with a combination of numerical and analytical methods. Newton-based optimization algorithms are not well suited for models with richly parameterized rate matrices, however, because of the computational cost of computing partial derivatives of the likelihood function with respect to rate-matrix parameters (the same is true of any method that makes use of function derivatives, such as conjugate-gradient methods). Moreover, they tend to require large numbers of evaluations of the likelihood function, whose computational cost increases exponentially with the order of a context-dependent model, and linearly with the number of sites considered. We will show in this section that an EM algorithm (Dempster, Laird, and Rubin 1977) is generally preferable to quasi-Newton methods for context-dependent models, although quasi-Newton methods can be useful for solving the optimization problem that arises on each iteration of the EM algorithm.

Suppose we seek the m.l.e. of (**Q**, **β**) given **τ** and **π**: (**Q̂**,

_{Q,β}

*P*(

*z*

_{•}|

**Q**, π, τ,

**β**). The theorem behind the EM algorithm says that a (local) maximum of

*P*(

*z*

_{•}|

**Q**,

**π**,

**τ**,

**β**) can be obtained by iterative maximization of the expected value of log

*P*(

*z*

_{•},

*z*

_{∘}|

**Q**,

**π**,

**τ**,

**β**), computed with respect to the (posterior) distribution of

*Z*

_{∘}(Dempster, Laird, and Rubin 1977). It can be shown (Appendix C) that the general EM update rule, by which estimates for iteration

*t*+ 1 are obtained from estimates for iteration

*t*, here becomes where

*E*[

*S*(

*b*,

*a*,

*u*)|

*z*

_{•},

**Q̂**

^{t}, π, τ,

^{t}] is the expected number of substitutions of

*b*∈

**Σ**

^{N}for

*a*∈

**Σ**

^{N}on the branch above

*u*, given the observations at the leaves of the tree and the previously estimated parameters. Obtaining these values (the “E” step of the EM algorithm) can be accomplished with a procedure that is closely related to Felsenstein's algorithm (Appendix C). These expected values are then treated as constants when solving the remaining maximization problem (the “M” step).

This “inner” maximization problem remains nontrivial, but it can be solved numerically with the same kind of algorithm we have rejected for the larger problem of estimating (**Q̂**,

*O*(

*nd*

^{2N}) time, versus

*O*(

*nL*′

*d*

^{2N}) time for the complete likelihood function). The efficiency of the optimization procedure can be further improved by using analytical methods to obtain partial derivatives, approximating the derivatives with respect to rate-matrix parameters, and initializing the parameter values on each iteration with the solution of the previous iteration. The partial derivatives with respect to rate-matrix parameters can be derived using the Cauchy integral formula, as shown recently by Schadt and Lange (2002). See Appendix C for details.

Rate variation can also be accommodated with EM, using the discrete gamma model. In this case, the update rule becomes

where*E*[

*S*(

*b*,

*a*,

*u*,

*l*)|

*z*

_{•},

**Q̂**

^{t}, π, τ,

^{t},

^{t}] is the expected number of substitutions of

*b*∈

**Σ**

^{N}for base

*a*∈

**Σ**

^{N}on the branch above

*u*for rate category

*l*. This quantity reflects the posterior probabilities of the rate categories at each site, as well as the posterior probabilities of substitutions. Note that the shape parameter α of the gamma distribution must be estimated as part of the inner maximization problem, and that the values

*r*

_{1}, … ,

*r*

_{k}are implicitly functions of this parameter. With rate variation, the cost of the E step is increased by a factor of

*k*, and the EM algorithm converges more slowly. See Appendix C for further details.

### Markov Dependence Between Sites

It is possible to allow for overlapping tuples of columns, and hence to capture context effects between all adjacent columns of the alignment *x*_{•}, by assuming Markov dependence of each column *x*_{•,i} on its *N* − 1 predecessors. In this case, the likelihood function can be written as

*P*(

*x*

_{•,i}|

*x*

_{•,i−N+1}, … ,

*x*

_{•,i−1}, ψ) can be computed as where ∑

_{x̃•,i}represents a sum over all possible columns of observed characters at site

*i*(all possible values at the leaves of the tree). The joint probability in the numerator of equation 5 is simply the probability of an observed

*N*-tuple of sites, and hence can be computed using an

*N*th order phylogenetic model. The sum in the denominator can be computed with the same model, using a slight adaptation of Felsenstein's algorithm that sums over all possible ancestral states at sites

*i*−

*N*+ 1, … ,

*i*, as usual, but also over all possible values of

*x*

_{•,i}(see details in Appendix B). The new algorithm uses the same principle that is used in the version of Felsenstein's algorithm that allows for missing data (Appendix A), and as a result, it differs from Felsenstein's algorithm only in its initialization step. Thus, the conditional probability

*P*(

*x*

_{•,i}|

*x*

_{•,i−N+1}, … ,

*x*

_{•,i−1}, ψ) can be computed with two passes through Felsenstein's algorithm, one for the numerator and one for the denominator of Equation 5, each with a different initialization strategy.

When Markov dependence between sites is combined with context-dependent substitution models, it is particularly useful to distinguish between sites of different types. Various *functional categories* of sites might be considered, for example, corresponding to first, second, and third codon positions, and noncoding regions. Alternatively, categories might correspond to different evolutionary rates, or to combinations of functional role and evolutionary rate (Siepel and Haussler 2004). In general, *k* categories are described by *k* phylogenetic models, ψ = (ψ_{1}, … , ψ_{k}), reflecting the different rates and/or patterns of substitution observed in the different categories. If each site can be assigned a category a priori, with *c*(*i*) being the category of the *i*th site (1 ≤ *i* ≤ *L*; 1 ≤ *c*(*i*) ≤ *k*), then in the 1st order case (*N* = 1) we have *P*(*x*_{•}|ψ) =

*LP*(

*x*

_{•,i}|ψ

_{c(i)}). In the general

*N*th order case, different context effects for different categories, as well as different rates and patterns of substitution, can be accommodated similarly, by altering equation 4 so that the probability of each

*x*

_{•,i}is conditioned on

**ψ**

_{c(i)}. For example, with

*N*= 2 and

*k*functional categories, equation 4 would be replaced by: When the categories are not known a priori, it is possible to consider all possible assignments of sites to categories by treating the model as a hidden Markov model, with categories corresponding to states, and emissions corresponding to alignment columns. Such an approach also allows the category of each site to be predicted (Felsenstein and Churchill 1996; Yang 1995; Goldman, Thorne, and Jones 1996; Pedersen and Hein 2003; Siepel and Haussler 2003).

The Markov model presented here is clearly more limited in several respects than the process-based models of Jensen and Pedersen (2000; Pedersen and Jensen 2001). In particular, context effects cannot “cascade” in both directions along a single branch of the tree because inferences are restricted to *N*-tuples of bases along each branch. Furthermore, because the interdependencies are ignored between the ancestral states associated with overlapping column tuples, there is no globally consistent probabilistic treatment of the latent variables and their interdependencies. Nevertheless, the conditional distributions defined for this (*N* − 1)st order Markov process are valid probability distributions, and they legitimately combine to produce a valid likelihood: the sum of the probabilities of all alignments is 1 for a given *L* and *n*. Most importantly, the model allows efficient computation of likelihoods, and (as will be seen below) fits the data well.

Instead of estimating parameters separately for each functional category under the assumption of independent tuples, parameters can be estimated directly with respect to the entire Markov model (equation 4), using a quasi-Newton algorithm (the EM algorithm described above can no longer be used). It is reasonable to expect, however, that parameter estimates will be similar to those obtained under an assumption of independent tuples. This conjecture is examined experimentally later in this article and will be found to hold reasonably well.

### Data

Our noncoding data set was drawn from a 1.8-megabase region of human chromosome 7 containing the CFTR gene and homologous sequences from eight other eutherian mammals (Thomas et al. 2003) (see species in fig. 4). The sequences were aligned with a new program called TBA (for “Threaded Blockset Aligner”; Blanchette et al. 2004), which was designed specifically for alignment of megabase-sized regions of multiple mammalian genomes. TBA builds a multiple alignment from local pairwise alignments, using a progressive alignment strategy and a predefined phylogenetic tree. It can “thread” a set of locally aligned blocks with respect to one sequence, such that every base of that sequence is represented exactly once, and all bases appear in order. In this case, the tree topology of Figure 4 was used, and the alignment was threaded with respect to the human sequence.

Sites were selected from this alignment corresponding to *ancestral repeats* (ARs)—transposons that appear to have been dispersed, and then become inactive, prior to the divergence of the species in question, and that are believed to have been evolving more or less neutrally since that time. Ancestral repeats were identified in the human sequence, using the program RepeatMasker (http://ftp.genome.washington.edu/cgi-bin/RepeatMasker/), and were mapped to the coordinate system of the multiple alignment. The corresponding columns were then extracted and combined into a subalignment, referred to below as the “AR alignment.” Essentially the same set of repeat subfamilies was considered as in recent papers by the Mouse Genome Sequencing Consortium (2002) and Hardison et al. (2003).

The AR alignment was post-processed in various ways to eliminate artifacts from both the alignment procedure and the extraction of columns. To diminish the effects of alignment errors adjacent to indels (due to the placement of gaps to optimize an alignment score), a heuristic was used by which the 3 bases adjacent to each indel were discarded, as were maximal ungapped subsequences shorter than 15 bases in length (discarded bases were replaced by missing data characters—‘N's). Subsequently, columns having actual bases (non-gaps and non-‘N's) present in fewer than four species were also removed, so that sites at which two-thirds or more of the species either had gaps or bases bordering gaps were excluded from our analysis. Whenever one or more columns were removed, they were replaced with *N* − 1 columns of missing data (where *N* is the order of the substitution model to be applied), so that no artificial context was introduced (this “padding” procedure was also used when the AR sites were originally extracted from the complete alignment). The final, cleaned alignment consisted of 162,743 sites, with an average of 5.7 species represented at each site, and a set of padding columns about once every 36 sites. In the end, our results appeared not to be very sensitive to the alignment and post-processing methods (see *Discussion*).

For the coding data set, we used mRNA sequences from various mammalian species that match known genes in the human genome. (It is not yet possible to assemble a data set of coding sites from mammalian genomic DNA with comparable size and species diversity to the noncoding data set described above; the CFTR data set, for example, contains only about 20,000 sites in coding regions.) Using the alignments of nonhuman mRNAs to the human genome that are available from the UCSC Human Genome Browser (Kent et al. 2002), we selected those genes from a nonredundant set of RefSeq genes (Pruitt and Maglott 2001) having aligned mRNAs from at least two of eight (nonhuman) mammalian species. The same species were used as in the CFTR data set, and only autosomal genes were considered. The best matching mRNA from each species was selected for each gene, with a requirement that this mRNA matched the human genome significantly in no other place. (This rule acted approximately like a reciprocal-best criterion for orthology, but it was somewhat more conservative; it was a more natural choice here because of the asymmetry between the human genome and the nonhuman mRNAs, and because of the redundant nature of the mRNA data.) The coding exons of each RefSeq gene were extracted from the human genome, and re-aligned to the putatively orthologous mRNAs using the T-Coffee program (Notredame, Higgins, and Heringa 2000). The resulting multiple alignments were then cleaned by removing all sites upstream of the start codon and downstream of (and including) the stop codon in the human sequence, all sites with gaps, and (maximal) gapless segments of the alignment shorter than 30 sites in length. In addition, alignments were discarded in which frame-shift indels or premature stop codons occurred in any species (if the anomaly occurred in the last 20% of sites in the gene, only the downstream portion was discarded; the criterion for frame-shift indels was based on the total number of gap characters in each sequence between retained gapless blocks of the alignment, relative to the number in the human sequence). After the cleaning procedure, 2,441 alignments (genes) remained, with an average of 3.4 species per alignment (one of which was always human). Aside from the human genes themselves, this data set is dominated by rodent sequences (2,396 alignments include mouse or rat), and in particular by three-way human/mouse/rat alignments (1,375 alignments); however, cow is fairly well represented (695 alignments), as is pig (433 alignments). Chimp and baboon have the poorest representation, each being present in fewer than 30 alignments. For the purposes of fitting phylogenetic models, all 2,441 alignments were concatenated into a single large alignment consisting of 3,081,993 sites; the species that were absent for each gene were represented in the corresponding columns of this alignment by missing data characters (‘N's). This alignment has no gaps or stop codons and maintains reading frame completely, with every consecutive triplet of columns representing a column of codons. It is referred to below as the “mRNA alignment.”

## Results

This section begins with results for the noncoding data set (the AR alignment), where all sites are treated equally, and then turns to results for the coding data set (the mRNA alignment), where it is important to account for the codon position of each site. In both cases, the likelihoods of various models will first be compared, then the estimates obtained for model parameters will be discussed.

### Results for Noncoding Data

Various models, with orders from *N* = 1 to *N* = 3, were fitted to the AR alignment, both assuming constant rates across sites and allowing for rate variation (discrete gamma method, *k* = 4 rate categories). Parameters were estimated with the EM algorithm described above, assuming independence of column tuples (with context-dependent models, the sites of the alignment were arbitrarily partitioned into adjacent *N*-tuples; see below). In all cases, the tree was assumed to have the topology shown later in figure 4, which has been confirmed to be correct by the presence and absence of particular instances of interspersed repeats in the sequences of the CFTR data set (Thomas et al. 2003). Alignment gaps were treated as missing data. Recall that the alignment consisted of *L* = 162, 743 sites, and that alignment gaps and ‘N's accounted for roughly one-third of the characters in each alignment column (on average).

### Model Likelihoods

The log likelihoods of all models, under the assumption of constant rates, are shown in Figure 1*a*. Values are plotted relative to the simplest model, HKY. A striking improvement can be seen as the order of the models is increased from *N* = 1 to *N* = 2, and again from *N* = 2 to *N* = 3, indicating strong context effects in the substitution process. Indeed, these improvements far exceed what is achieved by increasing parameter complexity in models of the same order. Nevertheless, most of the improvements among models of the same order do appear to be significant. In several cases, such models can be compared using the standard likelihood ratio test (Huelsenbeck and Rannala 1997), which is applicable between “nested” models, *M*_{1} ⊆ *M*_{2}; here, HKY ⊆ REV ⊆ UNR, and R*N*S ⊆ R*N* ⊆ U*N* ⊇ U*N*S, for *N* = 2, 3. With all comparable pairs of models, the improvement of the more complex one over the less complex one is statistically significant by this test (*P* < 0.0001), except in the case of U3 versus U3S (*P* = 0.2).

The likelihood ratio test is not applicable for models of different order, and it can be misleading with large data sets (given enough data, it is possible to obtain small *P* values for complex models that fit the data only slightly better than simpler counterparts). The Bayesian information criterion (BIC) (Schwartz 1979) is an alternative way of evaluating improvements in likelihood vis-à-vis increases in model complexity, which can be used for non-nested models, and which considers the size of the data set. The BIC strongly supports the use of second-order models over single-nucleotide models, and it strongly supports third-order models over second-order models (fig. 1*b*). Among second-order models, it finds strand symmetry (compare R2S and R2, U2S, and U2) to be a preferable constraint on the rate matrix to reversibility (compare R2S and U2S, R2, and U2), with U2S achieving the best (lowest) score overall. This relationship does not hold for the third-order models, where the cost of the additional parameters of unrestricted models outweighs the resulting improvements in likelihood, and the assumption of reversibility is supported. We note, however, that the BIC is known to tend to overpenalize complex models (Hastie, Tibshirani, and Friedman 2001).

Figure 2 shows, for selected models, the improvements in likelihood obtained by allowing rate variation and overlapping column tuples. Allowing rate variation with this data set makes only a small difference in likelihood (for most models, allowing rate variation is supported by the likelihood ratio test but not by the BIC), and leaves the relative scores of the models essentially unchanged (fig. 2*a*). Apparently, the rate of substitution is not highly variable in these AR sites, consistent with the assumption of neutral evolution. In contrast, a very large increase in likelihood is observed when Markov dependence between columns is introduced—indeed, the improvement over the single-nucleotide models is nearly doubled for dinucleotide models and increases by about half for nucleotide-triplet models, despite no change in the number of parameters of the model (fig. 2*b*). Notice that, if tuples are assumed independent, then context is ignored at every *N*th boundary between adjacent sites; if Markov dependence is subsequently introduced, then *N* boundaries are considered in place of each *N* − 1, for a relative gain of

*b*represent a lower bound on what can be obtained with the Markov-dependent models; direct optimization of the parameters of these models could improve likelihoods further (see below).

Two additional experiments were performed to test whether the increased likelihoods of context-dependent models, with and without Markov dependence of sites, might still somehow be an artifact of the construction of the models. The first experiment was a twofold cross-validation test based on the same data set (the AR alignment). The alignment was divided in half, selected models were trained on approximately the first *L*/2 sites, and likelihoods were evaluated on the second *L*/2 sites (*L* = 162,743); then the procedure was reversed, with training performed on the second half and likelihoods evaluated on the first. The second experiment involved randomizing the columns of an alignment, so that context was effectively erased, then fitting various models to the randomized data, and comparing likelihoods with those obtained from the original alignment. This experiment was performed with the first half of the AR data. The results of these two experiments, shown in figure 3, confirm that the improved likelihoods of the context-dependent models are due to their ability to capture real properties of the biological data, and not to statistical artifacts.

Finally, we tested the conjecture that parameter estimates can reasonably be obtained under an assumption of independent tuples, and then applied in a model that assumes Markov dependence of sites. At the same time, we examined whether it made any difference how a data set was partitioned into independent *N*-tuples, and whether those tuples were nonoverlapping. We fitted the U2S model to the AR alignment in four different ways: by assuming independent tuples (pairs) and training on (1) the “odd” pairs (all adjacent pairs (*x*_{•,i}, *x*_{•,i+1}) such that *i* is odd) (2) the “even” pairs, (3) both the odd and even pairs (so that each site was actually considered twice); and (4) by optimizing the full Markov model directly (computationally expensive but possible with *N* = 2). The likelihood of each model was then evaluated on the same data under the opposite assumption—that is, Markov dependence in the first three cases and independent tuples in the fourth—and all likelihoods (training and testing) were compared under each assumption. The likelihoods were very similar in all cases (within about 75 units of log likelihood), as were the actual parameter estimates (results not shown).

### Parameter Estimates

All parameter estimates discussed in this section were obtained with the EM algorithm, under the assumption of independent tuples. The estimates of branch-length parameters were very similar under all models (those for U3S with rate variation are shown in fig. 4*a*). Indeed, estimates for 2nd and 3rd order models differed from the values shown in figure 4*a* by an average of only 1% to 2%, and estimates for the HKY model without rate variation (the simplest model considered) differed from them by an average of only 5.2%. In general, branch-length estimates increased slightly with the number of parameters in the substitution model, and in particular with the introduction of rate variation, as has been noted in studies of independent-site models (Yang, Goldman, and Friday 1994). Estimates of the shape parameter α of the gamma distribution were very similar for models of the same order, but they increased slightly with model order (from about 8 for *N* = 1 to about 11 for *N* = 3). The ratio of the expected transition rate to the expected transversion rate,

_{ts}/

_{tv}(Appendix B), was very similar for models of the same order, but it increased slightly with model order (2.03 for

*N*= 1, 2.12–2.14 for

*N*= 2, and 2.16–2.17 for

*N*= 3).

More interesting are the estimates of specific rate-matrix parameters, which provide insight about context-dependent substitution in noncoding DNA. For context-dependent models, the elements of the estimated rate matrix **Q** (which, in the case of reversible models, also incorporate the equilibrium frequencies) spanned a wide range of values (e.g., 0.05–10.9 for the U3S model) and tended to cluster into three main groups, corresponding to transversions, transitions, and CpG transitions. CpG transversions (CG→AG and CG→GG, and their reverse complements) occurred at rates comparable to those of non-CpG transitions. Comparisons of context-dependent models with independent-site models (fig. 5) confirmed that the most pronounced context effects corresponded to CpG transitions, although CpG transversions also occurred at a significantly higher rate than expected (note that the mechanism behind CpG transversions is believed to be different from that behind CpG transitions; see, e.g., Blake, Hess, and Nicholson-Tuell 1992). CpG effects also appear to explain, at least in part, the somewhat poor fit of reversible, context-dependent models (as described above). When estimates of substitution rates under the U3 and U3S models are compared (fig. 6*a*), they are seen to be very closely correlated, indicating that the constraint of strand symmetry does not prohibit a near-optimal version of the matrix from being obtained. A similar comparison of the U3 and R3 models, however (fig. 6*b*), shows substantially poorer agreement for certain classes of substitutions, particularly CpG transversions (*r* = 0.55 for CpG transversions under U3 vs. U3S and *r* = 0.36 for CpG transversions under U3 vs. R3). The rates of these transversions appear to be systematically overestimated under R3, and the rates in the reverse direction appear to be underestimated.

It is worth noting that our results do not reflect the transcription-associated mutational asymmetry recently reported by Green et al. (2003), based on an analysis of the same sequence data, although a substantial portion of the sites we have considered are believed to be transcribed. Three of the nine genes in the region in question, however, are located on the opposite strand from the other six, and so our data set contains a mixture of bases not only that are and are not transcribed, but that correspond to the transcribed and nontranscribed strands. This is presumably why strand asymmetries are not apparent from a comparison of the U*N* and U*N*S models.

The estimated rates for the U3S model were compared to rates of context-dependent substitution in human pseudogenes estimated by Hess, Blake, and Blake (1994), in an expansion of the study by Blake, Hess, and Nicholson-Tuell (1992). The reported “corrected relative rates” were used, which are ratios with respect to the slowest estimated rate, with an upward correction applied to CpG substitutions. The results of the comparison are shown in figure 7. The two sets of estimates are remarkably consistent overall, considering the difference in methods and our use of ARs from a single, local region of the genome instead of pseudogenes from various chromosomes. The largest disagreements correspond to CpG transitions, for which we estimate considerably higher rates (1.8–2.8 times higher than would be predicted by the regression line of figure 7). These disagreements may reflect real differences in the two data sets—e.g., due to differences in the methylation patterns in AR sites versus pseudogenes—but it is likely that they are at least partially a consequence of a non-negligible rate of multiple substitutions per site, or of ascertainment bias in Hess, Blake, and Blake's (1994) methods (their selection procedure, which was designed to eliminate cases of multiple substitutions per site, may have been biased against sites with CpG transitions).

No simple explanations emerged for the variation in the estimates of non-CpG rates, which was considerable, both among transitions and among transversions (non-CpG transition rates for U3S had mean 0.653 and s.d. 0.203 and non-CpG transversion rates had mean 0.161 and s.d. 0.057). Indeed, several reported trends were not supported by our parameter estimates. For example, Hess, Blake, and Blake (1994) observed high rates of TA→CA transitions, but we did not find these rates to be unusual (fig. 7). Our estimates also did not show a significant dependency of the transition/transversion ratio on the A + T content of neighboring bases: the expected proportion of substitutions that are transversions, based on our estimated rate matrix, increased only slightly with the number of flanking A + T bases (under the U3 model, from 0.275 with 0 flanking A + T bases, to 0.290 with 1, and to 0.349 with 2), and this increase seemed to be largely a consequence of the CpG effect (when CpG substitutions were discarded the expected numbers were 0.334 for 0 A + T bases, 0.320 for 1, and 0.349 for 2). (We note that the proportions we estimated *including* CpG substitutions are comparable to those reported for primates by Zhao and Boerwinkle (2002) and Yang, Chen, and Li (2002), who appear not to have corrected for the CpG effect.) Our estimated rates also did not reflect the tendency observed in chloroplast genomes for a higher proportion of transversions when a 5′ pyrimidine is present (Morton, Oberholzer and Clegg 1997). As has been noted for chloroplast genomes (Morton, Oberholzer, and Clegg 1997), the pattern of context-dependent substitution in the noncoding DNA of mammals appears to be complex, and is not easily explained in terms of a small number of well-defined phenomena (such as the CpG effect and the transition/transversion bias).

### Results for Coding Data

A different set of models was fitted to the mRNA alignment, this time including the GYE and GYM versions of Goldman and Yang's (1994) codon model (Appendix B), as well as the HKY, REV, UNR, R3, and U3 models. Second-order models and strand-symmetric models were not considered. All models were fitted with and without rate variation. The EM algorithm was used, assuming independent column tuples (here corresponding to codons, for the third-order models), except for Goldman and Yang's models, which were fitted using the codeml program in the PAML package (Yang 1997). The same tree topology was assumed as in the previous section. Recall that the mRNA alignment consisted of 993 sites, *L* = 3, 081, with 3.4 species represented per site; sites with alignment gaps had been removed.

For the R3 and U3 models, the effect of assuming Markov dependence between columns was also examined. In this case (unlike in the previous section), three separate functional categories were considered, corresponding to the 1st, 2nd, and 3rd codon positions, and a separate third-order model was fitted to the sites of each category, using the EM algorithm. Thus, the third codon position model was treated like an ordinary codon model, but the first and second position models were trained on triples that straddled adjacent codons, so that context effects across codon boundaries were considered. All three models were incorporated into the likelihood calculation, as shown in equation 6.

### Model Likelihoods

Figure 8 shows the log likelihoods and BIC scores of all models, with and without rate variation. The codon models are seen to fit the data overwhelmingly better than the single-nucleotide models, indicating pronounced context-dependence among sites of the same codon. The R3 and U3 models, however, performed substantially better than Goldman and Yang's model; apparently, the pattern of codon substitution is complex, and is characterized only approximately by these simply parameterized models. The use of a physicochemical distance matrix (GYM) produced a relatively small improvement over the naive assumption of equal distances between amino acids (GYE), supporting Yang, Nielsen, and Hasegawa's (1998) conclusion that physicochemical distances and substitution rates are not strongly correlated (see below). We expect that the R3 and U3 models would improve similarly on other existing codon models, which have been reported to perform roughly as well as Goldman and Yang's model (Schadt and Lange 2002).

Allowing for rate variation made a large difference for all models, but a larger difference for single-nucleotide models and Goldman and Yang's codon model than for R3 and U3; the reason is probably that rate variation is being used to compensate for an oversimplified representation of substitution patterns (estimates of the shape parameter α were about 0.40 under the single-nucleotide models, 0.95 under GYE and GYM, and 1.4 under R3 and U3).

The U3 model outscored the R3 model by about 2,000 units of log likelihood (with and without rate variation), a sufficient improvement to reject the hypothesis of reversibility under the likelihood ratio test. The R3 and U3 models are almost equal according to the BIC, however, and indeed, R3 is slightly preferable in the case with rate variation. Despite some evidence against reversibility of codon substitution, it appears to be a reasonable modeling assumption, especially when its mathematical convenience is taken into consideration.

When Markov dependence between sites was introduced with the U3 and R3 models (fig. 9*a*), another large improvement in likelihoods occurred, indicating that context effects that cross codon boundaries are important (see also Pedersen, Wiuf, and Christiansen 1998). This improvement holds up well under cross-validation (fig. 9*b*). It remains to be determined to what extent the CpG effect is responsible for the observed improvements in likelihood (e.g., because of synonymous transitions at CpG), and in general, how similar the context effects in coding DNA (particularly at unconstrained sites) are to the ones observed in putatively neutral DNA.

### Parameter Estimates

The estimated branch lengths based on the mRNA alignment were similar for all models, but they were less consistent than the estimates for the noncoding data set. Results for selected models are shown in Table 2, with branch-length labels defined in figure 4*b*. (Throughout this section, estimates are presented for the versions of R3 and U3 that were trained like ordinary codon models.) The estimates for single-nucleotide models differed considerably at several branches from those for the codon models, and Goldman and Yang's model produced estimates at a few branches that differed significantly from those of R3 and U3. Most of the major differences, however, appeared in the subtree of the primates, and they should be interpreted with caution, because of sparse data for chimp and baboon, and because noise in the mRNA sequences will have a disproportionate effect on short branches. In general, these results suggest that simply parameterized codon models, and even independent-site models with rate variation, may be adequate for branch length estimation, despite the improved fit of richer models.

The relative proportions of branch lengths estimated for the noncoding and coding data sets are fairly similar overall (fig. 4), given the very different evolutionary pressures in coding and noncoding DNA, and the fact that the mRNA data are drawn from around the genome whereas the AR data come from a single local region. We interpret this general congruence as an indication that we have obtained a reasonable approximate characterization of the relative average rates of substitution on the branches of the tree, although the absolute values of the estimates are subject to many possible biases (for example, related to alignment). Interestingly, the ratio of the distances (in substitutions/site) to mouse and human from their common ancestor is estimated to be between about 2.8:1 (mRNA data, R3 with rate variation) and 3.4:1 (AR data, U3S with rate variation), which is considerably higher than the estimate of 2:1 given in the recent analysis of the mouse genome (Mouse Genome Sequencing Consortium 2002). Ratios similar to ours were estimated by Cooper et al. (2003).

Even more than with the noncoding data, a wide range was observed in the estimated substitution rates (from 0.001 to 4.63 for different codon substitutions under the U3 model). The estimated matrices for U3 and R3 were comparable. The estimated transition/transversion ratio,

_{ts}/

_{tv}, was similar for all codon models (e.g., estimates without rate variation were 2.17 for GYE and GYM and 2.15–2.16 for R3 and U3), and it was somewhat lower under the independent-site models (1.97 for HKY, REV, and UNR). Estimates of the ratio of the rates of synonymous and nonsynonymous substitution,

_{s}/

_{a}(Appendix B), were also similar under all codon models (2.80–2.82 for GYE and GYM, 2.80 for R3, and 2.72 for U3, without rate variation).

The amino acid substitution rates induced by the estimates for the R3 and U3 models (see Appendix B) showed good agreement with empirical substitution matrices, such as the one of Jones, Taylor and Thornton (1992; fig. 10). It appears that amino acid substitution rates can be estimated implicitly from nucleotide sequences, given enough data and a sufficiently rich parameterization of the substitution model. In contrast, the rates used by the GYM model, which are based on the physicochemical distances of Miyata, Miyazawa, and Yasunaga (1979; Appendix B), showed much weaker agreement with the induced rates of R3 and U3. These results support the conclusion of Yang, Nielsen, and Hasegawa (1998) that substitution rates and physicochemical distances are not related by a simple mathematical function, and they suggest that the use of physicochemical distances is a weakness of current codon models.

## Discussion

Allowing for context-dependent substitution makes phylogenetic inference considerably harder: when the assumption of site-independence is relaxed, a Markov random field arises, and standard methods can no longer be used for likelihood computation and parameter estimation (Jensen and Pedersen 2000). We have shown, however, that context-dependent substitution can be handled in an approximate way with some simple extensions of codon models, which themselves are direct extensions of Felsenstein's original framework. Our models (like codon models) require the assumption of certain limitations on the interdependence between sites (see below), but in return, they allow for exact inference without too much additional cost in computation. (It should be noted that our models span a wide range of computational demand, from the more simply parameterized second-order models, which are practical for use in ordinary phylogenetic analysis, to richly parameterized third-order models, which may be too costly for some purposes—see Appendix C.) The new models appear to fit real biological data substantially better than independent-site and codon models, in both coding and noncoding regions.

The improvements we observe appear to derive from three more or less separate properties of our models: (1) their ability to capture context-effects within independent *N*-tuples of sites; (2) their ability to allow these *N*-tuples to overlap; and (3) their rich parameterization of the substitution process. Judging by cases in which some apply but others do not, these three properties have roughly equal importance, and their effects are approximately additive. When *N*-tuples are assumed independent (property 1), as in codon and RNA models, standard methods can be used for likelihood computation. In addition, parameter estimation can be accomplished efficiently with an EM algorithm, which performs much better than general-purpose optimization algorithms on parameter-rich models and large data sets. Our models allow *N*-tuples to overlap (property 2) by assuming Markov dependence between sites. Likelihood computations are performed simply and efficiently, using a two-pass extension of Felsenstein's algorithm. Expectation maximization cannot be used to fit the Markov-dependent models directly, but parameter estimates based on an assumption of independent sites appear to be close to optimal.

The importance of a rich parameterization of the substitution process (property 3) indicates that the patterns of context-dependent substitution are complex in both coding and noncoding regions. In noncoding regions, the CpG effect is the single, strongest, clearly identifiable example of context-dependence, but it does not appear to be sufficient to explain the advantage of context-dependent models; judging by the wide variation in estimates of substitution rates, many more subtle effects also occur, and these may be significant in combination. The improvements in goodness of fit that occur when second-order models are replaced by third-order models suggest that important context effects are occurring at the level of nucleotide triplets; however, it is also true that the third-order models provide a better approximation of the actual process of context-dependent substitution (see below), so these increases should be interpreted with caution. Additional work will be required to sort out which are the most important context effects, and whether simpler parameterizations of context-dependent rate matrices can be justified.

In coding regions, a richly parameterized substitution model allows for significantly higher likelihoods than those of Goldman and Yang's codon model, and by extension, other codon models. Goldman and Yang's model is appealing in its use of a small number of parameters, each with a clear biological meaning, but it appears not to characterize the pattern of codon substitution as accurately as our more empirical models (it is important to note, however, that its characterization generally seems adequate for accurate estimation of branch lengths and quantities such as ρ_{ts}/ρ_{tv} and ρ_{s}/ρ_{a}). The most serious problem seems to be with parameterizing nonsynonymous substitution rates in terms of physicochemical distances between amino acids (Yang, Nielsen, and Hasegawa 1998). An alternative would be to use empirically derived amino acid substitution matrices (such as the JTT or Dayhoff matrices), but these matrices are estimated in a way that is not consistent with the assumptions of the continuous-time Markov models used in phylogenetic models (Goldman and Yang 1994). At the cost of additional parameters, our models allow empirical estimation of amino acid rates within this continuous-time Markov framework, and at the same time capture phenomena at the DNA level, such as the transition/transversion bias and the pattern of synonymous substitution. Thus, to an extent, they combine the benefits of the two competing types of models that Yang, Nielsen, and Hasegawa (1998) called “mechanistic” (codon models) and “empirical” (models applied at the amino-acid level, using empirical substitution matrices). On the other hand, our models rely on a large number of parameters—larger than in any phylogenetic model currently in wide use—which increases the computational burden of parameter estimation and implies that large data sets to obtain accurate estimates. The first of these problems, we argue, can be surmounted with better algorithms and faster hardware, and the second will be partially addressed by the large quantities of data generated by the genome sequencing projects. Nevertheless, models with hundreds of parameters clearly will still be inappropriate for some applications. It may be possible in some cases to use a model with a smaller number of parameters (such as Goldman and Yang's), in combination with amino-acid substitution rates that have been estimated on a larger data set, using a less constrained model (such as R3).

Although our models improve on the goodness of fit of independent-site and codon models, the question remains open of how they would compare to models that more faithfully describe the true process of context-dependent substitution, such as those of Jensen and Pedersen (2000; Pedersen and Jensen 2001). As discussed above, our models fail to capture the interdependence of the ancestral states associated with overlapping *N*-tuples, and thus they do not give a consistent probabilistic treatment of the latent variables. This deficiency could be rectified by defining the interactions between all variables, latent and observed, using a general graphical model (Jordan 1999). The result would be a kind of discrete Markov random field with loops of dependency. Although no efficient method exists for exact likelihood calculation from such models, there are efficient approximate algorithms (Murphy, Weiss, and Jordan 1999; Wainwright, Jaakkola, and Willsky 2003; Yedidia, Freeman, and Weiss 2000). Another issue is that our models do not allow context effects to “cascade” outside the boundaries of each *N*-tuple along an individual branch of the tree, as is allowed in the process-based models of Jensen and Pedersen. The importance of this limitation will decrease as the order of a model increases, but it may be non-negligible when *N* = 2 or *N* = 3; it could be assessed by comparing our models with process-based models using MCMC methods. In the end, exact inference with an approximate model (as described in this article) will have to be weighed against approximate inference with a (more) exact model, considering both the goodness of fit of the models and the computational burden of parameter estimation. It is likely that different approaches will be appropriate for different applications.

The usual caveats about modeling assumptions and alignment accuracy apply to all parameter estimates presented in this article. The assumption that the substitution process is homogeneous and stationary, in particular, undoubtedly does not hold in reality, e.g., due to regional variations in G + C content or differences between lineages in the pattern of substitution (G + C content in our AR data set from the CFTR region is relatively uniform, both across sites and across species, but it varies considerably in our genome-wide mRNA data set). It is not yet clear what effect violations of these assumptions may have. Aspects of this question could be examined by relaxing the assumption of homogeneity in the manner of the model of Yang and Roberts (1995). The coding sequences can be aligned with fairly high confidence, but the alignment of noncoding sequences may introduce certain biases. Experiments with alternative alignments—one for which the post-processing steps were omitted and one produced by the MAVID program (Bray and Pachter 2003)—indicated that the absolute values of branch-length estimates are somewhat sensitive to the alignment and cleaning procedure, but relative branch lengths are fairly robust, as are estimates of context-dependent substitution rates (results not shown). The parameter estimates for the coding data set could conceivably be influenced by erroneous assignments of orthology, or by the use of mRNA (rather than genomic) sequence data.

We have focused on using context-dependent phylogenetic models to estimate the pattern and rates of substitution on the branches of a tree of known topology, but their broader role in phylogenetic modeling is worth considering. In simulation studies concerned with the accuracy of topology reconstruction, maximum-likelihood methods (assuming independent sites) were found to be fairly robust against violations of site independence, but they did have somewhat diminished statistical power in the presence of such violations (Schöniger and von Haeseler 1995). Thus, it is reasonable to expect that context-dependent models will not dramatically improve estimates of topologies, but they might be helpful in certain cases, for example, with hard-to-resolve branchings near the root of the tree. On the other hand, consideration of context effects could lead to significant improvements in cases where phylogenetic models are used to discriminate between different types of sites, such as in gene finding (Pedersen and Hein 2003; Siepel and Haussler 2004) and secondary structure prediction (Goldman, Thorne, and Jones 1996). In addition, context-dependent models may be helpful in understanding the evolution of isochores, in which the CpG effect has been suggested to play an important role (Fryxell and Zuckerkandl 2000).

## Supplementary Material

The AR and mRNA alignments and complete descriptions of parameter estimates for all models are available at www.cse.ucsc.edu/∼acs/context/. The software used for parameter estimation is available from the authors upon request (acs@soe.ucsc.edu).

### Appendix A: Felsenstein's Algorithm

Let *u* be a node in the tree, and let *x*

*u*at site

*i*,

*x*

*x*

_{v,i}|

*v*∈

*v*is a descendant of

*u*}. Felsenstein's algorithm computes

*P*(

*x*

*X*

_{u,i}=

*a*) for all nodes

*u*and characters

*a*, using the following recursive rule (here

*v*and

*w*are the children of

*u*): where

*I*is the indicator function. The total probability of the column is obtained as

*P*(

*x*

_{•,i}|ψ) = ∑

_{a}π

_{a}

*P*(

*x*

_{r,i}|

*X*

_{r,i}=

*a*). Felsenstein's algorithm is a special case of what is known in the graphical models literature as the “elimination” algorithm; it is closely related to the “peeling” algorithm from statistical genetics, and the “forward” algorithm for hidden Markov models (Jordan, 2004). The algorithm requires

*O*(

*nLd*

^{2}) time to compute

*P*(

*x*

_{•}|ψ), for

*n*taxa, an alignment of length

*L*, and an alphabet of size

*d.*

Felsenstein's algorithm can be adapted easily to accommodate missing data. If a random variable *X*_{u,i}, corresponding to a leaf *u*, is somehow unobserved (as when alignment character *x*_{u,i} = ‘N’), then the algorithm can be made to sum over all possible values of *x*_{u,i} by changing the base case of the recursion to *P*(*x*

*X*

_{u,i}=

*a*) = 1 (for all

*a*). The resulting algorithm has the desirable property of returning a probability

*P*(

*x*

_{•,i}

**|Q**,

**π**,

**τ**,

**β**), for a column

*x*

_{•,i}with missing data, that is equal to the probability

*P*(

**Q**,

**π**,

**τ**′,

**β**′), for the induced subtree (

**τ**′,

**β**′) of (

**τ**,

**β**) and a version

*x*

_{•,i}that has the missing elements removed. Thus, the probability of a column

*x*

_{•,i}with missing data does not depend on the branches leading to the “missing leaves,” and when fitting a model, the column

*x*

_{•,i}has no bearing on such branches. Alignment gaps are sometimes treated as missing data.

### Appendix B: Context-Dependent Models

#### General *N*th-Order Models

Let δ(*a*, *b*) be the number of characters by which tuples *a* ∈ Σ^{N} and *b* ∈ Σ^{N} differ. We define the general unrestricted substitution model U*N*, in terms of free parameters of the form

_{a,b}, as: This model has

*N*(

*d*− 1)

*d*

^{N}free parameters, where

*d*is the size of the alphabet in question. The general reversible substitution model R

*N*is the same as U

*N*except for the constraint that π

_{a}

*q*

_{a,b}= π

_{b}

*q*

_{b,a}, and the strand-symmetric models U

*N*S and R

*N*S are versions of U

*N*and R

*N*, respectively, such that

*q*

_{a,b}=

*q*a⃖, b⃖ if a⃖ and b⃖ are the reverse complements of

*a*and

*b*, respectively. Notice that R1 is equivalent to REV, and U1 to UNR.

Once a maximum-likelihood estimate

*N*th-order model has been obtained, the expected rates of particular types of substitutions can easily be obtained from the estimated rate matrix,

**Q̂**. For example, the transition rate is given by (Note that equation B.1 ensures that every non-zero element of

**Q**corresponds to a single-nucleotide substitution.) Because of the scaling constraint on the matrix, the transversion rate is simply

_{tv}=

*N*−

_{ts}, and the transition/transversion ratio is

_{ts}/

_{tv}=

_{ts}/(

*N*−

_{ts}). The invariance property of maximum likelihood estimators ensures that estimates obtained in this way are maximum likelihood estimates (Goldman and Yang 1994).

#### Codon Models

A *codon model* is a third order phylogenetic model **ψ** = (**Q**, **π**, **τ**, **β**), defined with respect to a DNA alphabet, such that the equilibrium frequency of each stop codon is zero. A codon model **ψ** has the property of inducing an amino acid model **ψ**′ in the following way. Let **ψ**′ = (**Q**′, **π**′, **τ**, **β**), with **Q**′ and **π**′ defined with respect to an amino acid alphabet, Σ′. Let α(*c*) either indicate the amino acid *a* encoded by codon *c* (*a* ∈ Σ′, *c* ∈ Σ^{3}) or have a null value if *c* is a stop codon. Then the elements of **π**′ are given by

_{c:α(c) = a}π

_{c}, and the elements of

**Q**′ by

_{c1:α(c1) = a1}∑

_{c2:α(c2) = a2}

*q*

**ψ**is reversible then

**ψ**′ will also be reversible.

The codon model of Goldman and Yang (1994), as adapted by Yang, Nielsen and Hasegawa (1998), is defined by the following parameterization of **Q** (here denoted **Q** = {*q*

**D**= {

*d*

*a*

_{1},

*a*

_{2}∈ Σ′) is a symmetric amino-acid distance matrix (with

*d*

_{a,a}= 0 for all

*a*∈ Σ′),

*d*

_{max}is the maximum value in

**D**, and γ

_{1}, γ

_{2}are scaling parameters. The model assumes that

**D**is defined a priori. Goldman and Yang originally used the physicochemical distances of Grantham (1974), but Yang, Nielsen, and Hasegawa (1998) subsequently compared several different physicochemical distance measures and found the one of Miyata, Miyazawa, and Yasunaga (1979) to perform best. They included an “equal distance” model in their comparison, with

*d*

*I*(

*a*

_{1}≠

*a*

_{2}). In this article, the equal distance model is referred to as “GYE,” and the Miyata, Miyazawa, and Yasunaga model as “GYM.”

Quantities of particular interest, once the parameters of a codon model have been estimated, are the expected rates of synonymous (ρ_{s}) and nonsynonymous (ρ_{a}) substitution, and their ratio, ρ_{s}/ρ_{a}. Estimates of these expected rates can be derived from the estimated rate matrix, using the method described above (see equation B.2). They can be used to estimate the ratio *K*_{s}/*K*_{a} (or its inverse, *K*_{a}/*K*_{s}), as described by Goldman and Yang (1994). Note that the convention with codon models has been to scale **Q** such that the expected number of substitutions per *codon* is one, rather than the expected number per *site* (as we propose above). Our convention allows parameter estimates from models of different orders to be compared more easily.

#### The Extension of Felsenstein's Algorithm

The quantity in the denominator of equation 5, ∑

*P*(

*x*

_{•,i−N+1}, … ,

*x*

_{•,i−1},

*x̃*

_{•,i}|ψ), is computed using the missing-data principle described in Appendix A. Let

*Z*

_{u,i}= (

*X*

_{u,i−N+1}, … ,

*X*

_{u,i}) and

*z*

_{u,i}= (

*x*

_{u,i−N+1}, … ,

*x*

_{u,i}). In addition, let

*y*

*u*for sites

*i*−

*N*+ 1, … ,

*i*− 1 (that is, excluding site

*i*), so that

*P*(

*y*

**|**

*Z*

_{u,i}=

*a*) is the probability of those characters given that node

*u*is assigned tuple

*a*(compare to

*x*

*a*∈ Σ

^{N}

*partially matches*tuple

*b*∈ Σ

^{N}if

*a*and

*b*share their first

*N*− 1 characters, and let us denote this relationship as

*a*≈

*b.*The sum in question can be computed using a version of Felsenstein's algorithm (equation A.1) in which the base case of the recursion is altered as follows: As usual, a final step must be performed at the root

*r*of the tree to obtain the desired value: ∑

*P*(

*x*

_{•,i−N+1}, … ,

*x*

_{•,i−1},

_{•,i}|ψ) = ∑

_{a}π

_{a}

*P*(

*y*

*Z*

_{r,i}=

*a*).

### Appendix C: Parameter Estimation by EM

#### Derivation of the Update Rule

By the standard EM algorithm, each new estimate (**Q̂**^{t+1},

^{t+1}) is determined from the previous estimate (

**Q̂**

^{t},

^{t}) as Taking advantage of independence of column tuples, the fact that each

*P*(

*z*

_{•,j},

*z*

_{∘,j}|

**Q**, π, τ, β) (for 1 ≤

*j*≤

*L*′) is simply a product over the edges of the tree, and the fact that π and τ are constants in the maximization, this equation can be rewritten as Now, with

^{t}= (

**Q̂**

^{t}, π, τ,

^{t}) for notational convenience, let

*E*[

*S*(

*b*,

*a*,

*u*)|

*z*

_{•},

*z*

_{∘},

^{t}] be the expected number of substitutions of

*b*∈

**Σ**

^{N}for

*a*∈

**Σ**

^{N}on the edge above node

*u*, given

*z*

_{•},

*z*

_{∘}, and the previous parameter estimates, and let be the expectation of the same quantity considering all possible values of the latent variables. Because

*z*

_{u, j},

*z*

_{σ(u), j}∈

**Σ**

^{N}, the terms in equation C.1 of the form log

*P*(

*z*

_{u, j}|

*z*

_{σ(u), j}, β

_{u}) can be grouped by pairs

*a*,

*b*∈

**Σ**

^{N}such that

*z*

_{u, j}=

*b*and

*z*

_{σ(u), j}=

*a*. Using this idea, and substituting from equation C.2, we obtain equation 2 from the text:

#### The Expectation Step

The expectation (E) step of the EM algorithm consists of computing the expected numbers of substitutions of each type along each edge of the tree (equation C.2). It is not hard to show that, when independence of column tuples is assumed, these values can be obtained directly from the posterior probabilities of each type of substitution at each tuple of sites:

These posterior probabilities can be computed efficiently by combining the intermediate values of Felsenstein's algorithm with a set of complementary values, in an “inside-outside” procedure analogous to the “forward-backward” algorithm used with hidden Markov models (Durbin et al. 1998). (Both are instances of a general “sum-product” algorithm for graphical models [Jordan 2004]). Let*z*

_{ū, j}denote the observed characters at site

*j not*beneath node

*u*; that is,

*z*

_{ū, j}=

*z*

*z*

*a*at site

*j*is where the probabilities in the numerator are implicitly conditioned on

^{t}. The quantities of the form

*P*(

*z*

_{ū, j},

*Z*

_{u, j}=

*a*) can be computed recursively, as follows. Let

*v*be the parent of

*u*and let

*w*be the sibling of

*u*(σ(

*u*) = σ(

*w*) =

*v*). Then, Equation C.4 yields quantities equivalent to the marginal probabilities for ancestral states that were introduced by Yang, Kumar, and Nei (1995). The algorithm of Koshi and Goldstein (1996) for computing these probabilities appears to be essentially equivalent to the inside-outside procedure described here.

The probability of each type of substitution along each edge is given by

See Friedman et al. (2002) for an alternative derivation of these posterior probabilities, which assumes reversibility.When rate variation is allowed, separate expected numbers of substitutions are required for each rate category *l* (equation 3). In this case,

*Y*

_{j}is a random variable indicating the rate category of the

*j*th tuple and is the posterior probability of

*Y*

_{j}=

*l*. In addition,

*P*(

*Z*

_{u, j}=

*b*,

*Z*

_{σ(u), j}=

*a*|

*Y*

_{j}=

*l*,

*z*

_{•, j},

^{t},

^{t}) is simply

*P*(

*Z*

_{u, j}=

*b*,

*Z*

_{σ(u), j}=

*a*|

*z*

_{•, j},

**Q̂**

^{t}, π, τ,

*r*

_{l}

^{t}). This probability can be obtained as above (equations C.4, C.5, and C.6) but with

^{t}.

#### The Maximization Step

The maximization (M) step consists of finding the parameter values (**Q̂**,

*C*

_{b,a,u}=

*E*[

*S*(

*b*,

*a*,

*u*)|

*z*

_{•},

**Q̂**

^{t}, π, τ,

^{t}] can be considered constants (see equation 2). This can be accomplished with a quasi-Newton algorithm, given the partial derivatives of

*f*(

**Q**,

**β**) with respect to the free parameters of the model. The partial derivative with respect to a free parameter

**Q**is diagonalizable as

**S**Λ

**S**

^{−1}, and let the elements of

**S**and

**S**

^{−1}be denoted {

*s*

_{a,b}} and {

*a*,

*b*∈

**Σ**

^{N}. Similarly, let the diagonal elements of

**Λ**be denoted {λ

_{a}},

*a*∈

**Σ**

^{N}. (Note that elements of

**S**,

**S**

^{−1}, and

**Λ**may be complex-valued in the case of unrestricted models.) Thus, and if

_{u}, If

**Q**β

_{u})]

_{a,b}are the elements of the matrix

**Q**β

_{u}), which Schadt and Lange (2002) have shown, using a derivation based on the Cauchy integral formula, to be given by where ∘ denotes the Hadamard (pointwise) product of two matrices, and

**F**= {

*f*

_{a,b}} is defined as The partial derivative for each branch-length parameter can be obtained in

*O*(

*d*

^{3N}) time, but the derivative for each rate-matrix parameter requires

*O*(

*nd*

^{3N}) time. With context-dependent models, the computation of rate-matrix derivatives is a bottleneck for the EM algorithm, and it is worthwhile to approximate them. Using only the first few terms of the Taylor expansion of exp(

**Q**

*t*) turns out to accelerate the computation considerably. For example, if the first four terms are used, where

**Q**′ =

**Q**. Because

**Q**′ is very sparse for our models (it has between 2 and 8 non-zero elements, if the constraint on scaling is temporarily relaxed), matrix multiplications of the form

**XQ**′

**Y**can be accomplished in

*O*(

*d*

^{2N}) time. In addition, the matrices

**Q**

^{k}(

*k*≥ 2) can be precomputed and reused for all rate-matrix parameters. As a result, the time required to compute each rate-matrix derivative is effectively reduced by a factor of

*d*

^{N}(provided the number of rate-matrix parameters is large enough, as in our case).

Unfortunately, even when more than four terms of the Taylor expansion are used, the approximate derivatives are not sufficient for precise maximization of *f* (**Q**, **β**). They are useful, however, in early iterations of the EM algorithm, in which approximate maximization is adequate. Our software switches to exact derivatives when the approximate ones no longer lead to significant improvements in likelihood.

When rate variation is allowed, *f* (**Q**, **β**) becomes

*C*

_{b,a,u,l}=

*E*[

*S*(

*b*,

*a*,

*u*,

*l*)|

*z*

_{•},

**Q̂**

^{t}, π, τ,

^{t},

^{t}] (equation 3). The computation of branch-length and rate-matrix derivatives remains essentially unchanged, but now involves an additional sum over the

*k*rate categories, resulting in a factor of

*k*increase in computational time. In addition, the partial derivative with respect to the shape parameter α is now required. It can be computed as We simply obtain

#### Performance

The performance of the EM algorithm depends on many factors, including the size of the alignment (number of species *n* and number of sites *L*), the order of the model, the number of rate-matrix parameters, and the suitability of the starting values for parameters. Nevertheless, a rough idea of its performance can be obtained from the running times recorded in our experiments with the AR data set (*n* = 9 species, *L* = 162,743 sites). Experiments were performed on a desktop system with a 2.4 GHz Pentium IV processor (500 MB of RAM, Red Hat Linux 7.3), using an implementation of the algorithm written in C and compiled with gcc. The software relies on the LAPACK package (Anderson et al. 1999) for matrix diagonalization, and uses our own implementation of the BFGS quasi-Newton algorithm, based on the one described by Press et al. (1992). Parameters were initialized to arbitrary values (0.1 for all branch-length parameters, rate matrix parameters such that transitions had rates five times greater than transversions). Without rate variation, first-order models required 48.5 s (HKY), 49.2 s (REV), and 83.4 s (UNR), second-order models took between 12.3 min (R2S) and 47.4 min (U2), and third-order models between 18.9 h (R3S) and 6.6 days (U3). (Unrestricted models took considerably longer than reversible ones, in part because of complex arithmetic in derivative computations.) For comparison, the baseml program of the PAML package (version 3.13; Yang 1997) required 95 seconds for HKY and 133 seconds for REV, using the same data set and hardware. When a quasi-Newton algorithm was used in place of EM for context-dependent models, running times increased substantially, e.g., from 39.5 min to 18.0 h with the U2S model. The performance of the EM algorithm was significantly degraded when rate variation was allowed (*k* = 4 rate categories), and it depended heavily on the initial value of α; e.g., with a starting value of α = 1, the REV model took 672 seconds, but with a starting value of α = 8, it required only 158 seconds. We did not try to fit context-dependent models using poor starting values of α; with a starting value of 10, the U2S model took 93.4 min and the U3S model took 8.2 days (compared to 39.5 min and 4.8 days, respectively, without rate variation).

The EM algorithm generally required between 20 and 60 iterations to converge (sometimes more with rate variation). The M step was essentially instantaneous for first-order models, and for second-order models (without rate variation), it typically required between 1 and 5 s with the approximate derivatives, and between 5 and 30 s without them (occasionally 100–200 s, as when first switching to exact derivatives). The time of the M step varied widely for third-order models, depending on the number of iterations required to reach convergence; typical times with the approximate derivatives were between 1 and 20 min, and typical times without them were 20–100 min, but sometimes 8–10 h were required, and occasionally, 48 h or more. The E step, in contrast, was very consistent from one iteration to the next; it required 1.9 s for the UNR model, 38 s for the U2S model, and 27 min for the U3S model (without rate variation). Allowing rate variation increased the time of the E step by a factor of *k*, and the cost of the M step by at least a factor of *k* (usually somewhat more). For this reason, our software first fits a model approximately without rate variation, then switches to the discrete gamma model.

It is worth noting that the most computationally demanding parts of the algorithm—the computation of posterior probabilities of substitution in the E step, and of partial derivatives with respect to free parameters in the M step—are readily parallelizable. We expect that a parallel implementation of the algorithm would considerably reduce the time required to fit third-order models in particular.

Hervé Philippe, Associate Editor

**Table 1**

Model | N a | Parametersb | Description |
---|---|---|---|

HKY | 1 | 1 + 3 | Reversible single-nucleotide model of Hasegawa, Kishino, and Yano (1985), which allows only for non-uniform equilibrium frequencies and a transition/transversion bias |

REV | 1 | 6 + 3 | General reversible single-nucleotide model (Tavaré 1986) |

UNR | 1 | 12 | General unrestricted single-nucleotide model (Yang 1994a) |

R2S | 2 | 24 + 15 | General reversible dinucleotide model with strand symmetry |

R2 | 2 | 48 + 15 | General reversible dinucleotide model |

U2S | 2 | 48 | General unrestricted dinucleotide model with strand symmetry |

U2 | 2 | 96 | General unrestricted dinucleotide model |

R3S | 3 | 148 + 63 | General reversible trinucleotide model with strand symmetry |

R3 | 3 | 288 + 63 | General reversible trinucleotide model |

U3S | 3 | 288 | General unrestricted trinucleotide model with strand symmetry |

U3 | 3 | 576 | General unrestricted trinucleotide model |

GYE | 3 | 2 + 60 | The codon model of Goldman and Yang (1994), with equal distances between amino acids |

GYM | 3 | 3 + 60 | The codon model of Goldman and Yang (1994), with the physicochemical distance matrix of Miyata, Miyazawa, and Yasunaga (1979) |

Model | N a | Parametersb | Description |
---|---|---|---|

HKY | 1 | 1 + 3 | Reversible single-nucleotide model of Hasegawa, Kishino, and Yano (1985), which allows only for non-uniform equilibrium frequencies and a transition/transversion bias |

REV | 1 | 6 + 3 | General reversible single-nucleotide model (Tavaré 1986) |

UNR | 1 | 12 | General unrestricted single-nucleotide model (Yang 1994a) |

R2S | 2 | 24 + 15 | General reversible dinucleotide model with strand symmetry |

R2 | 2 | 48 + 15 | General reversible dinucleotide model |

U2S | 2 | 48 | General unrestricted dinucleotide model with strand symmetry |

U2 | 2 | 96 | General unrestricted dinucleotide model |

R3S | 3 | 148 + 63 | General reversible trinucleotide model with strand symmetry |

R3 | 3 | 288 + 63 | General reversible trinucleotide model |

U3S | 3 | 288 | General unrestricted trinucleotide model with strand symmetry |

U3 | 3 | 576 | General unrestricted trinucleotide model |

GYE | 3 | 2 + 60 | The codon model of Goldman and Yang (1994), with equal distances between amino acids |

GYM | 3 | 3 + 60 | The codon model of Goldman and Yang (1994), with the physicochemical distance matrix of Miyata, Miyazawa, and Yasunaga (1979) |

^{a}Order of the model.

^{b}Number of rate-matrix parameters + number of (free) equilibrium-frequency parameters (the latter are not required for unrestricted models). The scaling constraint on the rate matrix is ignored here. Note that these numbers do not characterize entire phylogenetic models (e.g., branch lengths are not con-sidered). In context-dependent models (*N* ≥ 2), instantaneous substitutions of more than one nucleotide are prohibited.

**Table 2**

Branch | REV | GYM | R3 | REV + Γ | GYM + Γ | R3 + Γ |
---|---|---|---|---|---|---|

1 | 0.0078 | 0.0103 | 0.0120 | 0.0074 | 0.0116 | 0.0125 |

2 | 0.0073 | 0.0043 | 0.0130 | 0.0058 | 0.0044 | 0.0134 |

3 | 0.0327 | 0.0091 | 0.0088 | 0.0257 | 0.0077 | 0.0086 |

4 | 0.0324 | 0.0372 | 0.0262 | 0.0396 | 0.0467 | 0.0325 |

5 | 0.0405 | 0.0200 | 0.0118 | 0.0398 | 0.0199 | 0.0125 |

6 | 0.0079 | 0.0088 | 0.0088 | 0.0086 | 0.0114 | 0.0105 |

7 | 0.0323 | 0.0341 | 0.0340 | 0.0346 | 0.0387 | 0.0371 |

8 | 0.0811 | 0.1004 | 0.1004 | 0.1078 | 0.1398 | 0.1272 |

9 | 0.0350 | 0.0375 | 0.0375 | 0.0382 | 0.0431 | 0.0413 |

10 | 0.0489 | 0.0564 | 0.0563 | 0.0577 | 0.0699 | 0.0656 |

11 | 0.0151 | 0.0180 | 0.0185 | 0.0189 | 0.0241 | 0.0219 |

12 | 0.0468 | 0.0526 | 0.0526 | 0.0521 | 0.0621 | 0.0585 |

13 | 0.0079 | 0.0088 | 0.0088 | 0.0086 | 0.0114 | 0.0105 |

14 | 0.0512 | 0.0577 | 0.0462 | 0.0552 | 0.0604 | 0.0514 |

15 | 0.0208 | 0.0233 | 0.0281 | 0.0225 | 0.0284 | 0.0292 |

16 | 0.0366 | 0.0405 | 0.0366 | 0.0410 | 0.0477 | 0.0411 |

Branch | REV | GYM | R3 | REV + Γ | GYM + Γ | R3 + Γ |
---|---|---|---|---|---|---|

1 | 0.0078 | 0.0103 | 0.0120 | 0.0074 | 0.0116 | 0.0125 |

2 | 0.0073 | 0.0043 | 0.0130 | 0.0058 | 0.0044 | 0.0134 |

3 | 0.0327 | 0.0091 | 0.0088 | 0.0257 | 0.0077 | 0.0086 |

4 | 0.0324 | 0.0372 | 0.0262 | 0.0396 | 0.0467 | 0.0325 |

5 | 0.0405 | 0.0200 | 0.0118 | 0.0398 | 0.0199 | 0.0125 |

6 | 0.0079 | 0.0088 | 0.0088 | 0.0086 | 0.0114 | 0.0105 |

7 | 0.0323 | 0.0341 | 0.0340 | 0.0346 | 0.0387 | 0.0371 |

8 | 0.0811 | 0.1004 | 0.1004 | 0.1078 | 0.1398 | 0.1272 |

9 | 0.0350 | 0.0375 | 0.0375 | 0.0382 | 0.0431 | 0.0413 |

10 | 0.0489 | 0.0564 | 0.0563 | 0.0577 | 0.0699 | 0.0656 |

11 | 0.0151 | 0.0180 | 0.0185 | 0.0189 | 0.0241 | 0.0219 |

12 | 0.0468 | 0.0526 | 0.0526 | 0.0521 | 0.0621 | 0.0585 |

13 | 0.0079 | 0.0088 | 0.0088 | 0.0086 | 0.0114 | 0.0105 |

14 | 0.0512 | 0.0577 | 0.0462 | 0.0552 | 0.0604 | 0.0514 |

15 | 0.0208 | 0.0233 | 0.0281 | 0.0225 | 0.0284 | 0.0292 |

16 | 0.0366 | 0.0405 | 0.0366 | 0.0410 | 0.0477 | 0.0411 |

We thank Eric Green and the NIH Intramural Sequencing Center for permission to use the CFTR sequence data prior to publication; Webb Miller for preparing the multiple alignment for the CFTR data set; Jim Kent, Matt Schwartz, and the UC Santa Cruz genome browser team for advice on retrieving and scoring the mRNA sequences; David Heckerman for a helpful discussion regarding Markov dependence of sites; and Mathieu Blanchette for comments on the manuscript. Nick Goldman encouraged us to write this article in the first place, and he provided several useful references when we finally began to put it together. Ziheng Yang's PAML package was helpful in testing the correctness of our software, and we borrowed its subroutines for computing discrete gamma mean rates. Finally, we thank the anonymous reviewers for several very good suggestions, particularly regarding model testing. Funding was provided by the Howard Hughes Medical Institute (D.H.) and National Human Genome Research Institute grant IP41HG02371 (A.S.).

## Literature Cited

*in*Myers, G., S. Hannenhalli, S. Istrail, P. Pevzner, and M. Waterman, eds. Proceedings of the Sixth Annual International Conference on Computational Biology. Association for Computing Machinery, New York.

*in*Atlas of protein sequence and structure, vol. 5. National Biomedical Research Foundation, Washington, D.C.

*in*K. B. Laskey, and H. Prade, eds. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI). Morgan Kaufmann, San Mateo, Calif.

*in*S. Gupta, and J. Yackel, eds. Statistical decision theory and related topics, Academic Press, New York.