Unbiased Estimate of Synonymous and Nonsynonymous Substitution Rates with Nonstationary Base Composition

Abstract The measurement of synonymous and nonsynonymous substitution rates (dS and dN) is useful for assessing selection operating on protein sequences or for investigating mutational processes affecting genomes. In particular, the ratio dNdS is expected to be a good proxy for ω, the ratio of fixation probabilities of nonsynonymous mutations relative to that of neutral mutations. Standard methods for estimating dN, dS, or ω rely on the assumption that the base composition of sequences is at the equilibrium of the evolutionary process. In many clades, this assumption of stationarity is in fact incorrect, and we show here through simulations and analyses of empirical data that nonstationarity biases the estimate of dN, dS, and ω. We show that the bias in the estimate of ω can be fixed by explicitly taking into consideration nonstationarity in the modeling of codon evolution, in a maximum likelihood framework. Moreover, we propose an exact method for estimating dN and dS on branches, based on stochastic mapping, that can take into account nonstationarity. This method can be directly applied to any kind of codon evolution model, as long as neutrality is clearly parameterized.

We denote by D a set of sequences in an alphabet A, and by T a phylogenetic tree. On each site, each sequence of D is the result of a substitution process from a root sequence along the branches of T . On a given branch b of length t, this substitution process can be represented by a continuous time Markov process (X(τ )) τ .
We define E = {(a, a) ∈ A×A; a = a } the set of all substitutions, and focus on a subset L of events (L ⊂ E). These events are named L−events. In our case, the L−events are the synonymous substitutions or the non-synonymous substitutions and A is the set of codons.
N L denotes the number of L−events that occur along process X. Since X is unknown, N L is unknown. Substitution mapping approach is used to compute the expectation of N L over the distribution of X, given branch b, model M, and data D, i.e. E(N L |b, M, D). Now we define and compute the ability of a model M (with generator Q ) along this process X.
At time τ , during a short time dτ , we define the instantaneous ability of M to perform L−event, A Q L (τ ), as the expectation -on X(τ ) -of the number of L−events that would have been performed by a process following model M during dτ : where Q a,L = a ∈A;(a,a )∈L Q a,a .
The ability A Q L is the mean value of this sum along the process X: P (X(τ ) = a)dτ where T a = t τ =0 P (X(τ ) = a)dτ is the time spent by X in state a. As for stochastic mapping, the expectation of A Q L over all X, E(A Q L |b, M, D) can be efficiently computed. In Minin and Suchard (2008), t.A Q L is defined as the reward of vector Q L = (Q a,L ) a , which expectation over all scenarios given branch b, model M, and data D, can be computed in the same way as E(N L |b, M, D).

Estimates of dN and dS
Here, we show that the proposed estimates of dN and dS, using the ability of a neutral model, are the most likely on a branch b of a tree T , given a model M and data D.
Given a model M with generator Q , the log-likelihood of a process X along a branch b is: where T a is the time spent by X in state a and N aa is the number of substitutions from a to a that occurred in X on branch b. And we consider the expectation on the distribution of X given T , M and D (we remove expectation condition |b, M, D for sake of clarity): E(lL|M ) = a∈A E(T a ).Q aa + (a,a )∈E E(N aa ) log(Q aa ) = − (a,a )∈E E(T a ).Q aa + (a,a )∈E E(N aa ) log(Q aa ) Now, we look for model M that maximizes this likelihood. Actually, we only focus on the factors that define non-neutrality, i.e. the factors that discriminate synonymous substitutions from non-synonymous substitutions. We take into consideration two sets of substitutions : S (resp. N) the set of synonymous (resp. non-synonymous) substitutions. S ∪ N = E.
And we consider that Q can be written as : where Q 0 aa does not depend on the synonymous property of the substitution from a to a (it is the "neutral" part of Q , a part that we do not want to estimate).
Then: with A 0 := A Q 0 . Now, we look for which values of α and β E(lL|M ) is maximized: Finally, given the fixed neutral part Q 0 (and given T, M, D), the most-likely model on branch b is: if (a, a ) ∈ N and the most likely estimator of ω is E(N N ) dN and dS are usually defined as the (non-)synonymous numbers of substitutions per (non-)synonymous nucleotide. In order to fit with this definition, since Q is normalized to perform one substitution per codon and per unit of time on a sequence at equilibrium, the computed estimates have to be divided per 3 and multiplied by the length of the branch.  Figure S2: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content and ω = 0.9. θ root : G+C frequency in the root sequence. θ eq : G+C equilibrium frequency of the simulation model.  Figure S3: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content and ω = 1. θ root : G+C frequency in the root sequence. θ eq : G+C equilibrium frequency of the simulation model.  Figure S4: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content and ω = 1.1. θ root : G+C frequency in the root sequence. θ eq : G+C equilibrium frequency of the simulation model.  Figure S5: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content and ω = 2. θ root : G+C frequency in the root sequence. θ eq : G+C equilibrium frequency of the simulation model.  Figure S6: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content in codon position 1, and ω = 0.1. θ root : G+C frequency in codon position 1 of the root sequence. θ eq : G+C equilibrium frequency in codon position 1 of the simulation model.  Figure S7: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content in codon position 2, and ω = 0.1. θ root : G+C frequency in codon position 2 of the root sequence. θ eq : G+C equilibrium frequency in codon position 2 of the simulation model.  Figure S8: Estimates of dN , dS and dN dS with a stationary model (left) and non-stationary model (right), on simulated data with changing G+C content in codon position 3, and ω = 0.1. θ root : G+C frequency in codon position 3 of the root sequence. θ eq : G+C equilibrium frequency in codon position 3 of the simulation model.  Figure S9: Substitution rates estimated with codeml. Sequences were simulated with changing G+C content and ω = 0.1. From top to bottom: dN , dS and dN dS . θ root : G+C frequency in the root sequence. Dashed curve : estimates on sequences at equilibrium. θ eq : G+C equilibrium frequency of the simulation model.  Figure S11: Estimated difference between equilibrium GC3 in primate clade and root GC3 compared to observed human GC3.