Modeling Codon Rate Variation: Robust Inference of Protein and Nucleotide Selection

There are numerous sources of variation in the rate of synonymous substitutions inside genes, such as direct selection on the nucleotide sequence, or mutation rate variation. However, the majority of the codon models which are developed and widely used today still incorporate an assumption of effectively neutral synonymous substitution rate, constant between sites of each gene. Here we propose a simple yet effective extension to codon models, which incorporates codon substitution rate variation along the gene sequence. We find strong effects of substitution rate variation on positive selection inference. The computational load of our approach remains tractable, and therefore we are able to apply it to genome scale positive selection scans in vertebrates and Drosophila. Our new model is strongly favored by the data. More than 70% of the genes detected by the classical branch-site model are presumably false positives caused by the incorrect assumption of uniform synonymous substitution rate. The new model is able to capture signatures of nucleotide level selection acting on translation initiation and on splicing sites within the coding region. Finally, we show that rate variation is highest in the highly recombining regions, and we hypothesize that recombination and mutation rate variation, such as high CpG mutation rate, are the two main sources of nucleotide rate variation. While we detect fewer genes under positive selection in Drosophila than without rate variation, the genes which we detect contain a stronger signal of adaptation to Wolbachia. We provide software to perform positive selection analysis using the new model.


I. Introduction
Detecting the selective pressure affecting protein coding genes is an important component of molecular evolution and evolutionary genomics. Codon models are one of the main tools used to infer selection on protein coding genes (Koonin and Wolf, 2010). This is done by comparing the rate of nonsynonymous substitutions (d N ) that are changing the amino acid sequence with the rate of synonymous substitutions (d S ) that are supposed not to affect the gene function.
While there is overwhelming evidence of negative and positive selection acting on the amino acid sequence of the proteins (Boyko et al., 2008), the synonymous substitutions affecting the protein coding genes are assumed to be effectively neutral. This is a reasonable first approximation, especially for species with low effective population size, such as many mammals (Keightley et al., 2005;Romiguier et al., 2014). Therefore the synonymous substitution rate can be used as a proxy for the neutral substitution rate, and comparison between d N and d S can be used to identify the selection acting on the level of amino acids (Yang and Bielawski, 2000).
Two similar approaches to model the evolution of codons have been proposed by Goldman and Yang (1994) and Muse and Gaut (1994). In the Goldman and Yang (1994) model, selection pressure on the protein sequence is represented by a single parameter (ω), which defines the ratio of nonsynonymous to synonymous substitutions (d N /d S ). In the Muse and Gaut (1994) model, both d N and d S are estimated as two independent parameters called α (d S ) and β (d N ). The two approaches also differ in the parametrization of the stationary frequency . Neither of these approaches makes a molecular clock assumption, i.e., overall substitution rates are free to vary between the branches of a phylogenetic tree. On the other hand, both approaches originally incorporated an assumption of constant rates or ratio of nonsynonymous and synonymous substitutions between sites over the gene sequence. Moreover, they originally incorporated an assumption of constant ratio between branches. Various extensions have been proposed over the years to those models, allowing the incorporation of variation of selection acting on the protein sequence between codons  and/or between phylogenetic branches (Zhang et al., 2005). Yet for almost all of those new models, an assumption of effectively neutral synonymous substitution rate, which is constant between sites, was kept.
There is no biological reason to assume that the rate of synonymous substitutions is uniform in this way. In practice, it has been suggested (Yang, 2014) that the effect of the variation in the rate of synonymous substitutions should not substantially bias the inference of selection strength. The idea is that the comparison between d N and d S is a contrast between the rates before and after the action of selection on the protein coding gene, while all the other factors will affect both rates in the same way. This reasoning might work when assigning a single ω value to the whole alignment. The estimated value is then simply an average. But for more sophisticated models, where ω varies between branches and sites, violation of the assumption of uniformity of the synonymous rate can affect the performances of the model (Rubinstein et al., 2011).
There are numerous sources of variation in the rate of synonymous substitutions inside genes. First, the raw mutation rate across each genome varies significantly. One of the strongest effects on the mutation rate in mammals is CpG sites. Transitions at CpG sites are more that 10-fold more likely than transversions at non-CpG sites (Leffler et al., 2013) due to spontaneous deamination, which causes a mutation from C to T, or from G to A. Both mutation frequencies and repair efficiency are highly dependent on the context. For example, the mammalian CpG mutation rate is lower in high GC regions (Fryxell and Zuckerkandl, 2000). This is probably related to strand separation and hydrogen bonding in the neighboring region (Segurel et al., 2014). High GC regions themselves are characterized by a higher muta-tion rate, which is probably caused by less efficient repair by the exonuclease domain. There are other context-dependent effects which are known, many of which lack a mechanistic explanation, such as a higher mutation rate away from T with an increasing number of flanking purines (Hwang and Green 2004, for reviews see Hodgkinson andEyre-Walker 2011 andSegurel et al. 2014).
Mutation rate is also affected by replication time: it is usually higher in the late-replicating regions (Stamatoyannopoulos et al., 2009). This effect has been attributed to the interference between RNA and DNA polymerases (Jørgensen and Schierup, 2009) and to the variation in the efficiency of mismatch repair (Supek and Lehner, 2015). It is not clear that this affects variation within genes, as opposed to between genes, but it could affect very long genes.
Mutation rates are also correlated with recombination rates. Some suggest (Lercher and Hurst, 2002;Hellmann et al., 2003Hellmann et al., , 2008) that recombination itself can have a mutagenic effect, possibly through an interaction with indels. Alternatively, this correlation can be a result of GC-biased gene conversion (GC-BGC), whereby mutations increasing GC content have a higher chance of fixation in the population (Duret and Galtier, 2009). While GC-BGC is a fixation bias, in some cases it can create a pattern which is hard to distinguish from positive selection (Ratnakumar et al., 2010).
Finally, the synonymous substitution rate can be affected by selection at the nucleotide level. First, while synonymous substitutions do not affect the protein sequence, they might affect translation efficiency. This effect is not limited to species with large effective population size, such as Drosophila (Carlini and Stephan, 2003), since selection for codon usage was identified even in Homo sapiens (Comeron, 2004) and other mammals, especially for highly expressed genes. It has been suggested that bias in codon usage reflects the tRNAs abundance, and thereby provides a fitness advantage through increased translation efficiency or accuracy of protein synthesis (Bulmer, 1991), although in many cases there is no depen-dency between tRNA abundance and codon frequency, and the source of the bias remains unknown (Plotkin and Kudla, 2011).
Selection at the nucleotide sequence can be also caused by secondary structure avoidance, as secondary structure can reduce translation efficiency (Kudla et al., 2009;Kertesz et al., 2010). Other potentially important sources of selection at the nucleotide sequence, independent of the coding frame, include splicing motifs located within exons, exon-splicing enhancers (Majewski and Ott, 2002), or functional non-coding RNAs, such miRNAs or siRNAs, which often reside within coding sequences (Mattick and Makunin, 2006).
Because of all these mutational and selective biases, it is important to model rate variation not only at the level of protein selection, but also at the nucleotide level. There are in principle two different approaches to incorporate rate variation into codon models. First, it is possible to model synonymous (d S ) and nonsynonymous (d n ) substitution rates separately by extending a two-rate model, as in  and Mayrose et al. (2007). Second, it is possible to incorporate site-specific rates as an independent parameter into onerate models (Scheffler et al., 2006;Rubinstein et al., 2011). In the second case, the substitution rate parameter captures biological factors, such as mutation rates, fixation rates, or nucleotide selection, which act on all substitutions, both synonymous and nonsynonymous.
Here we focus on the second approach, which is traditionally used for large-scale positive selection analyses in eukaryotes (Clark et al., 2007;Markova-Raina and Petrov, 2011;Zhang et al., 2014;Moretti et al., 2014). The selection parameter (ω ∼ d N /d s ) can be directly interpreted as selection acting on the protein sequence. Having separate d N and d S leads to d N being affected by three different factors: mutation rate, nucleotide level selection, and protein level selection. This would make protein selection estimation less straightforward and perhaps less precise. Indeed, Spielman et al. (2016) demonstrates the superior performance of having a compound parameter for the es-timation of selection strength on the protein. While we use a single parameter ω to model the selection strength at the protein level, ω can vary both between alignment sites and tree branches.
While codon models accounting for nucleotide rate variation are available for more than a decade, they are still rarely used for large-scale selection analyses, such as Kosiol et al. (2008); Moretti et al. (2014); Zhang et al. (2014). This is probably because these models have even higher computational demands, and the statistical performance of different approaches to nucleotide rate variation was never compared.
Here we extend the Scheffler et al. (2006) model, which captures variation between codons, i.e., uses a single rate per codon, and perform a direct comparison with Rubinstein et al. (2011), which captures variation between nucleotides, i.e., with three rate parameters per codon. We also assess the impact of nucleotide rate variation on the BS-REL-family model (Murrell et al., 2015). Models based on Goldman and Yang (1994) typically use the maximum likelihood approach on the two nested models in order to detect positive selection. This way the method can identify genes, while individual sites can be detected only using additional posterior analysis. This approach seems to be suitable for large-scale genome analyses, where one is interested in identifying biological functions undergoing positive selection (Clark et al., 2007;Kosiol et al., 2008;Zhang et al., 2014;Daub et al., 2017). We chose Murrell et al. (2015) as a comparison, since it is the only BS-REL model for gene-wide identification of positive selection, while other positive selection models in that family are intended for inference of selection at individual sites, and, thus, cannot be compared directly.
We first use simulations to compare different approaches of modeling synonymous rate variation. Then, we use our model to detect positive selection in twelve Drosophila species and in a vertebrate dataset.
We detect positive selection on genes from those two datasets under our new model, and we demonstrate that it is important to take rate variation into account for such inference of positive selection. We investigate factors affecting the nucleotide substitution rate, and we show that the new model successfully detects synonymous selection acting on regulatory sequences within the coding sequence. We also identify which gene features affect rate heterogeneity the most.

II. New approaches
We model the process of codon substitution as a Markov process defined by the instantaneous rate matrix Q. In a general case, the Q can be written as (Rubinstein et al., 2011): π j i and j differ by one synonymous substitution at site h ρ (h) λ ij ωπ j i and j differ by one nonsynonymous substitution at site h 0 i and j differ by more than one nucleotide Here ρ (h) is the substitution rate for a site h of an alignment (h ∈ 1 . . . H, where H is the alignment length in nucleotides). The variable λ ij is the substitution factor to change from codon i to codon j. It is typically used to account for the difference between transitions and transversion rates (Hasegawa et al., 1985). In this case, λ ij depends on the substitution type but does not depend on the codon position in which substitution occurs. The rate ρ (h) is used to account for various effects that are not captured by the variation in ω. In particular, it accounts for variation in mutation rate and selection acting on the nucleotide sequence. We will use ρ (m,p) = ρ ((m−1) * 3+p) to denote the rate associated with p-th (p ∈ {1, 2, 3}) position of codon m in the alignment (m ∈ 1 . . . M, where M = H 3 is the alignment length in codons).
In Rubinstein et al. (2011), ρ (h) is modeled using a one parameter gamma distribution across sites of the alignment, such that the mean relative substitution rate is equal to 1, i.e., ρ (h) ∼ Gamma(α, 1/α). Keeping a mean rate of 1 is important to avoid biases in the estimation of branch lengths. There is no implicit assignment of rates to sites, as in the CAT model (Lartillot and Philippe, 2004). Instead, a random-effect model is used: the gamma distribution is split into K equally probable discrete categories ρ k using quantiles, and the site likelihood is computed as the average of the likelihoods for each possible rate assignment. This approach adds only one extra parameter to the model, but it is computationally intensive. Indeed, in order to compute a likelihood for the m-th codon in the alignment, it is required to compute likelihoods for this codon given all possible rate assignment, i.e., For K discrete categories, K 3 likelihoods have thus to be computed per codon site.
In Scheffler et al. (2006), unlike Rubinstein et al. (2011), the three positions of each codon have the same rates, i.e., ρ (m,1) = ρ (m,2) = ρ (m,3) ; we denote the rate of a codon m as ρ (m, * ) . Here a codon belongs to one of three categories, each one represented by a single rate value ρ k . The rates and their respective proportions are estimated from the data, which leads to the estimation of four different parameters (two rates, as the third one is fixed by the constraint that the mean rate ρ = 1, and two proportions). This approach is virtually equivalent to adding a branch length multiplier for certain site classes, and therefore likelihood can be computed efficiently.
Here we propose having one rate per codon ρ (m,1) = ρ (m,2) = ρ (m,3) , while allowing this rate to vary following a gamma distribution, ρ k ∼ Gamma(α, 1/α). Similar techniques to model rate variation of codons under constant selective pressure (ω) across the sites and branches of a phylogenetic tree have been used in CodonPhyML (Gil et al., 2013) and in Baele and Lemey (2013).
With this approach we are combining the strengths of both Rubinstein et al. (2011) and Scheffler et al. (2006): we only increase parameter space by one parameter (α), allow a flexible distribution of rates, and keep the computational tractability of the model. Moreover, having similar parametrization and rate distributions allows us to compare the statistical performance of site rate and codon rate variation models.
Using this approach, we extended two widely used codon models: M8  and the branch-site model (Zhang et al., 2005). In principle our approach could be applied to any GY94-based model. In the site model M8, selection pressure represented by the ω parameter varies between the sites of an alignment following a beta distribution, while staying constant over the branches of the phylogenetic tree. In this model, a subset of sites can evolve under positive selection. In the branchsite model, ω varies both between the sites of the alignment and the branches of the phylogenetic tree. In this model, a subset of sites can thus evolve under positive selection on a predefined subset of branches. These two models were implemented in Godon, a codon model optimizer in Go, in three variants: no rate variation, site rate variation (Rubinstein et al., 2011) and codon rate variation as described above.
Three of the six models were implemented and used for the first time to our knowledge: the branch-site model with site rate variation similar to Rubinstein et al. (2011), and gamma distributed codon rate variation as proposed above, for two models: M8 and branch-site.

I. Simulations
Site models We have simulated three datasets using various flavors of the M8 model: a dataset without rate variation, a dataset with site rate variation, and a dataset with codon rate variation (Table 1). We then used the three corresponding mod-   Yang et al. (2000); BS: branch-site model of Zhang et al. (2005); BUSTED: BUSTED model from the BS-REL-family (Murrell et al., 2015).  els to infer positive selection in those datasets. In all three cases, as expected, the model corresponding to the simulations shows the best result ( Fig. 1, Table 2 and Supplementary Table S1). In the absence of rate variation, the statistical performance of the three methods is very similar, even though the M8 model without rate variation has a slightly better performance (Fig. 1A, Supplementary Fig. S1A). Despite increased complexity, the power of the codon variation model is only marginally reduced relative to the model without rate variation ( Supplementary Fig. S2A).
The M8 model on the dataset with site rate With the dataset with codon rate variation ( Fig. 1C, Supplementary Fig. S1C), there is a relatively large decrease in the performance of both other models, while, as expected, accounting for site rate variation performs better compared to the model without rate variation. False positive rates exceed the significance levels when rate variation is not taken into ac-count ( Supplementary Fig. S1C). Codon rate variation even increases the false positive rate above 50% for the model without rate variation at the significance level of 0.05. Stronger rate variation (i.e., smaller α value) causes a higher false positive rate ( Supplementary Fig. S3).
From this we can conclude that a) the performance of models accounting for codon variation is acceptable in all three scenarios, i.e., no rate variation, site rate variation and codon rate variation; b) in the presence of codon rate variation in the data, models not accounting for this kind of variation suffer from a notable loss of statistical performance.
The total computation time needed for the inference was 76 CPU hours for the model with no rate variation, 232 CPU hours for codon rate variation, and 4,023 CPU hours for site rate variation.

Branch-site models
The simulations based on the branch-site model show a qualitatively similar behavior to the simulations based on the M8-type models ( Fig. 2 and Supplementary Fig. S4, Supplementary Tables S2, S3), although the performances are more similar between models. As with M8type models, codon rate variation models performs well in all three cases, while simulating with codon rate variation causes a clear underperformance in both other models. Unlike in the case of M8-type models, false positive rates are only marginally inflated compared to theoretical expectations ( Supplementary Fig. S5). Nevertheless, the model with codon rate variation shows the best performance.
The total computation time needed for the inferece was 37 CPU hours for the model with no rate variation, 101 CPU hours for codon rate variation, and 1,630 CPU hours for site rate variation.
More complex models have a computational cost. Analyses with codon rate variation were 3.0 and 2.8 times slower compared to no rate variation for M8 and branch-site models respectively, while those with site rate variation were 52.7 and 44.3 times slower, respectively. Thus   codon rate variation captures biological signal at a much lower computational cost than site rate variation.

Comparisons with BS-REL
It was demonstrated (Murrell et al., 2015) that in certain cases the statistical power of BS-REL is superior to other methods, including the branch-site model, therefore it is important to study how rate variation affects the performance of those models. The only BS-REL model suitable for the gene-wide identification of positive selection is BUSTED (Murrell et al., 2015), and the current implementation supports neither rate variation nor d S variation as implemented in . The BS-REL framework differs from the branch-site and M8 models by the frequency parametrization. BS-REL is based on the approach of Muse and Gaut (1994), while the branch-site and M8 models are based on the approach of Goldman and Yang (1994). During simulations, we used F0 frequencies, i.e., identical frequencies of all the codons (π i = 1/61). As F0 can be considered a special case of both Muse and Gaut (1994) and Goldman and Yang (1994), we do not expect any bias caused by this difference between the two approaches. Because of differences in the simulations and model assumptions we did not compare model accuracy and receiver operating characteristic (ROC), but we focused instead on the effect of the rate variation on the false positive rate. In other words we are asking whether the unaccounted rate variations in the evolutionary process can cause false positives in positive selection inference.
BUSTED shows highly inflated rates of false positives in the presence of codon rate variation (Fig. 3, Supplementary Fig. S6); at a typical significance level of 0.05 the false positive rate of BUSTED is close to 0.3 and 0.2 for the M8 and branch-site simulations, respectively. In the presence of codon rate variation in the evolutionary process, the proportion of false positives generated by the model can be 4-6 times higher than we expect by chance. This shows that the statistical performance of the BS-REL family models is also affected by not taking into account rate variation.

II. Vertebrate dataset
Given the good performance of our model in the simulations, we applied it to real data. First we used 767 one-to-one orthologs from vertebrate species. This represents a set of genes with high divergence (more than 450 My), conservative evolution (Studer et al., 2008), and relatively low effective population sizes (although some vertebrates have high N e , see Gossmann et al. 2012), thus relatively weak impact of natural selection. We analyzed them with three variants of the branch-site model: no rate variation, site rate variation and codon rate variation. We observed that in most of the cases (each case corresponds to a single branch of a gene tree) the data supports (Akaike information criterion, AIC) the codon rate variation model: out of 26,721 individual branches tested, data supports codon rate variation in 85% of the tests, site rate variation in 15%, and no rate variation model was favored only in a single test (0.01%).
A large proportion of branches detected to be under positive selection with the no rate variation model are not detected to be under positive selection with the codon rate variation model (Table 3). This effect is even stronger when multiple testing correction is used. The majority of the positive predictions from the standard branch-site model are not supported when codon rate variation is accounted for. This suggests that evolution on these branches  can be explained by nucleotide substitution rate variation without positive selection (see Discussion). We observe relatively high agreement in predictions between the site rate variation and codon rate variation models (after multiple testing correction), as those are two different approaches to model the same evolutionary process.
Supplementary Table S4 shows prediction agreement between each model and the best supported model out of three, confirming the good performance of the codon variation model.
The analyses under the branch-site model took 9, 76 and 1732 CPU hours for no variation, codon rate variation and site rate variation models, respectively. Thus codon rate variation model was 8.4 times slower, while site rate variation model was 190 times slower.
With real data, differences between genes are not only stochastic, but are expected to be driven by underlying biological differences. It is thus interesting to find which factors affect rate variation as estimated by the model, as well as to know which genes favored the model with codon rate variation the most.
We focused on gene features associated with the underlying evolutionary process, such as recombination rate, GC content standard deviation (indicative of shifts in recombination hotspots) and expression level (associated with stronger purifying selection, Drummond et al. 2005;Pal et al. 2006;Kryuchkova-Mostacci and Robinson-Rechavi 2015). Parameters which can directly affect the performance of the method were also included in the linear model to avoid potential biases, e.g., number of sequences and alignment length.
Here and below we used three response variables for our analyses. First, we create linear models using the relative support of the model (AIC-based, see Methods) as a response variable. These models allow us to understand for which categories of genes the effect of rate variation is the strongest. Second, models using the α parameter of the gamma distribution (codon rate variation) as a response variable enable us to identify genes properties associated with high substitution rate variance. Finally, we create a model for the proportion of branches which are detected using the model without rate variation, but not using the model accounting for the rate variation. This model allows us to identify the main causes of discrepancy between the results of the two models. In all three cases, the model uses biological properties of genes acquired either from the data (e.g., length of the coding sequence, number of sequences) or from an external source (e.g., maximum expression level, recombination rate).
In this analysis, every group of orthologous genes was treated as a single observation. Estimated parameter values obtained by testing different branches of the tree were averaged. The relative support of the model with codon rate variation is mostly affected by total branch length, alignment length, and mean GC content of the gene (Supplementary Table S5). The positive correlation with tree length and alignment lengths is probably related to the increase in total amount of information available for the model. The relation to GC content might be due to the relationships between recombination rates, substitution rates, and GC content (Duret and Galtier 2009;Rudolph et al. 2016, see Discussion).
For the shape parameter of the gamma distribution α, the strongest explanatory variable is also the length of the alignment (Supplementary Table S6). Counterintuitively, shorter alignments are characterized by larger rate variation (a low α value corresponds to a large gamma distribution variance). We also observe a weak relation with maximal expression level. Highly expressed genes tend to have a higher rate variation, which could be explained by higher nucleotide level selection on certain parts of the gene.

III. Drosophila dataset
The second real dataset we used contains 8,606 one-to-one orthologs from Drosophila genomes. The Drosophila data set is ten-fold larger than the vertebrate dataset. Since analyses on the simulated and vertebrate datasets show a consistent superiority of codon rate over site variation rate, with a much lower computational cost, we ran only no variation and codon variation on the full dataset. However, we ran site variation on a subset of 977 genes selected randomly. Therefore for the Drosophila data, we are mainly focusing on comparing models with and without codon rate variation. Drosophila has large effective population sizes on average (Gossmann et al., 2012), thus stronger impact of natural selection; the genes studied are less biased towards core functions than in the vertebrate dataset, and have lower divergence: about 50 mya for Drosophila (Russo et al., 2013) compared to more than 450 mya for the vertebrate dataset (Betancur-R et al., 2015).
In total 66,656 branches were tested for positive selection. The model with codon rate variation was supported by the data in 97% or 96% of the tests when using AIC or likelihood ratio test (LRT), respectively. On the smaller subset, on which all the three approaches were applied, codon rate variation was supported in 69% of the tests, site rate variation in 29% and no variation in 2%. As with the vertebrate dataset, predictions were not consistent between the models (Table 4, Supplementary  Table S7). The site and the codon rate variation models display a stronger consistency in predictions of positive selection compared to the other model comparisons. In this case, when accounting for multiple testing, the vast majority of predictions of positive selection given by the model without rate variation are not supported by the model accounting for rate variation.
Genes identified to be under positive selection with the model accounting for rate variation are enriched for a single cellular component GO category: inner dynein arm (GO:0036156, q-value= 9.95 · 10 −5 , Supplementary Table S8). Dynein plays an important role in Wolbachia infection (Serbus and Sullivan, 2007), and is thus a likely candidate for strong positive selection (Werren et al., 2008). However, there are no significant cellular component GO categories identified using the branchsite model without rate variation (for dynein arm category q-value=1, Supplementary Table S9).
The slowdown in computational time of the branch-site model with codon rate variation was 9.3 times and the inference took about 10,200 CPU hours.
The relative support of the model with codon rate variation is mainly explained by alignment length, number of sequences and coding sequence length (Table 5). Stronger model support associated with increase in the amount of information (increased coding sequence length  means less gaps for the same alignment length), expression levels and mean GC content is consistent with the vertebrate results (Supplementary Table S5).
We also observe a dependence on the number of exons and on recombination rate. A larger number of exons implies more exonintron junctions, which might affect variation in levels of nucleotide sequence selection (see below). Recombination might affect GC-BGC, mutation rate, and selection strength acting on synonymous sites (Campos et al., 2014).
The rate variation parameter α can be explained by several features of genes (Supplementary Table S10). Most of the effects are not reproduced between the two datasets. While some of them are strongly significant, generally the effect sizes are not very large. The most consistent effect between the two datasets is dependence of the rate variation on GC content.

IV. Signatures of selection at the nucleotide level
Codon rate variation can be influenced by various factors such as mutation bias, fixation bias (e.g., gene conversion), or selection acting against synonymous substitutions. Notably, it is well known that exon regions adjacent to the splicing sites are evolving under purifying selection at the nucleotide level (e.g., see Majewski and Ott, 2002). We determined posterior rates for positions of protein coding gene regions located in the proximity of exon-intron and intron-exon junctions; first exons were excluded from the analysis.
We observe in Drosophila (Fig. 4) that our codon rate variation model captures these selection constraints: the codon substitution rate is lower at the exon-intron junction than at the intron-exon junction, and both have lower rates than the rest of the exon. This is in agreement with splicing motif conservation scores (e.g., see Cartegni et al., 2002), and consistent with negative selection acting on splicing sites.
We also used the model M8 with codon rate variation to simultaneously estimate the effect of factors which affect substitution rates of nucleotide and protein sequences, again in Drosophila. We observed that the model is able to recover opposing trends acting on the 5' region of the protein coding gene (Fig. 5). These trends are probably a results of the high functional importance of the 5' nucleotide sequence, but low functional importance of the corresponding amino acid sequence (see Discussion). It is worth noting that this effect cannot be explained by a dependence between the ρ and the ω estimates, as such a dependence is not observed in other regions ( Supplementary  Fig. S7).
We observe that the top 25% most highly expressed genes show both stronger conservation of the amino acid sequences (Supplementary Fig. S8) and more pronounced decrease in the substitution rate of the 5'-region (Supplementary Fig. S9). Stronger purifying selection acting on highly expressed proteins was previously observed across domains of life (Pál et al., 2001;Rocha and Danchin, 2004;Drummond et al., 2005;Kryuchkova-Mostacci and Robinson-Rechavi, 2015). As for the decrease in substitution rate, ρ is defined relative to the gene-wide average substitution rate. Therefore a stronger relative selection on the 5'-region can be detected above the average increase in purifying selection on highly expressed genes. The relation with expression levels is consistent  with the assumption that we are measuring natural selection on gene sequences in this case, rather than mutation rates.
IV. Discussion

I. Nucleotide level selection in coding regions
There is strong evidence of selection acting on synonymous substitutions within protein coding sequences, and the strength of this selection is expected to vary across the coding region (Chamary et al., 2006). In particular, negative selection strongly affects regulatory sequences, such as exonic splicing enhancers or exon junction regulatory sequences (Cartegni et al., 2002). This variation in the selection strength affecting both synonymous and nonsynonymous substitutions can affect the performance of codon models (Rubinstein et al., 2011) and it is essential to take it into account. While there are multiple ways to do this, for instance by modeling the synonymous and non-synonymous rates separately , here we focused on modeling the ratio of non-synonymous vs synonymous rates as a single parameter (ω), while allowing the substitution rate (ρ) to vary along the sequence.
Our approach succeeds in recovering a signal of splicing motif conservation jointly with negative and positive selection acting on the protein sequence.
We also demonstrate that our model is able to disentangle opposite trends acting on the same sequence, i.e., stronger negative selection acting on the nucleotide sequence combined with weaker amino-acid selection towards the beginning of the reading frame.
Selection on the 5' nucleotide sequence is probably due to selection for translation initiation efficiency (Bentele et al., 2013), and is perhaps related to suppression of mRNA structures at the ribosome binding site. At the same time N-terminal amino acids are more likely to be unstructured, and they are relatively less important to protein function and stability compared to the core (Guharoy and Chakrabarti, 2005).

II. Determinants of rate variation
The vast majority of gene alignments in the study indicated better support for the model with codon rate variation. Moreover, the relative probability of the models incorporating codon rate variation increases with the amount of information available, be it number of sequences, alignment length or total number of substitutions. This indicates that these models are better in describing the underlying evolutionary process, and if we have enough data, these models are favored. We detect a strong signal of nucleotide variation in two quite different datasets. nucleotide evolution, despite lower effective population sizes in many species (Gossmann et al., 2012;Romiguier et al., 2014). Thus, the effect of nucleotide rate variation appears quite general, and will probably be found in many other species. The strongest determinant of the relative support of the model with codon rate variation is GC content. It is well known (Fullerton et al., 2001;Marais et al., 2003;Chamary et al., 2006) that regions with high recombination rates have higher GC as a result of GC-BGC, notably in the species studied here. It has also been shown that models accounting for rate variation show significantly better performance than simpler models in the presence of recombination (Scheffler et al., 2006), even if the true tree topology is used.
The effect of recombination rate as measured in Comeron et al. (2012) on the relative support of codon rate variation is comparable to the effect of GC content alone. In mammals, the higher CpG dinucleotides mutation rates (Kong et al., 2012) can increase the disparity in substitution rates and therefore contribute to the dependence between the GC content and the relative model support.
Yet, we observe this dependence in Drosophila, where there is no significant neighboring base contextual effects on the mutation rate (Keightley et al., 2009). Here, we hypothesize that the GC content can be viewed as a proxy for the average recombination rate over time via GC-BGC. Considering the rapid evolution of recombination hotspots (Ptak et al., 2005), GC content probably captures better historical recombination rates, while the direct measurement of recombination rate captures only the current state.
In both datasets, we observe a significant positive association of rate variation with the maximal expression level. Pressure for translational robustness increases with expression levels (Drummond et al., 2005), and codon choice affects expression level (Bentele et al., 2013). One of the main causes of selection on the codon sequence of highly expressed genes is protein misfolding avoidance , but there is also selection for efficient translation initiation (Pop et al., 2014). It is reasonable to assume that only certain parts of protein coding genes will be affected by strong nucleotide sequence selection, and that this selection will be stronger on more expressed genes. Indeed our results show strong negative selection acting on the coding sequence of translation initiation regions, and the relative selection strength is higher for the 25% highest expressed genes ( Supplementary Fig. S9). This can lead to the overall increase in the substitution rate variation.
Given the stronger negative selection on the coding sequence and the stronger variation in the substitution rate in highly expressed genes, it is especially important to take this variation into account. Having a few quickly evolving codons in combination with a low average ω of highly expressed genes can be mistakenly interpreted as positive selection by models without rate variation.

III. Codon models and rate variation
Widely used mechanistic codon models rely on the assumption of constant synonymous substitution rates. This assumption is often violated due to factors such as mutation bias or nucleotide selection, which vary across the gene. While substitution rate variation can be caused by multiple factors, we use a single compound rate parameter to model this variation.
Here we demonstrate that a simple model captures such rate variation, and that it both detects new biological signal, and substantially decreases the false positive rate in positive selection detection. Not only do we observe this effect in simulations (Fig. 1, 2, 3), but inconsistency between models is even higher when applied to the vertebrate and fly datasets. The vast majority of the predictions of positive selection performed using models without rate variation were not supported by the model with codon rate variation. The comparable power of the two methods ( Supplementary  Fig. S2A) and the strong support for rate variation from the data suggest that most of those positive selection predictions can be explained by the nucleotide rate variation (Tables 3, 4). Thus they should be considered as probable false positives. The proportion of those potential false positives per gene alignment is positively associated with the amount of available information, such as alignment length (Supplementary Table S11). This confirms that loss of power is not the main cause for the lack of detection by the codon rate model, but rather false positive detection by the classical branchsite.
We demonstrate that the underlying biological process is highly variable across positions, and that a model selection procedure is able to capture this once enough information is available. The effect of rate variation on the model performance is stronger in Drosophila than in vertebrates. On the vertebrate dataset, after multiple testing correction, about 70% of the positive predictions might be false positives (Table 3C), while in Drosophila it is 90% (Table 4B). We hypothesize that this is a consequence of the higher effective population size (Gossmann et al., 2012) and thus stronger synonymous selection in Drosophila.
Incorporating rate variation into the model allowed us to identify a strong signal of positive selection acting on the inner dynein arm, likely to be a result of selection against Wolbachia infection in Drosophila. With the model without rate variation, this signal is masked by other, presumably false positive, genes. While there have already been several large-scale analyses of positive selection in Drosophila (Clark et al., 2007;Markova-Raina and Petrov, 2011;Cicconardi et al., 2017), none of them reported positive selection affecting inner dynein arm. Sexual selection, on the other hand, has been repeatedly demonstrated using various approaches (e.g., Lupold et al., 2016).
Both versions of the branch-site test detect a small number of genes under positive selection, relative to expectations from some other approaches (Bierne and Eyre-Walker, 2004). It has been suggested that the presence of positive selection on background branches can cause a decrease in power of the branch-site models (Kosakovsky Pond et al., 2011). However, we observe only a marginal decrease of power in our simulations (see Supplementary). This decrease is small both in the classical branchsite model and in the model with rate variation. Thus, this background selection effect does not seem to explain the small number of genes detected. While the gene list obtained with rate variation is shorter and thus provides less signal for anatomical enrichment, testis notably remains highly significant, as expected for selection on sexual conflict in flies (Supplemen-tary Table S12).
An important question is why accounting for rate variation changes the statistical properties of the test. Indeed it has been argued (Yang, 2014) that comparison between d N and d S is a contrast between the rates before and after the action of selection on the protein, and should not be biased by nucleotide rate variation. We hypothesize that d N /d S overestimation is caused not only by the variation in d S , but also by codon-specific substitution rates. Indeed, having a small percentage of rapidly evolving codons in the gene would not be captured by an overall rate for d S , and therefore would be interpreted as positive selection by models with protein level but without nucleotide level rate variation. Conversely, fully accounting for rate variation allows detecting these codons as rapidly evolving by the signatures of both synonymous and nonsynonymous substitutions.
There is recent evidence that double mutations in coding sequences increase the branchsite model false positive rate from 1.1% to 8.6% in similar datasets to those investigated here (Venkat et al., 2017). The interaction between this effect and rate variation along the gene is worth investigating.
We compared two different models accounting for rate variation: the site variation model of Rubinstein et al. (2011) and our new codon variation model which extends Scheffler et al. (2006). The codon rate variation model can be informally thought of as a special case of the site rate variation model. Despite that, the codon rate variation performs better both in the simulations (Table 2, Supplementary Table S2) and on the vertebrate dataset (Supplementary  Table S4). There are probably two reasons for that. First, the fact that we can assign a rate to a particular nucleotide position does not necessary mean that we can reliably estimate it. Only two amino acids allow single nucleotide synonymous substitution associated with the first or second codon positions. This means that individual position rates can be estimated mostly through non-synonymous substitutions, which are typically rare compared to synonymous ones. Moreover, the branch-site and M8 models allow variation in the nonsynonymous rate over codon positions, which means estimates of ω and site rates are not independent. However, there is no traceable dependency with codon rate variation (Fig. 4, Supplementary  Fig. S7).
Secondly, we expect site rates to be autocorrelated along the sequence since many factors, such as GC content, recombination rate, or chromatin state change slowly over the gene. Indeed we see a signal of such autocorrelation in our data (effect size 0.018, p-value < 0.0001). Therefore having an independent rate for every site is probably redundant.
One of the key advantages of codon variation relative to site variation is computational performance. Having a distinct rate for every position increases the number of site classes for which likelihood computations have to be performed by a factor of K 3 , where K is the number of discrete categories for gamma distribution. Having a rate only for each codon increases the number of site classes by a factor of K. Which means that even for four discrete categories, the slowdown of likelihood computation for site rate gamma model will be about 64 times, vs. only 4 times for codon rate variation model. In practice this ratio between the two models was respected in simulated and vertebrate data. This makes codon rate variation models usable in large real-world datasets, as we demonstrate by analyzing the dataset of 12 Drosophila genomes.
Unlike traditional mechanistic codon models, our new models allow independent estimations of substitution rate at the nucleotide level and of selective pressure on amino acid sequences. It should be noted that individual site rates estimates may be still noisy because of the amount of data available. But given enough data it is possible to have accurate estimates of selection acting on specific regions, e.g., splicing motifs, within coding sequences (Fig. 4).

V. Conclusions
We present here a new codon rate variation model family. These mechanistic codon models relax an unrealistic assumption that the only source of substitution rate variation over the gene sequence is selection on the protein. Failure to account for this leads to both type I and type II errors. We demonstrate that our model has a good statistical performance both in the presence and in the absence of rate variation. We also provide a software implementation of the new models. Rate variation is strongly supported by homologous genes both from species with larger (flies) and smaller (vertebrates) effective population sizes. We are able to capture differences in substitution rates caused by nucleotide selection. Importantly, while being more complex these model remain computationally tractable and therefore can be applied to large-scale datasets. These models open the opportunity of simultaneous analysis of different layers of selection.

I. Sequence simulations
We simulated six datasets ( Table 1) that include either no rate variation across sites (corresponding to the GY94 model), variation between sites (corresponding to the Rubinstein et al. (2011) model) and variation between codons (corresponding to our new approach). Each dataset contains 1,000 alignments simulated under the null hypothesis H0 with no positive selection (all ω ≤ 1) as well as 1,000 alignments under the alternative hypothesis H1 with positive selection (some ω > 1). All the datasets had between 8 and 12 sequences composed of 100 to 400 codons and were simulated using the software cosim. The parameters of each simulation, including the alignment length and the number of species, were generated at random from their respective distribution (Supplementary  Table S13, Supplementary Fig. S10). Maximum values of ω > 1 for H1 were 79 and 15 for the branch-site and M8 models, respectively. Values of α were within the range of values estimated from the real data ( Supplementary  Fig. S11), with an emphasis on smaller values where the variation is stronger. For the simulations including rate variation, we used four discrete gamma categories that we assigned either to sites or to codons. The M8 model assumes that the neutral sites and those under purifying selection have an ω drawn from a beta distribution and we represented this distribution using five discrete categories. Finally, to simulate evolution under the branch-site model, we randomly selected one "foreground" branch of the phylogenetic tree (either internal or terminal) for every simulated alignment.

II. Vertebrate and Drosophila datasets
We analyzed two biological datasets. Our goals were to compare the fit of the different models on real data, and to study which gene features are contributing to the variation of the substitution rate. First, we used a vertebrate one-to-one orthologs dataset (Studer et al. (2008), available at http://bioinfo.unil.ch/supdata/ positiveselection/Singleton.html) consisting of 767 genes (singleton dataset). This dataset was already used in previous studies of codon models (Fletcher and Yang, 2010;Gharib and Robinson-Rechavi, 2013;Davydov et al., 2017).
We also used a subset of one-to-one orthologs from 12 Drosophila species from the Selectome database (release 6, http:// selectome.unil.ch/). This dataset consists of 8,606 genes, and the alignments are filtered to remove unreliably aligned codons (Moretti et al., 2014).

III. Model parameters inference
For all the tests on simulated data we used the correct (i.e., simulated) tree topology, but starting branch lengths were estimated using PhyML v. 20131022 (Guindon et al., 2010) with the model HKY85 (Hasegawa et al., 1985). We did not start the optimization from the true branch lengths, by similarity to a real use-case, when only gene sequences are available, and the true branch lengths are unknown. While tree topology is also inferred in real use-cases, and wrong topology could impact the inference of positive selection (Diekmann and Pereira-Leal, 2015), investigating this is outside the scope of our study. Optimization of all model parameters jointly with branch lengths is not practical and substantially increases the computational load. We instead first estimated branch lengths using the simpler M0 model, which assumes a constant ω across branches and sites, and optimized in a second step the model parameters of the M8 or branch-site models with or without rate variation, while fixing branch lengths. A similar approach was used in previous studies (Scheffler et al., 2006;Moretti et al., 2014).
We show that this approach at least in the case of the absence of variation does not decrease significantly the statistical properties of the positive selection inference ( Supplementary  Fig. S12).
The model optimization was performed in Godon, followed by model selection (see below).
For BUSTED we used an implementation available in HyPhy v. 2.2.6 . When running BUSTED on M8 simulations, positive selection was tested on all the branches jointly. In the case of branch-site model simulations, only the foreground branch (ω ≥ 1) was tested for positive selection.
For the biological datasets, all the internal branches were tested using the branch-site model for positive selection. Tip branches were not tested to reduce the potential effect of sequencing errors. The M8 model was applied to estimate substitution rates and ω for individual sites.

IV. Model selection
During model selection we had six models to choose from: three rate variation approaches and, for each, the absence or presence of positive selection. Although LRT can be used to test for positive selection, it is not possible to use it to compare across all six models that we tested (i.e., any pair of codon rate variation and site rate variation models cannot be represented as a nested pair).
We thus first used the AIC on the alternative model to select one of the three approaches to model rate variation: no rate variation, site rate variation or codon rate variation. For the Drosophila dataset, when only the no rate variation and codon rate variation models were compared, we used both AIC and LRT.
The relative support of each model was computed as a log ratio between Akaike weights (Wagenmakers and Farrell, 2004) of the model with codon rate variation and the model without rate variation.
Once the rate variation model was selected, we performed LRT to detect positive selection on the corresponding pair of models, i.e., model with ω ≤ 1 and model without this constraint. A 50:50 mix of a χ 2 distribution with one degree of freedom and of a point mass of 0 was used as a null distribution (Yang and dos Reis, 2011). False discovery rate was computed using the qvalue package (Storey et al., 2004).

V. Posterior rates inference
In order to estimate substitution rates for individual codons we used an approach similar to Rubinstein et al. (2011). First, we estimated the probability of a codon belonging to each rate as P = Pr(ρ (m, * ) = ρ k |x m , η), where ρ (m, * ) is the rate of codon m, ρ k is the k-th discrete gamma rate, x m is the data observed at codon m, and η are the parameters of the model (e.g., for M8 η = {p 0 , p, q, ω}). In this approach, η is replaced with the maximum likelihood estimate of model parametersη. Thus codon rates can be estimated as a weighted sumρ (m, * ) = ∑ K k Pr(ρ (m, * ) = ρ k |x m ,η)ρ k .
An alternative would be to use Bayes empirical Bayes (BEB, Yang et al., 2005), instead. However, BEB was developed and tested for site detection in particular codon models, and we do not know how well it is applicable to rate variation. On top of that given the increased parametric space of the model, BEB would be computationally intensive. Since we are averaging rates over multiple sites, random noise should not introduce a substantial bias.
Codon site d N /d S ratios in the M8 model can be estimated using a similar approach, while replacing codon rate categories with the ω categories.
Posterior codon rate and d N /d S estimation is implemented in Godon. In all cases we used an alternative model for posterior estimation. Since the null model for every pair is a special case of the alternative, we can use the later for parameter estimation without any significant loss of precision.
To test for autocorrelation, we used the average absolute difference between the posterior rates of the neighboring codon positions as a statistic: |ρ (m, * ) −ρ (m+1, * ) |.
The null distribution was computed by shuffling the rates within all the genes 10'000 times.

VII. Regression analysis
To estimate dependencies between various parameters (variables) we used linear models (lm function, R version 3.3.2). Variables were transformed to have a bell shaped distribution if possible (see Supplementary Table S14, Supplementary Fig. S13). Subsequently, parameters were centered at zero and scaled so that standard deviation was equal to one. This transformation allowed us to compare the estimates of the effects. Because in some cases residuals showed strong heteroscedasticity (Supplementary Fig. S14, S15), we used White standard errors (White, 1980) implemented in the sandwich R package.
Codon model param-