In mammals there is a bias in amino acid usage near splice sites that is explained, in large part, by the high density of exonic splicing enhancers (ESEs) in these regions. Is there a similar bias for the relative use of synonymous codons, and can any such bias be predicted by their abundance in ESEs? Prior reports suggested that such trends may exist. From analysis of human exons, we find that 47 of the 59 codons with at least one synonym show differential usage in the proximity of exon ends, of which 42 remain significant after correction for multiple testing. Within sets of synonymous codons those more preferred near splice sites are generally those that are relatively more abundant within the ESEs. However, the examples given previously appear exceptionally good fits and there exist many exceptions, the usage of lysine's codons being a case in point. Similar results are observed in mouse exons. We conclude that splice regulation impacts on the choice of synonymous codons in mammals, but the magnitude of this effect is less than might at first have been supposed.
Whether selection acts to induce and maintain codon usage bias is a question that is central to the neutralist/selectionist debate and has recently led to the identification of several convincing causative parameters. In bacteria, yeast, Caenorhabditis and Drosophila, biased codon usage is largely explained by selection for translational efficiency (including accuracy) (Ikemura 1985; Akashi and Eyre-Walker 1998; Duret 2002; Wright et al. 2004). The story in vertebrates, however, is far more complex (Kanaya et al. 2001; Lander et al. 2001; Duret 2002; Comeron 2004; dos Reis et al. 2004; Lavner and Kotlar 2005). Recent studies, in mammals, suggest that the dominant force in amino acid and codon usage is not selection for translational efficiency (dos Reis et al. 2004). Splice related biases are, however, evident (Willie and Majewski 2004). In particular, selection for the preservation of exonic splicing regulatory elements, most notably exonic splicing enhancers (ESEs), explains the low synonymous substitution rates (Parmley et al. 2006), low SNP density (Fairbrother et al. 2004a; Carlini and Genut 2006) and low protein evolutionary rates (Parmley et al. 2007) near intron-exon boundaries, this being where there is the highest concentration of regulatory elements. It also introduces a predictable amino acid bias: those amino acids encoded by codons that occur frequently in ESE sequences are preferred near intron-exon boundaries (Parmley et al. 2007). These studies accord with several in depth analyses of individual exons and genes (Pagani et al. 2005; Baralle et al. 2006; Raponi et al. 2007) indicating that both synonymous and non-synonymous mutations can have fitness effects via the modification of splicing, in some instances leading to genetic disorders (Cartegni et al. 2002; Chamary et al. 2006).
To what extent does selection for the preservation of exonic splice regulatory elements affect codon bias? In a few anecdotal cases it has been argued that a codon that is common in ESEs is preferred near boundaries relative to the synonymous codons that feature less commonly in ESEs (Willie and Majewski 2004; Chamary and Hurst 2005). Willie and Majewski (2004) highlighted the usage of GAA compared with GAG, both specifying glutamic acid. GAA features very much more often in splice enhancers, compared with GAG, and is greatly enriched, relative to its synonym, near intron-exon boundaries. Here we ask about the generality of this observation. In particular we ask whether, given what is known of ESEs, it is possible to predict which codons are preferred near intron-exon boundaries.
A data set of over 170,000 internal human exons (see methods; Parmley et al. 2007) was assessed to determine the usage of each codon up to a distance of 30 codons from intron-exon boundaries. At any given distance from the exon end, summing across all exons, we determine the proportion of each codon amongst the set of codons seen at this distance. For each codon we then derive a plot of the proportional usage of that codon as a function of the distance from exon ends. This trend is captured by 2 statistics. First, the correlation between the proportional codon usage and distance from the exon end was assessed by Spearmans rank correlation (Rho). Second, the slope (α) on the line relating the distance from the exon boundary to the proportional usage (i.e. proportional usage = α distance from boundary + β). Naturally the 2 measures are strongly correlated. For subsequent analysis we consider only those 59 codons that specify the amino acids with at least 2-fold degeneracy.
From the significance values derived from the Spearmans rank correlation we find that there are significant trends in the usage of synonymous codons near intron-exon boundaries for 47 of the 59 codons, of which 42 are significant after sequential Bonferonni correction (supplementary table 1). We repeated the analysis for 115,466 exons from 14,005 mouse genes. The trends found in mice (supplementary table 2) are very similar to those reported in humans.
Can these trends be explained by the presence of splice control elements? A set of putative hexameric ESE sequences has been determined for several species including human and mouse (Fairbrother et al. 2004b). From these we were able to identify a set of 176 high confidence ESEs, by employing only those present in both the human and the mouse set. To identify those codons that are most common in splice regulatory elements, a hexamer preference index (HPI) was calculated, a modification of that previously developed (Parmley et al. 2007) (see supplementary methods 1). A high HPI value indicates that a given codon is enriched in ESEs compared with that expected given their content in the genome and given the underlying variance expected given the number of hexamers as input.
Is it generally the case that codons found commonly in splice enhancers are preferred near exon boundaries? To determine this we considered the correlation between HPI and α, the slope of the line describing codon abundance versus distance from the exon boundary. We find a robust general trend of this variety (Spearman Rank correlation, r = −0.5520, P = 7.025 × 10−6; fig. 1). This analysis, however, conflates amino acid preferences with codon preferences. To control for biases in amino acid usage near boundaries, a series of pairwise analyses were performed between synonymous codons. Under the splice enhancer model we would expect that the synonymous codon that is more abundant in ESEs would have a greater preference for usage near the splice sites (conversely, the one more profoundly avoided in enhancers should be more profoundly avoided near boundaries).
For a synonymous codon pair we considered the difference in slopes (Δα), this difference being plotted against the difference in HPI (ΔHPI). We oriented the comparisons so as to ensure that the difference in HPI was always positive. For 2-fold degenerate amino acids a simple pairwise comparison was implemented, with one comparison for each amino acid. For amino acids with greater degeneracy, every pairwise permutation was assessed. If the ESE model is correct we expect that a negative correlation between these 2 parameters should exist. This indeed we observed: those codons more common in ESEs are more common near splice sites (Spearman rank correlation between Δα and ΔHPI, r = −0.3098, P = 0.0036; fig. 2). More generally, we expect that the codon relatively preferred near boundaries should be relatively enriched in the ESE data set. Indeed, 63 of 87 comparisons accord with this prediction.
Might the above result stem from the fact that small exons contribute more data to the slope calculation at positions near the boundaries? To examine this possibility, we restrict analysis to only those exons longer than 60 codons; so all exons contribute equally to all distances. We find no important differences in the results (see supplementary table 3, Supplementary fig. 1). The same data set also permits estimation of the magnitude of the difference in a codon's usage near and far from exon boundaries. To do this we consider the average of a codon's usage up to a distance of 5 codons from the boundary (excluding the first codon), and compare this to codons 30–33 inclusive (see supplementary methods 2). At the extreme, some codons (e.g. TAT, AGA) are approximately 30% more common close to boundaries, while others (e.g. CGC, CGG) are 30% less common (see supplementary table 4).
The above results suggest that the need to specify efficient splice enhancers near intron-exon boundaries explains some of the variation in relative codon usage as one approaches intron-exon boundaries. There are, however, numerous exceptions (any data point with a difference in slope of greater than zero in figure 2 is unexpected). Are we able to explain the behaviour of those synonymous codons that go against our expectations by controlling for the presence of other splicing control elements within our analysis? Although ESEs are the most well studied and most prolific splice regulatory element, exonic splicing suppressors (ESSs) are also an important splice regulator (Wang et al. 2004), especially for alternative splicing events (Wang et al. 2006). A list of 133 decameric putative ESSs has been described in human (Wang et al. 2004) from which we were able to produce a DPI (Decamer Preference Index) for synonymous codons, in the same way as the HPI with minor necessary changes. The difference in DPI does not correlate with the difference in slope (P = 0.23). The indexes were then combined (the mean of the 2) to produce an overall Splice Control Preference Index (SCPI). The pairwise analysis of synonymous codon usage was repeated using the new SCPI as the indicator of codon representation in splice control elements (fig. 3). Although there are still comparisons that do not behave as we would expect, the overall trend is now a little stronger and more significant (Spearman rank correlation, r = −0.3398, P = 0.0014; fig. 3), suggesting that a combined suppressor/enhancer model provides a better fit. Under these conditions, 64 of 87 comparisons fit the expectation that the codon preferred near boundaries is more common in ESEs than their synonym.
Prior analysis comparing GAA with GAG argued that codons needed in splice enhancers are strikingly more abundant near exon boundaries (Willie and Majewski 2004; Chamary and Hurst 2005). More generally, in human and mouse genes, we commonly see trends in the usage of synonymous codons near intron-exon boundaries. However, from 2-fold degenerate amino acids, the previously noted GAA/GAG comparison (E on fig. 2) is by far the best support for the hypothesis. While there exists a general trend for those codons that are preferred near splice sites to be more common in human ESEs (higher HPI), the GAA/GAG comparison is perhaps misleading in the extent to which the trends are predictable.
Perhaps the most striking exception is lysine. This has 2 codons, AAA and AAG, but the one that is more abundant in splice enhancers (AAG) is the one that is relatively avoided near boundaries: both AAA and AAG are preferred near boundaries, in line with the observation that lysine is greatly preferred, but the slope for AAG is less negative than the slope for AAA. This may well be explained by the cryptic splice site avoidance model predicting, as it does, a force against AG ending codons that might be mistaken for exon ends (Eskesen et al. 2004; Chamary and Hurst 2005). In this regard it is notable that the GAA/GAG comparison matches both models, it just so happens that the synonym that might act as a cryptic splice site (GAG) is also the one less employed in splice enhancers.
In sum, we find that patterns of preference of codon usage as one approaches intron-exon boundaries are modulated in a manner that, to a first order approximation, is explained by splice control elements. This model however fails to explain all of the variation and numerous outliers remain.
Supplementary tables 1, 2, 3 and 4, Supplementary figure 1 and supplementary methods 1 and 2 are available at Molecular Biology and Evolution on-line (http://www.mbe.oxfordjournals.org/).
J.L.P. is funded by the United Kingdom Biotechnology and Biological Sciences Research Council.