Physicochemical amino acid properties better describe substitution rates in large populations

Substitutions between chemically distant amino acids are known to occur less frequently than those between more similar amino acids. This knowledge, however, is not reflected in most codon substitution models, which treat all non-synonymous changes as if they were equivalent in terms of impact on the protein. A variety of methods for integrating chemical distances into models have been proposed, with a common approach being to divide substitutions into radical or conservative categories. Nevertheless, it remains unclear whether the resulting models describe sequence evolution better than their simpler counterparts. We propose a parametric codon model that distinguishes between radical and conservative substitutions, allowing us to assess if radical substitutions are preferentially removed by selection. Applying our new model to a range of phylogenomic data, we find differentiating between radical and conservative substitutions provides significantly better fit for large populations, but see no equivalent improvement for smaller populations. Comparing codon- and amino acid models using these same data shows that alignments from large populations tend to select phylogenetic models containing information about amino acid exchangeabilities, whereas the structure of the genetic code is more important for smaller populations. Our results suggest selection against radical substitutions is, on average, more pronounced in large populations than smaller ones. The reduced observable effect of selection in smaller populations may be due to stronger genetic drift making it more challenging to detect preferences. Our results imply an important connection between the life history of a phylogenetic group and the model that best describes its evolution.


Introduction
Quantifying the impact of natural selection on proteins is of broad interest in evolutionary biology, providing insight into the structural and they adapt to an organism's environment. The most widely used method for studying selection using multiple sequence alignments of proteincoding sequences is to consider the ratio of the non-synonymous substitution rate (dN ) to 10 the synonymous substitution rate (dS), often referred to as ω = dN/dS. These codon-based models assume that dS reflects the neutral rate of evolution and dN represents the rate after selection has acted. The ω measure is well-15 characterised, and established methods allow it to vary along a sequence, through an evolutionary tree, or a combination of both (Goldman and Yang, 1994;Muse and Gaut, 1994;Yang, 1998;Yang andNielsen, 1998, 2002). Although useful,20 ω is coarse-grained because dN takes on the same value regardless of the amino acids being substituted despite widespread evidence that some amino acids substitutions are more likely to occur than others. 25 Early research showed that amino acid substitutions between amino acids with very different physicochemical properties occur less frequently than substitutions between more similar amino acids (Epstein, 1967), leading to 30 the introduction of a range of scales measuring physicochemical distances (Grantham, 1974;Miyata et al., 1979). The original codon models used a dN/dS measure that incorporated these distances (Goldman and Yang, 1994), based on 35 the rationale that selection against more similar amino acid substitutions ought to be weaker than against more distant ones. Subsequent research, however, found that this model frequently provided a poorer fit than the simpler M0 model, 40 which estimates dN/dS but does not capture differences in the selective pressures acting on different amino acid substitutions .
Other studies proposed classifying amino 45 acid substitutions into conservative and radical substitutions, describing substitutions between similar and dissimilar amino acids, respectively.
The relative amount of radical and conservative change can be described by the ratio R/C, 50 which is interpreted as a measure of selective pressure acting on the two different classes of non-synonymous mutations. Several approaches have been proposed to estimate R/C by counting amino acid changes (Popadin et al., 2007;Smith, 55 2003; Weber et al., 2014;Zhang, 2000), and results suggest that radical mutations are more likely to be selected against than conservative mutations (Smith, 2003). Further, organisms with a small effective population size tend to accumulate more 60 radical substitutions than those with a larger effective population size and more efficient natural selection (Eyre-Walker et al., 2002;Popadin et al., 2007;Smith, 2003), although see (Figuet et al., 2016). However, count-based methods 65 may be unreliable in the presence of mutationand composition biases (Dagan et al., 2002;Smith, 2003), given their implicit incorporation of parsimony-type scoring. Attempts to incorporate R/C into a more statistically justified and robust 70 codon substitution model have so far been limited to surveys of positive selection in genes already known to contain sites with accelerated nonsynonymous substitution rates (Sainudiin et al., 2005). Therefore, little is known about whether 75 a model measuring R/C provides a significant improvement in model fit over M0, and to what extent statistical fit supports the notion that negative selection treats radical substitutions differently than conservative ones. 80 An alternative approach to describe variation in the selective pressures acting on amino acids is to model them at the amino acid-as opposed to the codon level. Here, "exchangeability" parameters, estimated from databases of sequences, capture 85 the rate of change between pairs of amino acids and reflect a hazy combination of the genetic code and the selective pressures acting on the substitution. These models range from simple approaches that apply an averaged substitution 90 process across the entire protein to sophisticated mixture models that have different substitution patterns for different sites (Jones et al., 1994;Le and Gascuel, 2008;Whelan and Goldman, 2001).
They have been successfully used for a range of 95 phylogenetic problems, with the mixture models proving valuable at resolving deep phylogenies (Lartillot et al., 2007). Modelling approaches that link the biophysical attributes of amino acid substitution models to codon models have also 100 been developed (Seo and Kishino, 2009;, but are currently not widely used. Recently, statistical model selection methods for choosing between codon and amino acid substitution models have been devised, allowing 105 the state space of the model that best describes a given sequence alignment to be determined (Seo and Kishino, 2009;Whelan et al., 2015).

Analyses of collections of sequence alignments
show that preference for nucleotide, amino acid 110 or codon models is data set dependent. Moreover, the best-fitting model class is correlated with the intensity of selection as estimated by dN/dS, with more constrained alignments tending to select amino acid models and less constrained 115 alignments selecting codon models. These results imply that the factors driving substitution rates between amino acids, and therefore differential model selection, could be driven by the selective pressures acting on biophysical properties, the 120 structure of the genetic code, or a combination of the two (Whelan et al., 2015).
In light of the availability of vast quantities of sequence data spanning an enormous range of taxa, we return to the question of whether 125 chemical distances predict selective preferences in protein-coding sequences. Previous analyses were restricted to a handful of mammalian sequences and therefore may not present a complete picture.
To examine the relative selective pressures acting 130 on different types of amino acid substitutions, we propose a codon model that separates nonsynonymous substitutions into conservative or radical categories. Our approach allows R/C to be robustly estimated in the presence of mutational 135 and compositional bias and long branches with multiple hits. Using a mutation-selection model we then explore how R/C will respond to variation in evolutionary variables, including effective population size (Ne), assuming that 140 radical substitutions have larger negative effects on fitness than do conservative substitutions. We next assess whether the R/C ratios observed in empirical sequence data are consistent with radical substitutions being more disruptive and 145 selectively unfavourable, and find that this is the case for a subset of alignments overwhelmingly belonging to taxa with large effective population sizes and strong selection. Moreover, we see a preference for amino acid models over standard 150 codon models in the same taxa, while small population size predicts a preference for codon models. Taken together, our results confirm that chemical amino acid distance is a predictor of the strength of selection and that organismal 155 life-history influences model selection.

Results
The Conservative or Radical (CoRa) substitution model The base model for describing codon substitutions 160 is the M0 model (Goldman and Yang, 1994) the condition ω C = ω R , which places conservative and radical mutations under the same amount of selective pressure. This restriction is absent in CoRa, so a measure of R/C describes the relative probability of fixing a radical versus a conservative 180 mutation.

Factors affecting R/C estimates
In the Methods section we describe two broad types of method for inferring R/C. The first relies on the formulation of the CoRa model and directly 185 estimates the relative instantaneous substitution rates of dR/dS = ω R and dC/dS = ω C . The second uses observed sequences to count the relative frequency of radical to conservative substitutions, K r /K c , either: i) by using a parsimony approach 190 to count the observable differences between sequences to obtain Kr Kc (t), which misses cases where more than one substitution has occurred at a site; or ii) through stochastic mapping (Minin  For short distances where few substitutions are missed Kr Kc (t) closely resembles Kr Kc (Q). However, when t becomes larger there is an increasingly 245 strong response in Kr Kc (t) to the value of ω that can be attributed to the undercounting of more common non-synonymous substitutions, which becomes more severe as ω increases. Hence, Kr Kc (t) is prone to inflation for long branches. Meanwhile, 250 variation in ω has no effect on the dR/dC = 1 estimate (not shown).
The effect of %GC-content and transitiontransversion rate on R/C is shown in Figure 1   When considering R/C we are interested in the ratio of differential selective pressures affecting different types of non-synonymous change, with 340 the assumption that conservative substitutions are more likely to be fixed than radical substitutions, due to stronger selection acting against the latter (selective coefficients: s c ,s r < 0;s c > s r ; that is, s c is less negative than s r ). Figure 2

Genomic analyses of amino acid substitution patterns
The previous section outlines how we expect dR/dC to respond to different values of N e 380 and the range of values it might take. Next, we investigate whether results from real data match these expectations by examining a range of phylogenomic datasets. It is not possible to obtain accurate estimates of N e directly, so we separate 385 our data a priori into two broad categories of high and low N e based on other authors' observations. Mammals, birds and vertebrates are assumed to have relatively low N e relative to yeast, arthropods and insects, which are assumed 390 to have relatively high N e .
Effective population size predicts relative rates of conservative and radical substitutions To further explore this hypothesis we examine 435 Figure 4, which plotsω R from the CoRa model average, subject to stronger negative selective pressure, consistent with the expectation that they are more often deleterious. We also observe that the high N e datasets tend to have lower  models, which account for biophysical differences between amino acids. In contrast, low N e datasets select codon models, which do not (see Table 1).
Hence, models that incorporate information about of substitution in the CoRa model (Fig 2 a).
To further explore these differences in preferred 495 state-space we examine the difference in AIC between the best-fitting codon model and the best-fitting amino acid model for each alignment. That is, they discard information about the evolution of individual sequences that codon models are able to capture. They might, therefore, 510 be less well tuned to the fine-grained process, but their ability to describe average fitness differences between amino acids overcomes that limitation.
We have also considered the possibility that the amount of divergence affects our ability to 515 detect preferences for different types of amino acid substitutions. We find that the tree lengths (sum of branch lengths) from the low N e datasets tend to be substantially shorter than those from the high N e datasets, but this difference does not 520 seem adequate to account for the differences in the substitution process between the two groups.

Discussion
In this study, we examine how the physicochemical and transition-transversion bias (Dagan et al., 2002;Smith, 2003) under a random alternative code (Freeland and Hurst, 1998;Haig and Hurst, 1991). The genetic code alone is unlikely to explain, however, the preference for simple models in low N e populations. It is more likely that we lack either 590 the statistical power or model resolution to detect that selective differences between broadly binned amino acids for low N e populations. More phylogenetic diversity or longer sequences might help with statistical power, and methods that 595 allow across-site variation in the non-synonymous rates might also be helpful.
The very clear support for codon-over amino acid models observed for low N e alignments may also reflect the drawbacks of not considering 600 information about the codon-level processes that influenced the evolution of the sequences (Seo and Kishino, 2008). The observation that a subset of the same sequence alignments that reject amino acid models nonetheless select CoRa 605 over M0 aligns with the idea that accounting for the structure of the genetic code does offer advantages. Overall, our results are in accord with the notion that selective constraints on amino acid changes are, on average, greater than the 610 relative constraints on radical changes compared to conservative ones (Smith, 2003) and that this is more readily detectable in high N e populations.
That the rate of molecular evolution correlates with population size as a function of the strength 615 of selection is not surprising and has previously been widely discussed (Bromham, 2011;Hua and Bromham, 2017;Woolfit, 2009). Here, we confirm that this observation also extends to the ratio of radical to conservative substitutions, R/C, in 620 accord with theoretical expectations. Although this relationship has previously been hinted at, we can now rule out the possibility that it is merely an artifact of the effect of compositional bias on count-based estimates of R/C (see discussion in 625 Weber et al. (2014)). In addition, we find that life history does not just affect rates of evolution but also more broadly predicts which models are suitable for describing them.
Our results may help explain many previous 630 observations in the literature and provide insight into possible causes for the differences in relative fit of existing phylogenetic models. There is a wide variety of available amino acid substitution models, trained on data ranging from a handful of 635 mammalian mitochondrial sequences  to vast groups of species covering the entire tree of life (Le and Gascuel, 2008). The difference between these models is usually discussed in terms of the varying protein contents they describe 640 (Keane et al., 2006) or the parameters they infer, inference (Galtier, 2001;Huelsenbeck, 2002;Tuffley and Steel, 1998;Wang et al., 2006;Whelan, 2008;Whelan et al., 2010), although its exact causes remain nebulous. Our results suggest that variation in N e across the tree might 675 be an important causal factor for this temporal heterogeneity. For instance, a change from a low to high N e along a branch might lead to a switch from 'on' to 'off' in a covarion process (Tuffley and Steel, 1998 PMSF (Blanquart and Lartillot, 2008;Lartillot and Philippe, 2004;Si Quang et al., 2008;Wang et al., 2018) only account for spatial heterogeneity along the sequence and cannot capture variation in substitution patterns through the tree.

705
Finally, our findings also suggest that one path towards capturing and understanding protein evolution is to begin explicitly incorporating N e into phylogenetic models. One particularly promising route would be the implementation of 710 mutation-selection models (Halpern and Bruno, 1998;Tamuri et al., 2012;Thorne et al., 2012;Yang and Nielsen, 2008

Methods Models
Existing codon models Standard phylogenetic methods for analysing 735 codon sequences are typically based around the M0 model derived from Yang and Goldman (Goldman and Yang, 1994). This substitution model is defined through the off-diagonal elements of the instantaneous rate matrix, where ω = 740 dN/dS, as follows:  (Sainudiin et al., 2005;Smith, 2003;Zhang, 2000), but there is limited discussion regarding how different estimates of these values into to the two subcategories ω C = dC/dS and 765 ω R = dR/dS in the following instantaneous rate matrix: i and j differ by more than a single nucleotide 1 i and j differ by a single synonymous transversion κ i and j differ by a single synonymous transition ω C i and j differ by a single non-synonymous conservative transversion ω C κ i and j differ by a single non-synonymous conservative transition ω R i and j differ by a single non-synonymous radical transversion ω R κ i and j differ by a single non-synonymous radical transition (2) In order to fully resolve this matrix we need to define the radical matrix M rad , which is a 20x20 binary matrix that labels each amino Any substitution involving a change in category for polarity or volume is considered radical, whereas substitutions altering neither property's 780 category are considered conservative (Sainudiin et al., 2005). In a preliminary study, we examined other conservative and radical classifications, but the above definition tended to provide better fitting models.

785
Under this model we can calculate dR/dC = ω R /ω C , which is a R/C measure describing the relative rates of conservative and radical substitutions. Under the constraint ω R = ω C conservative and radical changes cannot be 790 distinguished by selection and CoRa reduces to the simpler model M0. This nesting means that standard likelihood ratio tests with one degree of freedom may be used to assess the fit of CoRa relative to M0. When CoRa provides 795 no significant improvement on M0 there is no evidence for differences in the rate of conservative and radical substitutions. With a significant improvement in fit, then dR/dC < 1 represents our expectation that radical changes, which are likely 800 to disrupt the protein structure, occur at a lower instantaneous rate than conservative changes, which are more likely to maintain the current structure. A significant improvement in fit coupled with values of dR/dC > 1 represent the case 805 where radical substitutions occur significantly more frequently than conservative substitutions.

Counting-based K r /K c measures and their relationship to the CoRa model
The definition of dR and dC described above  (Minin and Suchard, 2008;Nielsen and Huelsenbeck, 2002) or by counting the number of conservative and radical changes over a very short period of time.
Using the CoRa model the expected number of 830 conservative and radical substitutions per unit time can be computed as: where 1 Radical and 1 Conservative are indicator functions that return 1 when i → j are radical and conservative non-synonymous substitutions, 835 respectively, and 0 otherwise. Expected values calculated using this approach will be referred to as Kr Kc (Q) to indicate they have been derived from the instantaneous rate matrix Q. The second approach to estimating K r /K c is to calculate the 840 the relative numbers of conservative and radical substitutions after a given evolutionary time, t.
This approach is essentially the same as above, but replaces Q with the probability matrix P(t), and fails to correct for cases where more than one 845 substitution occurs at a site. For a given value of t these quantities can be calculated as: For both of these cases, the Kr and Kc measures are comparable to those obtained from amino acid sequences when counting the 850 number of radical and conservative substitutions.
Estimates made using this approach will be referred to as Kr Kc (t).
Defining dR/dC in terms of N e Several studies have proposed that radical and 855 conservative substitutions might predict N e , as radical substitutions are expected to be removed more effectively from large populations (Eyre-Walker et al., 2002). Following the formulation of mutation-selection models, where the substitution 860 rate (Q ij ) is a product of the mutation rate from i to j and the probability of fixation of j relative to wildtype i, the CoRa model describes the relative fixation probabilities through the ω C and ω R parameters. Using the weak mutation model of 865 Golding and Felsenstein (Golding and Felsenstein, 1990) to define these parameters in terms of the selective pressures acting on conservative (s c ) and radical changes (s r ) for diploids we have: of N e on the values of ω C and ω R can also be computed in this manner, with Yang and Nielsen providing details (Yang and Nielsen, 2008).) Our approach allows us to analytically examine how dN/dS might respond for different values of N e , 885 s c and s r .
We performed basic quality control, including removing all sequences with internal stops; CTG 900 codons were masked out in the Candida clade due to a non-standard genetic code. To obtain codon alignments for the data of Salichos et al. (Salichos and Rokas, 2013), amino acid alignments were mapped back to nucleotide 905 sequences using PAL2NAL (Suyama et al., 2006).
The sequence alignments in each data set were analysed independently. The optimal state space (RY, nucleotide, codon or amino acid) using ModelOMatic with default settings (Whelan et al., 2015). starting point to fit the models (Gascuel, 1997).