A Mutation–Selection Model of Protein Evolution under Persistent Positive Selection

Abstract We use first principles of population genetics to model the evolution of proteins under persistent positive selection (PPS). PPS may occur when organisms are subjected to persistent environmental change, during adaptive radiations, or in host–pathogen interactions. Our mutation–selection model indicates protein evolution under PPS is an irreversible Markov process, and thus proteins under PPS show a strongly asymmetrical distribution of selection coefficients among amino acid substitutions. Our model shows the criteria ω>1 (where ω is the ratio of nonsynonymous over synonymous codon substitution rates) to detect positive selection is conservative and indeed arbitrary, because in real proteins many mutations are highly deleterious and are removed by selection even at positively selected sites. We use a penalized-likelihood implementation of the PPS model to successfully detect PPS in plant RuBisCO and influenza HA proteins. By directly estimating selection coefficients at protein sites, our inference procedure bypasses the need for using ω as a surrogate measure of selection and improves our ability to detect molecular adaptation in proteins.


Introduction
Understanding how natural selection acts on molecular sequences has long been a pursuit of evolutionary biology. For example, Kimura (1983), using a model that assumes the genome has an infinite number of sites, showed the relative rate of molecular evolution is approximately given by: where S is the selection coefficient acting on mutations. If new mutations in the genome are positively selected (S > 0) the relative rate of molecular evolution is accelerated (u > 1), whereas the rate is the neutral mutation rate (u ¼ 1) if there is no selection (S ¼ 0), and the rate is decelerated (u < 1) if mutations are negatively selected (S < 0). Equation (1), which is the relative probability of fixation of selected over neutral mutations (Fisher 1930;Wright 1931;Kimura 1962;McCandlish et al. 2015), has important implications for understanding molecular adaptation in proteins. For a sample of protein-coding sequences from various species, the ratio between the number of substitutions at nonsynonymous sites (which are under selection) and at synonymous sites (which are under weak or no selection) should approximately follow the dynamics of equation (1) (Nielsen and Yang 2003). This ratio, commonly known as x ¼ d N =d S , is widely used as a test of molecular adaptation in proteins, with x > 1, x ¼ 1, and x < 1 interpreted as evidence of molecular adaptation (positive selection), neutral evolution, and purifying selection respectively. However, Kimura's relative rate of molecular evolution (eq. 1), based on the infinite-sites model (Kimura 1969;Sawyer and Hartl 1992), assumes all new mutations appear at new sites in the genome. This assumption appears unrealistic for proteins. Nielsen and Yang (2003) have argued that if amino acid fitnesses are reassigned every time a new mutation appears at a site in a protein (so that the selection coefficient, S, is always the same at the site), then equation (1) gives the relationship of S and x under a finite-sites model. However, it is not clear in which condition this fitness reassignment should apply: if an i to j mutation has selection coefficient S, then the reverse j to i mutation should have coefficient ÀS, but Nielsen and Yang's model assumes it reverts to S. Without this assumption it does not appear possible to equate x ¼ S=ð1 À e ÀS Þ. Spielman and Wilke (2015), and dos Reis (2015), used the Fisher-Wright mutation-selection model (Fisher 1930;Wright 1931;Halpern and Bruno 1998) to derive the relationship between x and the selection coefficients acting on codon sites within a finite-sites model. They showed that x 1 when selection coefficients are constant over time, that is, they are not reassigned (dos Reis 2015; Spielman and Wilke 2015); whereas x > 1 can be achieved for a short period of time after selection coefficients undergo a single shift during an adaptive event, for example, when a virus adapts to a new host (dos Reis 2015).
However, the relationship between x and selection coefficients under the more general case of persistent changes in selection over time appears unclear. This case, which we term persistent positive selection (PPS), is important because selection coefficients acting at codon sites may change repeatedly during persistent environmental changes, during adaptive radiations, and in host-pathogen interactions (such as in a virus evading herd immunity in a host population). Thus, understanding how PPS affects x in proteins can inform the development of methods to detect positive selection and give us insight onto the mechanisms of adaptive evolution in general.
Here, we develop a mutation-selection model of codon substitution under PPS. The new model can be used to study the mechanistic relationship between the scaled selection coefficients and x, providing insight into the evolutionary dynamics of proteins under PPS. Furthermore, we develop a penalized-likelihood implementation of the model and successfully use it to detect PPS directly in real proteins bypassing the need to use x as a surrogate measure of selection. Analysis under the new model indicates codon substitution is an irreversible Markov process, leading to a highly asymmetrical distribution of selection coefficients among substitutions in proteins under PPS. More strikingly, the PPS model shows the criteria x > 1 to detect molecular adaptation in proteins is conservative and indeed arbitrary, as we find evidence of PPS at codon sites where x < 1.

The PPS Codon Substitution Model
We develop the new model by integrating the nonhomogeneous selection model of Kimura and Ota (1970) with the mutation-selection codon substitution model of Halpern and Bruno (1998). Consider a population of organisms with haploid genome number N. That is, the number of copies of the genome in the population is N (i.e., the population size is N if the organism is haploid and N=2 if it is diploid). Suppose a site k in a protein-coding gene is fixed for codon i in the population, and the scaled Malthusian fitness of i is F i;k . A new mutant codon j appears at the site and has initial selective The selective advantage then decays exponentially as a function of time (Kimura and Ota 1970), for example, due to gradual environmental change. Kimura and Ota (1970) showed the fixation probability of j is approximately S ij;k =ð1 À e ÀS ij;k Þ Â N À1 where S ij;k is constant and 0 < S ij;k < S Ã ij;k . In other words, the fixation probability of j is the same as that of an allele with intermediate, but constant, selective advantage S ij;k .
It appears other types of decay function lead to the same fixation probability. For example, the same result is obtained in the case of frequency-dependent selection when the fitness of j decays exponentially as a function of the frequency of j in the population (dos Reis 2013). In the case of frequencydependent selection, once j becomes fixed, any new mutant alleles may have high fitnesses because they would be rare. We expect this type of dynamics in, for example, viruses escaping the herd immunity of a host population. Similarly, if the environment gradually shifts between two states, then the selective advantage of j or i would be continuously reset depending on the particular environment. This would then lead to resetting (or reassignment) of the fitnesses of i and j. This persistent change in the selection coefficient is what we term PPS. We formalize codon substitution under the PPS model next.
Let the selection coefficient for the i ! j mutation be S ij;k ¼ F j;k À F i;k þ Z k , where F j;k ; F i;k and Z k ð! 0Þ are constant. Let the selection coefficient for the reverse mutant, j ! i, be S ji;k ¼ F i;k À F j;k þ Z k . In other words, we have partitioned the fitnesses of j and i into two components: a constant component, F j;k and F i;k , representing structural constrains of the protein on the amino acid encoded by the codon; and Z k , the PPS component. Thus, when Z k > 0, the selection coefficient is persistently reset with new mutations.
The substitution rate from i to j at location k, q ij;k , is equal to the neutral mutation rate, l ij , times the number of i alleles in the population, N, times the fixation probability of the j mutant (Kimura 1983;Halpern and Bruno 1998). Assuming synonymous substitutions are neutral, this gives: q ij;k ¼ l ij S ij;k 1Àe ÀS ij;k ; if the substitution is nonsynonymous; l ij otherwise: 8 < : (2)

Irreversibility of Codon Substitution under PPS
Equation (2) describes codon substitution as a continuous Markov process. Polymorphisms are ignored and the population is assumed to switch from i to j instantaneously. This assumption appears reasonable if Nl ij ( 1, for all l ij (Bulmer 1991). The proportion of time location k remains fixed for j (i.e., the stationary frequency of j) is p j;k . A Markov process is said to be reversible in equilibrium if it satisfies the detailedbalance condition p i;k q ij;k ¼ p j;k q ji;k (Grimmet and Stirzaker 2004). When Z k ¼ 0, the model of equation (2) is reversible (Yang and Nielsen 2008). However, when Z k > 0 the process is, in general, irreversible because the detailed balance condition does not hold. When Z k > 0, the stationary frequencies are found by solving the system of equations P j p j;k q ji;k À P j p i;k q ij;k ¼ 0 with the constraint P i p i;k ¼ 1. We calculate the irreversibility index for site k as I k ¼ jp i;k q ij;k À p j;k q ji;k j, where I k > 0 indicates evolution at site k is irreversible, and I k ¼ 0 otherwise (Huelsenbeck et al. 2002).

Identifying Protein Locations under PPS
Given an alignment of protein-coding genes with corresponding phylogeny, the model of equation (2) can be used to Tamuri and dos Reis . doi:10.1093/molbev/msab309 MBE estimate the F i;k and Z k using maximum penalized likelihood. To estimate the F i;k , we use the Dirichlet-based penalty of Tamuri et al. (2014) and for Z k , we use an exponential penalty with parameter k (see Materials and Methods). For each site in the alignment, we compare the null model Z k ¼ 0 (no PPS) against Z k > 0 (PPS) using a likelihood-ratio test. Because of the boundary condition (Z k > 0) in the test and the use of penalized likelihood, the distribution of the likelihood-ratio statistic does not follow the typical v 2 distribution. Thus, we use Cox (1962) simulation approach as used in phylogenetics (Goldman 1993) to obtain the appropriate null distribution (see Materials and Methods).

The Relationship between Selection Coefficients and x
The average substitution rate of codon site k, averaged over time is: This rate can be separated into its nonsynonymous and synonymous components, and where the indicator function I N ¼ 1 if the i to j substitution is nonsynonymous, and ¼ 0 otherwise. For a neutrally evolving sequence (e.g., a pseudogene) the corresponding rates are: where p ð0Þ i is the stationary frequency of i without selection, which is the same for all sites. Thus, the relative nonsynonymous rate is: See dos Reis (2015) for the full derivation. Spielman and Wilke (2015) give a slightly different definition of x k (see also Jones et al. 2016, Youssef et al. 2020.
We note the PPS model is general and has other models as special cases. For example, when Z k 6 ¼ 0 and F i;k ¼ F j;k for all i, j, we have: and the model of equation (2) can be written as q ij;k ¼ l ij x k if the substitution is nonsynonymous and q ij;k ¼ l ij otherwise. In other words, the classic codon models (Muse and Gaut 1994;Yang and Nielsen 1998) are a special case of equation (2) when all codons are assumed to have the same fitness. On the other hand, when Z k ¼ 0 and F i;k 6 ¼ F j;k , the model of equation (2) reduces to the mutation-selection model of Halpern and Bruno (1998).

Detection of PPS in Simulated Data
Extensive simulations on the estimation of F i;k are available in Tamuri et al. (2012Tamuri et al. ( , 2014. Here, our focus is on using simulations to assert whether sites under PPS (Z k > 0) can be identified using Cox's method. We simulate codon alignments (1,000 codons in length) on a 512-taxa phylogeny, under various strengths of PPS, with Z k ¼ 0; 2; 5; and 10. The values of F i;k are drawn from random distributions to produce sharp amino acid profiles as in real proteins (see Materials and Methods). These F i;k and Z k values result in x k values roughly between 0.05 and 6 (eq. 3). When Z k ¼ 0, 6.6% of sites are incorrectly detected to be under PPS, which is slightly higher than the 5% error I threshold (table 1). When the selective advantage is slight (Z k ¼ 2Þ, the method roughly identifies 44% of sites under PPS (table 1). The power of the method is excellent and roughly over 95% when the selective advantage is strong (Z k ! 5). We note the exponential penalty on Z k has a noticeable, albeit slight, effect on the power of the test. When the penalty parameter, k is small, the resulting penalty is diffuse and the penalty is weak. However, as k increases, the penalty becomes stronger with probability density in the exponential moving toward zero. In this case, estimates of Z k ) 0 are more strongly penalized and this translates in a small reduction in the power of the test (table 1). We note the penalized likelihood method used here is essentially the same as posterior mode finding giving our penalties are proper probability densities (Cox and O'Sullivan 1990;Tamuri et al. 2014), and thus the penalties on Z k and F i;k act as prior densities which regularize the parameter estimates (Cox and O'Sullivan 1990).

Detection of PPS in Real Proteins
We tested for PPS sites in three real sequence data sets: the hemagglutinin protein (HA) from human influenza H1N1 virus, the rbcL protein subunit from flowering plants, and the mitochondrial cytochrome b (CYTB) protein from mammals (table 2). Given the multiple sequence alignment, phylogeny, and mutational parameters, we estimated the F i;k and Z k using two penalty strengths, k ¼ 0:001; and 0.05. We then performed the LRT of PPS versus no PPS and used false discovery rate (FDR) at the 5% level to identify sites under PPS. Using the weak penalty, k ¼ 0:001, we detected PPS (Z k > 0) at 65 sites in the plant rbcL and 18 sites in the influenza HA, but we found no PPS sites in mammal CYTB  figure 1A to A 00 . When using the stronger penalty, k ¼ 0:05, the number of sites detected in rbcL and HA are reduced to 50 and 17 sites respectively (table 2). This is not unexpected because, as noted above, stronger penalties push estimates of Z k toward zero affecting the likelihood ratio test.

The Distribution of Selection Coefficients at Sites under PPS Is Asymmetrical
We estimated the distribution of selection coefficients among nonsynonymous substitutions (Tamuri et al. 2014) in the three protein-coding genes analyzed ( fig. 1B to B 00 ). For non-PPS sites (i.e., sites where Z k ¼ 0), the distribution of selection coefficients is symmetrical, with a mode at S ¼ 0, because in this case codon substitution is reversible and the detailed balance condition guarantees the proportions of slightly advantageous and deleterious mutations fixed in the population will be equal over time (Yang and Nielsen 2008). However, among PPS sites in plant rbcL and influenza HA, the distribution is highly skewed with a mode at S > 10 because irreversibility of the substitution process means the detailed balance condition does not apply, and hence there is a persistent excess of advantageous mutations being substituted into the population. For example, for sites with Z k ! 10, the irreversibility index is as high as 0.12, indicating there is a deviation of 12% of substitutions from detailed balance, which is a strong deviation ( fig. 2A). Larger values of Z k are also associated with faster substitution rates ( fig. 2B) and larger x k values ( fig. 2C). For example, for sites with Z k ! 10, the corresponding x k values range from about 1 to 4 ( fig. 2C).

PPS Sites Are under Strong Purifying Constraints
At equilibrium, the average selection coefficient of new mutations at site k is: where P ij ¼ l ij = P j l ij is the probability that the next mutation is j given the site is currently fixed for i (dos Reis 2015). If most new mutations are very deleterious, then the site is under purifying selection and S k < 0; whereas if most new mutations are advantageous, the site is under positive selection and S k > 0. Historically, x k has been used as a proxy for S k , based on the approximation of equation (1) (Nielsen and Yang 2003). Thus calculating S k should provide insight into the relationship between the strength of selection at a site and x k . Figure 1C to C 00 shows the estimated S k for the three data sets plotted against x k . For 43 PPS sites in rbcL and one PPS site in HA, we find that S k < 0. This shows PPS sites are effectively under a mixture of purifying selection against deleterious amino acid substitutions, and diversifying selection in favor of a few amino acids that substitute rapidly among each other. This trend is evidenced when studying the pattern of PPS substitution in the influenza HA protein. The H1N1 influenza virus entered the human population sometime prior to the 1918 influenza pandemic (Taubenberger et al. 2005;dos Reis et al. 2009) and has remained largely as a single lineage since then, except from the introduction of a separate lineage of reassortant H1N1 swine virus in the 2009 pandemic (Smith et al. 2009). Figure 3 shows the pattern of amino acid substitution for the 18 PPS sites in influenza HA between 1918 and 2009. For example, site 3 remained virtually fixed for alanine between 1918 and the late 1990s, and then suffered several back and forth substitutions between alanine and valine between the late 1990s and 2009, whereas site 142 has been characterized by shifts between lysine and asparagine between 1918 and 2009. It is clear from figure 3 that the majority of PPS sites in the HA protein are characterized by back-and-forth substitutions among a fairly reduced set of amino acids.

Discussion
Mutation-selection models of codon substitution have been successfully used to study the distribution of selection coefficients in proteins (Rodrigue et al. 2010;Tamuri et al. 2014), to detect selection shifts during adaptation (Parto and Lartillot 2017), shifting balance (Jones et al. 2016), and to understand protein evolution given structural constraints (Youssef et al. 2020). Previous works have also accommodated a x parameter within the mutation-selection model to detect adaptation at amino acid sites (Yang and Nielsen 2008;Rodrigue and Lartillot 2017;Rodrigue et al. 2021). However in these works x is a separate parameter and not a function of the selection coefficients and thus its population genetics interpretation is not clear (Rodrigue and Lartillot 2017). Here, we extended the mutation-selection framework to the case of PPS without the need for the additional x parameter. Instead, in the new model, x is a function of the selection coefficients and we believe this modeling approach can help gain insight on the nature of protein adaptation.
The new PPS model is flexible as it appears to have performed well for the different modes of selection studied here. For example, rbcL is the major subunit of the RuBisCo enzyme responsible for the fixation of carbon during photosynthesis. The efficiency of RuBisCo is affected by environmental factors and rbcL has been under persistent adaptive pressures during the successful adaptive radiation of angiosperms around the ecoregions of the world (Kapralov and Filatov 2007;Parto and Lartillot 2018). This is akin to the environment change model envisaged by Kimura and Ota (1970). On the other hand, the influenza HA protein is the classic example of positive MBE selection on a pathogen evading its hosts' herd immunity (Fitch et al. 1997), and we showed here the PPS model performed well in detecting this mode of adaptation. We believe the new PPS model, together with previous mutation-selection models that relaxed the assumption of constant fitnesses (Tamuri et al. 2012;Parto and Lartillot 2017), now encompass the major modes of selection in proteins. We would like to note here two features of codingsequence evolution that are ignored in our formulation of the PPS mutation-selection model. First, the model assumes amino acid sites within the protein evolve independently. This is unrealistic because amino acids are linked and their substitution pattern is affected by interactions with other amino acids within the folded protein (Pollock et al. 2012;Youssef et al. 2020). In particular, substitutions toward suboptimal amino acids can be compensated by rapid substitution in another interacting amino acid, so as to reduce contact energies in the folded protein (Pollock et al. 2012).

FIG. 2.
Relationship between Z k and evolutionary parameters for PPS sites in HA and rbcL. (A) Irreversibility index, I k , versus Z k . The index is normalized to give the expected excess number of substitutions from detailed balance. (B) Site substitution rate, r k ¼ À P i p k q ii;k , versus Z k . Note the q ij;k are scaled so that they give the relative rate with respect to a neutral sequence (Tamuri et al. 2014). Thus, if r k ¼ 1, then the site evolves at the same rate as, say, a pseudogene. (C) Nonsynonymous rate, x k versus Z k . The penalty on Z k is k ¼ 0:001 in all cases. Measuring Persistent Positive Selection in Proteins . doi:10.1093/molbev/msab309 MBE How these rapid substitutions affect evolutionary dynamics within PPS and how they should be accommodated within the inference model will require further research (Youssef et al. 2020). Second, the model assumes polymorphism is absent and new mutations either become fixed or lost instantaneously. This assumption appears reasonable for most populations of plants and animals because, in these, the scaled mutation rates, Nl, are much less than one (Lynch and Conery 2003). Even for influenza, a fast-evolving RNA virus, estimates of Nl are in the order of 10 À3 (Zhao and Illingworth 2019). However, levels of standing polymorphism can be substantial in many microorganisms (Lynch and Conery 2003) or for some loci under certain forms of selection (e.g., selection in favor heterozygotes, Hughes and Nei 1988). Incorporating polymorphism within the mutation-selection inference machinery will be challenging, but recent polymorphism-aware phylogenetic approaches may provide a way forward (De Maio et al. 2015).
Perhaps the most important insight from the application of the PPS model to real data is that the criteria x > 1 to detect positive selection in proteins is conservative. As we show here, sites under PPS are also under strong purifying constraints, and, at equilibrium, produce many deleterious mutations that are removed by selection. Because x k is the weighted average over the rate of all possible synonymous substitutions at a site, it follows x k will be reduced if there are many deleterious mutations at the site even if the site is shifting rapidly among a few positively selection amino acids. We believe this insight should be incorporated into the much faster codon substitution models used in phylogenomic analyses, such as the branch-site model (Yang and Nielsen 2002), to improve power in detecting adaptation in proteins.

Maximum Penalized Likelihood Estimation and Likelihood Ratio Test of PPS
The swMutSel model (Tamuri et al. 2012(Tamuri et al. , 2014 is the special case of swMutSel-PPS when Z k ¼ 0. We use swMutSel as a null model (H 0 : Z k ¼ 0) and swMutSel-PPS as the alternative model (H 1 : Z k > 0) in a likelihood-ratio test. The vector of fitnesses at site k, F k ¼ ðF i;k Þ and the PPS component, Z k are estimated by maximizing a penalized likelihood. The penalty on F k is the Dirichlet-based penalty of Tamuri et al. (2014), whereas for Z k we use an exponential penalty PðZ k Þ ¼ e ÀkZ k , where the regularization parameter, k, controls the strength of the penalty. When k ¼ 0, there is no penalty while k > 0 leads to increasingly stronger penalties on the estimation of Z k . During inference, the q ij;k (eq. 2) are scaled in terms of the expected number of neutral substitutions per site (Tamuri et al. 2012). This guarantees all sites are normalized to the same timescale. To speed up computation, the mutational parameters, required to construct l ij , and the branch lengths on the phylogeny are estimated under the FMutSel0 model (Yang and Nielsen 2008) as explained in Tamuri et al. (2014). We note only the differences among the F i;k enter equation (2), thus, the fitness for the most common amino acid at site k is set to zero. Large negative F i;k values are capped to À10 during numerical optimization. We recommend the optimization routine is repeated three times using different parameter start values to ensure convergence to the correct estimates.
Let the maximum penalized log-likelihood for site k be ' 0;k and ' 1;k , under the H 0 and H 1 hypotheses respectively. The test statistic is the difference in log-likelihoods d k ¼ ' 1;k À ' 0;k . If the test statistic is significantly different from zero, this is evidence site k is evolving under PPS. The distribution of the 2d k statistic, when the null hypothesis is true, does not follow a v 2 distribution. There are two reasons for this. First, because Z k ¼ 0 is at the boundary of parameter space, the test statistic would be, asymptotically, distributed as a 50:50 mixture of a v 2 distribution and a 0.5 point probability mass at 0 (Self and Liang 1987;Goldman and Whelan 2000). The second reason is that the penalty on Z k affects the 50:50 proportion because the penalty forces the estimates of Z k toward zero.
Because we do not know what the asymptotic distribution of d k should be, we use Cox's (1962) Monte Carlo simulation to obtain the null distribution of d k . For a given site k in the alignment, we simulate N replicate sites on the phylogeny using the maximum penalized likelihood estimates (MPLEs) of F k under H 0 . The distribution, D k , is determined by the 0;k is the log-likelihood difference for the i-th simulation. If the test statistic from the real data (d k ) is larger than, say, 95% of D k , we reject the null hypothesis H 0 (no PPS) and accept the alternative hypothesis H 1 (PPS) at the a ¼ 0:05 significance level. Cox's approach has been shown to work well in phylogenetic data sets (Goldman 1993). When analyzing an ensemble of sites in a multiple sequence alignment, we correct for multiple testing using a FDR procedure to select candidate PPS sites (Benjamini and Hochberg 1995).

Pad e Approximation to Calculate the Matrix Exponential
Calculation of the likelihood along a branch of length t in the phylogeny requires calculation of P k ðtÞ ¼ exp tQ k , where Q k ¼ ðq ij;k Þ is the substitution matrix (eq. 2). However, because the PPS model is irreversible, the usual Eigen decomposition algorithm used to calculate P k ðtÞ is not stable (Yang 2014 In our model, the calculation of the likelihood at a site involves multiple computations of exp tQ for every branch in the phylogeny. We choose q and j according to the largest branch length t. Because ðtQ=mÞ i ¼ ðt=mÞ i Q i , we calculate all necessary c q ðiÞ and Q i once and cache these in memory throughout the likelihood calculation. Calculating Q i once is more efficient than setting A ¼ Qt and applying the Pad e approximation directly. Instead, we compute B i ¼ ðt=mÞ i Q i ; R qq ðBÞ, and finally ½R qq ðBÞ 2 j for each value of t. We found this matrix exponentiation algorithm is approximately 1.5 times faster than the Taylor series approximation suggested in phylogenetics (Yang 2014), albeit using more memory to store the precalculated matrix powers.

Simulated Data
To test the specificity and sensitivity of the LRT for PPS, we simulated sites on a balanced 512-taxa tree with branch lengths equal to 0.0125 neutral substitutions per site (Tamuri et al. 2014). We simulated sites under a null model with no PPS (H 0 : Z k ¼ 0), and under the alternative model with PPS (H 1 : Z k > 0) with three strengths of selection Z k ¼ f2; 5; 10g, and with 1,000 sites simulated under each model setup. Following Tamuri et al. (2014), amino acid fitnesses for each site were sampled from a bimodal normal distribution with ten randomly selected amino acids chosen to have F $ Nð0; 1Þ and the remaining amino acids to have F $ NðÀ10; 1Þ. This simulation setup was chosen because it leads to simulated data that captures two important features seen in real data: 1) A bimodal distribution of selection coefficients among mutations, and 2) a sharp distribution of amino acid preferences among sites (Tamuri et al. 2012(Tamuri et al. , 2014. Simulated data were then analyzed with the swMutSel software to estimate model parameters. The branch lengths and mutation parameters were fixed to their true values (k ¼ 2, p Ã ¼ 0:25) throughout the analysis and only the sitewise fitnesses ðF k Þ and diversifying selection ðZ k Þ parameters were estimated. For each simulation setup, we calculated the MPLE and the LRT as described above, using N ¼ 100 replicates in Cox's procedure. In all analyses, the Dirichlet penalty on F k has a ¼ 0:01, and three strengths of penalty on Z k were tested, k ¼f0.01, 0.5, 1.0g.
Using the LRT results, we determined the false-positive and false-negative rates. The false-positive rate is calculated by determining the number of tests that incorrectly rejected the null hypothesis (Z k ¼ 0). The true-positive rate is calculated from the number of tests that correctly rejected the null hypothesis (Z k > 0).

Real Sequence Data
We downloaded 3,120 HA protein-coding sequences of human influenza H1 viruses (excluding 2009 pandemic-H1N1 and partial sequences) from the NIAID Influenza Research Database (Squires et al. 2012); we downloaded 3,490 RuBisCO eudicotyledon sequences from a previous study (Stamatakis et al. 2010); and we downloaded CYTB genes of placental mammals from NCBI RefSeq (O'Leary et al. 2016) mitochondria genomes. We reduced the HA and RuBisCO data sets to 466 and 478 sequences respectively by using CD-HIT (Fu et al. 2012) with clustering thresholds of 99.3% and 96% of amino acid sequence identity. The CYTB data were reduced to 418 sequences by keeping one sequence per mammal genus. Sequences were aligned using PRANK (Loytynoja and Goldman 2005), and the alignments used to estimate tree topologies with RAxML under the GTRCAT model (Stamatakis 2014). Because the swMutSel-PPS model is irreversible, trees must be rooted. Thus, outgroups were used to root the trees: avian influenza (HA), monocotyledons (RuBisCO), and monotremes (CYTB). Outgroups were removed and analyses carried out on the rooted ingroup tree (for the PPS model), and the unrooted ingroup tree (for the no PPS model). Sites with residues in fewer than 50 taxa were not analyzed. This corresponds to 31, 23, and 27 sites in the rbcL, HA, and CYTB alignments respectively. We note Z k is not identifiable if a site is conserved for a single amino acid. Such conserved sites have the same likelihood under the H 0 and H 1 hypotheses. MPLE and LRT were carried out as described above using a ¼ 0:01 in the Dirichlet penalty. We note estimates of F i;k are different between the two models Measuring Persistent Positive Selection in Proteins . doi:10.1093/molbev/msab309 MBE