Efficient inference, potential, and limitations of site-specific substitution models

Abstract Natural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states, or only change in concert with other sites. On one hand, such constraints on sequence evolution can be to infer biological function, one the other hand they need to be accounted for in phylogenetic reconstruction. Phylogenetic models often account for this complexity by partitioning sites into a small number of discrete classes with different rates and/or state preferences. Appropriate model complexity is typically determined by model selection procedures. Here, we present an efficient algorithm to estimate more complex models that allow for different preferences at every site and explore the accuracy at which such models can be estimated from simulated data. Our iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences and known topology. However, the joint estimation of site-specific rates, and site-specific preferences, and phylogenetic branch length can suffer from identifiability problems, while ignoring variation in preferences across sites results in branch length underestimates. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of these substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates.


Introduction
Over time, genome sequences change through mutations and are reshuffled by recombination. Modifications to the genomes are filtered by selection for survival such that beneficial variants spread preferentially and those that impair function are purged. As a result, some parts of genomes change rapidly, while other are strongly conserved. In addition to variation of the evolutionary rate, different sites in a genome explore different subsets of the available states. Some positions in a protein, for example, might only allow for hydrophobic amino acids, while others require acidic side chains. Patterns of conservation and variation, possibly involving more than one site, are therefore shaped by functional constraints which in turn allows inference of biological function from genetic variation.
Similarly, phylogenetics aims at reconstructing the relationships and history of homologous sequences from the substitutions that occurred in the past. Modern phylogenetic methods describe this stochastic evolutionary process with probabilistic models of sequence evolution and aim to find phylogenies that either maximize the likelihood of observing the alignment or sample phylogenies from a posterior probability distribution (Felsenstein 2004).
Inferring phylogenies is a computationally challenging problem since the number of phylogenies grows superexponentially with the number of taxa and because the calculation of the likelihood is computationally costly (though linear in the number of taxa). Due to this computational complexity, the most commonly used substitution models are simple caricatures of biological complexity. The simplest substitution models assume that all sites and sequence states are equivalent and evolve at the same rate, that is, they assume an unconstrained non-functional sequence that mutates at random between the different sequence states. Such simple models are clearly inadequate and ignoring rate heterogeneity tends to result in biased estimates of divergence times or otherwise erroneous results (Yang 1996). Most commonly used models account for variation in substitution rates among sites and average properties of the substitution process such as transition/transversion bias or more frequent substitution between similar amino acids (Yang 1994;Price, Dehal, and Arkin 2010;Stamatakis 2014;Nguyen et al. 2015). To avoid over-fitting, these methods typically do not estimate a rate for each site, but treat site-specific rates as random effects that are integrated out (often using discrete approximations of a Gamma distribution (Yang 1996), mixture of multiple unimodal distributions (Mayrose, Friedman, and Pupko 2005), or a small number of fixed rates).
In addition to rate variation different sites in a protein differ in the amino acids they allow, different positions in codons experience different constraints, and secondary structure or protein binding to DNA or RNA imposes additional levels of selection. Such variation can again be accounted for by modeling multiple categories with different preferences for amino acids or nucleotides to which sites can be assigned or which can be integrated out during phylogenetic inference (Lartillot and Philippe 2004;Shapiro, Rambaut, and Drummond 2006;Kainer and Lanfear 2015).
Recent deep mutational scanning experiments have shown that site-specific preferences are mostly conserved between moderately diverged proteins (Doud, Ashenberg, and Bloom 2015). Using such experimentally inferred site-specific models in phylogenetic inference greatly increases the likelihood of the data Bloom (2014). More than two decades ago, Halpern and Bruno (1998) pointed out that ignoring that equilibrium frequencies vary from one position to another will result in underestimation of branch lengths of a phylogeny-possibly dramatically when frequencies are heavily skewed. Hilton and Bloom (2018) recently showed that experimentally measured preference not only improve the phylogenetic fit, but also results in longer branch length estimates. Models with site-specific preferences are also known as mutation-selection balance models (Bruno 1996;Yang and Nielsen 2008) reflecting the intuition that equilibrium frequencies are determined by competition of diversifying processes (like mutation) and selection for an optimal function.
Instead of partitioning the sequence into a small number of categories, we ask under what circumstances it is possible to estimate models that allow different state frequencies at every site in the alignment and what implications such variation has for phylogenetic inference. While biologically plausible, estimating such models from data exacerbates the over-fitting problem and it rarely attempted in practice. In the context of the site-specific models this issue has become known as extensive parametrization or even infinitely many parameters problem Rodrigue (2013) and Spielman and Wilke (2016). With sufficient data, however, site-specific parameters can be accurately estimated (Tamuri, dos Reis, and Goldstein 2012;Scheffler, Murrell, and Pond 2014;Spielman and Wilke 2016). Here, we implement an EM-style algorithm inspired by (Bruno 1996) to infer sitespecific rates and preferences from simulated data, quantify its accuracy and the different sources of bias and noise, and show how divergence time estimates depend on the fidelity with which site-specific model parameters are known. We apply this algorithm to large HIV-1 alignments and explore the consequences for phylogenetic inference.

Efficient inference of site-specific substitution models
Following work by Halpern and Bruno (1998), we parameterize a site-specific general time-reversible (GTR) substitution model at site a from state j to i as: Here, l a is the substitution rate at site a, p a i is the equilibrium frequency of state i at site a, and W ij is a symmetric substitution matrix that we assume to be the same for all sites (in what follows, superscript a will always refer to the position in the sequence, while subscript i; j; n; m refers to the state). The second equation ensures conservation of probability. In addition, we require P i p a i ¼ 1 and P L a¼1 P i6 ¼j W ij p a i p a j ¼ L to normalize the frequencies and fix the scale of W ij .
Extending the approach by Bruno (1996), we show in Section 3 and the supplement that the iterative update rules approximately maximize the likelihood of the data. Here, s a j is the time site a spends in state j across the tree, n a ij is the number of transitions from state j to state i at site a, c is a pseudocount analogous to a Dirichlet prior that in the absence of data will drive the p a i to a flat distribution and the substitution rates to c divided by the total tree length. To make the behavior of these update rules more explicit, we have multiplied numerator and denominator with the parameter that is being updated. This 1, shows that the update rules are multiplicative and therefore ensure positivity, and 2, illustrates that each of these rules are the ratio of the observed number of transitions between states n a ij and the expected number p a i W ij s a j -each appropriately summed over sites or states. These update rules are an example of nonnegative factorization algorithms (Lee and Seung 2001).

Accuracy of model inferences
The accuracy of this iterative solution will depend 1, on the validity of the linear approximation, 2, the accuracy to which n a ij and s a j can be estimated, and 3, the accuracy of the tree reconstruction. Furthermore, the estimation of this extensive number of parameters requires large data sets in which most positions mutate multiple times on the tree-otherwise the model will overfit the data and result in inaccurate estimates.
To assess these sources of error and effect of data set sizes independently, we simulated sequences evolving along trees and explicitly recorded the number of transitions between states at every site (n a ij ) and the time spent in each state (s a j ) for a range of sample sizes, levels of divergence, and models with different degrees of variation between sites, see Section 3. From these simulated data sets, we inferred the site-specific models using different aspects of the simulated data: 1, full knowledge of all transitions and ancestral states, 2, the tree and the tip sequences, or 3, only the tip sequences from which a tree was reconstructed. We quantified the accuracy of the inferences as the squared estimation error v 2 ¼ L À1 P a;i ðp a i À p a i Þ 2 , wherep a i and p a i are the inferred and true equilibrium probabilities, respectively. Lower v 2 corresponds to more accurate estimates.
In our first analysis summarized in Fig. 1 we quantified the accuracy at which p a i and l a can be estimated if ancestral sequences are known. As expected, the average squared error v 2 decreases with the substitution rate and the size of the data set. The error is well predicted by the average number of substitutions per site, which increases with both data set size and the substitution rate. The data in Fig. 1A are shown on double logarithmic scales, such that the approximately straight decrease in v 2 with slope À1 implies that the squared deviation is inversely proportional to the expected number of substitutions. This scaling suggests that accuracy is limited by the inherent stochasticity of the evolutionary process and that the inference uses all available information efficiently.
The simulation data were generated by drawing site-specific rates from a Gamma distribution with a ¼ 1:5 (blue lines) and a ¼ 3 (orange lines). The scaling behavior v 2 $ ðtree lengthÞ À1 is more evident for a ¼ 3 than for a ¼ 1:5. In the latter case of strong rate variation, the average squared error v 2 is dominated by a small number of sites with low rates at which frequencies are estimated poorly and the average tree length does not capture behavior at these sites.
At the largest substitution rates, branch lengths are on the order of 0.5 and the linear approximation underlying the iterative equations is not longer accurate. Nevertheless, the accuracy of thep a i continues to increase. While equilibrium frequencies are insensitive, the substitution rate estimates are affected by linearization and Equation (2) will consistently underestimate l a . This is expected as the linearization ignores cases where the same site changes twice along a branch in very much the same way as Hamming distance will underestimate branch length. The relative substitution rates, however, are accurately estimated (with Pearson correlation coefficients >0.9 as soon as the majority of sites experience several mutations across the tree), see Fig. 1B. Regularization by pseudo-counts c reduces the overfitting problem when the number of substitutions per site is small.
Above, we investigated the accuracy of model inferences when the ancestral states and the tree are known. In practice, however, ancestral sequences are unknown and need to be inferred or summed over (marginalized) which will introduce additional uncertainty. This problem was recently described as the 'Darwinian uncertainty principle' by Gascuel and Steel (2020) showing that there are fundamental limits to the accuracy at which model and ancestral states can be estimated jointly.
To test the influence of tree and ancestral state reconstruction, we reconstructed phylogenetic trees using IQ-tree (Nguyen et al. 2015) with a GTRþR10 model (simulated nucleotide sequences) or FastTree (Price, Dehal, and Arkin 2009) using the default settings (simulated amino-acid sequences). Figure 2 compares different schemes to reconstruct ancestral sequences, infer substitution models, and optimize the tree. The simplest approach is to take the inferred tree as given, reconstruct the ancestral states using a simple evolutionary model (e.g. Jukes-Cantor model) and calculate n a ij and s a i from this reconstruction. This naive approach works well up to root-to-tip distances of about 0.3, beyond which estimates deteriorate, see gains are made when iterating model inference and ancestral reconstruction using the inferred site-specific model ('Iterative model estimation', brown line). The error now continues to decrease up to root-to-tip distances of about one. At this level of divergence, tree reconstruction starts becoming problematic and branch lengths deviate substantially from their true values. Using the true tree instead of the reconstructed tree leads to continuous improvements of accuracy with increasing levels of divergence (yellow line). Similar improvements are achieved by optimizing tree branch lengths along with the model. These improvements are consistent with Gascuel and Steel (2020) in that joint estimation of ancestral states and model is possible for Yule trees in which tree length grows linearly with the number of tips while tree depth increases only logarithmically. In contrast, using the frequencies of different sequence states in the alignment as estimates for p a i is much less accurate due to the correlation induced by shared ancestry (orange lines in Fig. 1A).
We found qualitatively similar patterns for four-letter and twenty-letter alphabets in the overall accuracy of the estimated model and its dependence on the different approximations, compare Fig. 2A and B. For larger alphabets, the breakdown at large root-to-tip distances is less dramatic and sets in at larger values. Overall, we conclude that site-specific frequencies can be estimated accurately when the overall tree length comfortably exceeds ten such that most sites experienced multiple mutations.

Implications for branch length estimates in phylogenies
As previously observed by various authors (Halpern and Bruno 1998;Hilton and Bloom 2018), sites with heavily skewed preferences for specific states result in underestimation of branch lengths if these skews are not modeled appropriately. This is a straightforward consequence of the fact that the probability of observing the same state at random is P i ðp a i Þ 2 , which is increasing sharply with more peaked preferences. Models that do not account for site-specific frequencies will take a large number of sites that agree between two sequences as evidence for their close evolutionary relationship, while it is simply a consequence of resampling the same states with high probability. In other words, this underestimation is the result of incorrect models of saturation and is closely related to the observation that incomplete purifying selection can result in erroneous and apparently time-dependent evolutionary rates Wertheim and Kosakovsky Pond (2011). Figure 3 shows the average branch length (panel A) and the average root-to-tip distance of mid-point rooted trees (panel B) with branch lengths optimized using 1, the true model, 2, using the GTRþR10 model of IQ-tree, and 3, inferred models using different degrees of regularization. While branch lengths are accurately estimated when using the true model, they are systematically underestimated by the GTRþR10 model without site-specific preferences. This problem is particularly severe for the average root-to-tip distance which is dominated by long branches deep in the phylogenies that are prone to underestimation. When ignoring site-specific preferences, the inferred root-to-tip distance is essentially flat for distances greater than 1 (Fig. 3B). This effect is entirely due to skewed equilibrium frequencies, as branch length inference by IQ-tree is accurate if the simulated data had flat p a i ¼ q À1 with the same rate variation. The underestimation of branch lengths is less severe for larger alphabets or if frequencies are not heavily skewed, see Supplementary Fig. S2.
Surprisingly, using the inferred site-specific models only partially rectified the problem of branch length underestimation, despite the fact that these models are close to the true model in terms of low v 2 . The principle contributor to this deviation is inaccuracies in the rate estimates. Combining the true rates with inferred preferences reduces the error in branch length estimation (see Fig. 3, lines labeled ('true rates')).

The effect of model misspecification on divergence estimates
In the previous section, we have observed that model deviations, for example the error made during model inference, can result in substantial errors in branch length estimates. To investigate this effect more systematically, we constructed A B Figure 2. Quantification of errors stemming from tree inference and ancestral reconstruction. Panels A and B show the mean-squared deviation v 2 of inferredp a i from the true p a i for four-letter and twenty-letter alphabets, respectively. Alignment frequencies are poor estimates of the p a i even in very diverse samples, while the different phylogeny aware methods initially improve rapidly with root-to-tip distance. At large root-to-tip distances, ancestral reconstruction becomes less and less certain and estimation of p a i fails (red lines). These errors are gradually eliminated by first summing over ancestral uncertainty (violet), iteratively redoing ancestral reconstruction using the inferred model (brown), and re-optimizing branch lengths using the updated models (or using the true tree, yellow/pink). Data in this figure uses were generated assuming Gamma distributed rate variation with a ¼ 1:5.
mixtures of the true model and a model with flat p a i and/or l a as follows: For a mixing fraction c, we constructed a model with rates where hl a i is the average of the rate. Mixtures of site-specific frequencies are constructed analogously. Figure 4A shows how deviation from the true model affect relative branch length estimates. Deviation in rate estimates result immediately in underestimated branch length, while substantial effects of too flat site-specific preferences only manifest themselves once deviations are of order c ¼ 0:3. If the same preferences are used for every site (c ¼ 1), however, deviations are substantial. Deviations in rate and preferences are approximately additive. These observations are consistent with the finding above that using inferred preferences and true substitution rates result in more accurate branch length estimates than inferring both rates and preferences. Figure 4B shows the degree of mis-estimation when using flat preferences, flat rates, or both as a function of the degree of divergence. The relative error increases approximately linearly with the average evolutionary rate.

Applications to large HIV alignments
Since its zoonosis, HIV-1 has diversified into several different subtypes that differ from each other at about 20 per cent of sites A B Figure 3. Skewed equilibrium concentration results in branch length underestimates. Panel A shows the inferred average branch length as estimated by IQ-tree and TreeTime as a function of the true average branch length. Panel B shows the results of the same optimization for the average root-to-tip distance which is dominated by deep long branches that are more strongly affected. While using the true model results in unbiased branch lengths, inferred site-specific models only partially ameliorate underestimation, in particular with high pseudo-counts c (see main text). Parameters: n ¼ 1,000, a ¼ 1:5.
A B Figure 4. Sensitivity of branch length estimates on model misspecification. Panels A and B show the relative error in total tree length when using a mixture model as defined in Equation (3) for branch length inference. Panel A shows this error as a function of the mixing fraction c for hli ¼ 0:2. Panel B shows the error as a function of the evolutionary rate hli for c ¼ 1. The mixing is applied to the equilibrium frequencies p a i , the rates l a , or both. The models assume an alphabet size q ¼ 4 (nucleotides). in their genome. For each subtype, thousands of sequences are available, which typically differ by about 10 per cent (Los Alamos HIV Sequence Database 2017). This large sample of moderately diverged sequences should be suited to estimate site-specific preference using the iterative inference frame work and quantify how selection shaped the evolution of HIV.
Simple population genetic models predict that the fixation probability f a ij of a mutation from state j to state i at site a should depend on the fitness difference s a ij and the effective population size N as (Kimura 1964) On longer time scales, the fixation rates f a ij correspond to transitions rates of the site-specific GTR model Q a ij . The effective population size N plays the role of a coalescent time scale T c and in general has little to do with census population sizes, in particular in rapidly adapting populations (Neher 2013). Nevertheless, the logarithm of the ratio log f a ij =f a ji % 2Ns a ij is an interpretable quantity related to the average fitness difference between states on time scales longer than the population genetic scale N $ T c . We generalize this notion to multiple states and define a fitness score of state i at position a as the ratio of rates into and out of state i The logarithm of this ratio is expected to be proportional to the fitness difference between state i and the average alternative state at site a, while the common multiplicative factor 2 N is unknown.
We downloaded alignments of HIV-1 pol sequences from the LANL data base, constructed phylogenetic trees, and inferred site-specific GTR models at the nucleotide and amino acid level, see Section 3. We compared 2Ns a i ¼ log ðC in =C out Þ to fitness costs measured using mutation-selection balance models and within host diversity data of HIV (Zanini et al. 2017). However, instead of a linear relationship between s a i and with-in host estimates of fitness costs, we observed a linear relationship between the logarithm of fitness effects measured with-in host and s a i (see Figs 5 and 6) This relationship explains about half the variance of nucleotide effects and about one-third of amino acid effects. Interestingly, this correlation between the intrahost estimates and cross-sectional estimates decreased as soon as regularization increased above 0.05 and we therefore used a weak regularization of c ¼ 0:01.
The fact that the relationship of fitness proxies deviates from the expectation cross À sectional $ with À in host and instead is approximately cross À sectional $ log ðwith À in hostÞ points toward different process that drive cross-sectional and with-in host diversification. While the order of fitness effects with-in host and cross-sectionally seem to be mostly concordant, with-in host variation is much greater than crosssectional variation. With-in host fitness effects are masked and damped at the population level, possibly due to fluctuating selection by diverse host immune systems or epistasis (Shekhar et al. 2013;Zanini et al. 2015). The ratio of in/out rates e 2Ns a i is a very good correlate of alignment diversity (see Supplementary  Fig. S4).
Next, we quantified the effect of ignoring site-specific preferences on branch length estimates. We generated sequence pairs that evolved for a specific time t under the site-specific model inferred from the HIV-1 pol alignment and then estimated a maximum likelihood branch length between these two sequences. This estimate was done using a model with homogeneous equilibrium frequencies set to the average frequencies across sites while maintaining site-specific rate variation. As soon as the length of the simulated branch approaches 0.2, the length inferred by a model without site-specific preferences deviates substantially from the true value and saturates around 0.5 for larger and larger t. As expected, these deviations are even more severe when estimating branch length as simple sequence divergence (p-distance), while using the generating model reproduces the correct branch length. Assuming a typical evolutionary rate of HIV of 0.002 changes per site and year, this analysis suggests that length estimates of branches longer than 100 years start to become inaccurate. Furthermore, this analysis suggests very little signal to estimate the length of branches that are longer than 300 years.

Discussion
Different positions in a genome sequence or a protein change at different rates and explore different subsets of the available states. These site-specific properties are evolutionary conserved and are the basis of common homology search tools. Which states are permissible at which position can be measured with high-throughput using deep mutational scanning techniques (Fowler and Fields 2014) or can be estimated from large alignments of homologous sequences. Both approaches can reveal biological function and organization of the entity coded for by the sequence. Bruno (1996) pointed out that the frequencies of states in columns of an alignment are distorted by phylogenetic correlations and how these phylogenetic correlations can be accounted for. However, estimating these frequencies accurately requires large data sets that have only recently become available.
While accurate estimates of site-specific frequencies require correction of phylogenetic correlations, accurate phylogenetic inference requires sufficiently realistic models of sequence evolution while being computationally tractable and easy to parameterize. A common compromise is to model rate heterogeneity as random effects, while assuming identical site preferences across the sequence. More complex models allow for a small number of data partitions with different preferences (Lartillot and Philippe 2004;Kainer and Lanfear 2015).
Here, we presented method to estimate site-specific preferences using an iterative EM-type approach inspired by Bruno (1996). We showed that site-specific models can be inferred with an accuracy limited by the Poisson statistics of the substitution process using an efficient iterative scheme. For short branch lengths a parsimonious ancestral reconstruction is sufficient, while for more diverged samples iterative inference of the model, the ancestral states, and the branch lengths is necessary. The computational complexity of the inference is dominated by ancestral reconstruction and branch length optimization, which is quadratic in the alphabet size and linear in sample size and sequence length (Felsenstein 2004). We have implemented the algorithm for site-specific model inference and branch length optimization in TreeTime (Sagulenko, Puller, and Neher 2018).
We further explored how model choice affects the maximum likelihood estimates of branch lengths. As has been reported before, branch lengths are underestimated when variation in rate or preferences are not fully accounted for (Halpern and Bruno 1998;Hilton and Bloom 2018). While such underestimation often has a moderate effect on the total length of a tree, it can result in substantial underestimation of root-to-tip distance that are dominated by a few long branches close to the root. While using the correct model results in correct branch length estimates, joint inference of site-specific models and branch lengths is typically unable to recover the true branch lengths and substantial underestimation remains. These points to parameter identifiability problems rooted in the similar effects that skewed equilibrium frequencies, low rates, and shorter branch length have on the likelihood of the data. These problems are related to the inability to jointly infer ancestral states and rate parameters of evolving discrete states (Gascuel and Steel 2020). Model and tree inference alone, however, does not require acurate reconstruction of ancestral states as these can be marginalized.
Branch length estimates using models constructed by mixing the true model and flat Jukes-Cantor type models showed that misspecified equilibrium frequencies result in substantial errors as soon as the model deviates from the true model by 30 per cent or more. This mirrors observations by Hilton and Bloom (2018), who found that preferences measured for influenza HA proteins of type H1 or H3 affect branch length of in the vicinity of the focal sequence, but not globally on the tree. Using site-specific models inferred from HIV alignments, we A B Figure 6. Underestimation of divergence in HIV. Panel A shows the estimated ML branch length for the model used to generate the sequences and a model with constant equilibrium frequencies as a function of true branch length. For comparison, the p-distance between the two simulated sequences is also shown. The top axis shows branch length in units of years assuming a substitution rate of 0.002/year and site. Error bars denote one standard deviation. Panel B shows the distribution of rates across sites for both models. About 20 per cent of sites are essentially invariable, while the rates of the remainder vary by at least ten-fold. The distributions differ slightly in the overall scale since rates have been rescaled such that the average substitution rate in both models is identical.
quantified the error made when ignoring site-specific preferences. While the true models are unknown in this case, this analysis nevertheless suggests that errors are substantial as soon as branch length exceed t ¼ 0.2 ($100 years) and sequence divergence saturates at levels around 0.3. This result is consistent with the discrepancy between molecular clock-based and biogeography-based estimates of the divergence times of different SIV lineages (Worobey et al. 2010): Beyond a few hundred years, there is very little signal to estimate branch length-in particular when the underlying site-specific model parameters need to be estimated themselves. This lack of information is further underscored by the average nucleotide distances between pol sequences of different HIV and SIV strains: HIV-1 subtypes differ at about 10 per cent of sites, distances between pol and HIV-1 and SIVcpz are on the order of 25 per cent and distances between sequences in the HIV-1/HIV-2/SIV compendium alignment are about 35 per cent (all distances ignore sites with >20% gaps). These observations are compatible with evidence for frequent reversion of HIV to a preferred sequence state following immune escape (Leslie et al. 2004;Carlson et al. 2014;Zanini et al. 2015). This potentially large error in branch length estimates can seriously affect deep divergence time estimates. Typically, short branches close to the tips of the tree are used to calibrate molecular clock models. The deep branches, however, tend to be much longer and rate estimation and variation of site-specific preferences will result in saturation effects not accounted for by the model (Hilton and Bloom 2018). This effective variation in rate is related to the effect of transient deleterious mutations that inflate the rate on short time scales (Ho et al. 2005;Wertheim and Kosakovsky Pond 2011): the time it takes purifying selection to prune deleterious mutations is related to the relaxation time scales of GTR models with site-specific preferences. Instead of the qÀ1 degenerate eigenvalues of a Jukes-Cantor model, each site has a spectrum of eigenvalues and different eigenmodes relax at different speeds, generating apparently time-dependent rates.
Estimating site-specific models from sequence alignments faces one fundamental problem: Reliable estimates require observation of many changes at each site which requires many sufficiently diverged sequences. At the same time, preferences at individual sites are expected to change due to epistatic interactions with other sites in the sequence, as for example witnessed by the gradual divergence of experimentally determined preferences (Doud, Ashenberg, and Bloom 2015;Haddox et al. 2018). Hence the approach described in this work is largely restricted to cases like HIV where many moderately diverged sequences ($10%) are available. Outside of this limit there either is not enough data to reliably estimate the large number of coefficients, or epistatic interactions need to be taken into account-probably at the expense of ignoring phylogenetic signals (Morcos et al. 2011).

Model and notation
Most models of sequence evolution express the probability that sequences evolved from sequencer in time t as ðe Q a t Þ s a ;r a ; where Q a is the substitution matrix governing evolution at site a, and s a and r a are the sequence states at position a. The product runs over all L sites a and amounts to assuming that different sites of the sequence evolve independently. In absence of recombination, homologous sequences are related by a tree and the likelihood of observing an alignment A ¼ fs k ; k ¼ 1 . . .ng conditional on the tree T and the substitution model Q a can be written in terms of propagators defined in Equation (6). It is helpful to express this likelihood as product of sequence propagators defined in Equation (6) between sequences at the ends of each branch in the tree (implicitly assuming that evolution on different branches is independent and follows the same time reversible model). Unknown sequences of internal nodes fs 0 g need to be summed over and the likelihood can be expressed as Pðs c s p ; t; QÞ ¼ X fsg e 'ðfs gjT;QÞ ; wheres c ands p are the child and parent sequences of branch k, respectively, and the factor Q a p a s a 0 is the product of the probabilities of the root sequence s a 0 over all positions a. The probabilitiesp a are the equilibrium probabilities of the substitution model at position a. The latter ensures that the likelihood is insensitive to a particular choice of the tree root. This equation defines the log-likelihood ' of a particular internal node assignment fs 0 g which is given by where s a c and s a p are indices corresponding to the child and parent sequence of branch k.
The sum over unknown ancestral sequences can be computed efficiently using standard dynamic programming techniques. Nevertheless, it requires Oðn Â L Â q 2 Þ operations (where q is the size of the alphabet A) and optimizing it with respect to a large number of parameters is costly. Our goal here is to infer site-specific substitution models using a computationally efficient iterative procedure.
Instead of inferring completely independent models for every site in the genome, we follow Halpern and Bruno (1998) and only allow for site-specific rates and equilibrium frequencies while using the same transition matrix for every site. Such a site-specific GTR model, can be parameterized as: where W ij is a symmetric matrix with W ii ¼ 0 and the second equation ensures conservation of probability. In addition, we require P i p a i ¼ 1 and P L a¼1 P i6 ¼j W ij p a i p a j ¼ L to ensure that the average rate per site is l a .

Iterative optimization algorithm
The derivatives of ' with respect to l a , p a i , and W ij need to vanish at the values that maximize '. @'ðAjT; QÞ @X ¼ X fs g e 'ðfsgjT;QÞ @'ðfsgjT; QÞ @X ¼ 0; where X is one of the parameters we vary.

Implementation
We extended our package TreeTime (Sagulenko, Puller, and Neher 2018) to handle site-specific GTR models as defined in Equation (1) by adding an additional class GTR_site_specific that generalizes existing the GTR class. Since these classes have an almost identical interface and can be used interchangeably in other analysis run by TreeTime. Using the new class, TreeTime can generate sequences with site-specific evolutionary models and infer these models from sequence data using the algorithms above. In the future, this could extended to also allow site-specific W ij but this is not implemented as of now.

Generation of simulated data
To generate models and sequence ensembles for which the ground truth is known, we sampled binary trees with n ¼ 100; 300; 1; 000; 3; 000 leaves and Yule tree branch length statistics using betatree (Neher, Kessinger, and Shraiman 2013)  Given these tree, we generated multiple sets of sequences of length L ¼ 1,000 for each tip of the tree. Specifically, we explored the effect of model choice and realization of the evolutionary process by generating data sets as follows: For each tree, we sampled two site-specific models for sequences of length L ¼ 1,000 from following distribution: Site-specific rates l a were sampled iid from a Gamma distribution with parameter a ¼ 1:5 or 3.0. Site-specific preferences p a i (or equilibrium probabilities) were sampled from Dirichlet distributions with parameters a ¼ 1 for alphabets with q ¼ 4 states (nucleotides) or a ¼ 0:2 and a ¼ 0:5 for alphabets with q ¼ 20 (amino acids). These distributions correspond to an average number of effective states of N eff ¼ 2.6 (nucleotides) and 4.7 or 7.8 (amino acids), where the number of effective states is given as ð P i ðp a i Þ 2 Þ À1 . The entries of the transition matrix W ij were sampled from a Dirichlet distribution with a ¼ 2. The average substitution rate of these models was fixed to the required hli. In addition, we generated one set of models (q ¼ 4) with rate variation but uniform p a i and W ij . For each combination of tree and model, two sets of sequences were evolved using the sequence generation function of TreeTime.
In total, this amounts to eighty alignments for each of the data set sizes n ¼ 100; 300; 1; 000; 3; 000 and four different ensembles. For each alignment, we reconstructed phylogenetic trees using IQ-tree (Nguyen et al. 2015) with a GTRþR10 model or FastTree (Price, Dehal, and Arkin 2009) using the default twenty category model for nucleotide and amino-acid sequences, respectively. These trees were used to infer site-specific models from data using TreeTime, see below. The exact workflow is documented in the script src/generate_toy_data.py in the associated git repository at github.org/neherlab/ 2019_Puller_SiteSpecificGTR (last accessed 27 July 2020).

Model inference from simulated data
Simulated data and trees (reconstructed or true) were read in by TreeTime and models reconstructed using functions of TreeTime to infer models with the different approximations discussed in the text. The exact workflow is documented in the script src/reconstruct_toy_data.py in the associated git repository at github.org/neherlab/2019_Puller_SiteSpecificGTR (last accessed 27 July 2020).
Sequences were aligned to the HXB2 reference sequences using mafft (Katoh and Standley 2013), phylogenies were inferred using IQ-tree (Nguyen et al. 2015), and ancestral sequences were inferred using TreeTime (Sagulenko, Puller, and Neher 2018) via the nextstrain's augur pipeline (Hadfield et al. 2018). The different steps were assembled into a pipeline using the workflow manager Snakemake (Koster and Rahmann 2012).
The sequences and trees were then used for site-specific GTR inference as implemented in TreeTime (Sagulenko, Puller, and Neher 2018). The scripts detailing this analysis and producing the figures are available on GitHub in repository github.org/ neherlab/2019_Puller_SiteSpecificGTR (last accessed 27 July 2020).