Modelling structural constraints on protein evolution via side-chain conformational states

Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally-aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino acid identity and structural information. Each character specifies an amino acid as well a rotamer state: the discrete geometric pattern of permitted side-chain atomic positions. By assigning rotamer states in 251,194 protein structures and identifying 4,508,390 substitutions between closely related sequences, we generate a 55-state model that shows that the evolutionary properties of amino acids depend strongly upon side-chain geometry. The model performs as well as or better than traditional 20-state models for divergence time estimation, tree inference and ancestral state reconstruction. We conclude that the concomitant evolution of sequence and structure is a valuable source of phylogenetic information.


27
The development of evolutionary models is a prerequisite (albeit sometimes an implicit one) for many 28 common bioinformatics tasks such as recognition of homologous sequences, phylogenetic tree estimation, 29 evolutionary hypothesis testing and protein structure prediction (Huelsenbeck and Rannala 1997,  (TRP). (a) In traditional amino acid replacement models their distinct χ 1 rotamer configurations are merged into a single amino acid state. (b) In our model these states are split into three χ 1 configuration-specific states (1, 2, and 3) defined, as in the Dunbrack rotamer library, by the dihedral angle between the first two covalently linked carbons in the side-chain (C α and C β ; see also Sup. Fig. 1).  ALA  ALA  ARG  ARG1  ARG2  ARG3  ASN  ASN1  ASN2  ASN3  ASP  ASP1  ASP2  ASP3  CYS  CYS1  CYS2  CYS3  GLN  GLN1  GLN2  GLN3  GLU  GLU1  GLU2  GLU3  GLY  GLY  HIS  HIS1  HIS2  HIS3  ILE  ILE1  ILE2  ILE3  LEU  LEU1  LEU2  LEU3  LYS  LYS1  LYS2  LYS3  MET  MET1 MET2 MET3  PHE  PHE1  PHE2  PHE3  PRO  PRO1  PRO2  SER  SER1  SER2  SER3  THR  THR1  THR2  THR3  TRP  TRP1  TRP2  TRP3  TYR  TYR1  TYR2  TYR3  VAL  VAL1 VAL2 VAL3 Table 1 interchanging states (see Rotamer state exchangeability analysis). To further quantify the strength of 141 these sub-matrix patterns we use Cramér's V ( V ), a measure of association between two categorical 142 variables (here the χ 1 configurations of amino acids A and A ). Existence of strong association does 143 not guarantee a diagonal pattern (rotameric state conservation); we therefore also consider the diagonal 144 ratio for each sub-matrix, indicating the tendency of rates to lie on each 3×3 sub-matrix's diagonal. V 145 and diagonal ratio are shown in Fig. 3 for the 111 sub-matrices with significant associations. right), indicating a strong preference for conserving side-chain orientation. This exchange pattern might 148 be capturing the effect of local constraints on how freely a bulky aromatic side-chain can be positioned 149 without displacing or clashing with those of neighbouring residues. A similarly strict configuration 150 conservation can be observed for negative-negative and positive-positive exchanges; however, negative-151 positive exchanges have high V but somewhat lower diagonal ratios (Fig. 3b). These sub-matrices show 152 significant association between specific configurations of the exchanging residues but no common pattern, 153 possibly arising from the competing pressures to retain compatible geometries upon substitution but also 154 to displace the charged moiety to a new location following a charge swap. It is also interesting to note 155 that leucine has high diagonal score in exchanges with all the aromatics (aliphatic-aromatic comparisons, show in subsequent sections this provides additional inferential power from the ability to distinguish 178 states and state-interchanges according to χ 1 configuration.

179
Due to RAM55's expanded state-space, the probability of observing any amino acid, given the initial 180 state and a divergence time t, is different in the 55-state model than it is in the 20-states model. For 181 instance, a histidine residue is more likely to be substituted with an asparigine when χ 1 ≈ 60 • than when 182 in one of the other χ 1 configurations. In the RUM20 model the three χ 1 configurations are merged, and 183 thus the amino acid probability distribution at time t corresponds to the weighted average of the three 184 rotamer states. Thus, for each rotamer state, there is a divergence between the probability distributions 185 of the amino acids states at time t using the RAM55 model when compared to that when using RUM20. 186 Indeed, as RUM20 can be arrived at by merging states in RAM55, this divergence constitutes a loss 187 of information regarding the amino acid probability distribution when RAM55 is approximated using 188 RUM20. This can be quantified using the Kullback-Leibler divergence (Kullback and Leibler 1951; see

189
Kullback-Leibler divergence). At t = 0, no loss occurs due to the amino acid sequence being fully known 190 in both models. As t → ∞, both models tend towards the equilibrium amino acid frequencies and the 191 loss tends towards zero. The differences between the two models manifest in between these extremes.
192 Figure 4 shows that average information loss for one state peaks at 0. but not specifically about its exchange rates still produces a worse fit than the full RAM55 model. These 220 results confirm that, when the more complex RAM55 model matches the underlying process generating 221 the input alignments, it is possible to detect an improvement in fit over simpler models. Comparison of the RAM55 model (blue bars) against LG (green; top panel) and against our RUM20 model (orange; bottom panel) in terms of Euclidean distance of their inferred trees from the reference phylogeny used to simulate the alignment. Box plots illustrate distance distribution and median (horizontal lines), scatter plot points represent individual distance values. Tree inference is performed on alignment data sets (1000 sites, 64 taxa, 100 replicates per scaling) simulated using RAM55 and a randomly-generated reference phylogeny, scaled according to the factors on the x-axis.
As a further performance test, we also evaluated whether ML trees inferred under the RAM55 model 223 are closer to the reference phylogeny used during the simulation process than those inferred with other 224 models. For these comparisons we considered both (1) the Euclidean distance (Felsenstein 2004), a 225 metric that accounts for both topological differences between trees and differences in branch lengths, 226 and (2) the lengths of individual branches. Under the former measure, RAM55 infers trees that are at 227 least as close or closer to the reference phylogenies than those inferred by amino acid replacement models 228 such as our RUM20 model or LG.

Model benchmarking: empirical alignments
We assessed RAM55's performance on three empirical amino acid alignments -with 13, 82 and 46 taxa, 243 respectively -for which we can obtain corresponding structural information (see Empirical alignments), 244 and compared goodness-of-fit and inferred phylogenies across models. RAM55 was used in ML analyses, 245 and results compared with those derived using structure-free models such as the 20-state models LG,

246
WAG and our own RUM20, and the 55-state LGbyfreq-exp model which recognizes the frequencies of 247 the 55 states but not their structural information content (see Log-likelihood comparison across models,
249 Figure 7 shows the goodness-of-fit (  where ancestral reference sequences (and structures) are unlikely to be available. Figure 8 shows that 278 our model performs equally or slightly better than LG in terms of of amino acid state reconstruction 279 accuracy, particularly at longer evolutionary distances (see also Sup. Fig. 7 FIG. 8. Amino acid reconstruction accuracy. Amino acid states inferred using joint reconstruction from rotasequence alignments (200 sites) simulated under RAM55 using our 8-taxon reference phylogeny and scaling its branches according to the factors reported on the x-axis (note the non-linear scale used for clarity). The joint reconstruction algorithm is then employed along with RAM55 and the true phylogeny to reconstruct rotasequences at various internal (C) or terminal (A, B) nodes. The y-axis indicates the proportion of amino acid states (i.e. masked rotasequence states) correctly reconstructed for each inferred sequence. Plain boxplots indicate the distribution of percentages of sites correctly reconstructed (y-axis) by this method; the same procedure is then repeated using LG on "masked" alignments (hatched boxplots). Each box-plot contains results from 100 simulation replicates for a given node.
We thus assessed our models' accuracy when joint reconstructing ancestral rotamer states simulated 286 under the model itself and our 8-taxa phylogeny. Figure 9 shows that RAM55 is able to infer the 287 correct ancestral side chain configuration for residues belonging to internal sequences in almost all cases 288 when the ancestral amino acid state is accurately reconstructed (as shown in Fig.8). Similar results are 289 obtained for other alignment lengths and numbers of taxa in reference phylogenies (data not shown).

290
These reconstructed ancestral rotamer states could be used to predict side-chain geometry for homology 291 modelling of ancestral proteins, to assess which configuration better fits the evolutionary data.  Node C FIG. 9. Rotamer state reconstruction accuracy. Rotamer states inferred using joint reconstruction from rotasequences (8 taxa, 200 sites) simulated under RAM55 using our 8-taxon reference phylogeny and scaling its branches according to the factors reported on the x-axis (note the non-linear scale used for clarity). The joint reconstruction algorithm is then employed along with RAM55 and the true phylogeny to reconstruct rotasequences at various internal (ROOT, C, D) or terminal (A, B) nodes. The y-axis indicates the proportion of rotamer states correctly reconstructed for each inferred sequence. Boxplots indicate the distribution of percentages of sites correctly reconstructed (y-axis) by this method. Each boxplot contains results from 100 simulation replicates for a given node.

293
We have created a Dayhoff-like continuous-time Markov model that accounts for structural constraints

303
Using simulated data, we confirmed that our 55-state model captures enough information to detect the 304 χ 1 configuration-aware expanded state space, and observed that it consistently offers detectably better fit 305 to data compared to models that use the traditional 20-state space such as LG, WAG, and our RUM20.

306
Further, RAM55 appears to infer equally or more reliable phylogenies than any of the 20-state models.

307
This argues in favour of its consideration for phylogenetic analysis of protein sequences.Moreover, when 308 applied to empirical data, the model provided a better fit than any of the traditional models evaluated.

309
Our model can also be applied to perform structurally-aware reconstruction of ancestral sequences. could be used to predict side-chain geometry for homology modelling, to assess which configuration better 316 fits the evolutionary data. In this paper we apply our model to empirical data where both amino acid 317 sequences and the corresponding high-quality atomic coordinates. While an increasingly large number 318 of protein sequences can be associated with reliable X-ray crystallography data, and recent advances in  defining χ 1 (N , C α , C β and C γ for most amino acids), to ensure that rotamer state assignments were 339 based on unambiguous electron densities and not modelling artefacts. We also removed non-standard 340 residues, disordered residues and those with peptide bonds exceeding 1.8Å, the last to ensure a continuous Tabulating substitution counts 350 We combined the amino acid sequences and rotamer state sequences to produce sequences in an expanded 351 alphabet (see Table 1), which we refer to as 'rotasequences'. Each rotasequence consists of symbol pairs  Table 1.

377
The 55-state IRM Q is computed from these normalized counts as described by Kosiol and Goldman  i.e. in terms of number of amino acid state substitutions. This additional scaling factor ρ * is defined as: and corresponds to the proportion of rotamer state changes where the amino acid changes, irrespective 395 of χ 1 configuration. Then the 'superscaled' IRM is given by:  only these are considered for further analysis (e.g. Fig. 3).

416
We assessed the strength of association between the χ 1 rotamer configurations of each pair of residues Kullback-Leibler divergence 441 We measured the amount of information lost, regarding the amino acid sequence, when a 20-state model with S 55 and S 20 being the 55-and 20-state spaces, R a the χ 1 configurations of amino acid a, P RUM20 (t) 449 and P RAM55 (t) the probability matrices of the respective models at time t (see eqn. (10) below), and 450 (e.g.) P RAM55 (t,(A,R),(a,r)) the ((A,R),(a,r)) element of P RAM55 (t).

451
Likelihood calculation and maximization over phylogenies 452 We implement our models using maximum likelihood (ML) methods applied to multiple sequence 453 alignments (Felsenstein, 2004). This standard approach searches for the tree T that maximizes the where Q is the IRM of the Markov process. The likelihood of T (including tree topology and branch 457 lengths) given data (alignment) X and IRM Q can then be computed as: where L(T |Q,X i ) corresponds to the likelihood of T given the states observed at site i of X (site 459 independence assumption). L(T |Q,X i ) is computed by applying eqn. (10) to each tree branch and using 460 the pruning algorithm (Felsenstein, 1981). Maximizing L over T provides estimates T and thus the 461 most-likely phylogeny given the observed data and the current substitution model. The '+F' approach 462 to matching the model's state frequencies to the observed data can be implemented by simultaneously 463 maximizing L over these frequencies.

464
It is also generally acknowledged that sites do not evolve at the same rate, due to various evolutionary 465 constraints. The most common way of accounting for this heterogeneity is to assume that rates 466 across sites follow a discretized gamma distribution (Yang, 1994). The shape parameter of the gamma 467 distribution, α, is usually estimated by ML along with T as it is considered specific to each dataset.

468
Models using the gamma distribution to model rate heterogeneity are denoted '+G'.

469
In this study, all ML tree inferences were performed using RAxML-NG ( general kernels that are less efficient. Nevertheless, computation times remain acceptable, tending to be 477 5-10 times longer than using 20-state models (Sup. Fig. 10).

478
Tree generation and alignment simulation 479 We simulated sequence alignments under RAM55 using four randomly generated trees (8, 16, 32 or where θ C and θ D represent the totality of parameters from C and D; d(p,q) and c(p,q) are the distinct 509 and compound states observed for taxon p at site q; and π D d(p,q) and π C c(p,q) are these states' equilibrium 510 frequencies in their respective substitution models. In our application of this approach the distinct model 511 D corresponds to RAM55, whose states can be uniquely compounded into amino acid states (e.g. TRP3

515
As an independent approach to test the contribution of knowledge of rotameric configuration-state 516 substitutions, we generated 55-state models that were expanded versions of the 20-state LG model (Le 517 and Gascuel, 2008). Likelihoods are directly comparable to RAM55's since they share the same state-518 space. This model expansion operation was performed with the introduction of no information about the 519 additional states (LGexp model) or, alternatively, by accounting for just the observed frequencies of these 520 additional states in our dataset (LGbyfreq-exp). In each case, we started from LG's exchangeabilities 521 and reconstructed a raw substitution count matrix by reversing 20-state versions of eqns. (7) and (2).

522
For LGexp, this reconstructed counts matrix N was then expanded into a 55-state counts matrix (N ) where |R A |.|R A | is the product of the dimensions of a sub-matrix inN corresponding to a single cell 525 of N . Eqns.
(2) and (7) are then applied toN to derive the IRM for the LGexp model. This expanded 526 model represents the 'most-uninformed' expression of a 20-state model in a 55-state space, introducing 527 rotamer states but no information about their relative frequencies or replacement rates.

528
Alternatively, for LGbyfreq-exp, N was expanded according to where π (A,R) π (A ,R ) is the product of RAM55's equilibrium frequencies for states (A,R) and (A ,R ).

530
The LGbyfreq-exp expanded model's rates are therefore informed about each rotamer state's frequency,

531
but not the relative rates of replacement between them observed in real protein sequences. We can thus 532 compare all our models in term of their fit for a specific dataset using eqn. (12) with the likelihood 533 term corresponding either simply to L for 55-state models (RAM55, LGexp, LGbyfreq-exp) or to the 534 state-corrected likelihood obtained from eqn. (13) for 20-state models (RUM20, LG, WAG). The latter 535 is referred to as a 'state-corrected AIC score'.

536
Empirical alignments 537 We assessed RAM55's goodness-of-fit and performance on empirical data using rotasequence alignments 538 (available in Sup. Files) that can be masked by removing the rotamer configuration information in order 539 to convert then to amino acid sequences for comparison inferences with 20-state models. Alignments To test whether our RAM55 model allowed us to correctly infer unobserved ancestral states starting 560 from data simulated under the model itself, rotasequence alignments was simulated using RAM55, 561 trees with fixed topology (Sup. Fig. 11) and branch lengths scaled, in turn, according to a set of  these studies reconstruct ancestral amino acid sequences from alignment of present-day proteins that 576 in many cases lack high quality structural information, do not allow a systematic comparison of our 577 reconstructed rotasequences with reference ancestral rotasequences obtained from deposited structures.

578
To overcome this, we employed a "leave-leaves-out" (LLO) approach (Sup. Fig. 12) Fig. 11, scaled according to the factors on the x-axis. RUM20, LG and RAM55 itself are then used to perform inference over the simulated rotasequence alignments (or masked amino acid alignments for RAM20 and LG) and the resulting trees are compared to the original phylogeny in terms of Euclidean distance. RAM55 can infer trees that are closer to the original as shown by distance distributions and medians shifted towards lower values.  Figure 9: Structural data quality distributions across residues. Kernel density estimate (KDE) for the average B factor for each rotamer state (excluding alanine and glycine) across all residues in our unfiltered dataset. Average B factor is computed over the four atoms (N , C α , C β and C γ for most residues) that constitute the dihedral angle defining the χ 1 rotamer configuration. A threshold of B factor < 30 is then applied (dashed line) to ensure only highly reliable atomic coordinates are used to assign rotamer states. Only outlier density plots are labelled, for clarity; the color scheme represents overall plot density distribution at each point along the x-axis. Areas under the curves to the right of the dashed line indicates the proportions of each rotamer state in our alignments that are removed by the B factor filtering. were simulated using RAM55 and the 16-, 32-and 64-taxon trees of Sup. Fig. 11, scaled according to the factors shown on the x-axis. The plots report mean run times (in seconds) for ML inference analysis of these rotasequence alignment data under the RUM20, LG and RAM55 models in RAxML-NG (using masked alignments for the RUM20 and LG analyses). A's sequence. This can be compared to its original sequence (green). (b) The analogous procedure is then followed to reconstruct B's sequence and compare it to the original.