Universal Constraints on Protein Evolution in the Long-Term Evolution Experiment with Escherichia coli

Abstract Although it is well known that abundant proteins evolve slowly across the tree of life, there is little consensus for why this is true. Here, I report that abundant proteins evolve slowly in the hypermutator populations of Lenski’s long-term evolution experiment with Escherichia coli (LTEE). Specifically, the density of all observed mutations per gene, as measured in metagenomic time series covering 60,000 generations of the LTEE, significantly anticorrelates with mRNA abundance, protein abundance, and degree of protein–protein interaction. The same pattern holds for nonsynonymous mutation density. However, synonymous mutation density, measured across the LTEE hypermutator populations, positively correlates with protein abundance. These results show that universal constraints on protein evolution are visible in data spanning three decades of experimental evolution. Therefore, it should be possible to design experiments to answer why abundant proteins evolve slowly.


Introduction
One consequence of the high complexity and intricate functional organization of organisms is that most mutations are deleterious. Natural selection resists the loss of function and fitness caused by mutation accumulation over time (Leiby and Marx 2014;LaBar and Adami 2017;Grant et al. 2021). This process, called purifying selection, maintains the complexity and functional integrity of evolved organisms.
Despite its importance, purifying selection has been little studied in experimental systems (Alvarez-Ponce et al. 2016), in contrast to adaptive evolution (Barrick and Lenski 2013). In two recent papers, my colleagues and I reported evidence for purifying selection in metagenomic time series of Lenski's long-term evolution experiment with Escherichia coli, often called the LTEE for short (Lenski et al. 1991;Good et al. 2017). We considered the molecular evolution of the six hypermutator LTEE populations, which have elevated mutation rates due to evolved defects in DNA repair (Tenaillon et al. 2016;Maddamsetti and Grant 2020a). These populations continue to increase in fitness due to adaptive evolution, even though genome evolution in these populations largely reflects the accumulation of nearly neutral mutations (Couce et al. 2017). In Grant et al. (2021), we reported evidence for purifying selection on aerobic-and anaerobic-specific genes in E. coli. In Maddamsetti and Grant (2020b), we then reported evidence for purifying selection on genes that were found to be essential in the ancestral LTEE strain, REL606, in a transposon mutagenesis screen (Couce et al. 2017).
Here, I report evidence that purifying selection in the LTEE reflects a universal constraint on protein evolution found across the tree of life, namely that highly abundant and highly interacting proteins evolve slowly (Fraser et al. 2002; Significance A universal evolutionary pattern is that highly abundant and highly interacting proteins evolve slowly. This pattern was discovered in analyses that cover millions of years' worth of sequence variation, so it is not clear how long it takes (decades, centuries, millennia) for such patterns to emerge. Here, I report that this universal evolutionary pattern emerges in metagenomic data that cover just 30 years of experimental evolution. et al. 2004;Drummond et al. 2005;Hahn and Kern 2005;Drummond and Wilke 2008;Alvarez-Ponce et al. 2017). Despite the universality and simplicity of this pattern of purifying selection, its proximate causes continue to be debated (Plata et al. 2010;Plata and Vitkup 2018;Razban 2019;Usmanova et al. 2021). A number of compelling hypotheses have been proposed, but consensus has not been reached. The findings reported here will not settle this debate. Nonetheless, an important consequence of my findings is that it may be possible to resolve the causes of this universal pattern by experimental means.

Rationale and Study Design
This study takes a novel approach to study the anticorrelation between protein abundance and evolutionary rates (P al et al. 2001;Drummond et al. 2005Drummond et al. , 2006Drummond and Wilke 2008;Lobkovsky et al. 2010;Yang et al. 2010;Wylie and Shakhnovich 2011;Serohijos et al. 2012;Serohijos and Shakhnovich 2014). In this section, I present the logical structure of the hypotheses and predictions under consideration and explain the methods that I use ( fig. 1).
I assume that the mutation rates in the hypermutator LTEE populations are high enough that the vast majority of observed mutations are nearly neutral hitchhikers, whose dynamics are driven by a relatively small number of highly beneficial mutations (Barrick and Lenski 2009;Levy et al. 2015;Maddamsetti, Lenski, et al. 2015;Tenaillon et al. 2016;Couce et al. 2017;Good et al. 2017;Ba et al. 2019;Maddamsetti and Grant 2020a). This allows us to infer information about mutation rates and biases (Couce et al. 2017;Maddamsetti and Grant 2020a) even under environmental and population-genetic conditions that favor strong positive selection. It follows that the mutations observed across the nonmutator and hypermutator LTEE populations, to a large extent, reflect different parts of the distribution of mutation fitness effects (DFE) per gene.
With this assumption in hand, I start from the hypothesis that purifying selection causes abundant proteins to evolve slowly. This means that the DFE for abundant proteins should contain more deleterious mutations than the DFE for less abundant proteins, all else being equal. It follows that highly abundant proteins should have fewer observed mutations in the hypermutator LTEE populations, because it is unlikely that highly deleterious mutations will reach observable allele frequencies in the LTEE, given the population-genetic conditions of the LTEE (Good et al. 2017). This is the logical basis for using the hypermutator LTEE populations to test for purifying selection on abundant proteins.
The key technical trick is that we do not need to calculate evolutionary rates for the LTEE-in fact, we can completely ignore the phylogenetic structure of each population. Instead, we only need to count the number of observed mutations per gene across all hypermutator populations, and normalize by gene length ( fig. 1). An additional benefit of this approach is that the effects of clonal interference and frequencydependent selection (Maddamsetti, Lenski, et al. 2015;Good et al. 2017) can be ignored, because these phenomena do not affect the density of mutations that are ever observed in the LTEE. By contrast, clonal interference and frequencydependent selection may have significant effects on evolutionary rates (Lang et al. 2013;Serohijos and Shakhnovich 2014;Maddamsetti, Lenski, et al. 2015;Good et al. 2017).
The great advantage of the LTEE, and other evolution experiments with microbes, is the "fossil record" of frozen samples that can be revived for comparison with later samples. The vast majority of mutations in the LTEE lie off the line of descent, but are still accessible from sequencing those frozen population samples (Good et al. 2017). By contrast, analyses of natural sequence data are largely restricted to extant within-population polymorphism and between-species fixations. The use of mutations off the line of descent in the LTEE, along with its multidecade duration, provides sufficient (and ever increasing) statistical power to discern patterns of purifying selection, such as the one discussed in this work.

Correlations between mRNA and Protein Abundance and Mutation Density per Gene in LTEE Populations
I compared the density of observed mutations in the LTEE (Good et al. 2017) with mRNA and protein abundance data for the LTEE ancestral strain, REL606, grown in DM500 media (Caglar et al. 2017). These comparisons are shown in figure 2; note that throughout this section, all significant Spearman correlation coefficients and associated P values are labeled on the figures. In the hypermutator LTEE populations, mRNA abundance during exponential growth significantly anticorrelates with mutation density, whereas protein abundance, at all time points, significantly anticorrelates with mutation density. The same anticorrelation holds, for all time points, when only nonsynonymous (i.e., missense and nonsense) mutations are considered ( I also asked whether the strength of the Spearman correlations between protein abundance and mutation density in the hypermutator populations increased over the course of the LTEE ( fig. 5). In analyses of natural sequence variation, it is understood that the strength of anticorrelation between protein evolutionary rates and protein abundance increases with divergence time among the taxa under consideration (Serohijos et al. 2012). Based on protein biophysics, Serohijos et al. (2012) additionally predicted that the strength of the anticorrelation between evolutionary rates and protein abundance would increase, but at declining rates over time. Even though the differences in measurements, units, and timescales make direct comparisons to those theoretical predictions impossible, it is striking that a similar functional form of the relationship between time and the strength of the rateabundance anticorrelation occurs with the mutations ob- Many studies have reported that highly abundant proteins evolve slowly. If this fact is caused by purifying selection, then mutations in highly abundant proteins, should be more deleterious than mutations in less abundant proteins, on average. This logic leads to the prediction that highly abundant proteins should have fewer observed mutations than less abundant proteins across the hypermutator populations of the LTEE, taking gene length into account. (B) Previous studies inferred evolutionary rates using DNA and protein sequence comparisons across species. (C) This study sums all observed mutations per gene in metagenomic time series of the long-term evolution experiment with Escherichia coli (LTEE), considering nonmutator and hypermutator populations separately. This approach increases statistical power over a rate-based approach and is affected by neither clonal interference nor frequency-dependent selection. To give a concrete example, the top panel in (C) shows the number of observed mutations (stars) in the adhesin gene yeeJ in the nonmutator population Ara À 6 over 60,000 generations. The bottom panel in (C) shows the number of observed mutations (stars) in yeeJ in the hypermutator population Ara þ 6 over the same period. For comparison across genes, the number of observed mutations is normalized by gene length. and B). By contrast, the positive Spearman correlation coefficient between synonymous mutation density and protein abundance remains steady at $0.075 for at least 40,000 generations, ranging from the 20,000-generation mark through 60,000 generations ( fig. 5C).
A limitation of these analyses is that these RNA and protein abundance data come from the ancestral LTEE clone, REL606, and so these patterns may not hold for evolved strains. To address this limitation, I examined RNA abundance data for eleven 50,000 generation LTEE clones, grown to exponential As an additional check for the robustness of these correlations, I compared the density of observed mutations per gene in the LTEE with protein abundance data in the ProteomeVis database (Razban et al. 2018). Although these data only cover 664 out of 4,205 genes analyzed in the LTEE metagenomic data, they still reveal significant anticorrelations between mutation density per gene in the hypermutator populations and protein abundance, when all mutations and nonsynonymous mutations are analyzed (supplementary fig. S9, Supplementary Material online). Corresponding results for synonymous mutations in the hypermutator LTEE populations, and for all mutation types in the nonmutator LTEE populations, are not statistically significant.

Highly Interacting Proteins Evolve Slowly in Hypermutator Populations
Another universal pattern is that highly interacting proteins evolve more slowly than those with fewer interaction partners (Fraser et al. 2002;Hahn et al. 2004;Hahn and Kern 2005;Alvarez-Ponce et al. 2017). I hypothesized that highly interacting proteins would be under strong selection in the LTEE, based on those reports, as well as previous results showing that the E. coli core genome is under positive selection in the LTEE (Maddamsetti et al. 2017), and that global regulators of gene expression show evidence of strong positive selection in both nonmutator and hypermutator LTEE populations (Maddamsetti and Grant 2020b). In particular, I hypothesized that highly interacting proteins should evolve rapidly in the nonmutator LTEE populations due to positive selection, but should evolve slowly in the hypermutator populations during to purifying selection.
I compared the number of protein-protein interactions (PPI) with the density of observed mutations across LTEE populations for every protein-coding gene in the E. coli genome, using three curated data sets of PPI in E. coli (Razban et al. 2018;Cong et al. 2019;Zitnik et al. 2019), which I refer to as the Cong data set, the Zitnik data set, and the Razban data set. These comparisons are shown in figure 6 and supplementary figure 10, Supplementary Material online. I find significant negative correlations between mutation density and PPI degree in the hypermutators (Spearman's rho ¼ À0.056, P ¼ 0.00037 for Cong data set; Spearman's rho ¼ À0.11, P < 10 À11 for Zitnik data set; Spearman's rho ¼ À0.068, P < 10 À4 for Razban data set). However, the weak positive correlations between mutation density and PPI degree in the nonmutators are not significant (supplementary fig. 10, Supplementary Material online).

Discussion
I show that a number of well-known but poorly understood correlations between mRNA abundance, protein abundance, PPI degree, and evolutionary rates across the tree of life are also found in the hypermutator populations of the LTEE. In some cases, I find significant anticorrelation between mutation densities and mRNA abundance in exponential phase, but not during stationary phase. The simplest explanation for this finding is that mRNAs decay more rapidly than the proteins they encode. Protein abundance consistently shows a negative correlation with the density of all observed mutations ( fig. 2) and with nonsynonymous mutation density across all time points (fig. 3).
It is widely believed that these correlations are driven by purifying selection on universal aspects of protein evolution (Drummond et al. 2006;Drummond and Wilke 2009;Serohijos et al. 2012;Serohijos and Shakhnovich 2014), and indeed, this is the most parsimonious explanation for why similar patterns are seen in the LTEE. An intriguing difference, however, is the positive correlation that I find between synonymous mutation density across LTEE hypermutator populations and protein abundance ( fig. 4)-which contrasts with the anticorrelation between the rate of synonymous mutations and gene expression seen in nature (Drummond and Wilke 2008). In part, this may be explained by the differences in the distribution of synonymous mutations observed in the LTEE, and the distribution of synonymous diversity per gene in nature (Maddamsetti, Hatcher, et al. 2015;Maddamsetti and Grant 2020a), although the causes for this difference between natural variation and experiment is still a matter for hypothesis generation (Maddamsetti 2016), data collection, hypothesis testing, and debate.
An important limitation of these results is that the protein and mRNA abundance data for LTEE strains were collected in DM500 and DM4000 media (Caglar et al. 2017;Favate et al. 2021). These media contain much more than the 25 mg/l glucose in the DM25 media used in the LTEE. This represents a technical compromise due to the fact that researchers have not yet succeeded in isolating sufficient mRNA from exponential phase cultures in DM25 for RNA-seq (Jagdish T and Grant N, personal communication). With this caveat in mind, my findings support the conclusion that highly abundant proteins evolve slowly in the hypermutator LTEE populations.
The causes for why highly abundant proteins evolve slowly may emerge from a number of different, and nonmutually exclusive phenomena, so many explanations have been proposed (Razban 2019). These include the protein misfolding avoidance hypothesis (Yang et al. 2010), the protein misinteraction avoidance hypothesis (Levy et al. 2012;Yang et al. 2012), the mRNA folding hypothesis (Park et al. 2013), purifying selection on protein function (Konat e et al. 2019), folding stability (Serohijos et al. 2012;Serohijos and Shakhnovich 2014), and others (Tartaglia et al. 2007;Plata et al. 2010;Kepp and Dasmeh 2014).
Differentiating among these possibilities is difficult, because it is challenging to study the causes of patterns that span millions of years of protein evolution. I do not draw conclusions about the causes of these correlations. Rather, my results show that evolution experiments are reasonable model systems to study the causes of evolutionary rate variation in proteins. A concrete approach would be to recode the genomes of hypermutator strains to modulate the anticipated action of purifying selection per protein, based on the predictions of a particular explanation, and then ask whether those predictions are borne out during experimental evolution. Breakthroughs that allow for the inexpensive recoding of whole bacterial genomes may be needed, but it is plausible that such experiments will be feasible in the future.
Many other experimental directions are possible. First, a better understanding of how chaperones and other molecular mechanisms of protein quality control affect evolutionary rates and fitness (Chen et al. 2017;Alvarez-Ponce et al. 2019;Samhita et al. 2020) is needed. We also need to better understand purifying selection on synonymous mutations (Walsh et al. 2020). Second, studies on how RNA transcription error rates (Li and Lynch 2020) and RNA folding errors affect evolutionary rates would be valuable. Indeed, mRNA accessibility seems to be an important predictor of protein abundance (Terai and Asai 2020)-and RNA chaperones buffer deleterious mutations in LTEE hypermutator strains (Rudan et al. 2015). Third, it would be interesting to experimentally test the hypothesis that protein and RNA chaperones evolve under more and more stringent purifying selection during long-term experimental evolution, which follows from the premise that hypermutator LTEE populations are affected by a mutation load that affects protein folding and stability. Studies on the existence and relevance of phenomena like evolutionary capacitance caused by the contributions that PPI make to folding stability (Dixit and Maslov 2013;Jarzab et al. 2020;Mateus et al. 2020), including cryptic genetic variation hidden by protein and RNA chaperones (Queitsch et al. 2002;Bergman and Siegal 2003;Masel 2005Masel , 2006Masel , 2013Trotter et al. 2014;Geiler-Samerotte et al. 2016;Zheng et al. 2019) during experimental evolution, and the effects of such phenomena on rates of protein evolution may be especially valuable in this regard. Finally, it would be valuable to develop a better understanding of the temperature sensitivity of evolved LTEE populations (Mongold et al. 1996(Mongold et al. , 1999Leiby and Marx 2014), and to collect data on protein evolutionary rates in long-term experiments conducted at elevated temperatures (Bennett et al. 1990;Tenaillon et al. 2012). Much remains to be explored, in regard to how evolution experiments can deepen our understanding of purifying selection on molecular and cellular organization and function.

Materials and Methods
Preprocessed LTEE metagenomic data were downloaded from: https://github.com/benjaminhgood/LTEE-metagenomic. Transcriptomic and proteomic data for REL606, grown in Davis minimal media with 500 mg/l glucose (DM500), were taken from the supplementary tables for Caglar et al. (2017). For robustness, I also analyzed the transcriptomic data for eleven  (Razban et al. 2018) tend to evolve more slowly than those with fewer interactions in the hypermutator LTEE populations.
Universal Constraints on Protein Evolution GBE Genome Biol. Evol. 13(5) doi:10.1093/gbe/evab070 Advance Access publication 15 April 2021 50,000 generation LTEE clones grown in DM4000 media (Favate et al. 2021) available at: https://github.com/shahlab/ LTEE-gene-expression. I analyzed three different data sets of PPI in E. coli. First, I used the PPI network for E. coli K-12 MG1655 in the STRING database (Szklarczyk et al. 2021) as curated by Zitnik et al. (2019). Second, I used the data set of high confidence E. coli PPI interactions reported by Cong et al. (2019), which combines coevolutionary information in large protein multiple sequence alignments with gold-standard protein complexes in E. coli reported in the Ecocyc and Protein Databank (PDB) databases (Berman et al. 2000;Keseler et al. 2013). PPI network statistics were calculated using the SNAP toolkit (Leskovec and Sosi c 2016;Zitnik et al. 2019). Third, additional data on E. coli PPI interactions and protein abundance were downloaded using the web interface to the ProteomeVis database (Razban et al. 2018), available at http://proteomevis. chem.harvard.edu/. Associated metadata for ProteomeVis were downloaded from: https://github.com/rrazban/proteomevis/ blob/master/make_database/proteomevis_inspect.csv.
All statistical analyses involve two-sided tests for Spearman correlation coefficients that are significantly different from zero, using the cor.test function in the R statistical programming language, version 4.0 (R Core Team 2020). Unless stated otherwise, all correlations include genes with no mutations (i.e., zeros are included).

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.