-
PDF
- Split View
-
Views
-
Cite
Cite
Sergey V Venev, Konstantin B Zeldovich, Thermophilic Adaptation in Prokaryotes Is Constrained by Metabolic Costs of Proteostasis, Molecular Biology and Evolution, Volume 35, Issue 1, January 2018, Pages 211–224, https://doi.org/10.1093/molbev/msx282
- Share Icon Share
Abstract
Prokaryotes evolved to thrive in an extremely diverse set of habitats, and their proteomes bear signatures of environmental conditions. Although correlations between amino acid usage and environmental temperature are well-documented, understanding of the mechanisms of thermal adaptation remains incomplete. Here, we couple the energetic costs of protein folding and protein homeostasis to build a microscopic model explaining both the overall amino acid composition and its temperature trends. Low biosynthesis costs lead to low diversity of physical interactions between amino acid residues, which in turn makes proteins less stable and drives up chaperone activity to maintain appropriate levels of folded, functional proteins. Assuming that the cost of chaperone activity is proportional to the fraction of unfolded client proteins, we simulated thermal adaptation of model proteins subject to minimization of the total cost of amino acid synthesis and chaperone activity. For the first time, we predicted both the proteome-average amino acid abundances and their temperature trends simultaneously, and found strong correlations between model predictions and 402 genomes of bacteria and archaea. The energetic constraint on protein evolution is more apparent in highly expressed proteins, selected by codon adaptation index. We found that in bacteria, highly expressed proteins are similar in composition to thermophilic ones, whereas in archaea no correlation between predicted expression level and thermostability was observed. At the same time, thermal adaptations of highly expressed proteins in bacteria and archaea are nearly identical, suggesting that universal energetic constraints prevail over the phylogenetic differences between these domains of life.
Introduction
Over the 4 billion years of evolution, life has colonized an extreme diversity of physical environments on Earth, ranging from volcanic hot vents in the oceans to permafrost to hypersaline lakes. Adaptations to these conditions allow proteins and nucleic acids to function in a wide range of physical and chemical environments, resulting in specific patterns of nucleotide and amino acid usage (Galtier and Lobry 1997; Kreil and Ouzounis 2001; Zeldovich et al. 2007b; Berezovsky et al. 2007; England et al. 2003; Fukuchi et al. 2003; Sghaier et al. 2013; Sabath et al. 2013). Although the variation of amino acid frequencies across species is relatively constrained for a given genomic composition (Krick et al. 2014; Goncearenco and Berezovsky 2014), amino acid compositions of prokaryotic proteomes are sensitive to the temperature and salinity of their natural environments (Fukuchi et al. 2003; Kreil and Ouzounis 2001). Unraveling the evolutionary origins of amino acids usage in proteomes involves two main questions: first, what are the origins of the generally similar average amino acid usage across multiple highly divergent species, and second, what biological mechanisms drive adaptation of amino acid frequencies to environmental conditions.
Correlations between nucleotide and amino acid frequencies have been revealed simultaneously with the discovery of the genetic code (Sueoka 1961; Jukes et al. 1975; King and Jukes 1969). Subsequent genome-wide studies found that genomic composition strongly affects the patterns of amino acid and codon usage at the organismal level (Kreil and Ouzounis 2001; Knight et al. 2001; Lightfield et al. 2011; Goncearenco and Berezovsky 2014). At the same time, mutational patterns cannot fully explain the genome composition (Rocha et al. 2010). Closely related species adapted to different environments demonstrate variation in amino acid usage unaccounted for by their similar genomic compositions (Singer and Hickey 2003; McDonald 2010; Haney et al. 1999; Fukuchi et al. 2003). Therefore, selection at the level of nucleotide frequencies does not fully explain the variation of amino acid composition, and multiple mechanisms of protein-level selection have been proposed. In unicellular organisms, highly abundant proteins have a biased amino acid composition to decrease the metabolic cost of amino acid biosynthesis (Akashi and Gojobori 2002; Heizer et al. 2006; Seligmann 2003; Heizer et al. 2011). In this class of models, the best explanation of observed amino acid compositions is achieved in a phenomenological approach by Krick et al. (2014), who took into account the metabolic cost of amino acids synthesis and the rates of their chemical degradation.
Thermal adaptation is a particularly well-studied example of environmental adaptation that does not reduce to changing nucleotide frequencies. Maintenance of the pool of functional, properly folded proteins at elevated temperatures imposes constraints on protein structures (Szilagyi and Zavodszky 2000; England et al. 2003), as well as amino acid compositions (Zeldovich et al. 2007b; Singer and Hickey 2003; Kreil and Ouzounis 2001; Haney et al. 1999). The temperature span of life reaches almost 120 K (from −10° to 110°), a change in energy of 0.24 kcal/mol. As this value is comparable to the average effect of a single amino acid substitution in a folded protein kcal/mol (Zeldovich et al. 2007c) and the typical energy of inter-residue van der Waals contacts, thermophilic proteins evolved sequence and structure features to increase their stability. Thermally adapted proteins utilize positive and negative design strategies, stabilizing their native folds and destabilizing unfolded conformations (Berezovsky et al. 2007). Increased fraction of hydrophobic residues contributes to protein core stability, whereas increased fraction of the charged residues enforces native fold uniqueness by destabilizing unfolded conformations. Destabilization of nonnative states can be achieved by an increased fraction of charged residues on protein surface and formation of ionic pairs (Szilagyi and Zavodszky 2000; Zhao et al. 2011). It is known that whereas salt bridge typically stabilizes the protein, longer range ion pairs are often destabilizing (Kumar and Nussinov 2002). Microscopic models of electrostatic effects in protein stability have been extensively developed (Loladze et al. 1999; Loladze and Makhatadze 2008; Strickler et al. 2006; Karshikoff et al. 2015; Sawle and Ghosh 2015), leading to in vitro validation by redesign of electrostatic interactions in ubiquitin and several other proteins. Overall, biophysical models provide a solid understanding of atomistic-level interactions in specific proteins, and, statistically, explain well the global temperature trends of amino acid frequencies in prokaryotes (Berezovsky et al. 2007; Venev and Zeldovich 2015).
Unfortunately, even state of the art models can only explain either the overall proteomic amino acid composition (Seligmann 2003; Heizer et al. 2011; Krick et al. 2014), or its temperature trends (Berezovsky et al. 2007; Venev and Zeldovich 2015), but not both. Here, we couple protein folding and protein homeostasis costs to bridge this gap and build a microscopic model explaining both amino acid composition and its temperature trends. As it is known, chaperone-assisted folding mechanisms evolved to repair misfolded proteins (Hartl et al. 2011), and even a moderate decrease in protein foldability imposes an organismal fitness cost (Drummond and Wilke 2008; Geiler-Samerotte et al. 2011). Chaperones require energy to function, which in turn creates an additional selective pressure on protein foldability, especially in the case of highly abundant proteins (Kepp et al. 2014). As proteostasis consumes up to 80% of total metabolic rate of unicellular free-living organisms (Kepp et al. 2014), adaptation towards energy efficiency is a significant driver of evolution. In fact, while the present work was under review, the Dill group published a kinetic model of proteostasis in Escherichia coli, showing that dynamic sorting of client proteins between chaperone systems is energy efficient for the cell. Specifically, it was found that the “sickest” proteins (ones with a stable misfolded state kinetically accessible from unfolded state) use the most energy-intensive GroEL chaperone (Santra et al. 2017).
Here, we propose that global amino acid composition evolved under the selective pressure of the total energetic cost of proteostasis. Following earlier studies, our model includes the cost of amino acid synthesis and maintenance of their constant concentrations in the presence of chemical degradation (fig. 1). The key new feature, however, comes from considering the energy cost of chaperone assisted protein folding. Protein stability against thermal unfolding depends on the amino acid composition, and amino acid compositions delivering highly foldable proteins require lower energy expenditures on repairing misfolded proteins by chaperones. As detailed below, minimization of the total energy spent on amino acid synthesis and maintenance of folded proteins by chaperones provides an accurate description of both average amino acid frequencies, and their trends with environmental temperature.

Material and energy flux in proteostasis. Amino acid biosynthesis, translation and polypeptide synthesis, and chaperone assisted protein folding consume a significant fraction of energy available to a prokaryote. Maintenance of steady state concentrations of every amino acid bears a known energy cost, with cheaper amino acids preferred in highly expressed proteins (Akashi and Gojobori 2002). We propose that energy cost of chaperone activity depends on amino acid composition of client proteins, as protein foldability is affected by amino acid composition (Dill 1985; Berezovsky et al. 2007; Venev and Zeldovich 2015). Therefore, amino acid composition evolves under the energetic constraint from two distinct processes, amino acid biosynthesis costs and chaperone activity.
Results
Thermal Adaptation in Highly Expressed Proteins Is Similar in Bacteria and Archaea
Although archaea and bacteria have diverged early on during evolution, today they share many of the same environments, with both domains spanning wide temperature ranges. Thermal adaptations in the two domains provide a unique test case for comparing phylogenetically divergent responses to the same physical environment. To quantify thermal adaptation, we performed linear regressions between the frequencies fa of each of the 20 amino acids in the archaeal and bacterial proteomes, and optimum growth temperature (OGT), and used the slopes of the regression as metric of adaptation (supplementary fig. S1A and B, Supplementary Material online). Amino acids with positive slopes are statistically overrepresented in thermophilic proteomes, whereas negative slopes reflect reduced usage of an amino acid in thermophiles.
The correlation between the temperature trends of amino acid frequencies in complete proteomes of bacteria and archaea is not very high, R = 0.48 (fig. 2A). Moreover, bacterial slopes are generally lower than archaeal ones. Therefore, phylogenetic divergence and ensuing biochemical differences had a profound effect on proteome-averaged amino acid usage in the two prokaryotic domains.

Convergence of the archaeal and bacterial trends of thermal adaptation. Slopes of the amino acid frequencies versus OGT regressions are compared between archaea and bacteria. (A) Proteome-wide, the temperature trends of amino acid usage in bacteria and archaea are weakly correlated. (B) Ribosomal proteins of archaea and bacteria have similar patterns of thermal adaptation. (C) Predicted highly expressed proteins (top 10% CAI) in the organisms with CUS show identical patterns of thermal adaptation between bacteria and archaea. (D) Correlation of trends of thermal adaptation in complete proteomes of organisms with CUS is statistically insignificant in the organism bootstrap test (see text for details).
We hypothesized that for highly expressed proteins, energetic constraints on thermal adaptation may prevail over phylogenetic differences, leading to a greater similarity of archaea and bacteria. Highly expressed proteins are known to evolve slowly (Pál et al. 2001; Rocha and Danchin 2004), suggesting a stronger evolutionary constraint, which is partially reflected in more stringent folding requirements (Serohijos et al. 2012; Drummond et al. 2005; Drummond and Wilke 2008).
Ribosomal proteins serve as a particularly well-defined group of highly expressed proteins in both archaea and bacteria (Karlin et al. 2005). At the same time, differences in ribosome structures and sequences between the two domains are significant. Both domains of life exhibit very similar patterns in thermal adaptation of ribosomal proteins (fig. 2B), R = 0.77 (bootstrap to find a similar correlation in the same number of randomly selected proteins yields P < 0.001 (supplementary fig. S2, Supplementary Material online). However, the specific function of ribosomal proteins may have limited their options for thermal adaptation. To identify other types of highly expressed proteins we used a sequence based approach using codon adaptation index (CAI; Sharp and Li 1987) in organisms with apparent codons usage selection (CUS), see Materials and Methods for details. Remarkably, for predicted highly expressed proteins from organisms with CUS, the trends in thermal adaptation are nearly identical. Figure 2C demonstrates R = 0.91, P < 0.001 for all proteins within the top 10% of CAI; excluding ribosomal proteins yields the same R = 0.912 (data not shown). The null hypothesis that the observed correlation can be explained by a random choice of a subset of organisms or proteins is safely rejected (P = 0.015, randomized CUS assignment, P < 0.001, randomized CAI ranking (supplementary fig. S3, Supplementary Material online). A greater similarity between thermal adaptations in highly expressed proteins from archaea and bacteria compared with the whole-proteome case comes mostly from the differences in usage of isoleucine, alanine, lysine, and glutamic acid (A, I, K, E), as shown in figure 2A and C and supplementary figure S1C and D, Supplementary Material online.
At the same time, CUS alone does not imply similarity in thermal adaptation for bacteria and archaea. Trends in thermal adaptation in complete proteomes of bacteria and archaea with CUS (fig. 2D), R = 0.73, appeared more similar than for complete proteomes of all species (fig. 2A). However, correlation of could be achieved with probability 0.153 in a randomized selection of the same number of bacteria and archaea from the full data set (supplementary fig. S4, Supplementary Material online). Therefore, increased proteome-wide similarity of thermal adaptation between bacteria and archaea with CUS is not statistically significant.
This genomewide analysis clearly shows that highly expressed proteins in both archaea and bacteria share a common strategy of thermal adaptation, which becomes obscured at the level of complete proteomes. We propose that the common strategy may involve optimization of energetic costs of proteostasis by balancing amino acid metabolism and chaperone energy expenses, and present the results of the modeling below.
Simulated Amino Acid Frequencies Respond to Chaperone Energy Costs
We designed lattice model proteins of 64 residues each in a wide range of artificial temperatures in units of Miyazawa–Jernigan residue level potential (p.u.; Miyazawa and Jernigan 1999), following equations (8–10) (see Materials and Methods for details). The chaperone-adjusted synthesis cost w (see eq. 7), was varied from w = 0 to w = 0.15. The case of w = 0 corresponds to zero cost of maintaining the amino acid pool, so the energetic cost is completely defined by protein foldability. On the contrary, for large values of w, the energetic costs of amino acid maintenance prevail over energy expenditures by chaperones, reducing the effect of protein foldability on fitness. Proteins designed with no synthesis cost constraint, w = 0, mostly reproduced earlier results (Berezovsky et al. 2007; Venev and Zeldovich 2015). At low simulated temperatures, the folding constraint on protein sequences was weak. Starting from a random sequence with amino acid abundances, the design procedure was able to create well-folding sequences by swapping the residues while retaining the overall amino acid composition (supplementary fig. S5A, Supplementary Material online). As the temperature increased, relative amino acid abundances changed monotonically to allow designed proteins to increase their thermal stability (supplementary fig. S5A, inset, Supplementary Material online). Increasing frequencies of hydrophobic and charged residues extend the energy gap by decreasing the energy of the native state and raising the average decoy energy (Berezovsky et al. 2007).
The outcome of protein design changed significantly if the chaperone-adjusted synthesis cost w was introduced (supplementary fig. S5B, Supplementary Material online). In this case, frequent usage of “expensive” amino acids carried a significant penalty even if they were favorable for protein foldability. At w = 0.05, proteome-averaged frequencies of amino acids already diverged at low temperatures T, and the distribution of amino acid frequencies was largely determined by their relative metabolic maintenance costs, due to the smaller selective pressure on foldability. We then hypothesized that this interplay between the costs of raw materials (amino acid pool) and maintenance of product (chaperone-assisted folding) can explain both the average amino acid composition and its temperature trends.
Simulated Trends Correlate with Biological Data
The amino acid frequencies produced by our model are controlled by two parameters, temperature T and the chaperone-adjusted synthesis cost w. To assess the fit of the model to observed frequencies for given values of w and T, we used the Jensen–Shannon divergence (JSD) between the frequency distributions of the 20 types of amino acid in the simulated and biological data. The JSD between two probability (frequency) distributions is zero if the distributions are identical, and equals one if the two distributions are completely unrelated. Using the Pearson correlation coefficient between amino acid frequencies as the measure of similarity of the distributions produced qualitatively similar results (data not shown). Prokaryotic genomes were separated into mesophilic (20 OGT 50°) and thermophilic (OGT > 50°) groups and average amino acid frequencies from these groups were used for comparison with the simulated data. This analysis has been performed separately for bacteria and archaea. In bacteria, the JSD between simulated and real data reaches minima at specific values of T and w (fig. 3). Our model correctly segregated thermophilic and mesophilic genomes. The value of JSDM reached its minimum at p.u. whereas JSDT reached the minimum at a higher temperature (fig. 3A and B). Both JSDM and JSDT reached their minima at the same value of chaperone-adjusted synthesis cost , so the energetic balance between chaperone activity and costs of amino acid maintenance appeared similar between thermophiles and mesophiles. Remarkably, the value of was the same for archaea and bacteria; this finding was robust upon changes of proteostasis cost cutoff (supplementary table S1, Supplementary Material online).

Simulated frequencies of amino acids compared with the naturally evolved ones for bacteria. (A, B) Jensen–Shannon divergence between amino acid frequencies in simulated data and thermophilic (A) and mesophilic (B) proteomes exhibits clear minima with respect to the temperature T and shaperone-adjusted synthesis costs w. The best match between the model and mesophilic proteomes is achieved at a lower temperature than the best match to the thermophilic ones. (C) Pearson correlation coefficient RA between amino acid frequencies in simulated data and all bacterial proteomes reaches for . (D) In the same temperature range, the temperature trends of amino acid in simulated data are strongly correlated with those in bacterial proteomes, R = 0.64.
The temperature range and the cost successfully describe the complete data set of both mesophiles and thermophiles in terms of amino acid composition and its temperature trends. To simplify comparison with earlier works, figure 3C presents the Pearson correlation coefficient between average amino acid frequencies in 262 bacteria (mesophiles and thermophiles combined), and simulated data. The very high correlation, is similar to the predictions of the current-best phenomenological model (Krick et al. 2014). Figure 3D shows the correlation RD between the amino acid temperature trends (slopes ) in the model and biological data, , in the same 262 bacterial species. Difference quotients of simulated amino acid frequencies in the range were used to calculate the simulated temperature trends. For the real frequencies of amino acids, the slopes were derived from the linear regression over the entire OGT range (supplementary fig. S1A–D, Supplementary Material online). Similar to the values of RA, the value of RD exhibits a clear maximum with respect to both w and T, reaching , similar to earlier findings (Venev and Zeldovich 2015).
Interestingly, the relative temperature range in the model, compared well with the actual temperature range of prokaryotes, thriving between ∼280 and 370 K, a 30% change in absolute temperature. Comparison between simulated data and amino acid frequencies in archaea is presented in supplementary figure S6, Supplementary Material online, and shows similar values of the optimum temperature range and chaperone-adjusted synthesis cost w. Qualitatively similar results are obtained by comparing the simulation with amino acid frequencies derived from predicted highly expressed proteins (top 10% of CAI for organisms with CUS; supplementary figs. S7 and S8, Supplementary Material online).
To prove that these results are not a numerical artifact, we have shuffled the values of amino acid maintenance cost Ca and repeated the simulations. The reshuffling breaks the connection between the biochemistry of amino acids, reflected in their costs Ca, and their physical properties, such as interaction energies ϵab. As shown in supplementary figure S9, Supplementary Material online, the values of JSDA and RD obtained from reshuffled Ca are almost always lower than those for the initial true Ca, yielding . Therefore, we demonstrated that the interplay between amino acid synthesis costs and protein folding is internally consistent, and our model produces correct amino acid frequencies only when the connection between the organism-level biochemical and physical properties of amino acids is preserved.
The predicted temperature trends of amino acid frequencies had a statistically significant correlation with biological data for bacteria, R = 0.64 (supplementary fig. S10B, Supplementary Material online). A much weaker correlation was observed for archaea, R = 0.35 (supplementary fig. S10A, Supplementary Material online). We have then considered only highly expressed proteins (top 10% of CAI for organisms with CUS) from either domain, expecting that the selective pressure of proteostasis is stronger for this group of proteins. In bacteria, consideration of highly expressed proteins did not significantly affect the agreement between simulated and biological data, R = 0.55 (supplementary fig. S10D, Supplementary Material online). However, temperature trends in highly expressed proteins in archaea were strongly correlated with our simulation, R = 0.61 (supplementary fig. S10C, Supplementary Material online). Therefore, in archaea, highly expressed proteins experience a selective pressure that is well-described by the model, while the complete archaeal proteomes apparently evolved under a variety of constraints yet to be identified.
Results of the simulations were generally robust with the respect to changes of the model parameters, such as proteostasis cost cutoff , equation (10) as shown in supplementary table S1, Supplementary Material online. The correlation coefficient RD between the simulated and biological temperature derivatives of amino acid frequencies was not sensitive to . However, at low values of , the selection was not strong enough, leading to the optimum parameter w being different for thermophiles and mesophiles in certain cases. This artifact vanished at , the value ultimately chosen for production simulations.
Consistent with previous findings (Venev and Zeldovich 2015), the temperature trend of leucine frequency was not captured well by the model. Leucine is a very hydrophobic residue, as reflected by the Miyazawa and Jernigan interaction potential. Accordingly, the frequency of leucine rapidly increased with temperature in simulated proteomes, as leucine participates in hydrophobic interactions in the protein core. As it is known (Goncearenco et al. 2014), the frequency of leucine does not increase with temperature in bacteria, although it does so in archaea (supplementary fig. S1A and B, Supplementary Material online). Combined with the fact that leucine is relatively simple to synthesize, and is coded by six different codons, these observations clearly point to the biochemical differences between archaea and bacteria, and to the limitations of current biophysical models. Our choice of the Miyazawa and Jernigan amino acid interaction potential, which overemphasizes attractive forces, may be one of the factors contributing to the leucine being an outlier. Alternative derivations of the knowledge based potential, such as (Thomas and Dill 1996), may alleviate this issue. Aspartic acid is an another known outlier (Goncearenco et al. 2014). This charged amino acid is predicted to increase in frequency as the temperature rises, just as glutamic acid, lysine, and arginine. However, while glutamic acid and lysine consistently increase in frequency in both bacteria and archaea, aspartic acid is surprisingly depleted in natural thermophilic proteomes.
Amino Acid Synthesis Costs Increase with Environmental Temperature
Metabolic costs of amino acid synthesis are negatively correlated with protein expression levels across the three domains of life (Akashi and Gojobori 2002; Swire 2007). In figure 4A and C, we plotted the proteome-averaged Akashi–Gojobori amino acid synthesis cost against environmental temperature for 140 archaea and 262 bacteria, assuming equal expression levels of all proteins. We found a statistically significant positive correlation, confirming that thermal stability requires heavier usage of synthetically “expensive” proteins, in agreement with an earlier observation made on Thermus thermophilus genome (Swire 2007). In contrast with the amino acid synthesis cost, the amino acid maintenance cost, which combines synthesis and decay (Krick et al. 2014), is not significantly correlated with the environmental temperature (OGT; fig. 4B and D). These observations are generally consistent with our simulations. Supplementary figure S11A–H, Supplementary Material online, demonstrates that for the optimum temperature range and chaperone-adjusted synthesis cost , amino acid synthesis costs increased with temperature, whereas amino acid maintenance costs weakly decreased, similar to figure 4.

Temperature trends of the amino acid synthesis and maintenance costs in prokaryotes. Proteome average cost of amino acid synthesis and amino acid maintenance for archaeal species (A, B) and bacterial species (C, D). Marker color represents the genome-wide GC content of each specie; as it is well-established, genome-wide GC content is not correlated with OGT, see also supplementary figure S1A and B, Supplementary Material online.
Highly Expressed Proteins Are Similar to Thermophilic Ones in Bacteria, Not in Archaea
In full agreement with earlier findings by Akashi and Gojobori (2002) and Swire (2007), our analysis found a statistically significant negative correlation between the amino acids synthesis costs and CAI (proxy for expression) in sets of 167 bacterial and 65 archaeal genomes (supplementary fig. S12, Supplementary Material online). At the same time, it has been proposed that amino acid composition of highly expressed proteins is similar to the composition of thermophilic proteins (Cherry 2010). As noticed earlier (Serohijos et al. 2012), this is somewhat contradictory to the findings of Akashi et al and Swire, as thermophilic proteins tend to be more synthetically “expensive”.
To look for the origins of this controversy, we compared the protein expression levels, approximated by CAI, with their thermostability in bacteria and archaea. To assess protein thermostability of a group of proteins, we used the between their average amino acid composition and the average amino acid composition of 92 thermophilic archaea or 66 thermophilic bacteria (OGT > 50°). For each organism, proteins were split into twenty bins according to their CAI, and average amino acid usage of each bin was compared with the thermophilic composition using . For archaea, we did not observe a significant correlation between CAI and JSDT (fig. 5A). However, bacteria exhibited a statistically significant negative correlation between JSDT and CAI bin (fig. 5B): bacterial proteins appeared more similar to thermophilic ones (lower values of JSD) as their CAI increased. As a control, we have reshuffled synonymous codons within each genome, destroying the codon bias and thus CAI metric of each protein, but leaving amino acid composition intact. No correlation was observed in reshuffled data for bacteria (fig. 5B). To rule out the possible effects of binning on the observed correlations, we repeated the analysis using 5 and 50 bins of CAI values, and observed the same trends (data not shown). These results partially support Cherry’s findings and demonstrate the immense flexibility of the 20-dimensional space of protein sequence composition to satisfy multiple physical and phylogenetic constraints.

Similarity between proteins with high CAI and thermophilic proteins by comparing amino acid composition. Proteins in each organism are grouped into 20 bins according to CAI, and amino acid composition within each groups is compared with the average thermophilic composition using Jensen–Shannon divergence, separately for archaea (A), and bacteria (B). In bacteria, the higher is protein expression, the more similar is amino acid composition to a thermophilic one (decreasing JSDT, red line, statistically significant). No such trend was observed for archaea. As a control, codon reshuffling was used to destroy the relation between CAI and amino acid composition of proteins. For both archaea and bacteria, the correlation between JSDT and CAI for reshuffled codons was not significant. Error bars represent the 30% and 70% percentiles of the underlying distributions.
We have also tried to approximate protein thermostability by the fraction of IVYWREL amino acids, which is strongly correlated with OGT at the proteomic level (Zeldovich et al. 2007b). Although a strong negative correlation between IVYWREL and CAI was observed, the same trend persisted upon synonymous codon reshuffling, in both bacteria and archaea (supplementary fig. S13, Supplementary Material online). Therefore, an intrinsic connection between the amino acid and nucleotide frequencies via the genetic code prevents the use of IVYWREL metric to compare protein expression (CAI) and thermostability.
To check if our simulation can capture an increased similarity between highly expressed and thermophilic proteins, we have divided the model proteins into three bins of low, medium, and high abundance, positing that the metabolic cost is proportional to the total amount of protein in each group, equation (11) (see Materials and Methods for details). We have run the simulation for , corresponding to the best global fit of model with experimental data, and found that highly abundant model proteins were indeed consistently closer in amino acid composition to natural thermophilic bacteria, compared with low-abundant model protein (supplementary fig. S14, Supplementary Material online). Although this finding is encouraging, the effect is relatively small, for low abundance versus for high abundance. Further study of this phenomenon warrants development of a more detailed cellular fitness model, as in our approach the contribution of proteins to fitness was exclusively defined by stability, metabolic cost and abundance, without considering essentiality or connectivity in the cell’s metabolic network.
Discussion
Statistically significant correlations between environmental conditions and amino acid usage are well-established, dating back at least to 1982, when amino acid usage was quantitatively linked to environmental temperature by Ponnuswamy et al. (1982). An early physical model of amino acid composition has been proposed by Dill (1985), who derived the ratio of hydrophobic to polar residues conferring highest stability to a globular protein. The interest in the statistical understanding of thermal adaptation increased as microscopic simulations of protein evolution became possible (Taverna and Goldstein 2002; Bloom et al. 2006; Goldstein 2008). Modeling of lattice proteins (Berezovsky et al. 2007; Venev and Zeldovich 2015) showed that although the temperature trends in amino acid frequencies can be explained by purely physical models, the frequencies themselves are weakly correlated with genomic data. This discrepancy suggests that either the physical models are still not precise enough to resolve individual amino acids beyond their rough classification by hydrophobicity, or other factors contribute significantly to amino acid usage.
Complementary to protein folding constraints, metabolic costs and overall energy balance of a cell have been long identified as powerful evolutionary drivers (Pál et al. 2006), as exemplified, for example, by the success of quantitative flux based metabolic models (Varma and Palsson 1994; Price et al. 2004). Akashi and Gojobori (2002) estimated the energy expended on the synthesis of each of the 20 types of amino acid molecules, and found that highly expressed proteins are enriched in “cheap”, easily synthesized amino acids. These findings highlighted the importance of proteostasis as the major cellular process, coupling energy and material fluxes in a cell. The flux models were further advanced by an estimate of the amino acid decay rates within a cell (Krick et al. 2014). By combining the amino acid synthesis cost, decay rate, and sequence entropy into an empiric cost function, Krick et al. (2014) made successful predictions of amino acid frequencies. However, this model did not explicitly address protein folding or other physical considerations, and so is difficult to extend to the study of thermal adaptation.
To bridge this gap, we proposed that proteostasis is not limited to the chemical turnover of amino acid molecules, but, crucially, maintains appropriate levels of functional, correctly folded proteins. Molecular chaperones are an integral part of this process, attempting to refold proteins in an ATP-dependent manner. Invoking quality control systems in response to misfolded proteins causes a fitness penalty proportional to the fraction of misfolded proteins, their expression level and is largely function-independent (Geiler-Samerotte et al. 2011). Moreover, further experiments suggested that it is indeed the metabolic cost of chaperone activity that imposes the fitness penalty, rather than the consequences, for example, toxicity, of the presence of abundant misfolded proteins (Tomala et al. 2014). Chaperone function provides a feedback to the genotype, by accelerating its evolution while serving as a capacitor for otherwise deleterious phenotypic mutations (Bogumil and Dagan 2012; Çetinbaş et al. 2013).
Following Kepp et al. (2014) and in parallel with (Santra et al. 2017), we hypothesized that the energy consumed by chaperones is nonnegligible and must be taken into account together with other metabolic costs. Specifically, we assumed that the total energy cost of proteostasis includes contributions from both amino acid turnover and chaperone activity. The key feature of the model is the statistical dependence between foldability of a protein and its amino acid composition (Dill 1985; Berezovsky et al. 2007; Venev and Zeldovich 2015). Indeed, well-folded proteins typically contain a balanced mix of charged and hydrophobic residues, while intrinsically unfolded proteins do not (Uversky et al. 2000). Statistically, proteins with an unbalanced amino acid composition are less stable and so may require more frequent chaperone intervention. Therefore, we posited that amino acid compositions have evolved to minimize the total energy spent on amino acid homeostasis and chaperone activity, and tested this hypothesis by simulations.
By incorporating protein folding and metabolic cost in a single model, we were able to capture average amino acid composition and its temperature trends simultaneously (fig. 3), significantly improving upon purely physical models (Berezovsky et al. 2007; Venev and Zeldovich 2015). These models are captured in our study as a limiting case w = 0. As demonstrated in figure 3, the predictive power of the model dramatically increases by considering a balance between protein folding requirement and the metabolic cost constraints, . It has been long posited that stringent protein folding requirements are associated with increased fitness (Taverna and Goldstein 2002; Bloom et al. 2006; Zeldovich et al. 2007a; Lobkovsky et al. 2010). More or less explicitly, these works assumed that fitness is related to metabolic rates which in turn depend on the amount of folded, functional enzymes to catalyze chemical reactions. This assumption led to various models where fitness was positively correlated with stability, as in equation (8). In the present model, however, the positive contribution of stability to fitness does not stem from the metabolic rate argument. Rather, it emerges from the energy balance of the cell: more stable proteins require less energy expense in the chaperone system, leaving more ATP and other resources for replication. Therefore, our model suggests a novel and independent biological mechanism leading to the fitness being an increased function of protein stability. Further experimental studies will be needed to elucidate the relative contributions of metabolic versus energetic correlates of protein stability to organism fitness.
While comparing the model findings with experimental data, we have made several novel statistical observations. First, we showed that the trends of thermophilic adaptation of highly expressed proteins are very similar in archaea and bacteria, while no strong correlation is observed at the whole-proteome level (fig. 2). We interpret this finding as manifestation of convergent response to a selective pressure acting on highly expressed proteins irrespective of their phylogenetic history, and suggest the energetics of proteostasis as mechanistic explanation. Second, we clearly demonstrate that proteome-wide amino acid synthesis cost, according to Akashi–Gojobori scale, increases with OGT in both archaea and bacteria (fig. 4). This observation supports our hypothesis of synthetically expensive amino acids being crucial for protein stability and thermal adaptation. At the same time, it is well-established that highly expressed proteins are “cheap” to synthesize (Akashi and Gojobori 2002; Seligmann 2003; Heizer et al. 2006; Raiford et al. 2008 and supplementary fig. S12, Supplementary Material online). Therefore, we find that thermophilic proteins are expensive to synthesize but highly expressed ones are biosynthetically cheap. At the same time, it has been suggested that amino acid composition of highly expressed proteins is similar to that of thermophilic proteins (Cherry 2010), which creates a logical inconsistency. We attempted to address this issue by estimating the expression levels using CAI and correlating it with various composition-based predictors of thermostability in a large set of bacterial and archaeal proteomes. In the bacterial data set, we observed that highly expressed proteins had amino acid compositions more similar to the average composition of thermophilic proteomes. This finding parallels earlier results by Cherry (2010). However, no significant correlation was found in archaea (fig. 5). Apparent inconsistencies in the empirically observed cost-expression-stability triangle require further study, and suggest a surprising flexibility of amino acid usage evolving to satisfy different constraints. Observed statistical differences between archaea and bacteria in the cost-expression-stability space may complicate comparisons of evolutionary simulations (Drummond et al. 2005; Serohijos et al. 2012) with experimental data. Further development of high-throughput experimental methods for characterizing protein expression levels and thermostability, such as limited proteolysis and mass spectrometry, LiP-MS (Leuenberger et al. 2017) will make it possible to transition away from sequence-based predictors, and will stimulate the next generation of predictive, organism-level models of metabolism and selection.
Materials and Methods
Model of Protein Homeostasis and Simulation of Adapted Proteomes
Our model for protein homeostasis costs closely follows Kepp et al. (2014), together with the hypothesis that a specific amino acid composition generates an additional, folding-related energy demand due to chaperone activity required to refold misfolded or unstable proteins. As in Kepp et al. (2014), we assume that the cellular fitness increases with the amount of energy spent on replication. As the total energy supply is limited, reduction of the energetic costs of proteostasis confers fitness advantage on the cell.
The protein design simulation starts with random sequences. At every step, one amino acid mutation is introduced in each sequence, and a change of its score Π is calculated. Mutations that increase Π are always accepted, whereas mutations that decrease Π are accepted according to Metropolis Monte Carlo (MC) scheme with probability , with the design “temperature” . The stability condition (9) is then checked, and if all sequences satisfy it, the proteostasis cost criterion (10) is evaluated. The simulation stops if the criterion (10) is met or the number of iterations exceeds 103.
This expression reduces to equation (10) if abundances ai are all equal to each other. Furthermore, the MC criterion for accepting mutations becomes , so proteins of low abundance are less constrained in their sequences as long as the stability criterion (9) is still satisfied. This feature of the model is supported by our analysis of real data, where temperature trends in bacteria and archaea were similar for highly abundant proteins but less so for complete proteomes. The effective decrease of MC temperature for highly abundant proteins also provides for their slower evolution, established, for example, by Drummond et al. (2005). The distribution of protein stabilities Pnat of all sequences generated in the simulation is shown in supplementary figure S15, Supplementary Material online. As expected, low-abundance proteins have a lower average stability and a wider distribution of stabilities.
Therefore, for each combination of the two input parameters, environmental temperature T and chaperone-adjusted synthesis cost w, we were able to generate simulated proteomes of 105 sequences of length 64 each, optimizing the proteostasis costs (8) or (11) subject to stability condition (9). The GPU-accelerated lattice protein folding library GaleProt (Venev and Zeldovich 2015) was used for massively parallel evaluation of Pnat. Frequencies of amino acids found in the simulated proteomes were used for comparison with genomics data.
Data Sets
We used RefSeq and BioProject databases at NCBI to retrieve 543 completely sequenced, annotated, single-chromosome bacterial genomes with known OGT or a specified environmental temperature. A Python script (Cock et al. 2009) was used to retrieve OGT data from NCBI Entrez. If only a temperature range was specified, the average temperature was used as OGT. Following (Goncearenco et al. 2014), we removed 281 overrepresented species with the values of OGT of 27.5°, 30°, 37° as they represent plant and animal pathogens and experience diverse selective pressures unrelated to environmental temperature. The bacterial data set covers the OGT range of 15–90° and genome-wide GC content (GC) of 30–70% (supplementary fig. S16B, Supplementary Material online). As archaea are much less represented in the BioProject database, we performed a manual literature search for OGT of 617 species of archaea available in GenBank. The search yielded 223 species with known OGT and sufficient annotation (whole genome shotgun assemblies were included if at least 600 protein coding sequences were annotated). Genomes of 83 halophiles have been excluded from analysis, as they experience a strong evolutionary pressure of hypersaline environment (Fukuchi et al. 2003), and appear as outliers on the overall monotonous OGT trends of amino acid usage. The scatter plots in genomic GC-OGT coordinates for archaea (supplementary fig. S16A, Supplementary Material online) reveal a relatively homogeneous coverage, with the GC range 30–70% and OGT 25–110° with a lower coverage at ∼60° OGT, which may be attributed to the lack of corresponding environments. The analyzed data set comprised 262 bacteria and 140 archaea with sufficient annotation and known OGT, accessions numbers and OGT are listed in supplementary tables S2 and S3, Supplementary Material online. Analysis scripts and protocols are available at http://github.com/sergpolly/Thermal_adapt_scripts, last accessed November 1, 2017.
Identification of Highly Abundant Proteins
Protein abundance and expression level are important factors to consider when calculating energetic costs. Unfortunately, for most of prokaryotes with completely sequenced genomes neither protein abundance nor expression have been directly characterized, for example, there are only two archaeal entries in the major protein abundance database, PaxDB (Wang et al. 2015). We used a sequence based approach to identify putatively highly expressed proteins using CAI (Sharp and Li 1987). Ribosomal proteins were used as a reference of highly expressed proteins (Pedersen et al. 1978; Srivastava and Schlessinger 1990) to establish the codon usage pattern. We selected all species with at least 25 annotated ribosomal proteins, and used CAI as a proxy for expression and abundance level.
Previously, it has been shown that CAI has its limitations as a predictor of gene expression (Botzman and Margalit 2011), as in some species the CAI distribution is very narrow and codon usage of ribosomal protein genes is nearly indistinguishable from other genes (supplementary fig. S17B, Supplementary Material online). To address this issue, we selected a group of genomes where at least 85% of ribosomal protein genes are within the 25% of all genes with the highest CAI rank. This empirical criterion selects genomes with wide distributions of CAI and a marked difference in codon usage between ribosomal and other proteins (supplementary fig. S17A, Supplementary Material online), which in turn implies strong codon usage selection (CUS). We assume that in organisms with CUS, CAI can be used as a proxy for gene expression and, statistically, abundance (Sharp and Li 1987; Jansen et al. 2003; Supek and Vlahovicek 2005; Maier et al. 2009). CUS was identified in ∼50% of species used in this study, 167 bacteria out of 262 and 65 archaea out of 140. Our CUS criterion is compatible with the criteria proposed in Botzman and Margalit (2011) (supplementary fig. S18, Supplementary Material online), and preserves a relatively uniform GC-OGT distribution of species (supplementary fig. S16, Supplementary Material online). For CUS organisms, highly expressed genes (abundant proteins) were defined as the genes within the top 10% of CAI values. Thus we avoided using CAI-ranking for individual genes, which in turn mitigates the problem of poor expression versus abundance correlation (Maier et al. 2009).
Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.
Acknowledgments
The authors gratefully acknowledge the help of Alexey Shaytan and NCBI helpdesk for assistance with NCBI databases. This work was supported in part by the Defense Advanced Research Project Agency (DARPA), Prophecy Program (Contracts Nos. HR0011-11-C-0095 and D13AP00041) and by NIH P01 GM109767.
References
Author notes
Associate editor: Banu Ozkan