-
PDF
- Split View
-
Views
-
Cite
Cite
Sushant Kumar, Declan Clarke, Mark Gerstein, Localized structural frustration for evaluating the impact of sequence variants, Nucleic Acids Research, Volume 44, Issue 21, December 2016, gkw927, https://doi.org/10.1093/nar/gkw927
- Share Icon Share
Abstract
Population-scale sequencing is increasingly uncovering large numbers of rare single-nucleotide variants (SNVs) in coding regions of the genome. The rarity of these variants makes it challenging to evaluate their deleteriousness with conventional phenotype–genotype associations. Protein structures provide a way of addressing this challenge. Previous efforts have focused on globally quantifying the impact of SNVs on protein stability. However, local perturbations may severely impact protein functionality without strongly disrupting global stability (e.g. in relation to catalysis or allostery). Here, we describe a workflow in which localized frustration, quantifying unfavorable local interactions, is employed as a metric to investigate such effects. Using this workflow on the Protein Databank, we find that frustration produces many immediately intuitive results: for instance, disease-related SNVs create stronger changes in localized frustration than non-disease related variants, and rare SNVs tend to disrupt local interactions to a larger extent than common variants. Less obviously, we observe that somatic SNVs associated with oncogenes and tumor suppressor genes (TSGs) induce very different changes in frustration. In particular, those associated with TSGs change the frustration more in the core than the surface (by introducing loss-of-function events), whereas those associated with oncogenes manifest the opposite pattern, creating gain-of-function events.
INTRODUCTION
The advent of next-generation sequencing technologies has led to a remarkable increase in genomic variation data at both the exome as well as the whole-genome levels (1,2). These large data sets are playing a pivotal role in advancing efforts toward personalized medicine (3). Non-synonymous coding single nucleotide variants (termed SNVs throughout this study) are of particular interest because of their implications in the context of human health and disease (4–6). As such, considerable effort has been invested in curating disease-related SNVs into various databases, including the Human Gene Mutation Database (HGMD) (5), ClinVar (6) and the Online Database of Mendelian Inheritance in Man (4). Concurrently, initiatives such as The 1000 Genomes Project (7,8), Exome Sequencing Project (9) and Exome Aggregation Consortium (ExAC) (10) have generated large catalogues of SNVs within individuals of diverse phenotypes in general.
As the costs associated with sequencing human genomes and exomes fall, sequencing will become routine in both medical and academic settings (11). Indeed, it may take less than a decade to reach the milestone of a million sequenced genomes (12), resulting in massive data sets of rare SNVs. This exponential growth in the number of newly discovered rare SNVs poses significant challenges in terms of variant interpretation (13). Compounding this challenge is the fact that many of these variants will be unique to single individuals. The extremely low allele frequencies of such ‘hyper-rare’ SNVs render them too rare to draw variant-phenotype associations with confidence – unlike more common variants, the very rarity of these ultra-rare genomic signatures renders phenotypic inference through association studies extremely difficult. Together, these trends underscore a growing and urgent need to evaluate the potential effects of low-allele-frequency variants in unbiased ways using high-throughput methodologies.
Though the majority of variants lie in non-coding regions of the genome, many disease-related variants are present in protein-coding genes. Furthermore, only a limited fraction of non-synonymous SNVs may be mapped to known protein structures. However, immense progress has been made in resolving the three-dimensional structures of many proteins over the last several decades (14). In addition, a large volume of high-resolution data on protein–protein, protein–ligand and protein–nucleic acid complexes is now available. This complementary evolution of sequence and structural databases provides an ideal platform to investigate the functional and structural consequences of benign and disease-related SNVs on protein structures. The integration of variant and structure knowledge bases will lead to a greater understanding of the biophysical mechanisms behind various diseases. In addition to gaining a better understanding of how disease-related SNVs impart deleterious effects, this integration can be utilized to both predict the impacts of poorly understood SNVs (i.e. SNVs that are known to be deleterious, but for which a plausible biophysical or functional rationale is missing) and to prioritize SNVs based on predicted deleteriousness (15–18). We also note that this approach may aid in more intelligent and targeted design of drugs in various therapeutic contexts.
Over the last several decades, many studies have evaluated the impacts of SNVs by examining or predicting changes in thermodynamic stability (19–21). These approaches rely on the fact that SNVs may induce substantial changes in the folding landscape and conformational ensemble. Such changes in global stability are often quantified by calculating the folding free energy change (ΔΔG) after mutating residues (21,22). Importantly, however, many disease-related SNVs introduce local structural changes without appreciably affecting folding free energy or global stability (23,24). Such local perturbations may include disruptions in residue packing or hydrogen bond networks (25,26) and salt bridges (27,28). Examples of the associated effects include disruptions to catalytic centers, changes to ‘hotspot residues’ that are responsible for interaction affinities and specificity, as well as perturbations to key allosteric sites (29–31). Changes to such residues may impart only minimal effects to the protein's overall topology, but may nevertheless drastically influence protein behavior and functionality.
We examine the role of localized perturbations by calculating changes in the localized frustration indices (32,33) of residues impacted by SNVs. Qualitatively, the frustration of a given residue quantifies the degree to which the residue is involved in favorable or unfavorable interactions with neighboring residues in space. The residue change that is introduced by an SNV may result in more (less) unfavorable interactions with neighboring residues, thereby increasing (decreasing) the frustration at that site. SNVs thereby act as agents that may relieve unfavorable interactions or alternatively impair local stability, depending on the nature of the amino acid substitution and the surrounding environment within the protein. Throughout this study, such changes in frustration are designated by ΔF.
The concept of frustration was originally introduced by Wolynes et al. to describe the protein folding landscape (32). The protein folding process is believed to follow a smooth-funneled energy landscape, in which strong energetic conflicts are avoided (34–38). However, despite minimizing configurations that exhibit frustration, local frustration is essential to protein biology and function (39–41). Highly-frustrated local interactions result in micro-states of high potential energy. Such micro-states provide proteins with the avenues needed to carry out essential functions that entail a release of energy and the concomitant shifts in occupied energetic wells. Examples of processes that require these ‘energetic bursts’ include catalysis, allosteric communication, conformational switches and proteinquakes (42), as well as protein–protein interactions (32,43,44).
Ferriero et al. proposed a framework to compute the frustration profile of a given protein (32). The localized frustration index quantifies the contribution of each residue or residue pair to the total energy of the native structure compared to their contribution in a random non-native configuration (see Materials and Methods). A native residue (residue pair) is considered to be minimally frustrated if it contains sufficient extra stabilization energy in its native state. In contrast, a sufficiently destabilizing residue (residue pair) in the protein structure is considered to be maximally frustrated (45). In addition, a residue (residue pair) is considered to be neutral when its stability profile lies between these extremes.
We take a data-driven approach to analyze ΔF profiles from SNVs in a large data set of proteins. SNVs in healthy human populations (The 1000 Genomes and ExAC projects) are highly enriched in benign SNVs. Therefore, SNVs in these data sets are termed ‘benign’ (though we qualify this term by noting that a small subset of these SNVs may actually impart as yet undetected deleterious effects). However, within these data sets, there are various degrees along the continuum of phenotypic effects. While deleterious variants are more enriched among rare SNVs, neutral variants have greater representation among common variants. In addition, we also quantified and compared ΔF profiles introduced by disease-related SNVs (taken from the HGMD database), as well as cancer somatic variants, thereby enabling in-depth analyses of the differential effects between SNVs in driver and passenger genes.
Though the majority of our analyses are consistent with prior studies investigating how SNVs impact protein structures, we provide a distinct rationale through the lens of localized frustration. We observe that large disruptions in local interactions of minimally frustrated core residues distinguishes disease-related SNVs from benign SNVs, as well as SNVs impacting driver and passenger genes in cancer. In contrast, benign SNVs in passenger genes generate larger perturbations in local interactions of minimally frustrated surface residues compared to core residues. Furthermore, comparisons between rare and common SNVs within healthy human populations indicate that rare variants induce larger disruptions in favorable local interactions compared to common variants. Moreover, we also investigated the effects of SNVs impacting conserved and variable regions of proteins, where conservation was measured across different species. For disease-related SNVs, we detected a significant disparity between local perturbations observed due to SNVs impacting conserved regions compared to variable regions of proteins. However, no such disparity was observed for benign SNVs.
We also demonstrate how frustration may provide insights in the context of oncogenes and tumor suppressor genes (TSGs). We find that somatic SNVs in oncogenes disrupt local interactions of surface residues and potentially facilitate cancer progression through the introduction of non-specific regulatory interactions. However, SNVs in TSGs drive cancer progression through larger local perturbations in core residues. These observations suggest that SNVs in TSGs and oncogenes may impart loss-of-function (TSG) and gain-of-function (GOF) effects, respectively.
MATERIALS AND METHODS
SNV data sets
We utilized a comprehensive catalogue of non-synonymous SNVs from various resources. Our SNV data set is divided into two broad categories (benign and disease-related; Figure 1A). The benign set comprises SNVs reported in The 1000 Genomes (1KG) Project (phase 3) (7) and subsets of SNVs curated from The Exome Aggregation Consortium. The disease-related data set includes SNVs from The Human Gene Mutation Database (HGMD) (5) and pan-cancer data set (46) comprising publicly available somatic SNVs from The Cancer Genome Atlas (TCGA) (47), The Catalogue of Somatic Mutations in Cancer (48) and the SNV data set available from Alexandrov et. al (49). In order to avoid redundancy and false positive call sets, we only consider HGMD SNVs annotated as pathological variants (labeled as ‘DM’) in the HGMD data set. Furthermore, we removed HGMD variants in the 1KG and ExAC data sets. Similarly, we also removed known TCGA variants in the original ExAC SNV data sets.

Overview of SNV categories and their relative proportions within the analyzed data pool. (A) Flowchart representing the different categories and origins of the variants analyzed in this study. A given non-synonymous SNV can be classified as benign or disease-related on the basis of its provenance (i.e. whether it is taken from 1000 Genomes, ExAC, HGMD or Pan-cancer variant data sets). Relative proportions of SNVs from various data sets (B) prior to and (C) after mapping SNVs to high-resolution protein databank (PDB) structures.
SNVs from the pan-cancer data set were further sub-classified based on whether they fall in driver or passenger genes. Driver genes were curated from Vogelstein et. al. (50), in which driver and passenger genes are distinguished on the basis of mutational patterns. They define a driver gene as an oncogene if the SNV is recurrent at the same gene loci, whereas TSGs are mutated throughout their length. Similarly, we sub-classified passenger genes into cancer-associated genes (CAGs) and non-cancer associated genes (non-CAGs). CAGs include genes from the cancer gene census (CGC) (51) and a curated list of 4050 genes from a previous study (52). Furthermore, we removed any driver genes present in the CAG data set. The remaining set of genes impacted by pan-cancer SNVs constitutes our non-CAG data set.
Semi-balanced SNV data sets
The limited and uneven structural coverage of the human proteome primarily introduces two sources of potential bias when combined with SNV data sets: (i) some proteins may be over-represented when evaluating the effects of SNVs, and (ii) the sets of proteins that correspond to benign SNVs may differ considerably from those that correspond to deleterious SNVs, thereby complicating direct comparisons between benign and deleterious SNVs.
In order to address the first issue, we selected a non-redundant set of proteins within each data set. The non-redundant set is constructed by ensuring that no protein within the set shares more than 90% sequence identity with any other protein in the set. Using this approach, we find that there are 618, 907, and 303 distinct proteins within the set of high-resolution structures impacted by 1KG, ExAC and HGMD SNVs, respectively. Distributions delineating the number of SNVs within these non-redundant protein sets are given in Supplementary Figures S1–S3.
In order to address the second issue, we analyzed only those structures that fall within the intersection of the different non-redundant data sets. Thus, for each SNV in a structure within this intersection set of non-redundant proteins (which we term the ‘semi-balanced set’), at least one residue overlaps with an ExAC/1KG and HGMD SNV. We utilized this semi-balanced set to demonstrate the utility of localized frustration (for evaluating deleteriousness) in the context of other methods (PolyPhen-2 (53) and SIFT (15)), as described in Results. We also compare ΔF distributions for 1KG, ExAC and HGMD SNVs on the semi-balanced set (Supplementary Figure S4).
Workflow for calculating ΔF
As mentioned, we investigated the impacts of different categories of SNVs on the local stability of protein structures. Quantifying the ΔF value associated with a given SNV at position i within a structure involves three steps (see also detailed formalism below): (i) determining Fnat: this represents a normalized energetic difference between having the wild-type residue at position i (the wild-type residue is generically termed ‘nat’ here) and the mean energy of having all 20 amino acids at position i, where all 20 of these energies are calculated using the wild-type structure; (ii) determining F΄mut: this represents a normalized energetic difference between introducing the mutated residue at position i (the residue resulting from the SNV is generically termed ‘mut’ here) and the mean energy of having all 20 amino acids at position i, where all 20 of these energies are calculated using a model of the mutated structure, which is built using the SNV; and (iii) determining the ΔF value associated with this SNV – this is simply the difference between the values calculated in steps (i) and (ii) above: ΔF = F΄mut - Fnat. Qualitatively, ΔF thus describes the extent to which an amino acid change perturbs the local energetic landscape. We note that, in producing the model of the mutated structure in part (ii), this workflow performs a second-order calculation. As such, it provides values that may more accurately reflect the influence of introducing an SNV.
To map SNVs onto protein structures, the Variant Annotation Tool (VAT) (54) was applied to annotate our curated catalogue of SNVs. This annotation includes the gene and transcript names, residue position in the protein sequence, as well as the original and mutated residue identities. We then integrated the VAT annotation with the Biomart-derived (55) human gene and transcript IDs to map the SNV to specific structures in the protein databank (PDB). We restricted this SNV mapping scheme to high-quality structures with resolution values that were better than 2.8 Angstroms. Following SNV mapping to PDB structures, we generated models of the resultant mutated structures by applying homology modeling using the mutated protein sequence and the wild-type protein structure as input to Modeller (56,57).
Finally, we quantify the frustration index of the mapped residue in the wild-type structure as well as in the mutated model of the protein. Briefly, the residue-level localized frustration index (45) quantifies the degree to which that amino acid favorably contributes to the energy of the system relative to all 20 possible amino acids at that position: |$\ {F_{\rm i}} = \ \frac{{\langle {E_{\rm i}^{T,\ U}} \rangle \ - \ E_{\rm i}^{T,\ N}}}{{\sqrt {1/N\mathop \sum \nolimits_{{\rm k}\ = \ 1}^n ( {E_{\rm i}^{T,\ U} - \ \langle {E_{\rm i}^{T,\ U}} \rangle } )} }}$|, where |$E_{\rm i}^{T,\ N}$|is the total energy of the wild-type protein. This energy is calculated using a function that includes an explicit water interaction term, |$E_{\rm i}^{T,\ N} = \ \mathop \sum \limits_{k \ne i}^n (E_{{\rm contact}}^{i;k} + \ E_{{\rm water}}^{i;k}) + \ E_{{\rm burial}}^i$|. This water-mediated potential (44), describes the energies associated with direct interactions between residues i and k (|$E_{{\rm contact}}^{i;k})$|, as well as those with water-mediated interactions between residues i and k|$(E_{{\rm water}}^{i;k}$|) and an energy term associated with the burial of the residue (|$E_{{\rm burial}}^i)$|. The average energy of the decoy conformations (|$\langle {E_{\rm i}^{T,\ U}} \rangle$|) is generated by mutating the original residue i to each of the alternative possible 19 residues. The AMW potential includes different parameter values for different residues, so the decoy energies calculated vary based on the identity of the mutated residue.
This workflow is computationally tractable when evaluating ΔF values for large numbers of variants. Our benchmark calculations on 10,000 non-synonymous SNVs indicate that we can map, build mutated models and calculate ΔF values in ∼29 h on a single-core processor; specifically, we used an E5-2660 v3 (2.60GHz) processor. This approach is substantially more computationally tractable relative to traditional molecular dynamics simulations. Thus, it can readily be used to evaluate the effects of large numbers of SNVs. We also provide source code for the workflow on our GitHub page (https://github.com/gersteinlab/frustration).
In Figure 2, we demonstrate an example case in which a tryptophan residue at locus 31 within plastocyanin (PDB ID 3CVD) is mutated to tyrosine. For the wild-type structure of this protein, 19 decoy energies are calculated by changing the parameter values that are specific to each amino acid within the potential function (note that, at this stage, the structure is not altered or minimized in any way). In this case, the energy computed using the wild-type residue (ETRP) is substantially lower than the mean value 〈E〉 (rendering a positive value for ΔETRP). Because ΔETRP is greater than 0, the wild-type residue is said to be ‘minimally frustrated’.

The effect of introducing a typical deleterious SNV (ΔF < 0). Each of the two vertical lines represents an energy-level diagram. Each level on this energy scale corresponds to the total energetic value of the protein if the residue position (here, residue ID 31) were to be occupied by distinct amino acids (thus, for instance, in the left vertical line, having isoleucine occupy position 31 results in conferring the highest possible energy to the protein, whereas having valine occupy position 31 results in the lowest possible energy for the protein). The ΔF associated with an SNV is negative if the SNV introduces a destabilizing effect. Shown here is the result of changing residue ID 31 in plastocyanin (pdb ID 3CVD) from the wild-type residue (TRP) to a mutated residue (TYR). The protein in its wild-type form (in green), in which the tryptophan residue at position 31 is substantially more energetically favorable relative to the mean energy 〈E〉 that would result from having any of the possible 20 amino acids at that position. This disparity is designated by (〈E〉 - ETRP)/σE = FTRP > 0. The gray vertical arrow designates the relative energy (ETYR) associated with having a non-WT residue (TYR) at position 31 – this may conceivably be used to calculate FTRP in a similar manner. Relating FTRP to FTYR by using the wild-type structure alone as a basis for estimating the perturbation induced by the W31Y SNV (FTYR – FTRP = Δ|${\rm{\tilde{F}}}$|) would provide a naïve estimate of the perturbation. A more accurate evaluation of the perturbation requires a better estimate for FTYR. This entails a secondary calculation, wherein the entire protein structure (right) is first changed (see Materials and Methods) by generating a model of the mutated structure after the SNV W31Y is introduced. This changed structure redistributes the relative energies for the different amino acids. The new mean and standard deviation associated with the energies of the modeled structure are designated by 〈E〉΄ and σE΄, respectively. In this case, the SNV W31Y results in an energy that is higher than the mean energy (〈E〉΄) of all possible 20 amino acids at that position. This disparity is designated by (〈E〉΄ – E΄TYR)/σE΄ = F΄TYR < 0. Taken together, the negative value associated with the disparity between the F΄TYR and FTRP values (F΄TYR – FTRP = ΔF < 0) indicates that this SNV is locally unfavorable, and greater accuracy is likely obtained using the secondary calculation.
This same protein is known to contain a disease-related SNV at locus 31 (W31Y). To quantify the associated change in frustration, we first introduce tyrosine at locus 31 in silico, and then use Modeler to generate a model of the mutated structure. Thus, we now not only change the residue at locus 31, but also the configuration of the entire protein; the new structure is the model of the mutated protein. In this new energy landscape, the energy associated with the residue at the mutated locus 31 is higher than the mean energy among all 20 amino acids within the modeled structure (ΔE΄TYR < 0), suggesting that the mutated residue is ‘maximally frustrated’. We are primarily interested in the difference between these two states (ΔF). ΔF is proportional to the difference between ΔE΄TYR and ΔETRP. ΔΔE is defined to be the difference between the two energetic disparity measures (ΔΔE = ΔE΄TYR − ΔETRP). Here, ΔΔE is less than 0, suggesting that the frustration value is higher in the mutated structure than that of the wild type.
Downstream analyses
In order to investigate the differential effects of SNVs in various data sets, we ‘bin’ each SNV into distinct categories based on their frustration indices and relative accessible surface areas (RSASA) in the wild-type structure. SNVs are classified into three groups (all in wild-type structures): (i) minimally frustrated (MinFNS); (ii) maximally frustrated (MaxFNS) and (iii) neutral (NeutFNS). MinFNS residues have frustration indices ≥ 0.78, whereas MaxFNS residues have frustration indices less than or equal to −1.0. SNVs falling between these two extremes fall in the NeutFNS category. Moreover, we sub-classify each of these three categories into core and surface residues based on their RSASA value. We calculated the RSASA value for each residue using NACCESS (58). Residues were defined to be in the core if their RSASA value ≤ 25%; surface residues had RSASA values > 25%.
Furthermore, we investigated the differential influence of common and rare variants. SNVs with minor allele frequencies (MAF) ≤ 0.5% were considered to be rare. SNVs were otherwise classified as common. Similarly, we also compared the effects of SNVs influencing the conserved and variable regions of the genome. The distinction between conserved and variable regions was defined using genome evolutionary rate profiling (GERP) scores (59). GERP scores are used to identify functionally constrained genomic elements based on multiple sequence alignments of genomic sequences from diverse species. In our analysis, we defined a genomic position as conserved if its GERP score was greater than 2.0. GERP scores ≤ 2.0 were considered to designate variable genomic loci. For each variant dataset, Table 1 provides a survey of the SNV counts across these different categories, along with the frequencies with which these variants lie in core or surface regions.
Summary statistics on the number of SNVs used in comparative analyses
The deleteriousness of an SNV is a continuous variable, and indeed, this is reflected in the continuous nature of ΔF values. However, there is considerable value in applying a binary classification scheme to newly discovered SNVs, which may be predicted to be benign or deleterious. In order to perform such a binary classification, we applied a simplified decision boundary scheme, wherein we analyzed ΔF distributions for HGMD (disease-related) and ExAC (seemingly benign) variants. The threshold was set with the objectives of (i) minimizing the fraction of HGMD SNVs with ΔF values above the threshold, and (ii) minimizing the fraction of ExAC SNVs with ΔF values below the threshold. Using this approach, we observed that variants with ΔF scores ≤ −1.221 can be considered deleterious. Details of this scheme are provided as part of the Supplementary Material.
RESULTS
Differential effects of benign and disease-related SNVs on ΔF profiles
We performed a comparative analysis to investigate the impacts of benign (1KG and ExAC) and disease-related (HGMD) SNVs on the ΔF profiles of mutated residues in a large number of proteins. As detailed in Materials and Methods, each SNV data set was divided into three distinct categories based on the frustration index of the wild-type residue. Maximally frustrated residues in the wild-type structure exhibit conflicting interactions and unfavorable geometry in their local environment, thereby inducing local destabilization. Conversely, minimally frustrated residues are involved in biophysically favorable local interactions, and thus favorably contribute to the protein's stability.
For each SNV, ΔF was calculated as follows (Figure 2; see also Materials and Methods for more details). For a given SNV, two protein structures are used in our analysis: the wild-type structure (as it exists in the PDB), and a model of the structure as it may exist when the affected residue is mutated (this is modeled by optimizing the structure after introducing the SNV). If a given SNV maps to residue location j within the structure, then within each of these two structures, the frustration index is calculated at residue j (the corresponding values are denoted as Fnat and Fmut for the wild-type and mutated model structures, respectively). Subsequently, we determine the difference between the frustration index of the native residue in the wild-type structure and the mutated residue in the modeled structure (ΔF = Fmut – Fnat).
After calculating the ΔF values in all three categories, the resultant distributions are plotted (further details are given in Materials and Methods). We observed that most SNVs (across all data sets) affecting maximally frustrated residues in the wild-type structure induce small but positive ΔF values. This suggests that changes to maximally frustrated residues alleviate conflicting interactions, thereby resulting in a positive frustration difference (ΔF > 0). In contrast (and as expected), residues that are originally minimally frustrated tend to become more frustrated upon base-substitution, thereby, leading to a negative frustration changes (ΔF < 0) in a majority of cases across each data set. However, we emphasize that losses or gains in favorable interactions are dependent on the type of SNV (benign or disease-related) as well as whether the SNV affects a surface or core residue.
We observed that benign SNVs lead to greater disruptions within minimally frustrated surface residues compared to core residues in the wild-type structure, and this trend is observed when using both ExAC and 1KG data sets (P-value < 2e-16 from two-sample Wilcoxon test; Figure 3A and B). In addition, disease-related SNVs (from HGMD) result in similar frustration changes between core and surface residues. However, SNVs from HGMD that impact minimally frustrated core residues induce stronger perturbations than benign SNVs influencing minimally frustrated core residues (P-value < 2e-16 from two-sample Wilcoxon test; Figure 3C).

Differential effects of ‘benign’ and disease-associated SNVs on the localized frustration of minimally frustrated residues in the wild-type state. Violin plots showing ΔF distributions associated with SNVs affecting the core or surface, with SNVs taken from (A) 1000 Genomes, (B) ExAC and (C) HGMD. Comparisons between ΔF distributions for core and surface residues of the 1000 Genomes and ExAC data sets indicate that favorable interactions of surface residues in the wild-type states are highly disrupted upon mutation relative to core residues. Furthermore, the distributions of ΔF values for HGMD core residues were highly negative compared to 1KG and ExAC variants impacting core residues. The white dots, black boxes and vertical lines represent the medians, interquartile ranges and 95% confidence intervals of the ΔF distributions, respectively.
Differential effects of rare and common SNVs on localized frustration
In population-level studies, SNVs with lower MAF are generally interpreted as being more likely to be deleterious than SNVs with higher MAF values. Thus, within the set of benign SNVs (1KG and ExAC variants), MAF may be used as a proxy for varying degrees of selective constraint. Thus, we compared the ΔF distributions between rare and common SNVs. Consistent with our earlier observations regarding benign SNVs, we found larger disruptions to favorable local interactions in surface residues relative to core residues (Figure 4A). However, this disparity was slightly more pronounced for rare SNVs compared to common SNVs. This observation was consistent for the 1KG (Figure 4A) and ExAC data sets (Figure 4B; P-value < 2e-16 from two-sample Wilcoxon test). Furthermore, using both of these data sets, we observed that greater ΔF values associated with the introduction of SNVs (in either the positive or negative direction) tend to be associated with lower MAF values (Figure 4C, top and bottom panels). This trend is observed for SNVs that occur both on the surface and within the core.

Common and rare SNVs differentially influence ΔF profiles of minimally frustrated core and surface residues. Violin plots showing the ΔF distributions induced by common and rare variants from in the (A) 1000 Genomes and (B) ExAC data sets (shown are the effects on minimally frustrated core and surface residues). Rare variants in both data sets lead to more substantially negative ΔF values compared to common variants. (C) Scatter plots of ΔF values indicate that more extreme ΔF values (in either the positive or negative direction) tend to be associated with lower MAF values (i.e. rarer SNVs). The white dots, black boxes and vertical lines represent the medians, interquartile ranges and 95% confidence intervals of the ΔF distributions, respectively.
Differential effects of benign and disease-related SNVs in different evolutionary contexts
We also examined the local perturbations induced by disease-related and benign SNVs in conserved and variable regions of the genome. We plotted distributions for the ΔF values for the surface and core residues (Figure 5), and observed that benign SNVs originating in both conserved and variable regions had similar effects on minimally frustrated core residues (Figure 5A and B). This observation was true for the surface residues as well. In contrast, disease-related SNVs intersecting conserved and variable genomic regions lead to variable ΔF values for surface residues (P-value = 4.715e-03 from two-sample Wilcoxon test). This disparity is even more pronounced in core residues (Figure 5C; P-value = 6.723e-09 from two-sample Wilcoxon test).

Comparisons between ΔF distributions for ‘benign’ and disease-related SNVs on conserved and variable residues. Violin plots depicting ΔF distributions introduced by (A) 1000 Genomes, (B) ExAC and (C) HGMD variants, respectively. ΔF distributions associated with HGMD SNVs indicate larger disruptions of conserved core residues compared to variable residues. In contrast, for the 1000 Genomes and ExAC data sets, no significant difference in ΔF distributions was observed for conserved and variable core residues (the same was true for surface residues). The white dots, black boxes and vertical lines represent the medians, interquartile ranges and 95% confidence intervals of the ΔF distributions, respectively.
Differential effects of SNVs on driver and passenger genes
One of the most important challenge confronting the cancer genomics community involves discriminating between highly deleterious driver SNVs and the large number of neutral passenger SNVs that naturally arise over the course of tumor progression (60). As part of these efforts, a large number of cancer actionable genes have been curated in recent years. We applied our framework to evaluate the effects that somatic cancer SNVs have on driver genes (50), CAGs (52) and non-CAGs in the context of frustration. We mapped the somatic pan-cancer SNVs in these three distinct gene categories to protein structures, and then evaluated the ΔF distributions in all three.
As with benign SNVs, we observed that somatic SNVs in CAGs and non-CAGs generally lead to greater disruptions in minimally frustrated surface residues relative to core residues (Figure 6; P-value < 2.2e-16 from two-sample Wilcoxon test). This disparity was more pronounced among non-CAGs than CAGs (Figure 6). In contrast, SNVs in driver genes lead to larger disruptions in favorable localized interactions for surface and core residues (Figure 6; P-value < 2.2e-16 from two-sample Wilcoxon test) compared to CAG core and surface residues.

Comparisons between ΔF distributions associated with driver and passenger genes. (Left) Violin plots showing ΔF distributions associated with somatic SNVs affecting non-cancer associated genes (non-CAG), cancer associated genes (CAG) and driver genes encoding core and surface residues. Somatic SNVs affecting core residues of driver genes lead to a more substantially negative ΔF values compared to those in CAG and non-CAG proteins. (Right) In contrast, SNVs in CAGs and non-CAGs disrupt favorable interactions of surface residues to a larger extent relative to core residues. The white dots, black boxes and vertical lines represent the medians, interquartile ranges and 95% confidence intervals of the ΔF distributions, respectively.
Differential effects of SNVs on oncogenes and TSGs
Cancer driver genes are classified as oncogenes and TSGs based on their mutational pattern and their mode of inducing tumorigenesis (50). Oncogenes are marked by recurrent SNVs within the same gene loci across different cancer types, and are believed to drive cancer progression through GOF mechanisms. In contrast, a TSG generally contains protein-truncating mutations or SNVs that are scattered throughout the gene, and they are believed to facilitate cancer progression through loss-of-function (LOF) mechanisms. This line of thinking is guided by the idea that LOF variants often act by destabilizing the protein (Figure 7C, left panel), whereas GOF variants may impact protein–protein interaction interfaces (by reducing specificity for binding partners) or negatively affect auto-regulatory sites on the protein, many of which are on the surface (Figure 7C, right panel).

Differential impacts on ΔF distributions associated with SNVs on driver and passenger genes. Violin plots depicting ΔF distributions associated with (A) SNVs in TSGs and (B) oncogenes. Relative to surface residues, SNVs in TSGs lead to larger disruptions for minimally frustrated core residues. However, SNVs affecting oncogenes are associated with larger ΔF values for the surface compared to core residues. (C) These observations suggest a potential model in which SNVs in TSGs act by disrupting the hydrophobic core of a protein and drive cancer progression through loss-of-function (LOF) mechanisms (left). In contrast, SNVs in oncogenes may facilitate non-specific binding by changing surface residues and drive cancer through GOF events (right). The white dots, black boxes and vertical lines represent the medians, interquartile ranges and 95% confidence intervals of the ΔF distributions, respectively.
To evaluate the extent to which such effects manifest in our set of TSGs and oncogenes, we applied frustration to evaluate changes in local perturbation when SNVs impact these distinct categories of driver genes (Figure 7A and B). We observed that SNVs in TSGs induce stronger perturbations in minimally frustrated core residues relative to surface residues (Figure 7A; P-value = 4.765e-03 from two-sample KS test). In contrast, SNVs in oncogenes induce greater ΔF values within minimally frustrated residues in the surface relative to core residues (Figure 7B; P-value = 1.91e-13 from two-sample KS test). Moreover, SNVs impacting oncogenes lead to larger disruptions in favorable local interactions compared to TSGs for minimally frustrated surface residues (P-value = 2.3e-3 from two-sample KS test). However, SNVs impacting TSGs lead to greater disruptions in favorable local interactions compared to oncogenes affecting driver SNVs in core residues (P-value = 6.753e-13 from two-sample KS test).
Localized frustration as a means of complementing global metrics
As discussed, existing structure-based methods for predicting SNV deleteriousness rely on global metrics of protein stability. These approaches may incorrectly predict known disease-related SNVs to be benign (thereby producing false negatives). We address the extent to which ΔF rescues such false negatives by correctly predicting their deleterious effects. We first identified 626 HGMD SNVs within the semi-balanced set (see Materials and Methods), and predicted the impacts of these SNVs using SIFT, PolyPhen-2 and ΔF values. We use a ΔF threshold of −1.221 to discriminate between SNVs that are predicted to be benign or deleterious (details regarding how this threshold value was established are provided in the supplement).
SIFT produces false negatives for 13.7% of these HGMD SNVs. We find that employing ΔF rescues 46% of these SIFT false negatives (i.e. by correctly predicting deleterious impacts). Similarly, PolyPhen-2 produces false negatives for 10% of the HGMD SNVs. Applying ΔF enables us to rescue 38% of these PolyPhen-2 false negatives. Glucokinase is used as an example to demonstrate specific cases of rescued variants (Supplementary Figure S6). A list of all false negatives rescued by ΔF analysis is provided in Supplementary Material data file. Finally, we also provide the full library of all HGMD SNVs mapped to protein structures (along with their corresponding PolyPhen-2, SIFT and ΔF scores), which may then be used as a resource in related studies.
DISCUSSION
Over the course of the last decade, tremendous improvements in sequencing and structural biology techniques have led to immense growth in genomic variation and three-dimensional structural data sets. This provides us with an ideal platform to investigate the impacts of genomic variants on protein structures. The objective of these studies is to gain mechanistic insights into the origin of various diseases, as well as to design effective drug targets. Prior studies in this direction were limited by a lack of genomic variation and structural data. Moreover, these studies primarily focused on investigating the impacts of SNVs on the global stability of protein structure. However, many experimental studies have indicated a causal role for SNVs inducing local perturbation in various diseases. Here, we repurpose the concept of localized frustration (originally introduced in protein folding studies) to quantify SNV-induced local perturbations. The frustration index of a residue quantifies the presence of favorable/dis-favorable local interactions in the protein structure compared to a random molten globule structure.
Historically, the relative scarcity of genomic variation and structural data have presented challenges to variant interpretation, in that only a small pool of SNVs may be mapped to resolved structures. However, despite addressing sources of bias, limited mapping coverage persists as a major challenge. Nevertheless, a number of recent trends may partially help to mitigate this issue. Significant improvements in crystallographic protocols have enabled near-exponential growth in X-ray structures in the PDB (11). Furthermore, cryo-EM is opening entirely new avenues for revealing the architectures of many proteins that were previously elusive to crystallography, which is expected to expand the structurally-resolved proteome (61). Finally, inferring how a given SNV affects a particular structure is not limited to predictions regarding that protein alone – the protein's tight associations with other molecules may greatly broaden the scope of how that SNV influences other proteins (e.g. the consequences of an SNV within a multi-protein complex may adversely affect all members of the complex, despite the fact that the SNV maps to only one protein).
We employed an extensive catalogue of benign (∼5.7 million) and disease-related (∼0.76 million) SNVs. The benign SNV data set comprised SNVs from the 1KG Project (phase 3) and the ExAC Consortium. In contrast, HGMD SNVs and pan-cancer somatic SNVs constituted our disease-related SNV data set. We mapped ∼0.2 million benign and disease-related SNVs onto ∼10K high-resolution protein structures. Subsequently, we compared the impact of benign and disease-related SNVs on the frustration profiles of minimally frustrated residues in various protein structures. The ΔF distributions indicate that both benign and disease SNVs disrupt minimally frustrated surface residues to similar extents. However, the mechanistic differences between benign and disease SNVs can be attributed to their impacts on the local environment of core residues. Within the core, disease-related SNVs result in more severe perturbations to local interactions relative to those introduced by benign SNVs. These local disruptions are propagated throughout the core and, in turn, drive the deleteriousness of various disease-related SNVs.
Furthermore, we quantified the influence of rare and common SNVs (present in healthy human populations) on the frustration profiles of affected protein residues. We observed that rare SNVs lead to larger local perturbations of minimally frustrated surface residues compared to common SNVs. This observation is intuitively consistent, as one would expect rare SNVs to have greater impacts on protein stability. In addition, we also investigated the differential impacts of SNVs intersecting conserved regions compared to variable genomic regions. The distinction between conserved and variable regions was based on GERP scores, which quantifies a cross-species conservation score on each nucleotide position. This cross-species conservation analysis indicated that there is no disparity between ΔF associated with benign SNVs in conserved and variable regions. This lack of disparity can be attributed to the absence of significant local perturbations induced by benign SNVs, which do not compromise the overall stability of structures. In contrast, disease-related SNVs in conserved and variable regions of the genome exhibit significant differences in ΔF values. This is consistent with prior studies, which indicate that the deleteriousness of an SNV is more pronounced when SNVs impact functionally important conserved regions relative to variable regions of the genome.
In addition to studying disease-related variants in general, tremendous progress in next-generation sequencing has led to unprecedented efforts to characterize cancer genomes. A great deal of effort has been invested in discriminating between driver and passenger SNVs. Driver SNVs are known to play important roles in driving cancer progression. Motivated by this, we examined the influence of SNVs in driver and passenger genes. Specifically, we studied these effects in the context of the local stability of protein structures. Our analysis indicates that SNVs influencing non-CAGs and CAGs lead to greater perturbations of surface residues compared to core residues. In contrast, SNVs that impact driver genes have similar affects on ΔF values in core and surface residues. These observations support our earlier conclusion that the deleteriousness of a given SNV is determined by its ability to perturb the local interactions of core residues. These local perturbations further propagate through the core to completely destabilize the protein structure.
Furthermore, cancer driver genes are often classified as oncogenes or TSGs. SNVs in oncogenes generally lead to cancer progression through GOF mechanisms, whereas SNVs impacting TSGs generally contribute to cancer growth through LOF events. These two distinct modes prompted us to closely inspect SNVs originating in each category, and we compared the ΔF profiles for SNVs across both. We observed that SNVs in oncogenes and TSGs generate greater ΔF values at the surface and core, respectively.
Comprehensive catalogues of SNVs from large-scale genomics projects have clearly established the important roles of disease-related and rare variants in human populations. We foresee further growth in genomic data sets as large-scale genomic consortia (such as the International Cancer Genomics Consortium, The Pan-Cancer Genome Atlas, The UK10K Project and the Mendelian Genomic Program) continue to decipher the mutational landscape of human genomes and exomes. Similarly, advances in electron microscopy, NMR, small angle X-ray scattering and other techniques will further increase the availability of protein structural data. These expanding knowledge bases of genomic variation and structural biology will facilitate integrative studies to gain mechanistic insights into disease progression and to design effective disease therapy regimens. We demonstrate the role of localized frustration as a metric to quantify and investigate the influence of genomic variants on protein structures. The proposed framework is a logical extension to earlier studies that primarily employed global metrics (such as folding free energy changes) to quantify the effects of SNVs. We believe that the combination of these global and local metrics, along with sequence features, will further help to elucidate the mechanisms as well as predict the impacts of genomic variants.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
ACKNOWLEDGEMENTS
The authors acknowledge support from the NIH and from the A.L. Williams Professorship funds. The authors thank Diego Ferreiro for helpful discussion and sharing the original source code used in calculating localized frustration. The authors also acknowledge Suganthi Balasubramanian for providing valuable feedback during manuscript preparation. The authors would like to thank the Exome Aggregation Consortium and the groups that have provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about.
FUNDING
Funding for open access charge: NIH [2UM1HG006504‐05, U41 HG007497‐03].
Conflict of interest statement. None declared.
REFERENCES
Comments