- Split View
-
Views
-
Cite
Cite
Fabrizio Pucci, Katrien V Bernaerts, Jean Marc Kwasigroch, Marianne Rooman, Quantification of biases in predictions of protein stability changes upon mutations, Bioinformatics, Volume 34, Issue 21, November 2018, Pages 3659–3665, https://doi.org/10.1093/bioinformatics/bty348
- Share Icon Share
Abstract
Bioinformatics tools that predict protein stability changes upon point mutations have made a lot of progress in the last decades and have become accurate and fast enough to make computational mutagenesis experiments feasible, even on a proteome scale. Despite these achievements, they still suffer from important issues that must be solved to allow further improving their performances and utilizing them to deepen our insights into protein folding and stability mechanisms. One of these problems is their bias toward the learning datasets which, being dominated by destabilizing mutations, causes predictions to be better for destabilizing than for stabilizing mutations.
We thoroughly analyzed the biases in the prediction of folding free energy changes upon point mutations () and proposed some unbiased solutions. We started by constructing a dataset of experimentally measured s with an equal number of stabilizing and destabilizing mutations, by collecting mutations for which the structure of both the wild-type and mutant protein is available. On this balanced dataset, we assessed the performances of 15 widely used predictors. After the astonishing observation that almost all these methods are strongly biased toward destabilizing mutations, especially those that use black-box machine learning, we proposed an elegant way to solve the bias issue by imposing physical symmetries under inverse mutations on the model structure, which we implemented in PoPMuSiCsym. This new predictor constitutes an efficient trade-off between accuracy and absence of biases. Some final considerations and suggestions for further improvement of the predictors are discussed.
Supplementary data are available at Bioinformatics online.
The article 10.1093/bioinformatics/bty340/, published alongside this paper, also addresses the problem of biases in protein stability change predictions.
1 Introduction
De novo protein design is well known to be an important challenge in protein science. Its achievement would have a considerable impact on a wide series of academic and industrial applications that range from drug design in medicinal chemistry to the development of multi-component protein nanomaterials (Coluzza, 2017; Huang et al., 2016; Zanghellini, 2014). This goal is far from being reached, even though valuable developments have recently been made. Mutational studies with both experimental and computational techniques have thoroughly deepened our understanding of the mechanisms that drive the folding process. In particular, a lot of computational methods have been developed in the last decades to predict in an efficient way how an amino acid substitution impacts protein stability (Alford et al., 2017; Capriotti et al., 2005; Chen et al., 2013; Cheng et al., 2006; Dehouck et al., 2009, 2011; Giollo et al., 2014; Guerois et al., 2002; Kellogg et al., 2011; Laimer et al., 2016; Masso and Vaisman, 2008, 2014; Pandurangan et al., 2017; Parthiban et al., 2006; Pires et al., 2014a, b; Quan et al., 2016). They allow limiting extensive mutagenesis experiments and thus save time and money.
The most accurate methods among these are structure based. They use the three-dimensional (3D) structure of the wild-type protein as input for predicting how the folding free energy gets modified upon point mutations (). All these methods are based on a large variety of models that range from pure machine learning algorithms to more biophysics-oriented approaches where the energetic contributions are appropriately combined. Their average performances, measured by the root mean square deviation between experimental and predicted values for datasets that contain on the order of two thousand entries, are reported to be between 1.0 and 1.5 kcal/mol [for previous comparisons of the methods’ performances, see Potapov et al. (2009) and Khan and Vihinen (2010)]. These results are astonishing if one considers that, despite the complexity of the problem, some of the above-mentioned tools predict the of one mutation in less than a minute. This opens the way to perform computational mutagenesis experiments at the proteomic scale with reasonable accuracy.
Unfortunately, these methods suffer from different drawbacks. Like all machine learning approaches, they are prone to overfitting problems (Cawley and Talbot, 2010; Hawkins, 2004), and their results therefore tend to be biased toward the training datasets. The analysis and the correction of biases are of primary importance to get more accurate and reliable methods. However, it is a non-trivial task since biases are usually hidden and require careful work on the model structures and on the cleaning of the training datasets.
A known bias in protein stability prediction comes from the fact that the ensemble of experimentally characterized mutations and as a consequence, the training datasets, are dominated by destabilizing mutations. This implies that the predictors tend to be more accurate for destabilizing than for stabilizing mutations, which is a crucial issue given that the latter are the focus of protein design applications. This issue has been reported in a few investigations (Fariselli et al., 2015; Pucci et al., 2015; Thiltgen and Goldstein, 2012), but there is not yet a common, generally accepted, way to overcome it. Moreover, biases are not limited to this feature but can involve other characteristics such as the kind of protein or the type of wild-type and mutant amino acids, since not all substitutions are sufficiently sampled in the training dataset.
In this paper, we go further into this investigation, and assess the performances of different predictors on a new dataset of mutations with experimentally characterized values and with known 3D structures of both the wild-type and mutant proteins. This dataset is by construction balanced with respect to stabilizing and destabilizing mutations. We showed that imposing physical symmetries to the model structures is an efficient and elegant way to solve the bias problem, as already suggested in a preliminary study (Pucci et al., 2015).
2 Materials and methods
2.1 Folding stability changes upon mutations
2.2 Assessing predictors through bias evaluation
In this paper we constructed a new mutation dataset which is balanced with respect to stabilizing and destabilizing mutations (see Section 2.3), and used it for assessing the performance of 15 prediction methods (Section 2.4) and for quantifying their bias that tends to favor destabilizing mutations. We used the following measures, the former two to estimate the predictors’ accuracy and the latter two the bias:
and are the root mean square deviation and the linear correlation coefficient between the predicted and experimental values for the direct mutations in , from wild-type to mutant. Note that these mutations belong to the training dataset of the methods tested, so that the predictions are likely to be overfitted and and to be underestimated and overestimated, respectively.
and are the root mean square deviation and the linear correlation coefficient between the predicted and experimental values for the inverse mutations in , from mutant to wild-type. These mutations do not belong to the training datasets and thus constitute an independent test set.
is the linear correlation coefficient between the predicted values of the direct and the inverse mutations. A non-biased prediction that satisfies Equation (3) has equal to −1.
- A previously used criterion to estimate the bias is the parameter δ defined as (Thiltgen and Goldstein, 2012):(4)
A perfectly non-biased tool should have δ = 0 for every mutation. We used here its average value taken over all mutations that belong to .
2.3 Dataset construction
We created a manually curated dataset , by selecting mutations from the Protherm database (Bava et al., 2004) and checking them on the basis of the original literature. It contains mutations with experimental values for which the 3D structures of both the wild-type and mutant proteins are solved by X-ray crystallography with a resolution of 2.5 Å at most.
Sometimes, different values are available for the same mutation. We selected the measured under the environmental conditions closest to the standard conditions (pH = 7, T = 25°C). Note that they are frequently measured at the melting temperature of the wild-type protein.
We ended up with a dataset of 684 mutations, half of which are direct mutations inserted in 15 wild-type proteins, while the remaining half are inverse mutations inserted in 342 different mutant proteins. These mutations are given in Supplementary Information.
2.4 Prediction methods analyzed
We selected the predictors that are among the most renowned in terms of speed and accuracy. We list them below and briefly explain their characteristics.
PoPMuSiC v2.1 (Dehouck et al., 2009): based on standard statistical potentials, combined with sigmoidal weights that depend on the solvent accessibility of the mutated residues.
SDM (Pandurangan et al., 2017): uses conformationally constrained environment-specific substitution tables to calculate the change in thermodynamic stability between the wild-type and the mutant proteins.
CUPSAT (Parthiban et al., 2006): uses torsion angle potentials and structural environment-specific atom potentials.
Rosetta (Kellogg et al., 2011): generates a 3D structural model of the mutated protein from the wild-type structure, and computes the difference in energy between them, with as energy function the sum of a large series of empirical physics-based energy contributions [Coulomb electrostatic, Lennard-Jones atomic interactions, etc. (Alford et al., 2017)].
FoldX v3.0 (Guerois et al., 2002): uses a full atomistic description of the protein structure and is based on FOLDEF, an empirical force field developed as a linear combination of different empirical energy terms (van der Waals, solvation, electrostatic, hydrogen bonds, etc.).
I-Mutant v3.0 (Capriotti et al., 2005): a tool based on a support vector machine (SVM) that combines protein sequence and structure information.
iSTABLE (Chen et al., 2013): an integrated predictor, that combines, using an SVM algorithm, sequence information with predictions from different methods (I-Mutant, AUTOMUTE, MUPRO, PoPMuSiC and CUPSAT).
NeEMO (Giollo et al., 2014): uses an effective representation of proteins based on residue interaction networks (RINs) and combines the extracted information through a neural network.
AUTO-MUTE (Masso and Vaisman, 2008): uses as main ingredient four-body, knowledge-based, statistical contact potentials that are combined with machine learning tools (random forest and SVM).
STRUM (Quan et al., 2016): combines physics- and knowledge-based energy functions derived from protein structure models obtained by I-TASSER (Roy et al., 2010), through gradient boosting regression.
MAESTRO (Laimer et al., 2016): uses statistical energy functions as main features, and combines them with a multi-agent method that includes a linear regression, an artificial neural network and an SVM.
mCSM (Pires et al., 2014b): a machine learning method that utilizes graph-based distance patterns between atoms as well as the residue type.
DUET (Pires et al., 2014a): a consensus prediction method obtained by combining mCSM and SDM using an SVM algorithm.
MUPRO (Cheng et al., 2006): uses an SVM approach that takes into account sequence information only.
All the tools in this list utilize the 3D structure of the wild-type protein as input, except the last one which is based on the protein sequence only. The first five predictors are based on combinations of energy contributions and do not use machine learning, or use machine learning just to identify the parameters of a pre-established model structure. The last nine predictors are true machine learning methods.
Some predictors require as input the pH at which the change in folding free energy is computed (method 11) or both the pH and the temperature (methods 5–10), while the others do not ask for the specification of the environmental conditions, assuming standard conditions.
2.5 Designing unbiased prediction models
Two approaches can be devised to solve the bias problem and recover predictions that satisfy Equation (3). One solution is to train the model on a balanced dataset that contains, for each mutation, both the direct and inverse versions, from wild-type to mutant and from mutant to wild-type. However, this requires knowing the 3D structure of the mutant proteins, which is only available for a subset of mutations: our dataset contains 684 mutations, whereas the training datasets for which only the wild-type structure is requested contain about 3000 mutations. The datasets can be increased by including mutant structures obtained through comparative modeling, but this introduces noise into the data. Note that this is the only solution in the case of pure machine learning approaches where the model structure is not established a priori.
When the prediction model is pre-established and not obtained through a black-box machine learning technique, it is possible to identify the terms in the model structure that are responsible for the symmetry breaking and appropriately correct them. This is exactly what we did in Pucci et al. (2015), where the PoPMuSiCsym model, a symmetrized version of PoPMuSiC v2.1, was presented.
3 Results
We tested 15 predictors on a common, balanced, dataset of 684 single-site mutations, in order to evaluate their performances and, more importantly, their degree of bias with respect to the symmetry between direct and inverse mutations [Equation (3)]. Table 1 contains the values of the root mean square deviations σ and the linear correlation coefficients r, reported separately for the direct and inverse mutations. The importance of the bias is evaluated by two parameters, the correlation coefficient between the direct and inverse mutations and the δ parameter defined in Equation (4).
Note: The standard deviations and and the values of are in kcal/mol. The methods are ranked according to their performance on the independent test set of inverse mutations, more specifically on the basis of . The best values in each column are underlined
Note: The standard deviations and and the values of are in kcal/mol. The methods are ranked according to their performance on the independent test set of inverse mutations, more specifically on the basis of . The best values in each column are underlined
As clearly seen in Table 1 and Figure 1, all the tested methods are biased toward the training dataset, except PoPMuSiCsym which has been explicitly designed to avoid this bias. If we focus on direct mutations, the best performing method is MUPRO, a sequence-based machine learning method, with a of 0.95 and a of about 0.8. Remember, however, that all the direct mutations are part of the methods’ training datasets, and these results are thus likely to be affected by overfitting problems. In contrast, the inverse mutations do not belong to the methods’ training sets and can thus be considered as constituting an independent test set. The best performing predictors on the inverse mutations are PoPMuSiCsym, MAESTRO, FoldX and PoPMuSiC v2.1.
It is important to note that the black-box machine learning techniques suffer in general more from the bias issue than the other methods that use a more physics-based approach. For example, if one overlooks PoPMuSiCsym, the least biased predictor is SDM, which belongs to the physics-based class of predictors, with a correlation coefficient of about −0.8 and a value of about −0.3.
However, some physics-based methods are also strongly biased. The point is that such methods can avoid biases only if their model structure is adequately constrained to avoid them. More specifically, the current PoPMuSiC v2.1 version already shows a good performance compared with other predictors, but the implementation of the physical constraints of Equation (7) in PoPMuSiCsym spectacularly improves by more than 25% and yields a zero value.
Note that despite the symmetry constraints, there are still some outliers in PoPMuSiCsym with respect to the expected symmetry between direct and inverse mutations, as shown in Figure 2. These outliers actually correspond to mutations that cannot be predicted simply from the wild-type structure. Indeed, they provoke significant structural rearrangements to avoid steric clashes, empty cavities or other unfavorable conformations. In these cases, both the wild-type and mutant structures should be considered in the estimation. These issues explain why PoPMuSiCsym does not perfectly satisfy the symmetry relation of Equation (3) despite its symmetric model structure; the correlation is indeed equal to –0.77 rather than –1.0.
We also analyzed the bias effect separately for core and surface residues. Table 2 reports the results for the best performing methods. In general, the predictions are biased for both surface and core mutations. To correctly interpret these results, we have to remember that mutations in the core have on the average a larger effect on the protein structure and stability. In the dataset for example, the mean of the absolute values of the s is equal to 1.75 kcal/mol for core mutations and approximatively half (0.96 kcal/mol) for surface mutations. As a consequence, the values of the different methods tend to be worse in the core whereas the correlations tend to be worse on the surface.
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Core residues | ||||||
PoPMuSiCsym | 1.92 | 0.56 | 1.99 | 0.52 | −0.89 | 0.03 |
FoldX | 1.50 | 0.64 | 2.27 | 0.52 | −0.37 | −0.60 |
SDM | 1.75 | 0.62 | 2.52 | 0.49 | −0.90 | −0.56 |
MAESTRO | 1.55 | 0.49 | 2.57 | 0.47 | −0.58 | −0.82 |
PoPMuSiC v2.1 | 1.31 | 0.65 | 2.74 | 0.51 | −0.79 | −1.09 |
Surface residues | ||||||
PoPMuSiCsym | 1.16 | 0.42 | 1.15 | 0.48 | −0.62 | 0.03 |
PoPMuSiC v2.1 | 1.09 | 0.45 | 1.42 | 0.29 | −0.27 | −0.35 |
MAESTRO | 1.14 | 0.39 | 1.50 | 0.25 | −0.14 | −0.35 |
FoldX | 1.61 | 0.60 | 2.00 | 0.18 | −0.39 | −0.35 |
SDM | 1.72 | 0.18 | 2.02 | 0.16 | −0.63 | −0.08 |
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Core residues | ||||||
PoPMuSiCsym | 1.92 | 0.56 | 1.99 | 0.52 | −0.89 | 0.03 |
FoldX | 1.50 | 0.64 | 2.27 | 0.52 | −0.37 | −0.60 |
SDM | 1.75 | 0.62 | 2.52 | 0.49 | −0.90 | −0.56 |
MAESTRO | 1.55 | 0.49 | 2.57 | 0.47 | −0.58 | −0.82 |
PoPMuSiC v2.1 | 1.31 | 0.65 | 2.74 | 0.51 | −0.79 | −1.09 |
Surface residues | ||||||
PoPMuSiCsym | 1.16 | 0.42 | 1.15 | 0.48 | −0.62 | 0.03 |
PoPMuSiC v2.1 | 1.09 | 0.45 | 1.42 | 0.29 | −0.27 | −0.35 |
MAESTRO | 1.14 | 0.39 | 1.50 | 0.25 | −0.14 | −0.35 |
FoldX | 1.61 | 0.60 | 2.00 | 0.18 | −0.39 | −0.35 |
SDM | 1.72 | 0.18 | 2.02 | 0.16 | −0.63 | −0.08 |
Note: The standard deviations and and the values of are in kcal/mol. The predictors are ranked according to the smallest scores, computed on the set of inverse mutations which constitutes an independent test set, with no overlap with the methods’ training datasets. The best values in each column are underlined
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Core residues | ||||||
PoPMuSiCsym | 1.92 | 0.56 | 1.99 | 0.52 | −0.89 | 0.03 |
FoldX | 1.50 | 0.64 | 2.27 | 0.52 | −0.37 | −0.60 |
SDM | 1.75 | 0.62 | 2.52 | 0.49 | −0.90 | −0.56 |
MAESTRO | 1.55 | 0.49 | 2.57 | 0.47 | −0.58 | −0.82 |
PoPMuSiC v2.1 | 1.31 | 0.65 | 2.74 | 0.51 | −0.79 | −1.09 |
Surface residues | ||||||
PoPMuSiCsym | 1.16 | 0.42 | 1.15 | 0.48 | −0.62 | 0.03 |
PoPMuSiC v2.1 | 1.09 | 0.45 | 1.42 | 0.29 | −0.27 | −0.35 |
MAESTRO | 1.14 | 0.39 | 1.50 | 0.25 | −0.14 | −0.35 |
FoldX | 1.61 | 0.60 | 2.00 | 0.18 | −0.39 | −0.35 |
SDM | 1.72 | 0.18 | 2.02 | 0.16 | −0.63 | −0.08 |
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Core residues | ||||||
PoPMuSiCsym | 1.92 | 0.56 | 1.99 | 0.52 | −0.89 | 0.03 |
FoldX | 1.50 | 0.64 | 2.27 | 0.52 | −0.37 | −0.60 |
SDM | 1.75 | 0.62 | 2.52 | 0.49 | −0.90 | −0.56 |
MAESTRO | 1.55 | 0.49 | 2.57 | 0.47 | −0.58 | −0.82 |
PoPMuSiC v2.1 | 1.31 | 0.65 | 2.74 | 0.51 | −0.79 | −1.09 |
Surface residues | ||||||
PoPMuSiCsym | 1.16 | 0.42 | 1.15 | 0.48 | −0.62 | 0.03 |
PoPMuSiC v2.1 | 1.09 | 0.45 | 1.42 | 0.29 | −0.27 | −0.35 |
MAESTRO | 1.14 | 0.39 | 1.50 | 0.25 | −0.14 | −0.35 |
FoldX | 1.61 | 0.60 | 2.00 | 0.18 | −0.39 | −0.35 |
SDM | 1.72 | 0.18 | 2.02 | 0.16 | −0.63 | −0.08 |
Note: The standard deviations and and the values of are in kcal/mol. The predictors are ranked according to the smallest scores, computed on the set of inverse mutations which constitutes an independent test set, with no overlap with the methods’ training datasets. The best values in each column are underlined
According to our results, the least biased predictors are PoPMuSiCsym and SDM, for both core and surface mutations. But the performance of PoPMuSiCsym is generally better than that of SDM, especially when it is evaluated on the inverse mutation set which does not overlap with the methods’ training sets. The second best performing predictors on the set of inverse mutations is FoldX on core mutations and PoPMuSiC v2.1 on surface mutations.
The bias was also compared between mutations in which an amino acid is replaced by a much larger or a much smaller amino acid, and mutations in which the wild-type and mutant amino acids have roughly the same size (Table 3). The volume differences can indeed be another source of bias for some of the prediction methods. Here also, PoPMuSiCsym is the least biased predictor and the best performing on the set of inverse mutations, both for mutations with and without significant size difference. The next least biased predictor is SDM, and the next best performing predictors are MAESTRO and SDM for substitutions with large volume changes, and MAESTRO and FoldX for small volume changes.
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Large volume changes | ||||||
PoPMuSiCsym | 2.04 | 0.53 | 2.13 | 0.52 | −0.73 | 0.07 |
SDM | 1.78 | 0.59 | 2.77 | 0.36 | −0.67 | −0.60 |
MAESTRO | 1.63 | 0.61 | 2.77 | 0.47 | −0.54 | −0.72 |
FoldX | 1.89 | 0.60 | 2.90 | 0.41 | −0.27 | −0.84 |
PoPMuSiC v2.1 | 1.33 | 0.70 | 2.90 | 0.32 | −0.51 | −1.02 |
Small volume changes | ||||||
PoPMuSiCsym | 1.40 | 0.42 | 1.41 | 0.40 | −0.78 | 0.02 |
MAESTRO | 1.26 | 0.40 | 1.82 | 0.25 | −0.22 | −0.53 |
FoldX | 1.44 | 0.60 | 1.83 | 0.36 | −0.46 | −0.35 |
PoPMuSiC v2.1 | 1.17 | 0.51 | 1.90 | 0.20 | −0.08 | −0.61 |
SDM | 1.72 | 0.40 | 2.10 | 0.28 | −0.80 | −0.22 |
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Large volume changes | ||||||
PoPMuSiCsym | 2.04 | 0.53 | 2.13 | 0.52 | −0.73 | 0.07 |
SDM | 1.78 | 0.59 | 2.77 | 0.36 | −0.67 | −0.60 |
MAESTRO | 1.63 | 0.61 | 2.77 | 0.47 | −0.54 | −0.72 |
FoldX | 1.89 | 0.60 | 2.90 | 0.41 | −0.27 | −0.84 |
PoPMuSiC v2.1 | 1.33 | 0.70 | 2.90 | 0.32 | −0.51 | −1.02 |
Small volume changes | ||||||
PoPMuSiCsym | 1.40 | 0.42 | 1.41 | 0.40 | −0.78 | 0.02 |
MAESTRO | 1.26 | 0.40 | 1.82 | 0.25 | −0.22 | −0.53 |
FoldX | 1.44 | 0.60 | 1.83 | 0.36 | −0.46 | −0.35 |
PoPMuSiC v2.1 | 1.17 | 0.51 | 1.90 | 0.20 | −0.08 | −0.61 |
SDM | 1.72 | 0.40 | 2.10 | 0.28 | −0.80 | −0.22 |
Note: The standard deviations and and the values of are in kcal/mol. The predictors are ranked according to the smallest scores, computed on the set of inverse mutations which constitutes an independent test set, with no overlap with the methods’ training datasets. The best values in each column are underlined
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Large volume changes | ||||||
PoPMuSiCsym | 2.04 | 0.53 | 2.13 | 0.52 | −0.73 | 0.07 |
SDM | 1.78 | 0.59 | 2.77 | 0.36 | −0.67 | −0.60 |
MAESTRO | 1.63 | 0.61 | 2.77 | 0.47 | −0.54 | −0.72 |
FoldX | 1.89 | 0.60 | 2.90 | 0.41 | −0.27 | −0.84 |
PoPMuSiC v2.1 | 1.33 | 0.70 | 2.90 | 0.32 | −0.51 | −1.02 |
Small volume changes | ||||||
PoPMuSiCsym | 1.40 | 0.42 | 1.41 | 0.40 | −0.78 | 0.02 |
MAESTRO | 1.26 | 0.40 | 1.82 | 0.25 | −0.22 | −0.53 |
FoldX | 1.44 | 0.60 | 1.83 | 0.36 | −0.46 | −0.35 |
PoPMuSiC v2.1 | 1.17 | 0.51 | 1.90 | 0.20 | −0.08 | −0.61 |
SDM | 1.72 | 0.40 | 2.10 | 0.28 | −0.80 | −0.22 |
Method . | . | . | . | . | . | . |
---|---|---|---|---|---|---|
Large volume changes | ||||||
PoPMuSiCsym | 2.04 | 0.53 | 2.13 | 0.52 | −0.73 | 0.07 |
SDM | 1.78 | 0.59 | 2.77 | 0.36 | −0.67 | −0.60 |
MAESTRO | 1.63 | 0.61 | 2.77 | 0.47 | −0.54 | −0.72 |
FoldX | 1.89 | 0.60 | 2.90 | 0.41 | −0.27 | −0.84 |
PoPMuSiC v2.1 | 1.33 | 0.70 | 2.90 | 0.32 | −0.51 | −1.02 |
Small volume changes | ||||||
PoPMuSiCsym | 1.40 | 0.42 | 1.41 | 0.40 | −0.78 | 0.02 |
MAESTRO | 1.26 | 0.40 | 1.82 | 0.25 | −0.22 | −0.53 |
FoldX | 1.44 | 0.60 | 1.83 | 0.36 | −0.46 | −0.35 |
PoPMuSiC v2.1 | 1.17 | 0.51 | 1.90 | 0.20 | −0.08 | −0.61 |
SDM | 1.72 | 0.40 | 2.10 | 0.28 | −0.80 | −0.22 |
Note: The standard deviations and and the values of are in kcal/mol. The predictors are ranked according to the smallest scores, computed on the set of inverse mutations which constitutes an independent test set, with no overlap with the methods’ training datasets. The best values in each column are underlined
4 Discussion
In this paper, we thoroughly investigated the symmetry breaking issue and extensively discussed the fact that computational methods tend to predict the mutations more often as destabilizing than as stabilizing since the training datasets are dominated by destabilizing residue substitutions. Even though this problem was already described in the literature (Thiltgen and Goldstein, 2012; Fariselli et al., 2015; Pucci et al., 2015), a quantitative measure of the violation of the symmetry between the direct and the inverse substitutions by existing predictors was lacking. This gap has been filled in this paper, in which we quantified and discussed the performance and biases of 15 of the most efficient available tools. Our results can be summarized as follows:
All tested methods are biased toward destabilizing mutations. As a proof of this statement, we observed a prediction error on the set of direct mutations (dominated by destabilizing mutations, representing 75% of the dataset entries) which is larger by a factor of about two than the prediction error on the set of inverse mutations (dominated by 75% stabilizing mutations). Indeed, is equal to 0.94–1.75 kcal/mol, and to 2.09–2.88 kcal/mol. This effect is amplified for the substitutions in the core with respect to surface mutations.
Predictions that use black-box machine learning techniques tend to be more biased than the others. Indeed, four of the top five prediction tools, PoPMuSiCsym, PoPMuSiC v2.1, FoldX and SDM, use biophysics-oriented models that combine energy contributions in a coherent way. In contrast, the fifth tool, MAESTRO, uses statistical potentials and other biophysical features combined through several kinds of machine learning methods.
Imposing biophysical constraints on the model structure (when accessible) is an elegant and simple way to solve completely the bias problem. Indeed, from the analysis of the different folding free energy contributions, it is quite simple to avoid all the terms that violate the symmetry. Relying on symmetry principles in the construction of a model is a common and well-known strategy used in physics, which also pays off here, as shown by the spectacular improvement of the δ and values of PoPMuSiCsym.
Besides the necessity of getting rid of the symmetry biases, other issues need to be tackled to improve the protein stability prediction methods:
We would like to draw the attention on the training datasets. Most predictors use S2648 (Dehouck et al., 2009) or Q3421 (Quan et al., 2016) as training sets. These sets are manually curated and based on data coming from the ProTherm database (Bava et al., 2004), which has not been updated since more than 5 years. As many experimental data have been published since then, especially from deep mutagenesis scanning experiments (Fowler and Fields, 2014), it would be extremely useful to collect them into a new, extended and manually curated database.
The bias toward destabilizing mutations in the usual learning sets should be taken into account in the evaluation of the methods’ performances. A possibility is to systematically test new methods on , the dataset described in this paper that contains both the direct and inverse versions of each mutation and is thus by construction balanced with respect to stabilizing and destabilizing mutations.
The predictors possibly also suffer from other hidden biases. For example, some types of mutations could be insufficiently sampled in the learning set, with the consequence that the predictor could learn incorrect trends. We would like to stress once more that testing predictors in cross validation is insufficient to correctly evaluate them with respect to the learning dataset biases.
We would also like to underline the issues related to the addition of more and more features to the predictors. From one side, it allows taking into account the huge complexity of the problem, but from the other side it increases the risk of overfitting and biasing. Moreover, when features are added on top of other features, for example in the case of metapredictors, the performances are difficult to evaluate in genuine cross validation and should be carefully analyzed.
The improvement that the above analyses are expected to bring is crucial in view of addressing even more challenging issues such as the prediction of the changes in folding free energy upon multiple mutations. Indeed, even though it remains costly, the wide screening of single-site mutations can be performed experimentally in a reasonable time, via techniques such as deep mutational scanning (Fowler and Fields, 2014). Computational methods capable of predicting only point mutations could thus become less impacting in the protein design field in the near future and the attention should be more focused on the development of predictors that are able to predict the effect of multiple mutations. Such predictions would moreover be more likely to fulfill the requirements of improving protein stability in biotechnological applications, which are frequently impossible to satisfy by single-point mutations only, but require combinations of mutations to achieve, for example, high energetic stabilization while maintaining the solubility and activity of the protein unaltered.
Aknowledgement
We thank Nathalie Ruth for running some of the prediction programs.
Funding
The Belgian Fund for Scientific Research (FNRS) is acknowledged for financial support through a PDR research project. F.P. is Postdoctoral researcher and M.R. Research Director at the FNRS.
Conflict of Interest: none declared.
References