Quantification of biases in predictions of protein–protein binding affinity changes upon mutations

Abstract Understanding the impact of mutations on protein–protein binding affinity is a key objective for a wide range of biotechnological applications and for shedding light on disease-causing mutations, which are often located at protein–protein interfaces. Over the past decade, many computational methods using physics-based and/or machine learning approaches have been developed to predict how protein binding affinity changes upon mutations. They all claim to achieve astonishing accuracy on both training and test sets, with performances on standard benchmarks such as SKEMPI 2.0 that seem overly optimistic. Here we benchmarked eight well-known and well-used predictors and identified their biases and dataset dependencies, using not only SKEMPI 2.0 as a test set but also deep mutagenesis data on the severe acute respiratory syndrome coronavirus 2 spike protein in complex with the human angiotensin-converting enzyme 2. We showed that, even though most of the tested methods reach a significant degree of robustness and accuracy, they suffer from limited generalizability properties and struggle to predict unseen mutations. Interestingly, the generalizability problems are more severe for pure machine learning approaches, while physics-based methods are less affected by this issue. Moreover, undesirable prediction biases toward specific mutation properties, the most marked being toward destabilizing mutations, are also observed and should be carefully considered by method developers. We conclude from our analyses that there is room for improvement in the prediction models and suggest ways to check, assess and improve their generalizability and robustness.


INTRODUCTION
Proteins interact with each other to form complexes that perform a wide range of biological functions in the intra-and extracellular media, and are involved in key processes such as signal transduction, cell growth and proliferation and cell apoptosis.It is therefore of fundamental interest to understand how amino acid substitutions impact on the ability of proteins to bind to their interacting partners.Such insights would shed light on pathogenic mechanisms since aberrant protein-protein interactions (PPIs) caused by deleterious variants are often central to Mendelian disorders and complex diseases such as cancer [1][2][3][4].From a biotechnological perspective, it would improve the design of drugs that modulate PPIs, as targeting these is an established strategy in the treatment of disease [5,6].
There are several experimental methods for estimating the impact of mutations on PPIs.Biophysical methods such as isothermal titration calorimetry allow in-depth estimation of protein binding thermodynamics [7]; in contrast, high-throughput screening assays such as yeast-two-hybrid systems only allow identification of binary PPIs but have the advantage of being applicable at a large scale [8].However, given that all experimental approaches remain challenging, costly and time-intensive, there is room for computational methods which provide effective alternatives to predict and achieve better understanding of PPIs.
Over the last decade, many studies have been dedicated to the development of bioinformatics tools to predict the impact of mutations on protein-protein binding affinity ( G b ), which is the thermodynamic descriptor of PPIs [9][10][11][12][13][14][15][16][17][18][19][20][21].These tools are mainly based on structural features derived from experimentally characterized protein complexes and/or evolutionary data.These features are usually combined using standard machine learning techniques, but deep learning algorithms are starting to be used in predictor construction [20].
The first attempts to predict protein-protein binding affinity changes upon mutations ( G b ) were based on physical energy functions [22], with predictors such as Rosetta [9] (2002), FOLDEF [10] (2002) and DComplex [11] (2004).The lack of sufficiently large and standardized datasets of experimental G b values prevented them from being trained directly on such data.For this reason, some of them (e.g.DComplex) were completely unsupervised, while others (e.g.Rosetta and FOLDEF) were trained on experimental values of protein stability changes upon mutations ( G) reported in the ProTherm [23] dataset, with the assumption that physical properties of intraprotein interactions are transposable to interprotein interactions at the interface.In this case, experimental data were used only to parameterize the energy functions and to weight their individual contributions.Now, the SKEMPI dataset [24,25] fills this gap.It is considered as the gold standard for training and testing G b predictors.Its first release in 2012, SKEMPI 1.0 [24], collected, curated, selected and standardized entries from literature searches and from already existing datasets (ASEdb [26], PINT [27] and [28]).This first release allowed the development of a generation of G b predictors such as BeAtMuSiC [12] (2013), mCSM [13] (2014), MutaBind [14] (2016) and BindProfX [15] (2017).The large amount of collected experimental values enabled a more extensive use of machine learning methods (e.g. in mCSM), as well as leveraging other nonphysical information to predict energy values.For instance, evolutionary information was extracted from homologous structures (in BindProfX) and sequences (in MutaBind).
While these tools achieve good prediction accuracy on their respective training sets, the extent to which these results are generalizable to unseen data is one of the open issues in the field.Indeed, like all supervised machine learning methods, they are likely to suffer from undesirable biases toward the learning set, which often hinder the generalization of their predictions.One example of this problem is the bias toward destabilizing values of the folding free energy change upon mutations ( G), which has been thoroughly analyzed in a series of investigations [33][34][35].In summary, it has been shown that training protein stability predictors on the common experimental datasets that are dominated by destabilizing mutations leads to much better performance on destabilizing than on stabilizing mutations.
Although prediction biases have been studied for predictors of stability changes caused by mutations, they have not been for protein-protein affinity changes; yet having accurate and unbiased prediction tools of G b values is crucial for a wide range of biotechnological applications.In this paper, we have systematically quantified possible biases in state-of-the-art protein-protein G b prediction methods.More precisely, we evaluated their predictions on a set of mutations with experimentally measured G b values taken from [25], and on high-throughput data on the binding between the human angiotensin-converting enzyme 2 (ACE2) and the receptor binding domain (RBD) of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) spike protein taken from [36].After an analysis of the methods' performances, we suggest strategies to limit and correct possible biases and thus to further improve the methods' generalizability and scores.

Protein-protein binding affinity change upon mutations
The thermodynamic protein-protein binding affinity G b is a measure of the strength of a PPI and is defined using the Gibbs free energy: where R is the Boltzmann constant, T the absolute temperature (in K) and K D the equilibrium dissociation constant of the PPI.We use the convention that the stronger the interaction, the more negative the value of G b , and express it in kcal/mol.Under the action of a mutation, we define the binding affinity change as where wt refers to the wild-type complex and mt to the mutant.Thus, positive G b values correspond to mutations that destabilize the complex and negative values to stabilizing mutations.Since binding affinity is a thermodynamic state function, mutating from a wild-type complex to a mutant complex and then mutating back results in no change in G b , which is expressed by the following equation: We will refer to this property as the symmetry property.
In what follows, we will call 'direct mutation a mutation that goes from the wild-type to the mutant complex.Conversely, we will call 'reverse mutation' a mutation that goes from the mutant to the wild-type complex.Note that the terms wild-type, mutant, direct and reverse are defined with respect to the proteins that are part of our datasets and do not necessarily have a biological interpretation.

Defining protein-protein interfaces
The relative solvent accessibility (RSA) of a residue in a threedimensional (3D) structure is defined as the ratio (in %) of its solvent-accessible surface area in the structure and in an extended tripeptide Gly-X-Gly [37].We calculated them using our in-house software MuSiC [38] (which uses an extension of the DSSP algorithm [39]), available on the dezyme.comwebsite.We distinguished between interactant-RSA (iRSA) and complex-RSA (cRSA), which correspond to the RSA calculated from the structure containing solely the considered interactant and from the structure containing the complex with both interactants, respectively.We defined the RSA change upon binding as RSA := iRSA − cRSA; it measures how much the PPI changes the solvent accessibility of a residue.A residue is considered to be in the protein-protein interface if its RSA is greater than 5%.

Datasets of binding affinity changes upon mutations
We considered two datasets.The first is based on the SKEMPI sets [24,25], containing mutations in different protein-protein complexes of known 3D structure available in the Protein Data Bank (PDB) [40], whose G b values have been measured experimentally using biophysical methods, performed by various laboratories.The number of characterized mutations in each protein typically ranges from a few to a few dozen, and reaches in rare cases a few hundred [41,42].These datasets yield relatively accurate G b values but have the disadvantage of being unsystematic and of ref lecting the specific interests of the authors in the choice of proteins and mutations.
The SKEMPI 2.0 dataset [25] contains 7085 entries and is the most comprehensive, well-curated and diverse dataset of its kind.First, we discarded entries without G b value and entries describing multiple mutations.We then aggregated all redundant entries (with the same mutation in the same PDB structure) by taking their average G b value.To withdraw the dependency on the quality of the structures, we also dropped all mutations in low-resolution X-ray structures (resolution > 2.5Å) and in structures obtained by nuclear magnetic resonance spectroscopy.This defines our first benchmark dataset called S2536 which contains 2536 mutations in 205 different PDB structures.
The second dataset we considered contains affinity values obtained through deep mutagenesis experiments that systematically characterized all possible mutations in the RBD of the SARS-CoV-2 spike glycoprotein in interaction with the human ACE2 receptor [36].This dataset has the advantage of being systematic and therefore less biased.However, the measured values are not exact G b but close correlates.From this set, we first discarded the mutations of the few residues located in the N-and C-terminal tails of the spike protein, as they are absent from the reference PDB structure 6M0J.We then identified the ACE2-RBD interface residues, of which there are 20, using the above RSA criterion.We focused on all 380 possible mutations of these 20 residues, to define our second benchmark dataset C380.
For both the S2536 and C380 datasets, considered by definition as direct mutations, we constructed the datasets of reverse mutations using the symmetry property Eq. ( 3) to assign a G b value to each reverse mutation.When the distinction is required, we append the suffix -D to the name for a dataset of direct mutations, the suffix -R for a dataset of reverse mutations and the suffix -DR for a dataset of both direct and reverse mutations (e.g.S2536-D, S2536-R and S2536-DR).

Protein 3D structures
For predicting direct mutations in the S2536 set, we used the PDB structures of the protein complexes that have been collected in the SKEMPI 2.0 database, as they were curated to be as close as possible to the protein complexes on which the measurements were made.For direct mutations in the C380 set, we used the experimental 3D structure of the ACE2-RBD complex with PDB ID 6M0J [43], as referenced in [36].
For reverse mutations, we modeled the mutant complexes using the comparative modeling software MODELLER [44] with default parameters and the wild-type structures as templates.MODELLER reconstructs the side chain of the mutated residue, then slightly rearranges the backbone and the side chains of the complex to avoid steric clashes and to optimize atomic interactions with the new mutated residue.Since the template and mutant structures differ by only one mutation, the resulting model remains very close to the initial structure.

Prediction methods tested
We benchmarked the eight best-known, available and widely used G b predictors published in recent years.We brief ly describe their characteristics.mCSM-PPI2 [16] is a machine learning predictor that uses graph-based structural signatures of the inter-residue interaction network, evolutionary information, complex network metrics and energy terms.
MutaBind2 [17] uses seven features including protein-solvent interactions, evolutionary conservation and physics-based thermodynamic stability.
BeAtMuSiC [12] is our in-house predictor.It estimates the G b as a linear combination of the stability changes upon mutations ( G) of the protein complex and of the individual interactants, computed by the PoPMuSiC predictor [45].It uses statistical energy functions for G estimation, derived from the Boltzmann law which relates the frequency of occurrence of a structural pattern to its free energy.
SSIPe [18] combines protein interface profiles obtained from structure and sequence homology searches with physics-based energy functions.SAAMBE-3D [19] is a machine learning-based predictor that utilizes 33 knowledge-based features representing the physical environment surrounding the mutation site.
NetTree [20] is a deep learning method based on convolutional neural networks and algebraic topology features.It uses elementand site-specific persistent homology to represent the structure of a protein complex and to translate it into topological features.
FoldX [46] is a purely physics-based method that uses empirical energy functions to predict G b as described in the FOLDEF paper [10].Its energy terms are defined by theoretical models (e.g. the van der Waals potential energy function), which are parameterized and weighted using empirical data.
BindProfX [15] combines the FoldX prediction and a profile score based on structural interface alignments obtained by the iAlign software [47].The profile score exploits evolutionary information by comparing the frequencies of occurrence of the wildtype and the mutant amino acids in structurally similar interfaces.BindProfX is only applicable to protein dimers; when applied to higher order multimers, we use the FoldX term only.
These predictors can be classified into three groups based on the nature of their approach: mCSM-PPI2, MutaBind2, SAAMBE-3D and NetTree are machine learning predictors whose features are extracted from protein structures, physics and evolution; SSIPe and BindProfX linearly combine an evolutionary term and a physics-based energy term using G b data to optimize their models; BeAtMuSiC and FoldX are pure physics-based predictors.
In terms of training set, we have the following classification: NetTree was trained on antigen-antibody interaction data from the AB-Bind dataset [29] which is partially included in the SKEMPI 2.0 dataset; FoldX was trained on G data from ProTherm [23], however note that it has been updated several times since its first publication [10] in 2002 and it is unclear whether or not the current version (v5) [48] has used G b data for parameterization; BeAtMuSiC was also trained on G values, with only two parameters to balance interprotein and intraprotein contributions adjusted using SKEMPI 1.0 G b values; BindProfX was trained on SKEMPI 1.0 entries; all other predictors were trained on SKEMPI 2.0.Finally, mCSM-PPI2 and MutaBind2 included reverse mutations in addition to direct mutations in their training datasets.

An upper bound to the accuracy of predictors
Binding affinity change values collected from the literature and available in S2536 are derived from experiments performed using different techniques and under different environmental conditions such as pH, temperature or solvent additives.These differences add to the experimental error and usually lead to different G b values for the same mutation in the same protein complex.Furthermore, although SKEMPI 2.0 is particularly well curated, curation errors cannot be avoided, as illustrated by the error corrections between SKEMPI 1.0 and SKEMPI 2.0 (see Supplementary Section 1).The uncertainty on G b values places an upper bound on the precision of the predictions, which cannot exceed the accuracy of the experimental data.
An analytical method for estimating the upper bound on the Pearson correlation coefficient (ρ), which measures the strength of the linear relation between predicted and target values, and the lower bound on the root mean squared error (RMSE), which is a measure of the average error of a prediction, has recently been proposed [49,50].These bounds are expressed as where σ 2 DB is the variance of G b values in the whole dataset and σ 2 is the mean of the individual variances for redundant entries.We estimated the values of these bounds using the 116 redundant clusters with at least three entries among all single mutations from the SKEMPI 2.0 dataset.
We obtained: sup(ρ) = 0.89 and inf(RMSE) = 0.70 kcal/mol.Note, however, that these bounds are probably overestimated and underestimated, respectively, due to an underestimation of σ 2 .Indeed, only independent, uncorrelated, G b measures of a given mutation can yield a correct estimation of the variance, which seems to not be always the case.
The performances of the tested predictors presented in the following sections can be compared with these 'optimal' values.It should be stressed that an accuracy better than these bounds suggests that the predictor is overfitted toward the dataset.A good prediction should thus have a Pearson correlation significantly above zero but below the upper bound of 0.89.It is also expected to have an RMSE value above the lower bound of 0.70 kcal/mol.To give the reader an intuitive idea of the scale of the RMSE, we note that a predictor that consistently predicts G b to be zero would obtain RMSE values of 2.3 and 1.8 kcal/mol on S2536 and C380, respectively.

Biases in the S2536 dataset
As mentioned by the SKEMPI authors [24,25], mutations characterized and reported in the literature are not systematic but ref lect the interests of the experimenters.The collected data have therefore biases toward specific residues, mutation types, spatial locations, proteins and protein families.These biases can lead to overoptimistic assessments of the predictors, even when strict cross-validation methods are used.Indeed, if training and test sets are subject to the same biases, a predictor can learn and replicate them, increasing both its apparent performance and generalization error.This can lead to a gap between the performances estimated from either a biased test set or a set of systematic mutations, raising concerns about the reliability of predictors.In this section we have quantified and discussed some of the biases in the S2536 mutations set.
First we note the imbalance in terms of mutation types.The occurrences of the 380 possible mutation types in S2536 are shown in Figure 1A.Half of the mutations are toward alanine, 222 mutation types occur less than five times and 92 mutation types are not represented.This tendency is related to the prevalence of experimental alanine-scanning data in S2536.It may weaken the predictions of underrepresented mutation types.
Another notable imbalance is toward mutations located at protein-protein interfaces: 78% of S2536 entries are mutations of the 9% of residues located at the interface.Although interface residues are usually more critical for the interaction, noninterface regions can also be important and their effects risk being overlooked by the predictors.
Finally, the G b distribution is largely shifted toward positive values, as shown in Figure 1B.It has a mean value of 1.11 kcal/mol and a standard deviation of 1.99 kcal/mol with clear prevalence of destabilizing mutations.This imbalance is not surprising as experimentally studied complexes are often optimized for high binding affinity by evolution.However, it tends to cause predictors to systematically output destabilizing G b values even for neutral and stabilizing mutations, thus preventing the symmetry property (Eq.( 3)) from being satisfied.This issue, which is particularly problematic for, e.g.rational protein design, has been identified and widely investigated in the context of stability changes upon mutations [33][34][35][51][52][53].In the next sections, we will examine this in the context of changes in binding affinity.
Note that these imbalances were observed in S2536, but also occur in all single-site mutations of the SKEMPI 2.0 dataset (see Supplementary Section 2).

Performances on SKEMPI 2.0
We tested the performances of the eight selected predictors described in Methods (Section 2) on the direct and reverse mutations of the S2536 benchmark dataset.For that purpose, we used the Pearson correlation coefficient between predicted and experimental G b values (ρ) as performance metric.The results are represented in Figures 2-3 and Table 1.Other metrics such as the RMSE and the Spearman rank correlation (r) lead to the similar conclusions (as shown in Table 1 and https://github.com/3BioCompBio/DDGb_bias).
This benchmark, though informative, should be considered with caution, as the extent of cross-validation differs according to the predictor.The main issue is that each of the benchmarked predictors is trained on a different subset of S2536, with various covering ratios (CR) with respect to the subset of direct (S2536-D) and reverse (S2536-R) mutations (Table 2).For instance, the training set of mCSM-PPI2 contains 99% of the S2536-D mutations, while that of NetTree only 10%.Furthermore, mCSM-PPI2 is trained on almost all reverse mutations of S2536-R and MutaBind2, on the fraction necessary to balance the number of stabilizing and destabilizing mutations.
The best-performing predictors on the direct mutation set S2536-D are mCSM-PPI2, MutaBind2 and SAAMBE-3D with Pearson correlations ρ of 0.91, 0.90 and 0.88, respectively.These values exceed or are very close to the upper bound of 0.89 (Eq.( 4)), which suggests some overfitting toward the training set.They are followed by BindProfX, SSIPe, FoldX, BeAtMuSiC and NetTree.
We observe that the performance of all predictors but SSIPe and BindProfX significantly drops when tested on reverse S2536-R mutations.The magnitude of the drop indicates how much each predictor is biased toward direct mutations, which are mostly destabilizing.mCSM-PPI2 and MutaBind2 perform the best on S2536-R, which is expected since they have reverse mutations in their training set; the performance of mCSM-PPI2 drops less than that of MutaBind2, probably because the latter has seen only a part of the reverse mutations during training.Surprisingly, SSIPe and BindProfX are the most robust toward reverse mutations, with almost no drop in performance, although they do not use reverse mutations in training; their robustness is therefore not acquired by training but rather by the symmetry properties of the model.In contrast, BeAtMuSiC, SAAMBE-3D and NetTree basically fail to predict the G b of reverse mutations.Note the particularly huge drop in performance of SAAMBE-3D, whose Pearson correlation decreases from 0.88 to 0.11; this predictor appears thus to be heavily biased toward destabilizing mutations.
This first benchmark shows that a bias toward destabilizing mutations is present in the context of G b predictions.Note that the drop in performance observed when passing from direct to reverse mutations can partly be attributed to this bias but also to the presence of a larger proportion of mutations in S2536-R than in S2536-D which are unseen during training.
For the six methods trained on G b data (mCSM-PPI2, MutaBind2, SSIPe, SAAMBE-3D, NetTree and BindProfX), the covering ratio CR between training and benchmark datasets accurately predicts the performances of the predictors.Indeed, we found an almost linear relationship between the CR of the six predictors and their Pearson correlation ρ on the S2536-D set, with a coefficient of determination R 2 as high as 0.91 (Figure 4).
While this observation does not prove that these predictors are dataset specific and overfitted, it raises some concerns about their ability to generalize to mutations outside the training set.Therefore, further investigation based on a dataset of more systematic and unseen mutations is required: this is the topic of the next subsection.

Performances on SARS-CoV-2 mutations
The C380 dataset has two major advantages over S2536: it is unknown to the eight benchmarked predictors and it is systematic in terms of mutation types.This makes it a better dataset to  As shown in Figure 2, the performances of all predictors but NetTree drop from S2536 to C380, with no score higher than 0.6  The performance comparison between direct and reverse mutations of C380-D and C380-R confirms the conclusions of the previous section: all predictors suffer, to a different extent, from a bias toward destabilizing mutations.A way to quantify this bias for a given predictor is to compute the symmetry violation defined by Eq. ( 3) by computing the shift δ defined as averaged over all C380 dataset entries.While some f luctuations in δ are expected and acceptable, a systematic deviation of the mean shift δ from zero quantifies the asymmetry of a predictor and its bias toward stabilizing or destabilizing mutations.A perfect unbiased value for δ is zero; its 'worst-case' value can be estimated as twice the average G b value in the dataset of direct mutations, which is 1.24 kcal/mol in C380.We thus estimated the 'worst-case' δ-value to be about 2.5 kcal/mol.We show in Figure 5 the distributions of δ-values for the eight predictors on C380.Analogous δ-values distributions are depicted for S2536 in Supplementary Figure S-5.We observe that all predictors have a statistically significant shift toward destabilizing mutations, with a vanishing p-value, but amplitude of the shift widely varies.The most symmetric predictors are, as expected, those that perform best on reverse mutations: MutaBind2 with δ = 0.28 kcal/mol followed by mCSM-PPI2 with δ = 0.47 kcal/mol.This confirms that the usage of reverse mutations for training can largely reduce the asymmetry of the predictions.More biased predictions are observed for FoldX, SSIPe, BeAtMuSiC, BindProfX and SAAMBE-3D, with δ = 1.20, 1.28, 1.44, 1.49 and 1.62 kcal/mol, respectively.These values indicate a bias toward destabilizing mutations, which is, however, still significantly lower than the 'worst-case' bias.This means that such predictions are still able to distinguish the tendency between a set of mostly stabilizing and mostly destabilizing mutations.In contrast, NetTree obtains δ = 4.05 kcal/mol, which is largely above the 'worst-case' bias and ref lects its inability to distinguish stabilizing from destabilizing mutations.This particularly large δ -value can partly be explained by NetTree's tendency to predict very large G b values of about 2 kcal/mol, much higher than average experimental values.
In summary, this benchmark represents a fair and objective way to evaluate the performance of the predictors, since C380 is unknown to all.It confirms the presence of biases toward destabilizing mutations in the state-of-the-art G b predictors and highlights the two predictors mCSM-PPI2 and MutaBind2 that are the least affected by this bias.

Performances and biases toward mutation properties
We investigated the predictors' performances on subsets of S2536-D containing mutations sharing similar properties, i.e. mutation type, mutation location and type of complex, in order to highlight the predictors' strengths and weaknesses.As the standard deviations σ of the experimental G b values widely differ according to the subset, we used the normalized RMSE defined as nRMSE := RMSE/σ to assess the predictions.The results are shown in Figure 6.All observations discussed below are statistically significant with almost vanishing P-values (< 0.0001).
We first analyzed separately the subset of mutations toward alanine and the subset of other mutations.As seen in Figure 6A, no substantial differences are observed between these two subsets, except that MutaBind2 and SAAMBE-3D perform slightly better on the latter subset.This might be explained by actual strengths/weaknesses of the predictors or could suggest a mild overfitting, since it is easier to memorize G b values on underrepresented mutation types.
Most predictors are slightly weaker on mutations outside the protein-protein interface (Figure 6B).This is foreseeable, since effects on binding affinity of non-interface mutations are indirect an thus more difficult to predict.MutaBind2, BindProfX, SAAMBE-3D and NetTree suffer from the largest increase in nRMSE.In contrast, BeAtMuSiC and FoldX present similar performances on both subsets.SSIPe shows a surprisingly small drop in performance on mutations outside the interface, although it explicitly claims to be only able to predict interface mutations.
When comparing mutations in dimers to mutations in higher order multimers (Figure 6C), we observe that mCSM-PPI2, BeAtMuSiC and FoldX are the most stable and that MutaBind2, SSIPe, SAAMBE-3D and BindProfX show the largest performance drop.SSIPe's poor performance on higher order multimers is not surprising as it explicitly announces not to predict such mutations.BindProfX's drop is related to the fact that its predictions on higher order multimers are taken from FoldX (see Methods).Paradoxically, mCSM-PPI2 does not require specifying which chains make up the two interactants, although higher order multimers have several protein-protein interfaces and so there is an ambiguity.In spite of this, it maintains the same performance on both subsets, which could suggest overfitting toward its training dataset.In contrast, MutaBind2 asks the chains included in the interactants, but has the largest performance drop on higher order multimers.
We also assessed the performances on other S2536-D subsets, partitioned by secondary structure, solvent exposure in the complex and interface sub-regions [54] (definitions in Supplementary Section 4), but no relevant observations where found.Results are available at https://github.com/3BioCompBio/DDGb_bias).

Strategies for avoiding biased predictions
To ensure the generalizability of the predictions, k-fold crossvalidation procedures should be carefully performed, avoiding blindly splitting the training set.Indeed, when separating a dataset into folds, a direct mutation and its corresponding reverse mutation should end up in the same fold to avoid that information from one mutation inf luences the prediction of the other.As the S2536 dataset contains multiple homologous complexes differing by only a few mutations, random cross-validations can also lead to information leaks from training to testing sets and provide overoptimistic results.Thus, mutations on homologous complexes should also be kept in the same fold [24].
However, dataset biases can be learned by the predictors even if a strict cross-validation procedure is used.To illustrate this, we started by noticing that half of the mutations from S2536-D are toward alanine (X → A) and thus that half of the mutations from S2536-R are from alanine (A → X).Knowing moreover that S2536-D and S2536-R contain mostly destabilizing and mostly stabilizing mutations, respectively, the sign of G b can be often correctly guessed for X → A and A → X mutations while holding no predictive power.In other words, predictors can learn imbalances and cross correlations between mutations' properties from S2536, which improves its performances in cross-validation while also increasing its generalization error.
As a proof of this phenomenon, we created a 'perfectly biased' predictor, which estimates G b as the mean of the experimental G b values of the same mutation type in the training set (or zero if the mutation type was never encountered).This predictor manages to obtain a Pearson correlation ρ = 0.46 on S2536-DR in 10-fold cross-validation.When applying the same predictor (trained on S2536-DR) on mutation type-balanced, interface-only entries from C380-DR, the Pearson correlation falls to ρ = 0.35, and completely vanishes when dropping the interface filter and applying the predictor to the whole dataset of mutations on the RBD-ACE2 complex (-DR) with ρ = 0.04.The same phenomenon also happens, with however slightly smaller correlations, when considering direct mutations only.We indeed found ρ = 0.34 in 10fold cross-validation on S2536-D, ρ = 0.27 on C380-D and ρ = 0.05 on RBD-ACE2 (-D).Note that these scores are only an underestimation of how dataset-dependent cross correlations from S2536 can impact predictions; we have indeed only considered mutation type-related biases.
As extensively discussed above, asymmetric predictions are another type of unwanted bias.One easy way to avoid it is to symmetrize the prediction results.Indeed, the prediction shift δ vanishes when redefining the prediction of a mutation wt → mt as with, as a consequence, δ = G b wt→mt + G b mt→wt = 0.This operation requires both wild-type and mutant structures, but does not introduce any internal modifications to the predictor itself.Some but not all mutant structures have been resolved experimentally; we listed in the https://github.com/3BioCompBio/DDGb_bias repository the pairs of resolved wild-type and mutant structures from SKEMPI 2.0 that are separated by a single mutation (more details in Supplementary Section 5).Alternatively, the unavailable mutant structures can be modeled with homology modeling techniques using the wild-type structure as a template.Symmetrized versions of all tested predictors were obtained using Eq.(7).For predictors that suffer from a strong bias toward destabilizing mutations, the Pearson correlation coefficient of the symmetrized version falls somewhere between their scores on direct and on reverse mutations.In contrast, the least asymmetric predictors, mCSM-PPI2, MutaBind2, BindProfX, FoldX and SSIPe, show a significantly improved score on the reverse datasets S2536-R and C380-R, as well as on the combined datasets S2536-DR and C380-DR, and similar or only slightly lower performance on the direct datasets S2536-D and C380-D (Supplementary Section 3).This shows that the overall performances of some predictors can be improved while also increasing their symmetry without introducing any internal changes to the model.
As seen in the previous subsections, an alternative strategy to reduce the asymmetry of the predictions consists in using reverse mutations for training.Among the tested predictors, MutaBind2 and mCSM-PPI2 apply this technique and reach good symmetry properties.This practice increases the generalizability and robustness of predictors.However, the symmetrization of the training set has to be done carefully.Indeed, due to the presence of wild-type/mutant pairs in SKEMPI 2.0, adding the reverse of all mutations, as done in mCSM-PPI2, leads to redundant entries that should be avoided, as they are a source of biases.

Predictors' computational efficiency
Computational time efficiency is another characteristic to consider when choosing a prediction method, especially when a large set of mutations has to be analyzed, as for example in the study of variants impact on the interactome [2].In terms of speed, BeAtMuSiC and SAAMBE-3D are fast enough to enable large-scale computational mutagenesis experiments; indeed, they are able to predict all possible single-site mutations in a protein complex in a few to a few tens of seconds.While FoldX is significantly slower, it still can perform all mutations in a small protein complex in about a few hours.In contrast, mCSM-PPI2, MutaBind2, SSIPe, NetTree and BindProfX are time-consuming and require tens of seconds to tens of minutes to run a single mutation.This prevents their use for large-scale applications.

CONCLUSIONS
In the last decade, the computational prediction of how mutations impact protein-protein binding affinity have experienced substantial improvements.Due to the large amount of experimental mutagenesis data generated and the development of new machine learning algorithms and accurate force fields, many G b predictors that reach good performance have been developed and used in biotechnological and biopharmaceutical applications.
However, as clearly illustrated in our benchmarking analyses, the predictive power of a method is not necessarily well represented by its scores on its training dataset even if a strict cross-validation procedure is used.This makes the validation process particularly challenging.Here we identified two main issues, which are the predictors' systematic asymmetry and their lack of generalization on mutations outside their training set.They are discussed below.
Lack of generalization.A major challenge in G b predictions is to distinguish between statistical relations that are datasetdependent and the 'true' ones that have a biological meaning.We would like to stress that, while physics-and evolution-based methods are at least partly equipped to tackle this problem, pure machine learning methods struggle to make this distinction.This can explain the particularly large performance drop on unknown mutations observed for most purely machine learning methods such as SAAMBE-3D and the good generalizability properties observed for methods that are totally or partly physics-based, such as BeAtMuSiC, BindProfX and FoldX.
The generalizability of a predictor must be tested on independent sets of mutations outside the training set.Sets of systematic mutations obtained by deep mutagenesis experiments, such as C380, have the advantage of not being impacted by literature biases.They are thus appropriate for validating and benchmarking predictions, even though their G b values are less accurate than those obtained by individual thermodynamic experiments.
Symmetry properties.Symmetry properties should be carefully checked when constructing a prediction model.One way to assess them is on the basis of the shift δ (Eq.( 6)).As a general rule, the symmetry of a predictor can be achieved by (1) using symmetric data during training by including all or a fraction of reverse mutations, as done in mCSM-PPI2 and MutaBind2; (2) enforcing symmetry in the predictor's mathematical model, as in [33]; and (3) applying symmetry-correction methods through, e.g. the symmetrization defined in Eq. (7).Method (1) is a good practice which, as we showed, can increase the generalizability of the predictions.Method (2) can help the predictor to be symmetric, but it is only applicable when the mathematical expression of the model is known.Method (3) is the easiest to implement, but is efficient only if the predictor is already robust to symmetry.
There are additional challenges that need to be addressed.First, further data on binding affinity and interactions need to be collected.Accurate G b thermodynamics data have not been systematically collected for the past 5 years, after SKEMPI 2.0's release.Also, deep mutagenesis data of binding affinity are currently generated at a high rate but need to be collected, curated and harmonized.Secondly, the interpretation of G b prediction models is an issue that we do not explore in this paper and that is not sufficiently discussed in the literature.Indeed, performance is not the only criterion for evaluating a prediction model.Insights into model interpretation can help gaining physical understanding of molecular recognition and protein-protein binding mechanisms.
Finally, there is a need for more independent assessments.We invite the community to set up blind challenges for the prediction of changes in protein-protein binding affinity upon mutations, similar to what has been done during the 26 th critical assessment of predicted interactions (CAPRI) experiment [55].These community-wide blind challenges provide important insights into whether and how different predictors achieve the targeted accuracy, and help drive the development of new methods.

Key Points
• Predicting the impact of mutations on protein-protein binding affinity has seen substantial progress over the past decade, but still faces challenging issues.• Although many predictors achieve good performance on their training set, even in cross validation, they usually struggle to generalize to unseen data.• Most predictors are biased, especially toward mutations that destabilize protein-protein complexes, as their training sets are dominated by them.• Further strategies to limit biases are proposed to improve prediction performance.• Current machine learning-based approaches suffer more from training set overfitting issues than physicsbased methods which generally demonstrate better generalizability properties.

Figure 1 .
Figure 1.Characteristics of the S2536 dataset.(A) Number of occurrences of mutation types; (B) Distribution of the experimental G b values (in kcal/mol).

Figure 2 .
Figure 2. Pearson correlations ρ between experimental and predicted G b values on direct (in blue) and reverse (in orange) mutations of S2536 (left) and C380 (right).

Figure 4 .
Figure 4. Relation between the covering ratio CR and the Pearson correlation ρ between predicted and experimental G b values on the S2536-D set for six benchmarked predictors.The linear regression line (dashed) and coefficient of determination (R 2 ) are indicated.

Figure 5 .
Figure 5. Distribution of the shift δ (in kcal/mol) for the eight benchmarked predictors calculated for mutations from C380.The vertical blue dashed lines indicate δ = 0 and the vertical red dashed lines, the value of δ .

Table 1 :
Performances of the eight benchmarked predictors measured by the Pearson correlation (ρ), the Spearman rank correlation (r) and RMSE on the datasets S2536-D, S2536-R, C380-D and C380-R

Table 2 :
Year R, we further explored the predictors' bias toward destabilizing mutations; by comparing performances on mutations from S2536 and C380, we estimated the dataset dependence of the predictors.Predicted values and performance metrics on C380 are available on https://github.com/3BioCompBio/DDGb_biasandpredictionsare graphically represented in Supplementary FigureS-4.