Abstract

Motivation

Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes.

Results

We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing.

Availability and implementation

https://github.com/HongjianLi/MLSF

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

A key question in structural bioinformatics is how to predict the binding affinity of protein-ligand complexes accurately. Such prediction is performed by a scoring function (SF), which mathematically relates the X-ray crystal structures of protein-ligand complexes to their binding affinities. A plethora of SFs have been devised, which have been methodologically classified into two major categories: classical SFs and machine-learning SFs (Ain et al., 2015). Classical SFs refer to those relying on expert knowledge. These SFs normally assume a predetermined theory-motivated functional form for the relationship between the variables characterizing the complex and its binding affinity. In practice, this takes the form of linear regression using a small number of expert-selected structural features. On the other hand, machine-learning SFs circumvent the linearity assumption by not imposing a particular functional form, but instead, learning it from data (Ballester and Mitchell, 2010). These SFs are thus capable of implicitly capturing non-linear binding interactions that are difficult to characterize explicitly. A notable shift towards machine-learning SFs have been seen in recent years, as these SFs yield more accurate prediction of binding affinity across targets than classical SFs. A range machine-learning algorithms have been employed such as random forest (RF) (Ballester and Mitchell, 2011; Li et al., 2015a,b; Ballester et al., 2014; Li et al., 2016; Zilian and Sotriffer, 2013), support vector machine (Li et al., 2011; Ballester, 2012; Sun et al., 2016; Zhan et al., 2014), Neural Network (NN) (Durrant and McCammon, 2010, 2011, Durrant et al., 2013, 2015) or deep NN (Cang and Wei, 2017; Jiménez et al., 2018; Stepniewska-Dziubinska et al., 2018; Imrie et al., 2018).

It has recently been claimed (Li and Yang, 2017) that the superior performance of machine-learning SFs is exclusively due to the presence of training complexes with similar proteins to those in the test set. The authors utilized the PDBbind v2007 refined set (Cheng et al., 2009) and employed both structural similarity and sequence similarity to quantify the degree of similarity between the proteins in the training set and those in the test set. With different similarity cutoff thresholds, they designed 26 nested training sets with structural similarity and 28 nested training sets with sequence similarity. These two series of nested training sets were ordered in two directions, either ranging from small sets of highly dissimilar proteins to large sets that also include highly similar proteins, or ranging from small sets of highly similar proteins to large sets that also include highly dissimilar proteins. Either way, a larger dataset includes all the complexes from the smaller datasets. Trained and evaluated on such purposely designed data partitions were X-Score (Cheng et al., 2009) and the first version of RF-Score (Ballester and Mitchell, 2010), which were selected as representatives of classical and machine-learning SFs, respectively.

Using the same definition of similarity, the same data partitions and the same SFs, we re-analyzed the question of how protein structural and sequence similarity impacts the accuracy of machine-learning SFs for binding affinity prediction (Li et al., 2018). We mostly reached different conclusions than those in (Li and Yang, 2017). For instance, we found out that performance of machine-learning SFs is not at all exclusively due to learning from the most similar training samples. As a byproduct of this reanalysis, we found out that the accuracy of X-Score does not grow with increasing volume of training complexes whose proteins are similar to those in the test set, whereas that of RF-Score does. The accuracies of two other classical SFs, Cyscore (Cao and Li, 2014) and AutoDock Vina (Trott and Olson, 2010), have been shown to remain unaltered with increasing training set size in previous studies (Li et al., 2014; Li et al., 2015a,b). This constitutes a fundamental limitation of classical SFs: imposing an additive functional form to the SF results in early stagnation in accuracy. However, it is still uncertain whether the accuracies of other classical SFs are affected in the same way as X-Score with respect to training-test protein similarities. Furthermore, it is not clear whether SFs based on machine learning algorithms other than RF can also increase their accuracy with additional training data in this context. Thus, we will investigate how the accuracy of Extreme Gradient Boosting (XGBoost), a state-of-the-art machine learning technique (Chen and Guestrin, 2016), varies with training set size. Another open question is to which extent these SFs learn from training complexes with highly dissimilar proteins. Also, nested training sets can also be sorted in the opposite direction, i.e. from small sets of highly similar proteins to large sets that also include highly dissimilar proteins. This is required to better understand how well these SFs can exploit the most relevant data. Lastly, nested training sets have only been generated using protein similarity metrics. However, similarity between complexes also depends on how similar are their ligands, not only their proteins. Thus, the question of how similarities between training and test ligands affect SF performance also remains to be addressed. Here we present a systematic study to investigate all these questions. In addition, we will be presenting XGB-Score, the first SF based on XGBoost.

2 Materials and methods

The 195 diverse protein-ligand complexes in the PDBbind v2007 core set were kept for testing, which is a common practice, and the other 1105 complexes in the refined set were subdivided into multiple nested training sets according to their pairwise structural and sequence similarity cutoffs. The nested training sets used in this study were the same as those in Li and Yang’s (2017) , except from including two more cutoffs leading to additional training sets of even fewer complexes in order to evaluate the performance of the considered SFs when trained on no more than 60 complexes. More precisely, 28 structural similarity cutoffs from 0.3 to 1.0 and 30 sequence similarity cutoffs from 0.2 to 1.0 were employed to define nested training sets. The complete set of cutoff values can be found in Supplementary Tables S1 and S2. In addition to the two protein similarity metrics (Li and Yang, 2017), a third metric was employed to account for the similarity between ligands. Each ligand is described by its Morgan fingerprint counts and physico-chemical properties, as this combination was found to have the highest predictive value in a related problem (Sidorov et al., 2018). These three metrics take values between 0 and 1, with a higher value indicating a higher similarity (a more detailed description can be found in Supplementary Material).

Three regression algorithms were evaluated and compared, upon which the considered SFs were built. Multiple linear regression (MLR) attempts to model the relationship between two or more explanatory variables (e.g. molecular descriptors) and a response variable (e.g. binding affinity) by fitting a linear equation to observed data. The use of MLR is characteristic of classical SFs. In machine learning, RF is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the mean prediction of the individual trees (Breiman, 2001). XGBoost is an implementation of gradient boosted decision trees designed for speed and performance (Chen and Guestrin, 2016). Among other successful applications, XGBoost has been found to excel at ligand-based modeling of protein target binding (Sheridan et al., 2016).

Building upon either MLR, RF or XGB, nine SFs were evaluated and compared. Three of them, X-Score (Cheng et al., 2009), AutoDock Vina (Trott and Olson, 2010) and Cyscore (Cao and Li, 2014), fall into the category of classical SFs, which assume an additive functional form and use MLR to correlate the binding poses and the binding affinities. X-Score employs four empirical descriptors (VDW, H-Bond, Hydrophobic, Rotor). Similarly, Vina employs five descriptors (Gauss1, Gauss2, Repulsion, Hydrophobic, HBonding), and Cyscore employs four descriptors (hydrophobic, vdw, hbond, entropy). The next three SFs, RF-Score (Ballester and Mitchell, 2010), RF-Score-v3 (Li et al., 2015a,b) and a new XGBoost-based SF (presented here and denoted as XGB-Score), are machine-learning SFs employing an adaptive functional form that is inferred from the data. Lastly, RF variants of the three classical SFs, denoted as RF:: X-Score, RF:: Vina and RF:: Cyscore, were also implemented and assessed to see how their performance would compare to their respective linear regression counterpart. RF-Score (Ballester and Mitchell, 2010) used 36 intermolecular atomic distance counts as descriptors, whereas RF-Score-v3 (Li et al., 2015a,b) and XGB-Score added six more descriptors from AutoDock Vina (Trott and Olson, 2010). Since these machine-learning SFs are stochastic, 10 instances were built for each cutoff (each using a different random seed) and their average performance will be reported.

These comparative assessments have been performed on large sets of protein-ligand complexes with curated crystal structures and binding measurements (Cheng et al., 2009). These datasets have been preferred for studying properties of SFs because they permit the most direct assessment of performance against experimental data, thus minimizing the number of confounding factors in the assessment. As usual, the scoring power of the considered SFs was evaluated by the Pearson correlation coefficient (Rp), the Spearman correlation coefficient (Rs), and the root-mean-square error (RMSE) between the predicted and experimental binding affinities. Higher values of Rp and Rs and lower values of RMSE indicate better predictive performance.

3 Results

Figure 1 shows the scoring power (in terms of Rp) of the nine considered SFs trained on nested datasets ranging from small sets of dissimilar proteins to large sets that also include similar proteins. The subfigures are grouped by the employed similarity metric into two columns, with structural similarity on the left side and sequence similarity on the right side. The curves for RF-Score, RF-Score-v3 and XGB-Score are shown in each plot to better appreciate their performance gains over classical SFs. In the case of structural similarity, the performance of not only X-Score but also Vina and Cyscore stagnated with as few as 116 training complexes (only 10% of the full 1105 training complexes), regardless of whether the other 90% more similar complexes were incorporated for training. It is also true in the case of sequence similarity where their Rp value leveled off with as few as 181 training complexes. These results show that classical SFs are unable to exploit large volumes of structural and interaction data, against the common belief that adding more similar training data should help increase their performance (Pham and Jain, 2008). In contrast, RF-Score-v3 and XGB-Score kept learning and progressively outperformed all the three classical SFs with just ∼500 training complexes containing dissimilar proteins to those in the test set. This superiority is even more apparent in the case of comparison with Vina (the two subfigures in the middle row) where RF-Score-v3 and XGB-Score required merely 43 highly dissimilar training complexes to surpass Vina. This result implies that highly dissimilar training samples may be valuable when properly combined with an appropriate learning algorithm and such dissimilar training data could actually contribute a substantial part to the outstanding performance of machine-learning SFs. When the entire 1105 complexes were used for training, the performance gap between machine-learning and classical SFs became substantial, i.e. XGB-Score, RF-Score-v3, X-Score, Vina and Cyscore produced an average-in-10-instances Rp value of 0.806, 0.800, 0.643, 0.596 and 0.657, respectively. The best overall performance was 0.815 by XGB-Score. Note that X-Score and Cyscore are the most accurate classical SFs on this benchmark (Cao and Li, 2014; Cheng et al., 2009), with 15 other classical SFs obtaining much lower Rp values ranging from 0.22 to 0.57 (Cheng et al., 2009).

Training sets with increasingly similar training instances based on sequence/structure similarity cutoffs. Test set performance (in terms of Rp) of nine SFs. Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly dissimilar proteins to large sets that also include highly similar proteins. Each point represents the average test Rp of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these Rp values is too small to be appreciated). Left column: structural similarity. Right column: sequence similarity. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score, RF-Score-v3 and XGB-Score are shown for comparative purposes
Fig. 1.

Training sets with increasingly similar training instances based on sequence/structure similarity cutoffs. Test set performance (in terms of Rp) of nine SFs. Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly dissimilar proteins to large sets that also include highly similar proteins. Each point represents the average test Rp of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these Rp values is too small to be appreciated). Left column: structural similarity. Right column: sequence similarity. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score, RF-Score-v3 and XGB-Score are shown for comparative purposes

The comparison of classical SFs with their respective machine-learning variants is revealing. In each case, we merely substitute RF for MRL, while retaining exactly the same descriptors and datasets. In this way, any performance difference will necessarily come from algorithmic replacement. The performance curves of these RF variants, denoted as RF:: X-Score, RF:: Vina and RF:: Cyscore, are plotted in green color in all the figures. Results show that these RF variants might not perform as well as RF-Score or RF-Score-v3, and sometimes performed even worse than their respective classical counterpart when given insufficient training complexes <900, roughly), but still managed to keep improving performance with more training data and eventually overtake the classical SFs when given the full 1105 complexes for training. Although the above conclusions are drawn from the results of predictive performance in terms of Rp correlation, analogous phenomenon can be observed when inspecting Rs correlation (Supplementary Fig. S1).

Both Rp and Rs reflect the degree with which predicted binding affinities are correlated to known experimental affinities. The RMSE metric offers a complementary view of predictive performance. Figure 2 preserves the same plotting layout as in Figure 1 but shows RMSE instead of Rp. It is commonly expected that with more training samples (especially those similar to the test set) comes with higher predictive accuracy and lower error rate (Pham and Jain, 2008). However, we do not observe this outcome for classical SFs. Like in the case of Rp, here the RMSE curves of classical SFs stopped descending and stayed flat when reaching ∼400 training complexes. Incidentally, just about 400 dissimilar training complexes were all that RF-Score-v3 and XGB-Score required to generate a RMSE value lower than that of any of the three classical SFs. Likewise, when the training set was finally expanded to cover the whole 1105 complexes, the performance gap became large. All the assessed machine-learning SFs, including RF-Score and the RF implementations of the classical SFs, demonstrate an attractive capability of continuously reducing prediction errors when more similar samples were added for training. This is indeed a crucial characteristic of machine-learning SFs: with larger volumes of structural and interaction data available in the future, the performance gap will continue to broaden, rendering classical SFs increasingly less attractive.

Training sets with increasingly similar training instances based on sequence/structure similarity cutoffs. Test set performance (in terms of RMSE) of nine SFs. Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly dissimilar proteins to large sets that also include highly similar proteins. Each point represents the average test RMSE of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these RMSE values is too small to be appreciated). Left column: structural similarity. Right column: sequence similarity. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score, RF-Score-v3 and XGB-Score are shown for comparative purposes
Fig. 2.

Training sets with increasingly similar training instances based on sequence/structure similarity cutoffs. Test set performance (in terms of RMSE) of nine SFs. Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly dissimilar proteins to large sets that also include highly similar proteins. Each point represents the average test RMSE of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these RMSE values is too small to be appreciated). Left column: structural similarity. Right column: sequence similarity. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score, RF-Score-v3 and XGB-Score are shown for comparative purposes

Both Figures 1 and 2 illustrate how well the considered SFs exploited training complexes formed by proteins that were initially highly dissimilar to those in the test set and then gradually expanded to incorporate similar proteins as well. How would the SFs compare if the training set is firstly constructed with the most similar proteins only and progressively grown to include also dissimilar ones? Such opposite direction in the generation of nested training set has only been considered with the first version of RF-Score, so this comparison is still to be made. To this end, we repeated the experiments and the results are shown in Figure 3. It is noteworthy that under no circumstances did any of the three classical SFs (dark blue color line in each plot) outperform any of the six machine-learning SFs. With just the ∼320 training complexes with the most similar proteins, the Rp curve for RF-Score-v3 already bumps up to 0.767, suggesting that the amount of inherent knowledge RF-Score-v3 learns from similar training data is substantially more than that from dissimilar training data (RF-Score-v3 obtained a Rp value of 0.639 when trained on the 330 most dissimilar complexes; see Fig. 1). A simple algorithmic substitution of RF for linear regression (the green curves), which requires only a minimal effort without even changing the descriptors, has already led to considerably better performance. Utilizing XGBoost (light blue curves) resulted in even better performance in most cases. Analogous conclusions can be observed when inspecting Rs and RMSE (Supplementary Figs S2 and S3).

Training sets with increasingly dissimilar training instances based on sequence/structure similarity cutoffs. Test set performance (in terms of Rp) of nine SFs. Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly similar proteins to large sets that also include highly dissimilar proteins. Each point represents the average test Rp of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these Rp values is too small to be appreciated). Left column: structural similarity. Right column: sequence similarity. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score and RF-Score-v3 and XGB-Score are shown for comparative purposes
Fig. 3.

Training sets with increasingly dissimilar training instances based on sequence/structure similarity cutoffs. Test set performance (in terms of Rp) of nine SFs. Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly similar proteins to large sets that also include highly dissimilar proteins. Each point represents the average test Rp of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these Rp values is too small to be appreciated). Left column: structural similarity. Right column: sequence similarity. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score and RF-Score-v3 and XGB-Score are shown for comparative purposes

The similarity between a pair of protein-ligand complexes has so far been calculated exclusively from the two proteins, without considering the ligands. Alternatively, one could look at the question of how test set performance varies with the similarities of ligands in training and test complexes. Figure 4 shows the results of this experiment in terms of test RMSE (Supplementary Figs S4 and S5 show the corresponding Rs and Rp plots). As with protein similarity, classical SFs stagnated early unlike machine-learning SFs. Thus, just about 400 training complexes with the most dissimilar ligands were required by RF-Score-v3 and XGB-Score to achieve a RMSE value lower than that of any of the three classical SFs (left column in Fig. 4). When the training set was firstly constructed with the most similar ligands only and progressively grown to include also dissimilar ligands, under no circumstances did any of the three classical SFs outperform any of the six machine-learning SFs (right column in Fig. 4).

Training sets with increasingly similar training instances based on ligand similarity cutoffs. Test set performance (in terms of RMSE) of nine SFs. Left column: Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly dissimilar ligands to large sets that also include highly similar ligands. Each point represents the average test RMSE of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these RMSE values is too small to be appreciated). Right column: Each SF is trained on nested datasets of increasing ligand dissimilarity, from small sets of highly similar ligands to large sets that also include highly dissimilar ligands. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score, RF-Score-v3 and XGB-Score are shown for comparative purposes
Fig. 4.

Training sets with increasingly similar training instances based on ligand similarity cutoffs. Test set performance (in terms of RMSE) of nine SFs. Left column: Each SF is trained with nested datasets at different similarity cutoffs, ranging from small sets of highly dissimilar ligands to large sets that also include highly similar ligands. Each point represents the average test RMSE of 10 runs of the SF with that training set, each run using a different random seed (the standard deviation of these RMSE values is too small to be appreciated). Right column: Each SF is trained on nested datasets of increasing ligand dissimilarity, from small sets of highly similar ligands to large sets that also include highly dissimilar ligands. Top row: comparing X-Score to RF-Score, RF-Score-v3 and XGB-Score. Middle row: comparing Vina to RF-Score, RF-Score-v3 and XGB-Score. Bottom row: comparing Cyscore to RF-Score, RF-Score-v3 and XGB-Score. In all plots, the curves for RF-Score, RF-Score-v3 and XGB-Score are shown for comparative purposes

The main difference with respect to using protein similarities to generate training sets is that XGB-Score performs substantially worse in the range of 100–300 training complexes when using ligand similarity (compare plots in Fig. 2 with those in Fig. 4’s left column). XGB generates a sequence of regression trees, where each tree is built to correct the prediction of previous trees. The performance drop suggests that this optimization procedure is less effective when the training set is more diverse. Indeed, highly dissimilar ligands are more diverse than highly dissimilar proteins, as the latter are constrained by their polymeric structure and small repertoire of amino acid residues. In contrast, small-molecule ligands do not have these constraints, which results in a combinatorial explosion in their numbers and thus their diversity.

Finally, we compare the test set performance of the best machine-learning SF (XGB-Score) with that of the best classical SF (Cyscore), both exploiting the full training set. The two complexes with largest error by Cyscore are 7cpa (carboxypeptidase A in complex with phosphonate) and 2f01 (streptavidin in complex with epi-biotin), where this SF underestimates their binding affinities by 6.2 and 6.0 pKd units, respectively. In comparison, XGB-Score underestimates them by just 0.3 (2f01) and 4.5 (7cpa) pKd units. To see how common this situation is, we present plots with predicted versus measured binding affinity for each SF (Fig. 5, top).

Predictive performance of the best machine-learning SF (XGB-Score) against that of the best classical SF (Cyscore) using the entire 1105 training complexes. Top left: Cyscore-predicted versus measured binding affinity of the 195 test complexes. Top right: same plot with XGB-Score predictions instead. Bottom: Absolute error of both SFs per test complex, each complex colored by its measured affinity
Fig. 5.

Predictive performance of the best machine-learning SF (XGB-Score) against that of the best classical SF (Cyscore) using the entire 1105 training complexes. Top left: Cyscore-predicted versus measured binding affinity of the 195 test complexes. Top right: same plot with XGB-Score predictions instead. Bottom: Absolute error of both SFs per test complex, each complex colored by its measured affinity

Both SFs underestimate the binding affinities of practically all tightly bound complexes (those with >10 pKd units) and also overestimate those of weakly bound complexes (those with <4 pKd units). However, such errors are substantially more acute in Cyscore with 1.86 RMSE than in XGB-Score with 1.43 RMSE (the bottom plot in Fig. 5 shows error level per test complex too). To shed light into the reason why XGB-Score works better than Cyscore, we have to revisit Figure 2 (bottom right plot). When trained with the 181 complexes with the most dissimilar proteins, both SFs obtain practically the same RMSE than Cyscore trained with the full training set, with similar numbers of complexes with overestimated and underestimated affinities (Supplementary Fig. S6). However, by training XGB with the remaining 924 complexes, which include highly similar proteins, these errors are substantially reduced (Fig. 5 top right). Therefore, the reason why XGB-Score works better than Cyscore is because XGB is able to learn from the most similar complexes. The latter suggests that the relationship between binding affinity and protein-ligand features is non-linear and hence inadequately captured by linear regression.

4 Discussion

Accurate prediction of protein-ligand binding affinity has vital applications in molecular recognition and structure-based drug design. Intensive efforts have been put into the development of comprehensive benchmarks as well as SFs that utilize novel features (or descriptors) or novel definitions of common features trying to advance predictive performance. X-Score (Cheng et al., 2009) is unquestionably a well-established SF. It employs a set of expert-selected empirical features (e.g. van der Waals force, hydrogen bonding, etc.) and a linear regression model to calibrate the coefficients of a predetermined additive functional form. Innovations are found in proposing new features rather than in proposing new regression methods. Following this trend are a number of SFs developed subsequently, such as AutoDock Vina (Trott and Olson, 2010) and Cyscore (Cao and Li, 2014), which still rely on linearity assumptions despite inventing new and improved computations of common descriptors. These SFs are commonly referred to as classical SFs. In this study we selected X-Score, Vina and Cyscore as references for comparison because they are among the best performing classical SFs.

Alternatively, following another trend are a plenty of newly developed SFs that innovate in utilizing machine-learning techniques to automatically learn and devise the functional form from the data. RF-Score (Ballester and Mitchell, 2010) was the first machine-learning SF that exhibited substantially better accuracy than classical SFs. Successively, predictive performance gain has been seen after embracement of machine learning methods for many existing SFs, to name a few, Vina (Trott and Olson, 2010), Cyscore (Cao and Li, 2014) and SFCscore (Zilian and Sotriffer, 2013). In fact, a growing body of recent research in the assessment of binding affinity prediction has established the motivation and advantage of preferring machine-learning SFs to classical ones (Ain et al., 2015). Although the performance of machine-learning SFs has been recurrently shown to increase with more training data, the similarity between the training and test data is seldom studied and thus it remains unknown to what extent such training-test similarity or dissimilarity would impact the performance of classical and machine-learning SFs.

To address this issue, a benchmark using nested training sets with different protein similarity cutoffs was introduced (Li and Yang, 2017). In this way, a larger set include all the complexes from a smaller set plus some new complexes with similar or dissimilar proteins to those in the test set. Using this benchmark, their authors claimed that the remarkable performance of machine-learning SFs is exclusively due to learning from the most similar training samples (Li and Yang, 2017), but this has been shown not to be case (Li et al., 2018). As a side product of that study (Li et al., 2018), X-Score was found to suffer from early stagnation of performance when given similar complexes for regression. This fortuitous finding motivated us to evaluate additional SFs in the same manner on the same data partitions.

In this study, we have seen that not only X-Score, but also Vina and Cyscore, resulted in early stagnation in performance. In other words, integrating a larger quantity of similar complexes to the training set did not make these classical SFs more accurate than how they are otherwise supposed to be. In fact, the finding that classical SFs are unable to exploit large volumes of structural and interaction data is not widely known thus far. Their performance fluctuated noticeably when initially trained on highly similar samples and then on more dissimilar samples (Fig. 3). Strikingly, these SFs performed even worse when trained on the ∼500 most similar samples than trained on the ∼500 most dissimilar samples (comparing Figs 1 and 3), suggesting that linear regression can hardly capture the highly non-linear relationship between a binding pose and its affinity. As we have emphasized in related studies, assuming an additive functional form and depending on linear regression comprise an essential constraint of classical SFs. Similar results are observed when using ligand similarity to generate the nested training sets.

On the other hand, RF-Score-v3 and XGB-Score outperformed X-Score, Vina and Cyscore even when the 689 complexes with most structurally similar proteins are not included in the training set, respectively (Figs 1 and 2). In other words, machine-learning SFs only required a small part of the full training set to surpass classical SFs. Training complexes with highly dissimilar proteins to the test set can be valuable if appropriately exploited by a machine-learning SF. It is surprising to observe that the performance of RF-Score peaked at about 520 complexes rather than the full 1105 complexes (see Fig. 3, right columns). This early peak also appears when considering nested training sets sorted by protein structure similarity (see Fig. 3, left columns) or other machine-learning SFs. In particular, RF-Score-v3 and XGB-Score also present this peak with ∼550 complexes, but it represents a lower performance than using all training data. The latter suggests that machine learning SFs can only reach their utmost performance when allied with the most informative features. Overall, this peak seems to be due to a certain tradeoff between the size of the training set and its relevance to the test set.

To see whether these conclusions would be generalizable to machine-learning SFs other than RF, we employed XGBoost and implemented the first XGBoost-based SF, XGB-Score. Results with this second machine-learning algorithm suggest the conclusions will be generalizable to any algorithm able to learn effectively from large volumes of protein-ligand complexes. Indeed, being XGBoost a cutting-edge machine-learning algorithm, XGB-Score even outperformed RF-Score-v3 in most cases (see Figs 1–3).

When trained with the full training set, both XGB-Score and Cyscore underestimate the binding affinities of tightly bound complexes and also overestimate those of weakly bound complexes. However, such errors are substantially more acute with Cyscore. This is due to XGB being able to keep learning and further improve with the most similar complexes, unlike Cyscore.

The protein structure and sequence similarity metrics introduced in (Li and Yang, 2017) have been used to characterize the learning differences between classical and machine-learning SFs. This is of course not the only way in which one can relate training and test complexes, which opens several directions for future research. For example, these protein similarity metrics are of global nature, i.e. they consider the whole protein structure when calculating structural similarity and the whole protein sequence when calculating sequence similarity. Since the binding of a ligand to its intended protein is mostly determined by the local environment of the binding pocket, binding site similarity metrics could also be used to analyze this question.

5 Conclusions

We present a systematic study of binding affinity prediction using three classical SF and six machine-learning SFs evaluated on multiple series of nested datasets based on protein structure and sequence similarity as well as ligand similarity from two similarity directions, seeking for evidences on how well the SFs learn from data. In doing so, we have presented a new SF, XGB-Score, which uses XGBoost on this problem for the first time and outperforms all other evaluated SFs. We show that classical SFs are unable to exploit large volumes of structural and interaction data, whereas machine-learning SFs can assimilate training data instances better. This large performance gap will hence widen as more data becomes available. Substituting machine-learning techniques for linear regression helps to create more accurate SFs across targets.

Acknowledgements

We thank Yang Li and Jianyi Yang for providing us the X-Score prediction values and its four energetic terms of the 1300 protein-ligand complexes in the PDBbind v2007 refined set.

Funding

This study was supported by the Vice-Chancellor’s One-off Discretionary Fund, Faculty of Social Science Postdoctoral Fellowship Scheme and Institute of Future Cities, The Chinese University of Hong Kong as well as an ANR Tremplin-ERC [grant number ANR-17-ERC2-0003-01 (P.J.B.)].

Conflict of Interest: none declared.

References

Ain
 
Q.U.
 et al. (
2015
)
Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening
.
Wiley Interdiscip. Rev. Comput. Mol. Sci
.,
5
,
405
424
.

Ballester
 
P.J.
 et al. (
2014
)
Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity?
J. Chem. Inf. Model
,
54
,
944
955
.

Ballester
 
P.J.
(
2012
)
Machine learning scoring functions based on random forest and support vector regression
.
Lect. Notes Bioinformatics
,
7632
,
14
25
.

Ballester
 
P.J.
,
Mitchell
J.B.O.
(
2010
)
A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking
.
Bioinformatics
,
26
,
1169
1175
.

Ballester
 
P.J.
,
Mitchell
J.B.O.
(
2011
)
Comments on ‘leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets’: significance for the validation of scoring functions
.
J. Chem. Inf. Model
,
51
,
1739
1741
.

Breiman
 
L.
(
2001
)
Random forests
.
Mach. Learn
.,
45
,
5
32
.

Cang
 
Z.
,
Wei
G.-W.
(
2017
)
TopologyNet: topology based deep convolutional and multi-task neural networks for biomolecular property predictions
.
PLOS Comput. Biol
.,
13
,
e1005690.

Cao
 
Y.
,
Li
L.
(
2014
)
Improved protein-ligand binding affinity prediction by using a curvature dependent surface area model
.
Bioinformatics
,
30
,
1674
1680
.

Chen
 
T.
,
Guestrin
C.
(
2016
) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16. ACM Press, New York, New York, USA, pp.
785
794
.

Cheng
 
T.
 et al. (
2009
)
Comparative Assessment of Scoring Functions on a Diverse Test Set
.
J. Chem. Inf. Model
,
49
,
1079
1093
.

Durrant
 
J.D.
 et al. (
2013
)
Comparing neural-network scoring functions and the state of the art: applications to common library screening
.
J. Chem. Inf. Model
,
53
,
1726
1735
.

Durrant
 
J.D.
 et al. (
2015
)
Neural-network scoring functions identify structurally novel estrogen-receptor ligands
.
J. Chem. Inf. Model
,
55
,
1953
1961
.

Durrant
 
J.D.
,
McCammon
J.A.
(
2010
)
NNScore: a neural-network-based scoring function for the characterization of protein-ligand complexes
.
J. Chem. Inf. Model
,
50
,
1865
1871
.

Durrant
 
J.D.
,
McCammon
J.A.
(
2011
)
NNScore 2.0: a neural-network receptor-ligand scoring function
.
J. Chem. Inf. Model
,
51
,
2897
2903
.

Imrie
 
F.
 et al. (
2018
)
Protein family-specific models using deep neural networks and transfer learning improve virtual screening and highlight the need for more data
.
J. Chem. Inf. Model
,
58
,
2319
2330
.

Jiménez
 
J.
 et al. (
2018
)
KDEEP: protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks
.
J. Chem. Inf. Model
,
58
,
287
296
.

Li
 
H.
 et al. (
2016
)
Correcting the impact of docking pose generation error on binding affinity prediction
.
BMC Bioinformatics
,
17
,
308.

Li
 
H.
 et al. (
2015a
)
Improving AutoDock Vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets
.
Mol. Inform
.,
34
,
115
126
.

Li
 
H.
 et al. (
2015b
)
Low-quality structural and interaction data improves binding affinity prediction via random forest
.
Molecules
,
20
,
10947
10962
.

Li
 
H.
 et al. (
2014
)
Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: cyscore as a case study
.
BMC Bioinformatics
,

Li
 
H.
 et al. (
2018
)
The impact of protein structure and sequence similarity on the accuracy of machine-learning scoring functions for binding affinity prediction
.
Biomolecules
,
8
,

Li
 
L.
 et al. (
2011
)
Target-specific support vector machine scoring in structure-based virtual screening: computational validation, in vitro testing in kinases, and effects on lung cancer cell proliferation
.
J. Chem. Inf. Model
,
51
,
755
759
.

Li
 
Y.
,
Yang
J.
(
2017
)
Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein? Ligand interactions
.
J. Chem. Inf. Model
,
57
,
1007
1012
.

Pham
 
T.A.
,
Jain
A.N.
(
2008
)
Customizing scoring functions for docking
.
J. Comput. Aided Mol. Des
.,
22
,
269
286
.

Sheridan
 
R.P.
 et al. (
2016
)
Extreme gradient boosting as a method for quantitative structure–activity relationships
.
J. Chem. Inf. Model
,
56
,
2353
2360
.

Sidorov
 
P.
 et al. (
2018
)
Predicting synergism of cancer drug combinations using NCI-ALMANAC data
.
bioRxiv
,
504076.

Stepniewska-Dziubinska
 
M.M.
 et al. (
2018
)
Development and evaluation of a deep learning model for protein–ligand binding affinity prediction
.
Bioinformatics
,
34
,
3666
3674
.

Sun
 
H.
 et al. (
2016
)
Constructing and validating high-performance MIEC-SVM models in virtual screening for kinases: a better way for actives discovery
.
Sci. Rep
.,
6
,
24817.

Trott
 
O.
,
Olson
A.J.
(
2010
)
AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading
.
J. Comput. Chem
.,
31
,
455
461
.

Zhan
 
W.
 et al. (
2014
)
Integrating docking scores, interaction profiles and molecular descriptors to improve the accuracy of molecular docking: toward the discovery of novel Akt1 inhibitors
.
Eur. J. Med. Chem
.,
75
,
11
20
.

Zilian
 
D.
,
Sotriffer
C.A.
(
2013
)
SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes
.
J. Chem. Inf. Model
,
53
,
1923
1933
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Alfonso Valencia
Alfonso Valencia
Associate Editor
Search for other works by this author on:

Supplementary data