-
PDF
- Split View
-
Views
-
Cite
Cite
Cunliang Geng, Yong Jung, Nicolas Renaud, Vasant Honavar, Alexandre M J J Bonvin, Li C Xue, iScore: a novel graph kernel-based function for scoring protein–protein docking models, Bioinformatics, Volume 36, Issue 1, January 2020, Pages 112–121, https://doi.org/10.1093/bioinformatics/btz496
- Share Icon Share
Abstract
Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge.
Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein–protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes.
The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684).
Supplementary data are available at Bioinformatics online.
1 Introduction
Protein–protein interactions (PPIs) play a crucial role in most cellular processes and activities such as signal transduction, immune response, enzyme catalysis, etc. Getting insight into the three dimensional (3D) structures of those protein–protein complexes is fundamental to understand their functions and mechanisms (Aloy and Russell, 2006; Kiel et al., 2008). Due to the prohibitive cost and effort involved in experimental determination of the structure of protein complexes (Shoemaker and Panchenko, 2007), computational modelling, and in particular docking, has established itself as a valuable complementary approach to obtaining insights into structural basis of protein interactions, interfaces and complexes (Halperin et al., 2002; Huang, 2014; Melquiond et al., 2012; Rodrigues and Bonvin, 2014; Soni and Madhusudhan, 2017; Stein et al., 2011; Vangone et al., 2017).
Computational docking typically involves two steps (Halperin et al., 2002; Huang, 2014; Rodrigues and Bonvin, 2014; Soni and Madhusudhan, 2017): Sampling, i.e. the search of the interaction space between two molecules to generate as many as possible near-native models; and scoring, i.e. the identification of near-native models out of the pool of sampled conformations. As shown in the community-wide Critical Assessment of PRediction of Interactions (CAPRI) (Lensink and Wodak, 2010; 2013; Lensink et al., 2007, 2017), scoring is still a major challenge in the field. There is thus still plenty of room to improve the scoring functions used in protein–protein docking (Moal et al., 2013; Vangone et al., 2017).
Scoring functions can be classified into three types: (i) physical energy term-based, (ii) statistical potential-based and (iii) machine learning-based. Physical energy-based scoring functions are usually a weighted linear combination of multiple energetic terms. These are widely used in many docking programs such as HADDOCK (Dominguez et al., 2003; Vangone et al., 2016), SwarmDock (Torchala et al., 2013), pyDock (Cheng et al., 2007; Grosdidier et al., 2007; Jiménez-García et al., 2013), ZDock (Pierce et al., 2014; Pierce and Weng, 2007) and ATTRACT (Zacharias, 2003). Taking HADDOCK as an example, its scoring function consists of intermolecular electrostatic and van der Waals energy terms combined with an empirical desolvation potential (Fernández-Recio et al., 2004) as well as a buried surface area (BSA)-based term depending on the stage of the protocol (Vangone et al., 2016). Statistical potential-based scoring functions such as 3D-Dock (Moont et al., 1999), DFIRE (Zhou and Zhou, 2002) and SIPPER (Pons et al., 2011b), typically convert distance-dependent pairwise atom–atom or residue–residue contacts distributions into potentials through Boltzmann inversion. Unlike classical scoring functions that consist of linear combinations of energy terms, or simple geometric and physicochemical features (Bourquard et al., 2011; Fink et al., 2011; Moal et al., 2017), a machine learning approach can discover complex nonlinear combinations of features of protein–protein interfaces to train a classifier to label a docking model as near-native model or not. Simple machine learning algorithms work with fixed dimensional feature vectors. Because interfaces of different docking models can vary widely in size and shape, and in the arrangement of their interfacial residues, most machine learning based scoring functions typically use global features of the entire interface, for example, the total interaction energy and the BSA. However, such an approach fails to effectively utilize details of the spatial arrangement of interfacial residues/atoms.
Graphs, in which the nodes encode the amino acid residues or atoms and the intermolecular contacts between them are encoded by the edges, offer a natural and information-rich representation of protein–protein interfaces. Unlike the global interface feature vectors described above, a graph has a residue- or atom-level resolution and naturally encodes the topological information of interacting residues/atoms (Bunke and Riesen, 2011; Vento, 2015). Furthermore, the size of a graph is not fixed and can vary depending on the size of the interface.
Such graph-based descriptions have been used previously in several scoring functions (Chang et al., 2008; Khashan et al., 2012; Pons et al., 2011a). Graph (or network) topology-based metrics have mostly been used. Chang et al. (2008) exploited node degrees (measuring the number of direct contacts of a node) and clustering coefficients (measuring how likely a node and its neighbours tend to form a clique) to score docking models. Similarly, Pons et al. (2011a) used closeness (measuring how far a node from the rest of the nodes in a network) and betweenness (measuring how important a node as a connector in a network) in scoring with the intuition that residues with high centralities in a network tend to be key functional residues. Unlike the network topology-based approaches, the SPIDER (Khashan et al., 2012) scoring function uses a graph to represent the interface at residue level with nodes labeled by their amino acid identity. It ranks the docking models by counting the frequency of native motifs in the interface graph. However, all the preceding fail to fully exploit the rich features of protein interfaces.
Against this background, we represent the interface with a labeled graph, where the nodes encode the interface residues, edges encode residue–residue contacts, and the nodes are annotated with evolutionary conservation profiles. We treat the scoring problem as a binary classification problem. By calculating the similarity between an interface graph from a docking model with the positive (native) and negative (non-native) interface graphs in the training set, we predict the likelihood of the query interface graph belonging to the positive class or the negative class (Fig. 1). We make use of a novel graph kernel to compute the pair-wise similarity between the graph representations of protein–protein interfaces. We call the resulting graph kernel-based scoring function GraphRank.

Schematic workflow of our graph kernel-based scoring method. Docking models for a protein–protein complex are first represented as graphs by treating the interface residues as graph nodes and the intermolecular contacts they form as graph edges. Interface features are added to the graph as node or edge labels (only PSSM profiles as node labels in this case). Then, each of the interface graphs of the docking models is compared to the interface graphs of both the positive (native) structure and negative (non-native) models. This graph comparison generates a similarity matrix for the docking models with the number of rows and columns corresponding to the number of docking models and the total number of positive and negative graphs, respectively. Next, the support vector machine takes the graph kernel matrix as input and predicts decision values that are used as the GraphRank score. The final scoring function iScore is a linear combination of the GraphRank score and HADDOCK energetic terms (van der Waals, electrostatic and desolvation energies). The weights of this linear combination are optimized using the genetic algorithm (GA) over the BM4 HADDOCK dataset
GraphRank exploits random walk graph kernel (RWGK) (Vishwanathan et al., 2010) for computing the similarity of labeled graphs, which has previously been used for protein function prediction (Borgwardt et al., 2005) to calculate the similarity between two interface graphs. By simultaneously conducting random walks on two graphs, RWGK measures the similarity of two graphs by aggregating the similarity of the set of random walks on the two graphs. Unlike previous graph-based scoring functions, RWGK allows GraphRank to fully exploit various node labels and edge labels and to explicitly specify the starting and ending probability of the random walks. GraphRank has two major advantages over classical machine learning based scoring functions. First, GraphRank uses a more detailed representation of protein interfaces than that provided by the fixed dimensional feature vectors used by classical machine learning approaches. GraphRank exploits residue level attributes and network topology. Second, GraphRank uses the full profile of interface conservation as node labels, i.e. each node is represented as a 20 by 1 vector of conservation profile extracted from the Position Specific Scoring Matrix (PSSM). Residue conservation information plays an important role in protein–protein recognitions (Andreani and Guerois, 2014; de Oliveira and Deane, 2017; Hopf et al., 2014) and hence different types of conservation information have been used in several existing scoring functions (Andreani et al., 2013; Tress et al., 2005; Xue et al., 2014). The PSSM is a multiple-sequence-alignment (MSA) based conservation matrix. Its value is a log likelihood ratio between the observed probability of one type of amino acid appearing in a specific position in the MSA and the expected probability of that amino acid type appearing in a random sequence. Each position in a protein can be represented as a 20 by 1 PSSM profile, which captures the conservation characteristic of each amino acid type at a specific position.
For GraphRank we designed a specific random walk graph kernel to compare interface graphs. A graph similarity matrix was calculated from a balanced dataset of native and non-native structures from the protein–protein docking benchmark version 4.0 (BM4) (Hwang et al., 2010), and was used to train a support vector machine (SVM) classifier. GraphRank, the resulting scoring function, uses only the residue conservation information as node labels and as the basis of starting and ending probabilities of random walks. We further combined the GraphRank score with intermolecular energies, resulting our final scoring function, iScore. We benchmarked the iScore and GraphRank scoring functions on two independent sets of docking models for two different purposes: (i) 4 sets of docking software-specific models and their respective scoring functions and (ii) the CAPRI score set, a set of docking software-nonspecific models, in which models from different docking programs are mixed together. We also compare our performance with that of IRaPPA (Moal et al., 2017), one of the latest state-of-the-art machine learning based scoring functions. The results of our experiments on both benchmark sets show that iScore achieves scoring performance that is competitive with or superior to that of the state-of-the-art scoring functions. These results represent the first successful demonstration of the use of graph kernel applied to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes.
2 Materials and methods
2.1 Constructing interface graph and random walk graph kernel
2.1.1 Representing protein–protein interfaces as labeled bipartite graphs
A residue is defined as an interface residue if any of its atoms is within 6 Å of any atom of another residue in the partner protein. This is a commonly used interface definition (Xue et al., 2015), and, for example, a similar cutoff (5.5 Å) has been shown to work well for contacts-based binding affinity prediction (Vangone and Bonvin, 2015). We represent the interface of a native complex or a docking model as a bipartite graph (Fig. 1), in which each node is an interface residue, and each edge consists of two nodes that are within 6 Å distance from each other (based on any atom–atom distance within 6 Å between those residues). We further label the graph node with residue conservation profiles from Position Specific Scoring Matrix (PSSM). Each node is thus represented by a 20 1 vector of PSSM profile. Our current implementation uses a single type of nodes, namely residues, labeled with their PSSM profiles, and a single type of edges, namely, those that encode inter-residue contacts. However, our framework admits multiple types of nodes and edge labels.
The PSSM was calculated through PSI-BLAST (Altschul, 1997) of BLAST 2.7.1+. The parameters of the BLAST substitution matrix, word size, gap open cost and gap extend cost were automatically set based on the length of protein sequence using the recommended values in the BLAST user guide (https://www.ncbi.nlm.nih.gov/books/NBK279684/) (see Supplementary Table S1). Other parameters were: Number of iterations set to 3 and the e-value threshold to 0.0001. The BLAST database used was the nr database (the non-redundant BLAST curated protein sequence database), version of February 4, 2018.
2.1.2 Random walk graph kernel for interface graphs
We define a random walk graph kernel (RWGK) to measure the similarity of two interface graphs. Given two labeled graphs, a RWGK first applies simultaneous random walks on the two graphs with the same walk length (the number of edges) and then calculates the similarity between those two random walks. The RWGK score is then the weighted sum of the walk similarity varying the walk length from 0 to infinity (Ghosh et al., 2018).
The simultaneous random walks on graphs and are equivalent to a random walk on the direct product graph . In other words, each walk on the direct product graph corresponds to two walks on the two individual graphs, allowing the calculation of a similarity score between them. When the walk length is 1, these similarity scores are the elements of the weight matrix of consists of similarity scores of walk length of . The similarity between graphs and is thus the weighted sum of these walk similarities.
2.1.3 Support vector machine (SVM) algorithm
SVM maps arbitrary data objects (vectors, sequences, graphs, etc.) into a kernel-induced feature space where it searches for a hyperplane that maximizes (or approximately maximizes) the separation between classes (Vapnik, 2013). We used the SVM implementation from the LIBSVM (Chang and Lin, 2011) package to train a scoring function taking the graph kernel matrix from the training dataset as input ( is the number of the training graphs). Given a test data (an interface graph of a docking model in our case), we calculate the kernel vector that consists of the similarities of this query graph with all the training graphs. The trained SVM-based scoring model uses the resulting vector of similarities of the query graph with all of the training graphs as well as the labels of the training graphs to predict the likelihood of the query graph corresponds to a near-native conformation.
2.2 Evaluation metrics to compare scoring functions
Each scoring function has its own default protocol for selecting top models. To avoid subjectivity in the selection of top models in our comparisons, we used the success rate at cluster level to evaluate the scoring functions on the BM5 dataset. We defined a cluster as a hit if at least one of the top four models in that cluster is of acceptable or better quality. The success rate on top N clusters was defined as the number of cases (complexes) with at least one hit out of the N clusters divided by the total number of complexes considered.
The quality of the docking models was evaluated using standard CAPRI criteria based on the interface or ligand Root Mean Squared Deviations (i-RMSDs and l-RMSDs, respectively) and fraction of native contacts (Fnat) [for details refer to Figure 1 of Lensink et al. (2007)]. They were classified as incorrect (i-RMSD > 4 Å or Fnat < 0.1), acceptable (2 Å < i-RMSD ≤ 4 Å and Fnat ≥ 0.1), medium (1 Å < i-RMSD ≤ 2 Å and Fnat ≥ 0.3) or high (i-RMSD ≤ 1 Å and Fnat ≥ 0.5) quality (Lensink et al., 2007).
Comparison of GraphRank and iScore with CAPRI best performing group per target on the CAPRI score set
CAPRI targets . | GraphRank . | iScore . | CAPRI best . | # Total models . | #Near-native . |
---|---|---|---|---|---|
T29 | 4 | 4 | 9/5** | 1979 | 166 |
T30 | 0 | 0 | 0 | 1148 | 2 |
T32 | 4/1** | 4/1** | 2 | 599 | 15 |
T35 | 0 | 0 | 1 | 497 | 3 |
T37 | 2/1** | 4/2** | 6/1*** | 1364 | 97 |
T39 | 0 | 0 | 0 | 1295 | 4 |
T40 | 4/3** | 4/1*** | 10/10*** | 1987 | 535 |
T41 | 8 | 10/2** | 10/2*** | 1101 | 347 |
T46 | 3 | 4 | 4 | 1570 | 24 |
T47 | 8/5***/3** | 10/6***/4** | 10/10*** | 1015 | 608 |
T50 | 0 | 4/3** | 7/6** | 1447 | 133 |
T53 | 5/1** | 5/1** | 8/3** | 1360 | 122 |
T54 | 0 | 0 | 0 | 1304 | 19 |
Total | 8/1***/4** | 9/2***/5** | 10/4***/3** |
CAPRI targets . | GraphRank . | iScore . | CAPRI best . | # Total models . | #Near-native . |
---|---|---|---|---|---|
T29 | 4 | 4 | 9/5** | 1979 | 166 |
T30 | 0 | 0 | 0 | 1148 | 2 |
T32 | 4/1** | 4/1** | 2 | 599 | 15 |
T35 | 0 | 0 | 1 | 497 | 3 |
T37 | 2/1** | 4/2** | 6/1*** | 1364 | 97 |
T39 | 0 | 0 | 0 | 1295 | 4 |
T40 | 4/3** | 4/1*** | 10/10*** | 1987 | 535 |
T41 | 8 | 10/2** | 10/2*** | 1101 | 347 |
T46 | 3 | 4 | 4 | 1570 | 24 |
T47 | 8/5***/3** | 10/6***/4** | 10/10*** | 1015 | 608 |
T50 | 0 | 4/3** | 7/6** | 1447 | 133 |
T53 | 5/1** | 5/1** | 8/3** | 1360 | 122 |
T54 | 0 | 0 | 0 | 1304 | 19 |
Total | 8/1***/4** | 9/2***/5** | 10/4***/3** |
Note: 10 models are selected and evaluated. The values are labeled in green/red when the performance of our scoring functions is better/worse than the CAPRI best scoring group. The scoring performance for each target is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). For example, 8/2** means that there are totally 8 hits among the top 10 models, 2 models out of which are medium-quality models. The overall performance of each method on all 13 targets (the last row) is reported in a similar way. For example, 9/2***/5** means that a scoring function is successful in 9 targets, 2 targets out of 9 have at least a *** model and 5 out of 9 have at least a ** model in the top 10. Note that the CAPRI best column consists of results from 37 different groups (refer to Table 3 for a comparison of the performance per group and Supplementary Table S3 per target).
Comparison of GraphRank and iScore with CAPRI best performing group per target on the CAPRI score set
CAPRI targets . | GraphRank . | iScore . | CAPRI best . | # Total models . | #Near-native . |
---|---|---|---|---|---|
T29 | 4 | 4 | 9/5** | 1979 | 166 |
T30 | 0 | 0 | 0 | 1148 | 2 |
T32 | 4/1** | 4/1** | 2 | 599 | 15 |
T35 | 0 | 0 | 1 | 497 | 3 |
T37 | 2/1** | 4/2** | 6/1*** | 1364 | 97 |
T39 | 0 | 0 | 0 | 1295 | 4 |
T40 | 4/3** | 4/1*** | 10/10*** | 1987 | 535 |
T41 | 8 | 10/2** | 10/2*** | 1101 | 347 |
T46 | 3 | 4 | 4 | 1570 | 24 |
T47 | 8/5***/3** | 10/6***/4** | 10/10*** | 1015 | 608 |
T50 | 0 | 4/3** | 7/6** | 1447 | 133 |
T53 | 5/1** | 5/1** | 8/3** | 1360 | 122 |
T54 | 0 | 0 | 0 | 1304 | 19 |
Total | 8/1***/4** | 9/2***/5** | 10/4***/3** |
CAPRI targets . | GraphRank . | iScore . | CAPRI best . | # Total models . | #Near-native . |
---|---|---|---|---|---|
T29 | 4 | 4 | 9/5** | 1979 | 166 |
T30 | 0 | 0 | 0 | 1148 | 2 |
T32 | 4/1** | 4/1** | 2 | 599 | 15 |
T35 | 0 | 0 | 1 | 497 | 3 |
T37 | 2/1** | 4/2** | 6/1*** | 1364 | 97 |
T39 | 0 | 0 | 0 | 1295 | 4 |
T40 | 4/3** | 4/1*** | 10/10*** | 1987 | 535 |
T41 | 8 | 10/2** | 10/2*** | 1101 | 347 |
T46 | 3 | 4 | 4 | 1570 | 24 |
T47 | 8/5***/3** | 10/6***/4** | 10/10*** | 1015 | 608 |
T50 | 0 | 4/3** | 7/6** | 1447 | 133 |
T53 | 5/1** | 5/1** | 8/3** | 1360 | 122 |
T54 | 0 | 0 | 0 | 1304 | 19 |
Total | 8/1***/4** | 9/2***/5** | 10/4***/3** |
Note: 10 models are selected and evaluated. The values are labeled in green/red when the performance of our scoring functions is better/worse than the CAPRI best scoring group. The scoring performance for each target is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). For example, 8/2** means that there are totally 8 hits among the top 10 models, 2 models out of which are medium-quality models. The overall performance of each method on all 13 targets (the last row) is reported in a similar way. For example, 9/2***/5** means that a scoring function is successful in 9 targets, 2 targets out of 9 have at least a *** model and 5 out of 9 have at least a ** model in the top 10. Note that the CAPRI best column consists of results from 37 different groups (refer to Table 3 for a comparison of the performance per group and Supplementary Table S3 per target).
2.3 Training on docking benchmark 4 docking models
2.3.1 Training dataset for GraphRank
The dataset for training was based on protein–protein complexes from the protein–protein docking benchmark version 4.0 (BM4), considering only dimers and excluding antibody complexes, resulting in a set of 117 non-redundant protein–protein complexes. Docking models for those complexes had been generated previously by running HADDOCK in its ab initio mode using center of mass restraints (Karaca et al., 2013). The crystal structures of these 117 complexes (the ‘native’ structures) form our positive training set. The average number of nodes and edges in the corresponding graphs for this native set are 68 ± 25 and 119 ± 55, respectively. To create a balanced training set, we randomly selected 117 non-native (wrong) models from the pool of HADDOCK models with i-RMSD ≥ 10 Å and number of graph nodes ≥5 as our negative training set. The average number of nodes and edges in the non-native set are 48 ± 14 and 70 ± 23, respectively. In total, we thus have 234 (=117*2) structures as our training set.
2.3.2 Training dataset for iScore
For the training of iScore we selected BM4 complexes for which HADDOCK, running in ab initio mode using center of mass restraints, generated at least one good model in the final water refinement stage. This resulted in 63 cases for which at least one docking model with acceptable or better quality was present in the final set of 400 water-refined models. This dataset is denoted in the following as the BM4 HADDOCK dataset.
2.3.3 Training the graph kernel based scoring function (GraphRank)
We applied the commonly used SVM classifier C-SVC from LIBSVM (Chang and Lin, 2011) to train our scoring function. We precomputed the random walk graph kernel matrix () for the training data and used it as input of the SVM classifier. The SVM outputs the predicted decision values for a test case (the decision values from libsvm is defined as , where is the distance from a point to the hyperplane and is the weight vector of SVM that defines the classification hyperplane). To be consistent with energy terms which we later incorporated into iScore (the lower the energy, the better a model), we used the negative decision value from the SVM as the final score of GraphRank. The resulting optimized SVM classifier is denoted as the ‘GraphRank’ scoring function.
2.3.4 Integrating GraphRank score with energetic terms (iScore)
We combined the GraphRank score with three energetic terms from HADDOCK to train a simple linear scoring function named iScore.
The HADDOCK energetic terms used are:
Evdw, the intermolecular van der Waals energy described by a 12-6 Lennard-Jones potential;
Eelec, the intermolecular electrostatic energy described by a Coulomb potential;
Edesolv, an empirical desolvation energy term.
The van der Waals and electrostatic energies are calculated using a 8.5 Å distance cut-off using the OPLS united atom force field (Jorgensen and Tirado-Rives, 1988).
The parameters of the GA optimization were: Population size = 800, maximum generations = 100, crossover rate = 0.8 and stopping tolerance = 0.001. The GA converged quickly, stopping at the 51th generation. The GA optimization was repeated 30 times and the median values were used as final weights.
2.4 Validation and comparison with state-of-the-art scoring functions
2.4.1 Validation on models from different docking programs
We validated iScore’s performance on docking models from four different docking programs: HADDOCK (Dominguez et al., 2003; van Zundert et al., 2016), SwarmDock (Torchala et al., 2013), pyDock (Cheng et al., 2007; Grosdidier et al., 2007; Jiménez-García et al., 2013) and ZDock (Pierce et al., 2014; Pierce and Weng, 2007). These models were used to evaluate our scoring functions and compare them with the original scoring functions in these respective docking programs. The protein–protein complexes used for testing consist of the new entries from the protein–protein docking benchmark version 5.0 (BM5) (Vreven et al., 2015), on which none of the docking software listed above has been previously trained. These new cases are not present in and hence are non-redundant with BM4, which is our training set. Antibody complexes were excluded. The HADDOCK docking models for the BM5 new cases were generated using predicted interface residues from CPORT (de Vries and Bonvin, 2011) as reported in the BM5 paper (Vreven et al., 2015). The docking models for ZDock, pyDock and SwarmDock were taken from the work of Moal et al. (2017). In total, we could use 9, 18, 14 and 10 complexes for HADDOCK, SwarmDock, pyDock and ZDock, respectively, with the number of models per case varying from 125 to 500, for which at least one near-native model was present in the set of generated models.
Calculating HADDOCK energetic terms. We used HADDOCK to calculate the intermolecular energies for the docking models from other docking programs. For this, the missing atoms of the models were built according to the OPLS force field topology with standard HADDOCK scripts using CNS (Brünger et al., 1998). A short energy minimization (EM) was then performed with the following settings: 50 steps of conjugate gradient EM, van der Waals interactions truncated below the distance of 0.5 Å, and dielectric constant set to 1.
Removing docking models containing clashes. Docking models originating from rigid-body docking programs, such as ZDock and pyDock, often contain clashes that a short EM cannot resolve. We removed those clashing models from the test dataset following the CAPRI assessment procedure: A clash is defined by a pair of heavy atoms between protein partners with a distance below 3 Å. We discarded all models with more than 0.1 clashes per Å2 of buried surface.
Clustering. The remaining docking models for each case were clustered with the fraction of common contacts (FCC) method (Rodrigues et al., 2012) using a 0.6 cut-off and requiring a minimum number of 4 members per cluster.
Comparison with IRaPPA on models from different docking programs. We compared our performance with that of IRaPPA on models of the new BM5 complexes from SwarmDock, pyDock and ZDock (Moal et al., 2017). The authors of IRaPPA kindly provided us their selection of top 10 models (one model from each of the top 10 clusters). This allowed us to compare our results with IRaPPA on a per model level. iScore’s default protocol of selecting top 10 models is to select top 2 models from the top 5 clusters for each target when applicable. If less than 5 clusters are present, iScore evenly selects top models from each cluster. In cases where the models are too diverse to be clustered (e.g. only 1 cluster with 4 models and the large majority of models not clustering), iScore selects all models from all available clusters, and then chooses the remaining models from not clustered models.
2.4.2 Validation on the CAPRI score set
The CAPRI score set consists of a set of models collected from CAPRI participants and used in the scoring experiment of CAPRI (Lensink and Wodak, 2014). During the CAPRI scoring competitions, each scoring group is asked to select top 10 models. We tested our scoring functions on this dataset and compared its performance with various scoring functions used in the CAPRI challenge. Docking models with clashes were removed as described above. Both dimers and multimers were considered here. We used 13 cases from the CAPRI score set with number of models ranging between 497 and 1987. Following the CAPRI assessment protocol, we considered only 10 models for assessment. iScore’s default model selection protocol was used, i.e. simply selecting the top 2 models of the top 5 clusters for each target.
3 Results
3.1 Training and optimization
We first trained a graph kernel-based scoring function called GraphRank using an SVM classifier. GraphRank ranks docking models based on their similarity to the native/non-native set of structures used in the training. The similarity is measured concerning interface topology and conservation profiles. The smaller the GraphRank score is, the more similar the docking model is to native complexes.
The success rates of HADDOCK score, GraphRank score and iScore on the BM4 HADDOCK dataset (63 cases) are shown in Figure 2. GraphRank scores are obtained by leaving-one-complex-out, i.e. we keep all models from one complex as the testing data and rank them after training GraphRank on the remaining complexes, and we repeat this process for all complexes. Compared with the energy-based HADDOCK score, the graph- and conservation-based GraphRank score has higher success rates. It is also evident that adding energetic features in iScore results in an improved scoring, reaching a success rate of 62% on the top 5 clusters in comparison with 59% for GraphRank.

Success rate of HADDOCK score, GraphRank and iScore on the BM4 HADDOCK training dataset over top N clusters of models
3.2 Benchmarking on docking software specific docking models and their respective scoring functions
Sampling and scoring are typically not independent components. They are often interrelated since a specific scoring method might depend on the sampling strategy followed and the representation of the system. We benchmarked here the performance of iScore and GraphRank, which are trained on HADDOCK models, on docking software-specific docking models and compared their performance with that of each software respective scoring function.
For this, models from the new protein–protein complexes of Docking Benchmark 5 (Vreven et al., 2015) were generated using four widely used docking programs: HADDOCK (Dominguez et al., 2003; van Zundert et al., 2016), SwarmDock (Torchala et al., 2013), pyDock (Cheng et al., 2007; Grosdidier et al., 2007; Jiménez-García et al., 2013) and ZDock (Pierce et al., 2014; Pierce and Weng, 2007). The numbers of available complexes with near-native docking models for those four widely used docking programs are 9, 18, 14 and 10, respectively, with the numbers of docking models per complex varying from 125 to 500. The scoring performance was assessed with clustering of the docking models using our cluster procedure descried in Section 2.
iScore outperforms HADDOCK, ZDOCK and pyDock scoring functions and competes with that of SwarmDock on their respective docking program-specific models (Fig. 3). On the HADDOCK models (Fig. 3A), iScore shows the same performance as GraphRank, both outperforming HADDOCK on the top2 to top4, reaching 33% success rate for top 5 clusters. For all the other model sets, iScore outperforms GraphRank. It shows a better scoring performance than the original scoring functions of pyDock (Fig. 3C) and ZDock (Fig. 3D), while the original SwarmDock scoring function remains the best in terms of scoring performance (Fig. 3B). iScore reaches a success rate of 36% and 60% (top 5 clusters) on pyDock and ZDock models, respectively, which is clearly a large improvement.

Success rates measured at cluster level on four sets of docking program-specific models for newly added protein–protein complexes in BM5. GraphRank and iScore are compared with scoring functions from HADDOCK (A), SwarmDock (B), pyDock (C) and ZDock (D) on the docking models of the corresponding docking program, respectively
The scoring performance of iScore competes with that of IRaPPA (Moal et al., 2017), a state-of-the-art machine learning based scoring function, on BM5 docking models generated by SwarmDock, pyDock and ZDock (Moal et al., 2017) (for comparisons for each complex see Supplementary Table S2, and for overall performances see Table 1). IRaPPA identifies at least one hit for more cases than GraphRank and iScore, while iScore identifies higher-quality models for more cases than IRaPPA (Table 1). Specifically, iScore is successful in its top 10 models for 10, 6 and 6 cases for SwarmDock, pyDock and ZDock models, respectively, while IRaPPA for 12, 10 and 8, respectively. However, iScore appears to be more sensitive to high- or medium-quality models than IRaPPA: iScore and IRaPPA obtain 2 and 1 high-quality complexes on SwarmDock models, respectively and 5 and 3 medium-quality models on ZDock models, respectively. Considering that iScore was trained exclusively on BM4 HADDOCK models using a small number of features (1 for GraphRank, 4 for iScore) it performs well compared to IRaPPA, which exploits using 91 features and was separately trained on docking models generated by SwarmDock, ZDock and PyDock, respectively.
Comparison of GraphRank and iScore with IRaPPA on docking program-specific models of BM5 protein–protein complexes
Docking models . | #Complexes . | GraphRank . | iScore . | IRaPPA . |
---|---|---|---|---|
SwarmDock | 18 | 7/1***/6** | 10/2***/6** | 12/1***/6** |
pyDock | 14 | 5/3** | 6/3** | 10/3** |
ZDock | 10 | 4/3** | 6/5** | 8/3** |
Docking models . | #Complexes . | GraphRank . | iScore . | IRaPPA . |
---|---|---|---|---|
SwarmDock | 18 | 7/1***/6** | 10/2***/6** | 12/1***/6** |
pyDock | 14 | 5/3** | 6/3** | 10/3** |
ZDock | 10 | 4/3** | 6/5** | 8/3** |
Note: 10 models are selected and evaluated. The scoring performance for each complex is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). The overall performance of each method on all complexes is reported here. For example, 7/1***/6** means that a scoring function is successful in 7 complexes, 1 complex out of the 7 complexes has at least a *** model and 6 out of 7 have at least a ** model in the top 10.
Comparison of GraphRank and iScore with IRaPPA on docking program-specific models of BM5 protein–protein complexes
Docking models . | #Complexes . | GraphRank . | iScore . | IRaPPA . |
---|---|---|---|---|
SwarmDock | 18 | 7/1***/6** | 10/2***/6** | 12/1***/6** |
pyDock | 14 | 5/3** | 6/3** | 10/3** |
ZDock | 10 | 4/3** | 6/5** | 8/3** |
Docking models . | #Complexes . | GraphRank . | iScore . | IRaPPA . |
---|---|---|---|---|
SwarmDock | 18 | 7/1***/6** | 10/2***/6** | 12/1***/6** |
pyDock | 14 | 5/3** | 6/3** | 10/3** |
ZDock | 10 | 4/3** | 6/5** | 8/3** |
Note: 10 models are selected and evaluated. The scoring performance for each complex is reported as the number of acceptable or better models (hits), followed by the number of high (indicated with ***) or medium quality models (**). The overall performance of each method on all complexes is reported here. For example, 7/1***/6** means that a scoring function is successful in 7 complexes, 1 complex out of the 7 complexes has at least a *** model and 6 out of 7 have at least a ** model in the top 10.
3.3 iScore ranks among the top scorers on the CARPI score set
The scoring set from the CAPRI scoring experiments (Lensink and Wodak, 2014) is a valuable resource for evaluating scoring functions. CAPRI is a community-wide experiment for evaluating docking programs (started in 2001) (Janin, 2002) and scoring functions (from 2005 on). The CAPRI score set consists of 15 targets, 13 of which have near-native docking models. Each target has a mixture of 500–2000 models from the various docking programs used in the CAPRI prediction challenges (Table 2). This represents an ideal set for evaluating scoring functions independently of docking programs.
We benchmarked iScore and GraphRank on the models from the CAPRI score set and compared their performance with the reported performance of the various scoring functions/groups which participated to the CAPRI scoring experiments. Following the CAPRI assessment protocol, we selected only the top 10 ranked models for assessing the performance of iScore and GraphRank. This was done by selecting the top two models from each of the top five clusters for each target.
The scoring performance of iScore and GraphRank on the 13 CAPRI targets containing near-native models is summarized in Table 2, together with the performance of the best scoring function/group in CAPRI for each target. Details of the performance of the various scoring functions compared for these targets are available in Supplementary Table S3. Again, iScore outperforms GraphRank (Table 2) demonstrating the synergistic effects of conservation information and the interacting energies in differentiating near-native models from docking artefacts. Further, iScore selected near-native models on the top10 for 9 out of 13 targets, with 2 targets having high-quality models and 5 having medium-quality models. As a comparison, selecting for each target the best CAPRI scoring function/group resulted in 10 out of 13 correctly predicted targets, with 4 and 3 targets having at least one high-quality and medium-quality models, respectively.
Overall, iScore ranks among the top scorers on these 13 CAPRI scoring targets (Table 3). In total 37 scoring functions/groups were assessed (Supplementary Table S3), but only those that participated to at least 5 targets are shown in Table 3. When considering the common submitted targets (Supplementary Table S3), iScore still competes with the Weng group (8/2***/4** versus 8/3***/2**), the Bonvin group (8/2***/4** versus 8/2***/3**) and the Bates group (8/2***/4** versus 8/1***/4**). It should be noted that the CAPRI scoring groups, e.g. Weng and Bonvin groups, selected the 10 models with help of human expertise, while our selections were only generated from iScore and GraphRank without manual selection. Furthermore, the results clearly demonstrate the importance of the PSSM feature: GraphRank, using only the PSSM feature, already performs quite well (ranked in the 4th position).
Rankings of GraphRank and iScore in comparison with the scorer groups on the CAPRI score set
. | Performance . | # Submitted targets . |
---|---|---|
iScore | 9/2***/5** | 13 |
Weng | 8/3***/2** | 9 |
Bonvin | 8/2***/3** | 9 |
Bates | 8/1***/4** | 10 |
GraphRank | 8/1***/4** | 13 |
Zou | 7/4***/1** | 9 |
Wang | 6/2***/3** | 6 |
Fernandez-Recio | 5/2***/3** | 8 |
Elber | 5/1***/1** | 5 |
Wolfson | 4/1*** | 5 |
Camacho | 3/2***/1** | 5 |
… and many others |
. | Performance . | # Submitted targets . |
---|---|---|
iScore | 9/2***/5** | 13 |
Weng | 8/3***/2** | 9 |
Bonvin | 8/2***/3** | 9 |
Bates | 8/1***/4** | 10 |
GraphRank | 8/1***/4** | 13 |
Zou | 7/4***/1** | 9 |
Wang | 6/2***/3** | 6 |
Fernandez-Recio | 5/2***/3** | 8 |
Elber | 5/1***/1** | 5 |
Wolfson | 4/1*** | 5 |
Camacho | 3/2***/1** | 5 |
… and many others |
Note: In total 37 scorer groups were assessed (Supplementary Table S3), but only scorer groups that have submitted predictions for at least 5 out of the 13 CAPRI targets are shown here. The scoring functions/groups are ordered based on their performance. Number of targets with submitted predictions are shown for each function/group.
Rankings of GraphRank and iScore in comparison with the scorer groups on the CAPRI score set
. | Performance . | # Submitted targets . |
---|---|---|
iScore | 9/2***/5** | 13 |
Weng | 8/3***/2** | 9 |
Bonvin | 8/2***/3** | 9 |
Bates | 8/1***/4** | 10 |
GraphRank | 8/1***/4** | 13 |
Zou | 7/4***/1** | 9 |
Wang | 6/2***/3** | 6 |
Fernandez-Recio | 5/2***/3** | 8 |
Elber | 5/1***/1** | 5 |
Wolfson | 4/1*** | 5 |
Camacho | 3/2***/1** | 5 |
… and many others |
. | Performance . | # Submitted targets . |
---|---|---|
iScore | 9/2***/5** | 13 |
Weng | 8/3***/2** | 9 |
Bonvin | 8/2***/3** | 9 |
Bates | 8/1***/4** | 10 |
GraphRank | 8/1***/4** | 13 |
Zou | 7/4***/1** | 9 |
Wang | 6/2***/3** | 6 |
Fernandez-Recio | 5/2***/3** | 8 |
Elber | 5/1***/1** | 5 |
Wolfson | 4/1*** | 5 |
Camacho | 3/2***/1** | 5 |
… and many others |
Note: In total 37 scorer groups were assessed (Supplementary Table S3), but only scorer groups that have submitted predictions for at least 5 out of the 13 CAPRI targets are shown here. The scoring functions/groups are ordered based on their performance. Number of targets with submitted predictions are shown for each function/group.
4 Discussion
We have developed a novel graph-kernel based scoring function, iScore, for scoring and ranking docking models of protein–protein complexes. By benchmarking on docking models from four different docking programs, iScore shows competitive or better success rate than the original scoring functions of those docking programs. Further, validation on CAPRI targets and comparison with CAPRI scorer groups highlight the high performance of iScore, which achieves the top success rate with acceptable or better models selected for 9 out of 13 CAPRI targets. It is worth noting that both GraphRank and iScore were trained on a rather small dataset, using a very limited set of features, only one for GraphRank and four for iScore. We can expect to further improve the performance of iScore, by increasing the size of the training set and enriching the node and edge labels of interface graphs. Our iScore software with MPI (Message Passing Interface) and GPU supports can be freely downloaded from: https://github.com/DeepRank/iScore. Currently, it takes about 15 min to rank 1619 models of a recent CAPRI target (a 6 domain protein, ranging from 83 to 112 amino acids) using 12 CPU cores (data for this CAPRI round not published yet).
The usage of graph kernel on labeled graphs in iScore provides a novel way to score docking models. SPIDER (Khashan et al., 2012) is also a graph-based scoring function but is drastically different from our GraphRank hence also iScore. SPIDER identifies common interface residue patterns (i.e. interfacial graph motifs) in native complexes and rank a docking model by counting the frequency of the interfacial graph motifs. First of all, GraphRank is based on graph kernel functions to calculate the interface similarities between a docking model and the training complexes while SPIDER is based on the frequent graph mining technique to identify interfacial graph motifs. Second, and importantly, the graphs used in SPIDER has only node labels with amino acid identity, while our GraphRank framework can potentially explore not only the properties of individual interface residues with node labels, but also the features of contacts between residues with edge labels. While we have only used node labels in this work (residue conservation profiles), the concept can easily be extended to add labels to the graph edges, for example in the form of residue–residue interaction energies and coevolution information. Third, iScore uses multi-scale representations of docked interfaces by combining atom-level energy terms with residue-level graph similarities, which allows to account for both subtle differences in 3D space, interaction topology and residue conservations at the same time.
Both conservation profiles and intermolecular energies are important features for scoring of PPIs. Our scoring function GraphRank, using only conservation profiles of the interface residues as features, already shows a promising scoring performance. Physical energies have been widely used and identified as important features in state-of-the-art scoring functions and are complementary to evolutionary information. Considering the successful applications of intermolecular energies in existing scoring functions, in this work we simply combined three intermolecular energetic terms from HADDOCK with the conservation profiles-based GraphRank score. The resulting scoring function iScore outperforms GraphRank, indicating the significance of considering both evolutionary and energetic information in characterizing PPIs.
When comparing the performance of iScore on models from different docking programs on BM5 new data, we do observe that iScore is able to improve the ranking over the original scoring functions for the rigid-body docking programs (pyDock and ZDock), while iScore does not really outperform the flexible docking programs like HADDOCK and SwarmDock which generate more optimized interfaces (Fig. 3). This might be related to the structure quality of the docking models. For docking models from flexible docking, their structures are already optimized to release steric clashes, while the rigid-body programs usually do not have such an optimization step, leading to unnatural interactions (clashes) within structures. To improve the structure quality of the docking models, we did apply a short energy minimization to optimize the structures before calculating intermolecular energies. With higher structure quality, like those coming out of SwarmDock and HADDOCK, the impact of this short minimization is smaller, and the resulting improvement of iScore versus the original scoring functions is less.
Note that the current version of iScore does not work on antibody-antigen complexes, because PSSMs do not capture interface conservation in such complexes. Incorporation of antibody-antigen specific features into iScore is a topic of our ongoing work.
By introducing the labeled graphs and graph kernel in our scoring function iScore, we pave the way for exploring more detailed features in the graph presentation of protein–protein complexes. Natural extensions of this work will be to include edge labels, for example residue–residue interaction energies and co-evolution. Considering graphs are natural representations of biomolecules, this general framework should be useful for other important macromolecular interaction related topics, such as binding affinity predictions, hot-spot predictions and rational design of protein interfaces.
Acknowledgements
We thank Dr Iain H. Moal (EBI Hinxton, UK) for providing IRaPPA ranking results and docking models of SwarmDock, pyDock and ZDock. We thank Dr Yasser EL-Manzalawy from Penn State University and MSc. Mick Walter from Utrecht University for helpful discussions. The content is solely the responsibility of the authors and does not necessarily represent the official views of the sponsors.
Funding
This work was supported in part by the European H2020 e-Infrastructure grant BioExcel (grant no. 675728). C.G. acknowledges financial support from the China Scholarship Council (grant no. 201406220132). L.X. acknowledges financial support from the Netherlands Organisation for Scientific Research (Veni grant 722.014.005) and an Accelerating Scientific Discovery (ASDI) grant from the Netherlands eScience Center (grant no. 027016G04). The work of V.H. was supported in part by the National Center for Advancing Translational Sciences, National Institutes of Health through the grant UL1 TR000127 and TR002014, by the National Science Foundation, through the grants 1518732, 1640834 and 1636795, the Pennsylvania State University’s Institute for Cyberscience and the Center for Big Data Analytics and Discovery Informatics, the Edward Frymoyer Endowed Professorship in Information Sciences and Technology at Pennsylvania State University and the Sudha Murty Distinguished Visiting Chair in Neurocomputing and Data Science funded by the Pratiksha Trust at the Indian Institute of Science. YJ was supported in part by a research assistantship funded by the Center for Big Data Analytics and Discovery Informatics at Pennsylvania State University.
Conflict of Interest: none declared.
References