ISPRED4: interaction sites PREDiction in protein structures with a refining grammar model

Motivation: The identification of protein‐protein interaction (PPI) sites is an important step towards the characterization of protein functional integration in the cell complexity. Experimental methods are costly and time‐consuming and computational tools for predicting PPI sites can fill the gaps of PPI present knowledge. Results: We present ISPRED4, an improved structure‐based predictor of PPI sites on unbound monomer surfaces. ISPRED4 relies on machine‐learning methods and it incorporates features extracted from protein sequence and structure. Cross‐validation experiments are carried out on a new dataset that includes 151 high‐resolution protein complexes and indicate that ISPRED4 achieves a per‐residue Matthew Correlation Coefficient of 0.48 and an overall accuracy of 0.85. Benchmarking results show that ISPRED4 is one of the top‐performing PPI site predictors developed so far. Contact: gigi@biocomp.unibo.it Availability and Implementation: ISPRED4 and datasets used in this study are available at http://ispred4.biocomp.unibo.it.


Introduction
In the context of the cell macromolecular crowding, protein-protein interactions (PPIs) are the major drivers of all the biological processes. The knowledge of the specific protein surface residues involved in PPI is then crucial for characterizing the role of each protein in complex metabolic and signaling pathways and in sustaining cell physiology. Furthermore, PPI is an invaluable source of information for elucidating complex disease networks, for improving protein-docking studies and for drug design (Sudha et al., 2014).
Experimental determination of how proteins interact is expensive in terms of both time and costs. Theoretical methods can complement PPI experimental knowledge [for recent reviews see Xue et al. 2015;Esmaielbeiki et al. (2016), and references therein].
Routinely, PPI computational methods can (i) identify pairs of interacting proteins in complex interaction networks; (ii) identify pairwise residue contacts between two query proteins known to interact; (iii) predict on a target protein surface residues in interaction with one or more (not necessarily known) binding partner/s (predict PP interface/s). For this last type of prediction, several computational tools are available. Depending on the exploited information, they group into three different categories: template-, sequence-, and structure-based predictors (Aumentado-Armstrong et al., 2015;Esmaielbeiki et al., 2016) Template-based approaches (Jordan et al., 2012;Xue et al., 2011;Zhang et al., 2011) perform the prediction searching for similar structures (sequence homologues or structural neighbours). Interface residues are then transferred from the template to the query after structural alignment. These methods typically achieve good performances (Esmaielbeiki et al., 2016), but require the availability of reliable structural templates for a given query protein.
Sequence-based predictors (Chen and Li, 2010;Gallet et al., 2000;Ofran and Rost, 2007;Res et al., 2005;Yan et al., 2004) extract descriptors derived only from the protein sequence. These descriptors, or features, are subsequently processed using machinelearning algorithms. Typical features include: evolutionary information in the form of sequence profiles or position-specific scoring matrices; residue conservation computed on multiple sequence alignments (MSAs); residue interface propensity; residue physicochemical properties (Aumentado-Armstrong et al., 2015). Although these approaches can potentially be applied to all known protein sequences, their prediction performance is limited by the amount of information that can be derived from the protein sequence alone (Esmaielbeiki et al., 2016).
We present ISPRED4, an improved structure-based predictor of PPI sites on unbound monomer surfaces in the absence of interaction partners. Building on top of our previously released ISPRED3 (Savojardo et al., 2012), the proposed method exploits a richer set of 46 features derived from both sequence and structure, including local and global co-evolutionary analyses. Features are elaborated to predict interface residues on an input protein surface with a hybrid procedure. The method is based on the combination of SVMs and grammar-restrained conditional random fields (GRHCRF) (Fariselli et al., 2009).
When benchmarked towards other available methods for the same task and on the same dataset, ISPRED4 scores better than the other state-of the-art approaches.

Datasets
The dataset, adopted in this study to train/test ISPRED4, derives from the Docking Benchmark dataset version 5 (DBv5) (Vreven et al., 2015). The original DBv5 dataset contains 230 protein complexes whose key feature is the availability of both bound and unbound structures of the interacting proteins. Unbound monomers are distributed over 420 different PDB entries. Starting from this initial set of entries, we retained only the subset of protein complexes that met the following criteria: • Both bound and corresponding unbound structures were obtained by means of X-Ray crystallography. • Interfaces calculated from the bound structure could be successfully mapped to unbound structures unambiguously.
After this filtering procedure, we retained 151 bound protein complexes. For each complex, the corresponding unbound structures were collected, leading to 314 different monomer chains. In what follows, this dataset is referred to as DBv5Sel and it is available at http://ispred4.biocomp.unibo.it.
Two alternative definitions of PPI sites on a protein surface are routinely adopted: 1. Any surface residue whose distance (all-atom or Ca-Ca) from one residue in the partner molecule is below a pre-defined threshold (in the range of 5-8 Å ). 2. Any surface residue undergoing a reduction of its Accessible Surface Area (ASA) upon complex formation.
Several studies showed that the choice of interface definition has only a limited impact on the performance of a predictor (De Vries and Bonvin, 2008;Savojardo et al., 2012). In this paper, we adopt the second definition. In particular, surface residues are those with Relative Solvent Accessibility (RSA) ! 20% in the unbound monomer and interacting residues are those undergoing a decrease of the Absolute Solvent Accessibility (ASA) ! 1 Å 2 in the corresponding complex.
Per-residue ASA values were computed with the DSSP program (Kabsch and Sander, 1983); RSA values are then obtained by normalizing with respect to the residue-specific maximal accessibility values as previously described (Rost and Sander, 1993). The ASA differences were computed comparing each unbound chain with the corresponding chain taken from the bound complex. When the same monomer is part of different complexes, the union of the interacting residues was considered.
The performance benchmark was performed on a blind test set derived from past CAPRI experiments. In particular, we considered the targets released in CAPRI rounds from 1 to 29 (in total 67 targets). We filtered out target chains sharing sequence identity !30% with respect to any sequence of our training dataset. Furthermore, also inter-target redundancy was reduced to 30% sequence identity. The final CAPRI-Blind dataset comprises 22 different bound structures including 29 protein chains. The above-described procedure was adopted to extract surface residues and PPI sites. Summary statistics of both DBv5Sel and CAPRI-Blind datasets are reported in Table 1.

Input feature encoding
Several feature descriptors were extracted and adopted to perform the interface/non-interface classification task. The complete feature set consists of 10 different groups of descriptors encoded in a 46-dimensional real vector for each input surface residue. Table 2 reports a summary of the different descriptor sets used in this study.

Evolutionary information
Evolutionary information for each position of the primary chain sequence was extracted in the form of a sequence profile. For a given protein chain sequence of length L, the PSI-BLAST (Altschul et al., 1997) program was used to search the Uniprot Reference Cluster 90 database (Suzek et al., 2015) for similar sequences and the corresponding 20-by-L profile matrix built during the search was extracted as additional PSI-BLAST output (option -ascii_pssm). The matrix computed by PSI-BLAST already incorporates a datadependent pseudocount model (Altschul et al., 1997;Tatusov et al., 1994). Finally, for each surface residue i, a 20-dimensional vector v i was computed by averaging sequence profile entries over the surface structural context of the residue i, i.e.
where C(i) is the surface context of the residue i, defined as the set of surface residues whose Ca-Ca distance from i is below 12Å and P k is the 20-dimensional profile vector of residue k.

Residue conservation
Given the sequence profile derived from PSI-BLAST, a conservation score c i was computed for each surface residue position i using the normalized Shannon's entropy as previously described by Sander and Schneider (1991): where K ¼ 20 (i.e. the 20 different residue types) and P ij is the frequency of residue type j at position i extracted from the sequence profile.

Residue interface propensity
The propensity p k of each residue type to be in interaction sites was scored using the following log-ratio formula: where f I ðkÞ is the frequency of residue of type k in interaction sites and f S ðkÞ is the frequency of residue type k in the surface. Both frequencies and propensities scores were computed on the training set and kept fixed when encoding the testing set, for each crossvalidation iteration.

Residue physico-chemical properties
The 10 orthogonal properties introduced by Kidera et al. (1985) were used to represent the physico-chemical nature of each residue. As described by the authors, the 10 properties were derived with multivariate statistical analysis of a set of 188 different physical properties of naturally occurring amino acids. This allows representing each residue type with a small number of parameters that contain a sufficient amount of information (Kidera et al., 1985). Each residue in the surface was represented according to its type with a 10-dimensional real vector.

Residue co-evolution scores
Different methods are available to extract residue co-evolutionary indexes starting from a MSA. In this work, we adopted sparse inverse covariance estimation as implemented in the PSICOV method (Jones et al., 2012) as well as Mutual Information (MI). MSAs were generated running the HHBlits aligner (Remmert et al., 2011) against the UniprotKB database (we used the clustered Uniprot release 2016/02 at 20% sequence identity available at the HHSuite FTP site). Hence, PSICOV and MI co-evolutionary scores were computed from the alignments.
MI is defined as: where K ¼ 20 (i.e. the 20 different residue types), f i (a) and f j (b) are the frequencies of residue type a and b at positions i and j in the MSA, respectively, and f ij (a,b) is the frequency of residue pair ab at positions ij. Two different descriptors were computed for each surface residue, namely the average co-evolution of the residue with respect to (i) its surface context (co c i ) and (ii) the rest of surface residues (co r i Þ. More formally, for a given surface residue i: sði; kÞ (5) sði; kÞ (6) where C(i) is the surface context (see Section 2.2.1) of the residue i, l s is the total number of surface residues and sði; kÞ is either the PSICOV or MI co-evolutionary score between residues i and k. The rationale behind this approach is to characterize the extent of co-evolutionary correlations of a given surface residue on both its structural proximity and the rest of the protein surface. Similar coevolutionary proximity-based indexes, based on MI, were also adopted in literature for catalytic site identification (Aguilar et al., 2012;Buslje et al., 2010).

Geometrical descriptors
Protrusion (Pintar et al., 2002) and depth (Pintar et al., 2003) indexes were computed for each surface residue using the Protein Structure And Interaction Analyzer (PSAIA) toolkit (Mihel et al., 2008). In particular, four descriptors were used to account for protrusion (total average, side-chain average, minimum and maximum) and three descriptors were used for depth index (total average, sidechain average and maximum). Minimum depth was not considered since it is always zero for surface residues.
Furthermore, secondary structure assignment were obtained using the DSSP program (Kabsch and Sander, 1983) and mapped to three different classes: helix (H, G, I), strand (E, B) and coil (T, S). Finally, for each surface residue, we computed three descriptors, representing the frequency of helical, strand and coil residues in the surface context of residue i, respectively,

Residue average B-factor
The average B-factor b i for the surface residue i was computed by averaging over individual B-factors of its atoms, as reported in the PDB file.

Difference between predicted and real RSA
The idea of using predicted solvent accessibility to discover PPI sites is originally due to Porollo and Meller (2007). In their work, authors showed that RSA predictions starting from protein primary sequences, obtained using their SABLE predictor (Adamczak et al., 2004), tend to be biased towards the level of solvent exposure observed in protein complexes. As a consequence, the difference between the predicted and the real (i.e. taken from the unbound structure) RSA values (here referred to as dRSA) could be used as fingerprint of PPI sites (Porollo and Meller, 2007). Subsequently, this feature was also included in a previous version of ISPRED (Savojardo et al., 2012). To capture this trend, in this work the dRSA descriptor was computed as follows: where C(i) is the surface context (see above) of the residue i, rsa p ðkÞ and rsa o ðkÞ are, respectively, predicted and observed RSA value for the residue k. RSA predictions from sequence were obtained using the SABLE predictor (Adamczak et al., 2004).

The SVM classifier
In this work, we adopt SVMs to predict whether a residue located at the surface of a query protein chain is part of a PPI interface or not. A SVM model was trained on protein chains in the DBv5Sel dataset whose surface residues were represented with the 46-dimensional feature vectors described in the previous section. The SVM model was trained/tested using the LIBSVM implementation (Chang et al., 2011) with a radial basis function (RBF) kernel. The penalty factor C (which controls the trade-off between margin size and training error) and the RBF c hyper-parameters were both optimized using a grid-search procedure. The optimal C and c were selected in the sets 2 À5 ; 2 À3 ; . . . ; 2 13 Â Ã and 2 À11 ; 2 À9 ; . . . ; 2 5 Â Ã , respectively. To assess the contribution of the different features in predicting protein-protein interfaces, we also trained/tested several SVM models using only specific subsets of the 46 input features. In all cases, the same grid-search procedure was applied to optimize hyperparameters C and c.

Grammar-based correction
Standard classification approaches like SVMs treat residues as independent from each other's, classifying them as interacting (label I) or not interacting (label N). As a consequence, possible correlations between neighbouring residues on the protein surface sequence are neglected. On the contrary, sequence-labeling methods like CRFs or HMMs are able to capture correlations between neighbouring labels and to introduce global constraints on the labeling of the whole protein surface sequence (Li et al., 2007;Liu et al., 2009;Savojardo et al., 2012).
In this paper, we combine the SVM classifier described in Section 2.3.1, with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) model (Fariselli et al., 2009). GRHCRF is a discriminative framework that has been developed to address bio-sequence labeling tasks (Indio et al., 2013;Savojardo et al., 2011Savojardo et al., , 2013. In analogy with HMMs, GRHCRFs can be represented with a Finite State Automaton (FSA). The structure of the FSA casts the grammar modeling the constraints of the specific labeling problem at hand. The main difference with respect to standard CRFs (Lafferty et al., 2001) is the explicit inclusion of hidden variables represented by the states of the FSA.
More formally, the labelling task is to map an observation sequence x ¼ x 1 ; x 2 ; . . . ; x L ð Þ(x i is a feature-vector) into a label sequence y ¼ y 1 ; y 2 ; . . . ; y L ð Þ , where each y i belongs to a finite alphabet Y. The labelling process is mediated by a layer of 'hidden' states h ¼ h 1 ; h 2 ; . . . ; h L ð Þ . Each state h i is member of a finite set H. A one-to-many mapping is established between labels and states: each hidden state h belongs to the subset H y ðH y & HÞ of states associated to a given label y. Furthermore, the set of allowed hidden-state transitions is controlled by means of a user-defined FSA. In other words, only state paths allowed by the FSA are taken into consideration by the model (Fariselli et al., 2009).
A GRHCRF is a model of the conditional probability distribution of label sequences given observations, defined as where Z(x,y) and Z(x) are partition functions, defined as follows: The partition function Z x; y ð Þ is obtained by keeping fixed the label sequence and summing over all possible corresponding hidden-state sequences; Z x ð Þ is computed by summing over all possible label and state sequences (Fariselli et al., 2009). Wðh j ; h jÀ1 ; y j ; xÞ are called ISPRED4: interaction site PREDiction potential functions and score the transition of the model from state h jÀ1 to state h j at time j, assuming the usual log-linear form (Fariselli et al., 2009;Lafferty et al., 2001): where the indicator function 1 hjÀ1!hj 2FSA f g (which is 1 only when the condition in the brackets is met, otherwise it takes the value of 0), is used to enforce the grammatical constraints. The notation :; : h i defines the standard dot product, and x hj and k hj;hjÀ1 are model parameters, scoring, respectively, the compatibility between the state and the observation at position j and the transition from state h jÀ1 to state h j . These parameters are optimized by maximizing the log-likelihood over training data. Once the model is trained, an input observation sequence is labelled efficiently using posterior Viterbi decoding (Fariselli et al., 2009). For the specific application described in this paper, the observations are defined as sequences of two-dimensional feature-vectors. The feature-vector components are the probability of being, or not being in an interaction site as computed by the SVM (using the -b option of LIBSVM).
The grammar for the GRHCRF model used here is depicted in Figure 1. Two kinds of states are defined: interaction states, labelled as Ix, and non-interaction states, labelled as Nx. The grammar models the proximity relationships of interface residues along the sequence. GRHCRF introduces correlations among the different SVM predictions by filtering out single isolated spots and by reinforcing locally coherent predictions.

Performance evaluation at the residue level
Standard scoring measures were adopted to score the method at the level of binary residue classification. In what follows, let TP, FP, TN and FN be true positives, false positives, true negatives and false negatives, respectively (we consider the interaction site label as the positive class). The following scoring measures were adopted to score interface residue predictions: Recall (true positive rate) of the positive class [Rec(I)], defined as: Precision of the positive class [Pre(I)], defined as: The F1-score of the positive class [F1(I)], defined as: The classification accuracy [Q 2 ], defined as: The Matthews Correlation Coefficient [MCC], defined as:

Performance evaluation at the patch level
In order to better characterize predicted PPI sites, we also evaluated the performance in identifying interaction patches, i.e. clusters of neighbouring PPI sites identified on the protein surface. More formally, let I T and I P be, respectively, the sets of true and predicted PPI sites in the protein. In each set, we built a graph by connecting pairs of residues whose Ca-Ca distance is below 12Å . We defined the interaction patches as the connected components of the graphs.
In analogy with the Segment OVerlap (SOV) measure, previously introduced to score secondary structure prediction at the segment level (Zemla et al., 1999), we defined the Patch OVerlap (POV) index to measure the overlap between true and predicted interaction patches: where jpj indicate the cardinality of the patch and the quantity d p; q ð Þis defined as: d p; q ð Þ¼min p j j=2; q j j=2; p \ q j j; p j j þ q j j À 2 Â p \ q j j f g : (18)

Cross-validation
Primary sequences were extracted for each unbound chain in the DBv5Sel dataset from the corresponding PDB entry. Sequences were pairwise aligned using the blastp program (Altschul et al., 1997) and clusters of similar sequences were computed. Two chain sequences were put into the same cluster if blastp detected a pairwise sequence identity ! 25% (no coverage threshold was adopted in order to also consider local sequence similarity). A 10-fold cross-validation split was then generated by randomly assigning each cluster to one of the different cross-validation subsets. By adopting this procedure, we ensured that even local sequence identity was confined into the same cross-validation subset, avoiding any possible training/testing bias.

PPI site prediction with different descriptors
The discriminative power of the different descriptors used in this study is evaluated by training and testing our method on different combinations of descriptor sets. Table 3 lists the 10-fold crossvalidation results on the DBv5Sel dataset for each descriptor set. At this stage, the goal is to evaluate the effect of the different input features. Results are listed without applying the grammar correction described in Section 2.3.2, directly using the SVM classification outputs (see Section 2.3 for details), The baseline predictor is defined as the one taking in input only the 20 sequence profile descriptors. The first two rows of Table 3 show that averaging of the sequence profile over the surface context (see Section 2.2.1) improves the performance. Indeed, when the individual profile vector of each surface residue is used (first row) instead of the average (second row), the MCC drops from 0.33 to 0.27. For this reason, in the following experiments, we adopted the average profile as the baseline feature. All other classifiers were hence generated adding to this baseline predictor one descriptor set at a time.
As reported in Table 3, all descriptors positively contribute to the prediction performance. With the exception of the residue interface propensity, all descriptors improve the MCC when compared with the baseline. The most informative descriptors are secondary structure and residue physico-chemical properties, both improving MCC by 5 points with respect to the baseline (from 0.33 to 0.38). The two different coevolution scores (MI and PSICOV) result in very similar performances, with a slight increase of MCC, when PSICOV is adopted. For this reason, PSICOV score is used in all the subsequent experiments. As expected, structure-based descriptors (secondary structure, dRSA, depth and protrusion indexes, B-factor) are on average more informative than sequence-based ones (residue properties, co-evolution and conservation scores, propensity).
To better highlight this trend, we also trained and tested two classifiers adding to the baseline, respectively, the subsets of sequence (i.e. 34 features comprising PROF, PROP, CONS, PSICOV and RPROP) and structural (i.e. 32 features comprising PROF, BFACT, DPX, CX, dRSA and SS) descriptors. Performances of these two classifiers are reported in Table 3, in rows SEQUENCE and STRUCTURE, respectively. As expected, the structural features perform better than the sequence ones. The combination of sequence and structure descriptors further improves over top-scoring individual feature sets.

Scoring ISPRED4 performances on the full descriptor set
In the last two rows of Table 3, we also list the cross-validation performances of ISPRED4 trained/tested using the full descriptor set. For the sake of comparison, we report scoring measures obtained with and without the GRHCRF-based grammar correction described as in Section 2.3.2.
When all the features are included, the prediction performance improves up to 0.46 of MCC (compare values in Table 3). This suggests that the contribution of each descriptor is non-redundant with respect to the others.
Furthermore, it is evident from the last row (Table 3) that the grammar correction further improves the prediction performance. In particular, precision increases up to 0.78, at the cost of only a slight decrease in recall, leading to an overall MCC value of 0.48. Given the size and distribution of interaction residues in our dataset, this result is among the best reported in literature (Aumentado-Armstrong et al., 2015;Esmaielbeiki et al., 2016).
The beneficial effect of the grammar correction is more evident when ISPRED4 is scored at the level of interaction patches. Indeed, when the grammar correction is enforced, the POV index measuring the overlap between predicted and real patches improves from 0.41 up to 0.57. This indicates that predictions after the GRHCRF-based correction are more similar to the real PPI patches.

Comparison with other predictors
We compared the ISPRED4 predictor (including both the SVM and the GRHCRF) with other available state-of-the-art approaches on both the DBv5Sel and the CAPRI-Blind datasets. The benchmarked methods are our previously released ISPRED3 (Savojardo et al. 2012), PredUS (Zhang et al., 2011), PrISE (Jordan et al., 2012), SPPIDER (Porollo and Meller, 2007), ProMate (Li et al., 2008), cons-PPISP (Chen and Zhou, 2005) and metaPPI (Huang and Schroeder, 2008). All the methods are available as web servers and are among the best performing approaches developed so far for predicting PPI sites (Aumentado-Armstrong et al., 2015;Xue et al. 2015).
Residue-and patch-level prediction performances evaluated on both datasets DBv5Sel and CAPRI-Blind are reported in Table 4, sorted by MCC values computed at the residue level.
It is worth noticing that only ISPRED4 is tested in real crossvalidation (as described above), since we cannot control the overlap between the benchmark sets and the training sets adopted to develop the released web-servers.
Results listed in Table 4 indicate that ISPRED4 scores with the highest MCC and POV values on both testing and blind datasets. The true positive rate (recall) is lower than that of other predictors, indicating that ISPRED4 labels as 'interacting' less residues than other methods, however, with a higher likelihood to be correct. Indeed, ISPRED4 is endowed with the highest precision with respect to other predictors (Table 4).

Conclusion
The availability of accurate methods to identify PPI sites is crucial to characterize protein function and identify functionally important residues on protein surfaces. Moreover, PPI site predictions can be also incorporated into protein-docking methods to speed-up the search procedure in the conformational space (Zhou and Qin, 2007). In this paper, we present ISPRED4, an improved structurebased predictor of PPI sites. As a classification method, ISPRED4 adopts a combination of SVMs and probabilistic sequence labelling. We performed cross-validation experiments on a newly generated dataset derived from the Docking Benchmark v5 (Vreven et al., 2015) and consisting of 151 high-resolution protein complexes. Note: *PROF (no context) ¼ the individual profile vector of each residue is used instead of averaging over the surface context. **SEQUENCE ¼ the subset of 34 sequence-based features comprising the baseline PROF as well as PROP, CONS, PSICOV and RPROP. ***STRUCTURE ¼ the subset of 32 structure-based features comprising the baseline PROF as well as BFACT, DPX, CX, dRSA and SS.
Furthermore, comparative benchmarks were also performed on a blind test set derived from previous CAPRI experiments.
We trained and tested ISPRED4 on unbound complexes, which is a more stringent evaluation than those usually carried out for this task. Nonetheless, when compared with other state-of-the-art approaches, ISPRED4 out-performs other top-performing methods, on both residue-and patch-level performance measures, scoring as one of the best tools developed so far for PPI site prediction.

Funding
This work was partially supported by: COST BMBS Action TD1101 and Action BM1405 (European Union RTD Framework Program, to R.C.); FARB UNIBO 2012 (to R.C.) Conflict of Interest: none declared.   Huang and Schroeder (2008).