-
PDF
- Split View
-
Views
-
Cite
Cite
Balachandran Manavalan, Jooyoung Lee, SVMQA: support–vector-machine-based protein single-model quality assessment, Bioinformatics, Volume 33, Issue 16, August 2017, Pages 2496–2503, https://doi.org/10.1093/bioinformatics/btx222
Close - Share Icon Share
Abstract
The accurate ranking of predicted structural models and selecting the best model from a given candidate pool remain as open problems in the field of structural bioinformatics. The quality assessment (QA) methods used to address these problems can be grouped into two categories: consensus methods and single-model methods. Consensus methods in general perform better and attain higher correlation between predicted and true quality measures. However, these methods frequently fail to generate proper quality scores for native-like structures which are distinct from the rest of the pool. Conversely, single-model methods do not suffer from this drawback and are better suited for real-life applications where many models from various sources may not be readily available.
In this study, we developed a support-vector-machine-based single-model global quality assessment (SVMQA) method. For a given protein model, the SVMQA method predicts TM-score and GDT_TS score based on a feature vector containing statistical potential energy terms and consistency-based terms between the actual structural features (extracted from the three-dimensional coordinates) and predicted values (from primary sequence). We trained SVMQA using CASP8, CASP9 and CASP10 targets and determined the machine parameters by 10-fold cross-validation. We evaluated the performance of our SVMQA method on various benchmarking datasets. Results show that SVMQA outperformed the existing best single-model QA methods both in ranking provided protein models and in selecting the best model from the pool. According to the CASP12 assessment, SVMQA was the best method in selecting good-quality models from decoys in terms of GDTloss.
SVMQA method can be freely downloaded from http://lee.kias.re.kr/SVMQA/SVMQA_eval.tar.gz.
Supplementary data are available at Bioinformatics online.
1 Introduction
The three-dimensional (3D) structure of a protein is essential for understanding the biomolecule’s functions in a detailed manner (Baker and Sali, 2001). The completion of many genome sequencing projects resulted in a massive amount of protein sequence data (Lander et al., 2001) but it is estimated that less than 1% of these protein sequences have their native 3D structures in protein data bank (PDB) (Rigden, 2009). Common experimental techniques such as X-ray crystallography, NMR and electron microscopy are expensive and often time-consuming ways of determining the 3D structures of uncharacterized protein sequences. As a result, a huge gap exists between the number of known protein sequences and their experimentally solved 3D structures. This sequence-structure gap is projected to increase even more in the coming years (Wong, 2016). Currently, successful protein structure prediction is the only practical way of closing this gap. Due to the current advances in computing power, numerous alternative protein 3D models for a given protein sequence can be generated with little computational burden. However, properly ranking these predicted models and selecting the best from the pool remain challenging problems in structural bioinformatics (Kihara et al., 2009).
The two major steps of protein structure prediction are model sampling and model ranking. The first step is generating a large number of plausible 3D models for a given target; the second step is ranking these structural models so that the best model can be selected in a consistent manner. High-quality 3D models generated in this way are useful in a wide range of biological applications including ligand-docking and functional annotations. Generally, there are two different approaches for evaluating the global quality of a predicted model: the single-model approach uses only a given model (Cao et al., 2014; Manavalan et al., 2014; Uziela and Wallner, 2016; Yang and Zhou, 2008a,b; Zhang and Zhang, 2010; Zhou and Skolnick, 2011), while the consensus approach uses multiple models (McGuffin, 2008, 2009; McGuffin and Roche, 2010; Roche et al., 2014; Skwark and Elofsson, 2013). Each has its own strengths and weaknesses. In the previous Critical Assessment of Techniques for protein Structure Prediction (CASP) experiments, the best consensus method had performed better than the best single-model method. However, if poor quality models dominated the model pool, single-model methods performed better. Moreover, the computational costs of consensus methods increase as the square of the number of models, which makes it slow at applying them to a large number of models. Conversely, single-model methods do not suffer from these drawbacks and are better suited for real life applications where many models from various sources may not be readily available. Single-model methods can be grouped into three categories: (i) Physics-based potential functions, (ii) Statistical potential functions and (iii) Machine learning-based functions. Machine learning (ML) algorithms such as support vector machine (SVM), neural network and random forest (RF) evaluate the model quality according to learned ‘rules’ (Ginalski et al., 2003; Manavalan et al., 2014; Uziela and Wallner, 2016; Wang et al., 2009). Various features extracted from the sequence and structure of a protein are used as an input to these machines and the model quality is obtained from them. The major advantage of ML methods is that they can consider a large number of features simultaneously, often capturing the hidden relationships among them, which is hard to deduce with statistical potentials alone.
In this study, we present the development of SVMQA, a support-vector-machine-based single-model quality assessment which is based on the combination of two independent predictors, SVMQA_TM and SVMQA_GDT. As input, both of these predictors use statistical potential energy-based terms and consistency-based terms (between predicted values from the primary sequence and actual structural features extracted from the 3D structure) in order to predict the global quality assessment (QA) score (TM-score or GDT_TS score). We trained both predictors in a similar way, with slight variation in input features and objective values. The first predictor was trained with TM-score as the objective value (SVMQA_TM). It measures the global fold similarity between two structures and is less sensitive to local structural variations (Xu and Zhang, 2010; Zhang and Skolnick, 2004). The second predictor was trained with Global Distance Test Total Score (GDT_TS) as the objective value (SVMQA_GDT). It is calculated based on the largest set of common alpha carbon atoms that fall within a set of pre-defined distance cut-off values of their positions in the experimental structure. GDT_TS is one of the primary metrics used for 3D model evaluation by CASP assessors (Kryshtafovych et al., 2015). We applied SVMQA on various benchmarking datasets and the results show that SVMQA performed significantly better than other single-model methods in ranking protein 3D models as well as in selecting the best model from the pool. Moreover, SVMQA was blindly tested in the CASP12 experiment. According to the assessors, SVMQA was the best among the 42 participated QA methods in selecting good-quality models from decoys (http://predictioncenter.org/casp12/qa_diff2best.cgi).
2 Materials and methods
2.1 Dataset
In this work, we considered CASP8-10 single-domain targets and individual domains from the multi-domain targets. The final dataset contained 164, 146 and 119 domains from CASP8, CASP9 and CASP10, respectively. All the server models were downloaded from the CASP website (http:/predictioncenter.org/download_area/). In order to train the SVM with TM-score as the objective value, we compared each model with the corresponding experimental structure and measured its global similarity in terms of TM-score (Xu and Zhang, 2010; Zhang and Skolnick, 2004). Prior to the training, we screened out significantly bad models by sorting them according to TM-score and keeping only the top 60% of the models for each target. In addition, we also excluded targets whose average TM-score was less than 0.3. The final dataset contained a total of 390 domains (153 from CASP8, 134 from CASP9 and 103 from CASP10). These targets were divided into 7 groups based on their TM-score averages [X < TM-score average ≤ Y, the value of X, Y in the range of (0.3, 0.4) to (0.9, 1.0) with the step size of 0.1]. For each group, we randomly selected 80% of the targets to generate the training dataset, using the remaining targets as the testing dataset. The final training and testing datasets respectively contained 312 and 78 domains.
To train the SVM with GDT_TS score as the objective value, GDT_TS scores were taken from the CASP website. We note that GDT_TS has been used as a standard CASP evaluation measure. A GDT_TS score is in the range of [0, 100]. However, we normalized it into the range of [0, 1] for the purpose of ML. Using a similar protocol as described above, we kept only the top 60% of the models based on GDT_TS score and excluded targets whose average GDT_TS score was less than 0.3. The final dataset contained 385 domains (157 from CASP8, 131 from CASP9 and 97 from CASP10), which is 5 targets fewer than the final dataset obtained based on TM-score. These 385 domains were also divided into 7 groups based on the average GDT_TS score. For each group, we randomly selected 80% of the targets to generate the training dataset, using the remaining targets as the testing dataset. The final training and testing datasets respectively contained 312 and 73 domains.
2.2 Feature generation
The aim of the current experiment was to train an SVM to accurately map input features extracted from a 3D model to its TM-score/GDT_TS score; this is considered a regression problem. The most crucial part of this task is to extract a set of relevant features. In this study, we considered a total of 19 features (8 potential energy-based and 11 consistency-based between the predicted and actual values of the model).
2.2.1 Potential energy-based terms
We considered various potential energy-based terms typically used for QA. RWplus is a pairwise distance-dependent atomic statistical potential; which uses an ideal random-walk chain as the reference state (Zhang and Zhang, 2010). OPUS-PSP includes orientation dependent energy (ODE) and Lenard-Jones repulsive energy (Lu et al., 2008), however, we used both OPUS-PSP and ODE as two separate input features. dDFIRE is based on the distance-dependent pairwise energy terms (DFIRE) and the orientation between atoms (polar–polar, polar–non-polar) involved in the dipole–dipole interaction (Yang and Zhou, 2008a,b). We used both dDFIRE and DFIRE as two separate input features. GOAP includes DFIRE and angle-dependent terms (AG). We used GOAP, DFIRE and GOAP_AG as three separate input features (Zhou and Skolnick, 2011). It should be noted that DFIRE is already included in dDFIRE and GOAP, but because a slightly different cut-off (rcut) value is used two DFIREs (Yang and Zhou, 2008a,b; Zhou and Skolnick, 2011), we separately indicated their origin by naming them DFIREdDFIRE and DFIREGOAP. In total, we used 8 energy terms (dDFIRE, DIFREdDFIRE, RWplus, OPUS-PSP, ODE, GOAP, DFIREGOAP and GOAP_AG) as input features. These energy scores were normalized into the range of [0, 1] according to the formula described in our previous study (Manavalan et al., 2014).
2.2.2 Consistency between predicted and actual values
A total of 11 features were extracted from the consistency-based terms between the actual and predicted values (see Supplementary data). We note that, to the best of our knowledge, among the 19 features used in this study, three (ODE, GOAP_AG and SA_score) had never been used before, with the other 16 already having been previously used in existing ML-based methods (Cao et al., 2016; Jing et al., 2016; Manavalan et al., 2014; Wang et al., 2009).
2.3 Benchmark datasets
We used four datasets to benchmark SVMQA. The first dataset was the full set of I-TASSER decoys (Zhang and Zhang, 2010), downloaded from http://zhanglab.ccmb.med.umich.edu/decoys/decoy2.html. The second dataset was the full set of 3DRobot decoys, which consisted of four subdivided decoy sets: 3DRobot_set, On_Rosetta_set, On_Modeller_set and On_I-TASSER_set. 3DRobot_set contained 200 non-homologous proteins, each protein containing 300 structural decoys with RMSD ranging from 0 to 12 Å (Deng et al., 2016). The remaining three subsets (On_Rosetta_set (20 proteins), On_Modeller_set (58 proteins) and On_I-TASSER_set (56 proteins)) were generated by 3DRobot, whose proteins were each taken from their respective decoy papers (John and Sali, 2003; Simons et al., 1999; Zhang and Zhang, 2010). These decoys were downloaded from http://zhanglab.ccmb.med.umich.edu/3DRobot/decoys/. The third dataset consisted of our in-house server models, which were generated during CASP11. The final dataset consisted of CASP11 server models, which were taken from http://www.predictioncenter.org/download_area/CASP11/server_predictions/. Finally, SVMQA was blindly tested in the CASP12.
2.4 Evaluation parameters
Model accuracy was evaluated using three complementary measures: (i) Pearson’s correlation coefficient (CCrank), Spearman’s rank correlation (rank) and Kendall’s tau correlation (rank) between the actual ranking and predicted ranking; (ii) Average TM-score or GDT_TS loss; and (iii) Z-score. The definitions of these metrics are given in Supplementary data.
2.5 Construction of SVMQA
A detailed description of the SVM used in this study is provided in Supplementary data. In this section, we describe the parameter optimization and feature selection processes of SVMQA. To construct SVMQA, we used 19 input features for predicting the QA score of a given 3D model in the range of [0, 1]. In order to estimate the importance of each feature, we employed the RF ML method. A detailed description on how we estimated the importance of an input feature is reported in our previous studies (Lee and Lee, 2013; Lee et al., 2015; Manavalan et al., 2014). We then carried out 10-fold cross-validation on the training dataset. For each round of cross-validation, we built 500 trees with the number of variables at each node chosen randomly from 1 to 9. The ensemble average of feature importance score (FIS) from all the trees (10-fold cross-validation) is shown in Figure 1. The results show that GOAP_AG, %B and ASA_Cor made significant contributions when the RF was trained with either GDT_TS or TM-score as the objective value. Additionally, the ASA_Cos made a significant contribution when the RF was trained with GDT_TS. Overall, each FIS was similar between the two RF machines.
The input features are shown along with their importance scores. Feature importance scores (FISs) separately calculated using GDT_TS score and TM-score as the objective value are shown
As seen in Figure 1, we selected different sets of features based on FIS cut-off (0.01 ≤ FIS ≤ 0.30 with the step size of 0.01) and used those sets of features greater than the cut-off value as a set of input features for SVM to predict either TM-score or GDT_TS score. For each feature set, we optimized the ML parameters (C, γ, ε) by using 10-fold cross-validation on the training dataset (see Section 2.1) and selected the optimal parameters which gave the highest average CCrank between the actual ranking and the predicted ranking of all targets in the training set. These parameters were then applied to the testing dataset (see Section 2.1) to check their transferability. Based on the degree of consistency between the training and testing dataset results, we selected the final set of parameters and features separately for SVMQA_TM and SVMQA_GDT. SVMQA_TM used all 19 features, while SVMQA_GDT used only 15 (mentioned in filled circle), as shown in Figure 1.
Finally, SVMQA is set as the average QA-score of both SVMQA_TM and SVMQA_GDT. To the best of our knowledge, SVMQA is the only single-model QA method that uses two separate predictors, whereas other single-model QA methods such as QApro, QAcon, ModelEvaluator, MQAPrank, ProQ2 and RFMQA use only a single predictor for either GDT_TS score or TM-score (Cao et al., 2016; Cao and Cheng, 2016; Jing et al., 2016; Manavalan et al., 2014; Ray et al., 2012; Uziela and Wallner, 2016; Wang et al., 2009).
3 Results and discussions
SVMQA was constructed using the dataset described in Section 2.1. In this section, we describe the performance of SVMQA on various benchmarking datasets.
3.1 Performance of SVMQA on I-TASSER decoys
We evaluated the performance of SVMQA on the I-TASSER set which consists of 56 non-homologous targets with each target containing about 300–500 models (Zhang and Zhang, 2010). Previously, this set was used to test the ability of QA methods (such as GOAP, RWplus and OPUS-PSP) to identify near-native structures and to evaluate the TM-score-energy correlation (Lu et al., 2008; Manavalan et al., 2014; Zhang and Zhang, 2010; Zhou and Skolnick, 2011). We list the performance of SVMQA along with those of other energy-based methods in Table 1. SVMQA is ranked at the top, based on all metrics tested including TMloss. Using the P-value threshold of 0.01, we observed a significant difference between SVMQA and three energy-based methods (GOAP, RWplus and OPUS-PSP) in terms of CCrank, and a significant difference between SVMQA and OPUS-PSP in terms of average TMloss. These results suggest that, both in ranking the structural models and in selecting good models on the I-TASSER dataset, SVMQA is better than the energy-based methods.
Performance of SVMQA and other statistical potential-energy based methods on the I-TASSER dataset
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.551 | 0.458 | 0.322 | 0.088 | – | – | 51.703 |
| dDFIRE | 0.525 | 0.434 | 0.304 | 0.100 | 0.536 | 0.0774 | 42.478 |
| DFIREGOAP | 0.520 | 0.425 | 0.298 | 0.101 | 0.620 | 0.0497 | 43.129 |
| RWplus | 0.488 | 0.416 | 0.291 | 0.101 | 0.459 | 0.000997 | 40.646 |
| GOAP | 0.477 | 0.392 | 0.272 | 0.111 | 0.144 | 9.8E-05 | 34.696 |
| OPUS-PSP | 0.282 | 0.286 | 0.195 | 0.130 | 0.00331 | 1.5E-10 | 20.470 |
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.551 | 0.458 | 0.322 | 0.088 | – | – | 51.703 |
| dDFIRE | 0.525 | 0.434 | 0.304 | 0.100 | 0.536 | 0.0774 | 42.478 |
| DFIREGOAP | 0.520 | 0.425 | 0.298 | 0.101 | 0.620 | 0.0497 | 43.129 |
| RWplus | 0.488 | 0.416 | 0.291 | 0.101 | 0.459 | 0.000997 | 40.646 |
| GOAP | 0.477 | 0.392 | 0.272 | 0.111 | 0.144 | 9.8E-05 | 34.696 |
| OPUS-PSP | 0.282 | 0.286 | 0.195 | 0.130 | 0.00331 | 1.5E-10 | 20.470 |
The first column represents the method name. The second, the third and the fourth columns respectively represent the average Pearson’s correlation coefficient (CCrank), average Spearman’s correlation (ρrank) and average Kendall’s tau correlation (τrank) between the actual ranking and the predicted ranking. The fifth column represents the average TMloss. The sixth and the seventh columns represent the P-value (pairwise Wilcoxon signed ranked sum test) for the difference in TMloss and the difference in CCrank, respectively, between SVMQA and the other methods. A P-value ≤ 0.01 indicates that the difference is statistically meaningful between SVMQA and the selected method (shown in boldface). The final column represents the summation of Z-score for the first ranked model of each method.
Performance of SVMQA and other statistical potential-energy based methods on the I-TASSER dataset
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.551 | 0.458 | 0.322 | 0.088 | – | – | 51.703 |
| dDFIRE | 0.525 | 0.434 | 0.304 | 0.100 | 0.536 | 0.0774 | 42.478 |
| DFIREGOAP | 0.520 | 0.425 | 0.298 | 0.101 | 0.620 | 0.0497 | 43.129 |
| RWplus | 0.488 | 0.416 | 0.291 | 0.101 | 0.459 | 0.000997 | 40.646 |
| GOAP | 0.477 | 0.392 | 0.272 | 0.111 | 0.144 | 9.8E-05 | 34.696 |
| OPUS-PSP | 0.282 | 0.286 | 0.195 | 0.130 | 0.00331 | 1.5E-10 | 20.470 |
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.551 | 0.458 | 0.322 | 0.088 | – | – | 51.703 |
| dDFIRE | 0.525 | 0.434 | 0.304 | 0.100 | 0.536 | 0.0774 | 42.478 |
| DFIREGOAP | 0.520 | 0.425 | 0.298 | 0.101 | 0.620 | 0.0497 | 43.129 |
| RWplus | 0.488 | 0.416 | 0.291 | 0.101 | 0.459 | 0.000997 | 40.646 |
| GOAP | 0.477 | 0.392 | 0.272 | 0.111 | 0.144 | 9.8E-05 | 34.696 |
| OPUS-PSP | 0.282 | 0.286 | 0.195 | 0.130 | 0.00331 | 1.5E-10 | 20.470 |
The first column represents the method name. The second, the third and the fourth columns respectively represent the average Pearson’s correlation coefficient (CCrank), average Spearman’s correlation (ρrank) and average Kendall’s tau correlation (τrank) between the actual ranking and the predicted ranking. The fifth column represents the average TMloss. The sixth and the seventh columns represent the P-value (pairwise Wilcoxon signed ranked sum test) for the difference in TMloss and the difference in CCrank, respectively, between SVMQA and the other methods. A P-value ≤ 0.01 indicates that the difference is statistically meaningful between SVMQA and the selected method (shown in boldface). The final column represents the summation of Z-score for the first ranked model of each method.
3.2 Performance of SVMQA on 3DRobot decoys
We evaluated the performance of SVMQA and other energy-based methods on the 3DRobot decoys which contains 334 non-homologous targets (Deng et al., 2016). Table 2 shows that, again, SVMQA is ranked at the top based on all metrics used. We note that the difference in TMloss between SVMQA and OPUS-PSP is not significant (P-value 0.335). However, we observed a significant difference in CCrank between SVMQA and the five energy-based methods. Combined performance of SVMQA on I-TASSER and 3DRobot sets show that the SVMQA’s combination of various statistical potential energy-based terms and consistency-based terms between predicted and calculated values of 3D models improves the performance of QA.
Performance of SVMQA and other statistical potential-energy based methods on the 3DRobot decoys
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.910 | 0.882 | 0.713 | 0.035 | – | – | 597.423 |
| OPUS-PSP | 0.807 | 0.752 | 0.570 | 0.036 | 0.3349 | 6.4E-54 | 594.003 |
| GOAP | 0.883 | 0.849 | 0.671 | 0.052 | 0.00333 | 1.6E-18 | 567.896 |
| RWplus | 0.834 | 0.806 | 0.624 | 0.071 | 0.00039 | 8.4E-50 | 533.271 |
| DFIREGOAP | 0.840 | 0.808 | 0.627 | 0.074 | 5.9E-05 | 1.8E-47 | 528.946 |
| dDFIRE | 0.785 | 0.763 | 0.585 | 0.087 | 0.00016 | 1.6E-54 | 510.039 |
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.910 | 0.882 | 0.713 | 0.035 | – | – | 597.423 |
| OPUS-PSP | 0.807 | 0.752 | 0.570 | 0.036 | 0.3349 | 6.4E-54 | 594.003 |
| GOAP | 0.883 | 0.849 | 0.671 | 0.052 | 0.00333 | 1.6E-18 | 567.896 |
| RWplus | 0.834 | 0.806 | 0.624 | 0.071 | 0.00039 | 8.4E-50 | 533.271 |
| DFIREGOAP | 0.840 | 0.808 | 0.627 | 0.074 | 5.9E-05 | 1.8E-47 | 528.946 |
| dDFIRE | 0.785 | 0.763 | 0.585 | 0.087 | 0.00016 | 1.6E-54 | 510.039 |
The legend is the same as in Table 1.
Performance of SVMQA and other statistical potential-energy based methods on the 3DRobot decoys
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.910 | 0.882 | 0.713 | 0.035 | – | – | 597.423 |
| OPUS-PSP | 0.807 | 0.752 | 0.570 | 0.036 | 0.3349 | 6.4E-54 | 594.003 |
| GOAP | 0.883 | 0.849 | 0.671 | 0.052 | 0.00333 | 1.6E-18 | 567.896 |
| RWplus | 0.834 | 0.806 | 0.624 | 0.071 | 0.00039 | 8.4E-50 | 533.271 |
| DFIREGOAP | 0.840 | 0.808 | 0.627 | 0.074 | 5.9E-05 | 1.8E-47 | 528.946 |
| dDFIRE | 0.785 | 0.763 | 0.585 | 0.087 | 0.00016 | 1.6E-54 | 510.039 |
| Method . | CCrank . | ρrank . | τrank . | Avg. TMloss . | P-value (TMloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| SVMQA | 0.910 | 0.882 | 0.713 | 0.035 | – | – | 597.423 |
| OPUS-PSP | 0.807 | 0.752 | 0.570 | 0.036 | 0.3349 | 6.4E-54 | 594.003 |
| GOAP | 0.883 | 0.849 | 0.671 | 0.052 | 0.00333 | 1.6E-18 | 567.896 |
| RWplus | 0.834 | 0.806 | 0.624 | 0.071 | 0.00039 | 8.4E-50 | 533.271 |
| DFIREGOAP | 0.840 | 0.808 | 0.627 | 0.074 | 5.9E-05 | 1.8E-47 | 528.946 |
| dDFIRE | 0.785 | 0.763 | 0.585 | 0.087 | 0.00016 | 1.6E-54 | 510.039 |
The legend is the same as in Table 1.
3.3 Performance of SVMQA on in-house CASP11 models
We evaluated the performance of SVMQA on protein 3D models that were generated by nns (our in-house server) during CASP11. The modeling protocol is described in detail elsewhere (Joo et al., 2015a,b; Joung et al., 2015). We applied SVMQA to select the best model from the decoys (TMSVMQA) and compared it with both the best model selected by the in-house single-model QA (ihQA) method (Joo et al., 2014, 2015a,b), and with the nns1 model which was submitted as the best model (model 1) by our group during CASP11 experiment. For this experiment, we considered 52 single-domain targets from CASP11. The pairwise comparison between TMSVMQA and TMselect by ihQA (TMihQA) is shown in Supplementary Figure S1A. The model selection for five of the targets (T0766, T0769, T0782, T0803 and T0822) by SVMQA was significantly better than that by ihQA, but for the remaining targets the results were similar. During CASP11, in addition to ihQA, we also applied a consensus method to select the top model (nns1). Supplementary Figure S1B shows the pairwise comparison between TMSVMQA and TMnns1. SVMQA models for two of the targets (T0769 and T0829) were significantly better than the nns1 models. The total sum values of TMselect were 33.631 (SVMQA), 33.04 (nns1) and 32.30 (ihQA), suggesting that the SVMQA model selection would have led to a better 3D modeling of nns1.
3.4 Performance of SVMQA on CASP11 targets
We evaluated the performance of SVMQA on CASP11 targets. For this purpose, we used 88 targets for both Stage1 and Stage2, as used in the official CASP11 assessment. The average runtime for a single target (Stage2) with the average length of 265 amino acids was 9m54s (Feature calculation: 9m51s, and SVMQA calculation: 3 s) using 1 CPU core (Intel Xeon E5540@2.53 GHz). We note that SANN and PSI-BLAST were the two most time-consuming sections of our calculation. We compared the performance of SVMQA with the five best performing single-model QA methods [ProQ2 (Ray et al., 2012), ProQ2-refine (Uziela and Wallner, 2016), MULTICOM-NOVEL (Cao and Cheng, 2016), MULTICOM-CLUSTER (Cao et al., 2014) and VoroMQA] according to the CASP11 assessment (Kryshtafovych et al., 2015), as well as with a quasi-single-model method of ours called RFMQA (Manavalan et al., 2014), and with a consensus method (Wallner) (Larsson et al., 2009; Ray et al., 2012). It should be noted that all the values for these 7 methods are from the official CASP11 assessment (http://predictioncenter.org/casp11/qa_analysis.cgi). All the GDT_TS scores shown in this work range between 0 and 100.
3.4.1 Performance of various methods on Stage1 CASP11 targets
In Table 3, we show the performance of SVMQA, RFMQA and the other top five single-model QA methods on Stage1 of CASP11 each sorted by GDTloss. For comparison, the best consensus method Wallner is also included. We note that ProQ2 outperformed SVMQA in ∑Z-score by a rather large amount of 13.398, while in other three metrics (ρrank, τrank and GDTloss) SVMQA was better or similar. We observed that six of the targets (T0768, T0771, T0775, T0780, T0796 and T0818) were responsible for this ranking variation. Pairwise performance comparison between ProQ2 and SVMQA in terms of GDTloss and Z-score is shown in Supplementary data.
Performance of SVMQA and other top QA methods on Stage1 of CASP11
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.759 | 0.717 | 0.581 | 5.322 | 0.002 | 8.5E-14 | 144.260 |
| SVMQA | 0.640 | 0.563 | 0.429 | 7.874 | − | − | 112.772 |
| ProQ2 | 0.647 | 0.524 | 0.394 | 8.136 | 0.571 | 0.830 | 126.170 |
| ProQ2-refine | 0.654 | 0.544 | 0.413 | 8.555 | 0.892 | 0.204 | 120.303 |
| RFMQA | 0.609 | 0.497 | 0.376 | 9.028 | 0.330 | 0.0425 | 108.889 |
| MULTICOM-NOVEL | 0.636 | 0.534 | 0.407 | 9.082 | 0.982 | 0.377 | 117.343 |
| MULTICOM-CLUSTER | 0.648 | 0.511 | 0.387 | 9.470 | 0.366 | 0.892 | 118.111 |
| VoroMQA | 0.563 | 0.444 | 0.334 | 10.761 | 0.149 | 0.0001 | 108.579 |
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.759 | 0.717 | 0.581 | 5.322 | 0.002 | 8.5E-14 | 144.260 |
| SVMQA | 0.640 | 0.563 | 0.429 | 7.874 | − | − | 112.772 |
| ProQ2 | 0.647 | 0.524 | 0.394 | 8.136 | 0.571 | 0.830 | 126.170 |
| ProQ2-refine | 0.654 | 0.544 | 0.413 | 8.555 | 0.892 | 0.204 | 120.303 |
| RFMQA | 0.609 | 0.497 | 0.376 | 9.028 | 0.330 | 0.0425 | 108.889 |
| MULTICOM-NOVEL | 0.636 | 0.534 | 0.407 | 9.082 | 0.982 | 0.377 | 117.343 |
| MULTICOM-CLUSTER | 0.648 | 0.511 | 0.387 | 9.470 | 0.366 | 0.892 | 118.111 |
| VoroMQA | 0.563 | 0.444 | 0.334 | 10.761 | 0.149 | 0.0001 | 108.579 |
The legend is the same as in Table 1, however, the fifth column represents the average GDTloss (instead of TMloss), and the sixth and the seventh columns represent the P-value (pairwise Wilcoxon signed ranked sum test) of GDTloss and CCrank, respectively. For comparison, we also included Wallner, which is a consensus method and was the best QA method of CASP11.
Performance of SVMQA and other top QA methods on Stage1 of CASP11
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.759 | 0.717 | 0.581 | 5.322 | 0.002 | 8.5E-14 | 144.260 |
| SVMQA | 0.640 | 0.563 | 0.429 | 7.874 | − | − | 112.772 |
| ProQ2 | 0.647 | 0.524 | 0.394 | 8.136 | 0.571 | 0.830 | 126.170 |
| ProQ2-refine | 0.654 | 0.544 | 0.413 | 8.555 | 0.892 | 0.204 | 120.303 |
| RFMQA | 0.609 | 0.497 | 0.376 | 9.028 | 0.330 | 0.0425 | 108.889 |
| MULTICOM-NOVEL | 0.636 | 0.534 | 0.407 | 9.082 | 0.982 | 0.377 | 117.343 |
| MULTICOM-CLUSTER | 0.648 | 0.511 | 0.387 | 9.470 | 0.366 | 0.892 | 118.111 |
| VoroMQA | 0.563 | 0.444 | 0.334 | 10.761 | 0.149 | 0.0001 | 108.579 |
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.759 | 0.717 | 0.581 | 5.322 | 0.002 | 8.5E-14 | 144.260 |
| SVMQA | 0.640 | 0.563 | 0.429 | 7.874 | − | − | 112.772 |
| ProQ2 | 0.647 | 0.524 | 0.394 | 8.136 | 0.571 | 0.830 | 126.170 |
| ProQ2-refine | 0.654 | 0.544 | 0.413 | 8.555 | 0.892 | 0.204 | 120.303 |
| RFMQA | 0.609 | 0.497 | 0.376 | 9.028 | 0.330 | 0.0425 | 108.889 |
| MULTICOM-NOVEL | 0.636 | 0.534 | 0.407 | 9.082 | 0.982 | 0.377 | 117.343 |
| MULTICOM-CLUSTER | 0.648 | 0.511 | 0.387 | 9.470 | 0.366 | 0.892 | 118.111 |
| VoroMQA | 0.563 | 0.444 | 0.334 | 10.761 | 0.149 | 0.0001 | 108.579 |
The legend is the same as in Table 1, however, the fifth column represents the average GDTloss (instead of TMloss), and the sixth and the seventh columns represent the P-value (pairwise Wilcoxon signed ranked sum test) of GDTloss and CCrank, respectively. For comparison, we also included Wallner, which is a consensus method and was the best QA method of CASP11.
3.4.2 Performance of various methods on Stage2 CASP11 targets
We evaluated the performance of SVMQA and other top single-model QA methods on the Stage2 CASP11 dataset. For comparison, the best consensus method Wallner is also included. Among the single-model methods, ProQ2 is ranked at the top and SVMQA is ranked second based on the average GDTloss, but the difference between the two is not significant in terms of the loss (P-value is 0.588), demonstrating that the performance of SVMQA is close to the state-of-the-art single-model selection ability (Table 4). The pairwise comparison between models selected by these methods is shown in Figure 2A. We observed that the SVMQA model for T0795 was significantly worse, resulting in a drop of the SVMQA’s ranking. When we excluded this target from the current analysis, SVMQA (5.839) was better than ProQ2 (6.271) in terms of GDTloss.
Performance of SVMQA and other top QA methods on Stage2 of CASP11
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.614 | 0.566 | 0.426 | 4.869 | 0.247 | 6.2E-08 | 99.779 |
| ProQ2 | 0.368 | 0.363 | 0.256 | 6.340 | 0.588 | 0.00167 | 84.055 |
| SVMQA | 0.428 | 0.417 | 0.294 | 6.524 | – | – | 84.206 |
| ProQ2-refine | 0.366 | 0.373 | 0.264 | 6.754 | 0.776 | 0.00098 | 87.168 |
| MULTICOM-NOVEL | 0.389 | 0.389 | 0.277 | 6.888 | 0.792 | 0.00983 | 88.250 |
| RFMQA | 0.370 | 0.352 | 0.246 | 6.953 | 0.352 | 0.0049 | 76.366 |
| MULTICOM-CLUSTER | 0.405 | 0.397 | 0.280 | 7.058 | 0.838 | 0.355 | 83.001 |
| VoroMQA | 0.412 | 0.394 | 0.277 | 7.307 | 0.964 | 0.165 | 91.363 |
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.614 | 0.566 | 0.426 | 4.869 | 0.247 | 6.2E-08 | 99.779 |
| ProQ2 | 0.368 | 0.363 | 0.256 | 6.340 | 0.588 | 0.00167 | 84.055 |
| SVMQA | 0.428 | 0.417 | 0.294 | 6.524 | – | – | 84.206 |
| ProQ2-refine | 0.366 | 0.373 | 0.264 | 6.754 | 0.776 | 0.00098 | 87.168 |
| MULTICOM-NOVEL | 0.389 | 0.389 | 0.277 | 6.888 | 0.792 | 0.00983 | 88.250 |
| RFMQA | 0.370 | 0.352 | 0.246 | 6.953 | 0.352 | 0.0049 | 76.366 |
| MULTICOM-CLUSTER | 0.405 | 0.397 | 0.280 | 7.058 | 0.838 | 0.355 | 83.001 |
| VoroMQA | 0.412 | 0.394 | 0.277 | 7.307 | 0.964 | 0.165 | 91.363 |
The legend is the same as in Table 3.
Performance of SVMQA and other top QA methods on Stage2 of CASP11
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.614 | 0.566 | 0.426 | 4.869 | 0.247 | 6.2E-08 | 99.779 |
| ProQ2 | 0.368 | 0.363 | 0.256 | 6.340 | 0.588 | 0.00167 | 84.055 |
| SVMQA | 0.428 | 0.417 | 0.294 | 6.524 | – | – | 84.206 |
| ProQ2-refine | 0.366 | 0.373 | 0.264 | 6.754 | 0.776 | 0.00098 | 87.168 |
| MULTICOM-NOVEL | 0.389 | 0.389 | 0.277 | 6.888 | 0.792 | 0.00983 | 88.250 |
| RFMQA | 0.370 | 0.352 | 0.246 | 6.953 | 0.352 | 0.0049 | 76.366 |
| MULTICOM-CLUSTER | 0.405 | 0.397 | 0.280 | 7.058 | 0.838 | 0.355 | 83.001 |
| VoroMQA | 0.412 | 0.394 | 0.277 | 7.307 | 0.964 | 0.165 | 91.363 |
| Method . | CCrank . | ρrank . | τrank . | Ave. GDTloss . | P-value (GDTloss) . | P-value (CCrank) . | ΣZ-score . |
|---|---|---|---|---|---|---|---|
| Wallner | 0.614 | 0.566 | 0.426 | 4.869 | 0.247 | 6.2E-08 | 99.779 |
| ProQ2 | 0.368 | 0.363 | 0.256 | 6.340 | 0.588 | 0.00167 | 84.055 |
| SVMQA | 0.428 | 0.417 | 0.294 | 6.524 | – | – | 84.206 |
| ProQ2-refine | 0.366 | 0.373 | 0.264 | 6.754 | 0.776 | 0.00098 | 87.168 |
| MULTICOM-NOVEL | 0.389 | 0.389 | 0.277 | 6.888 | 0.792 | 0.00983 | 88.250 |
| RFMQA | 0.370 | 0.352 | 0.246 | 6.953 | 0.352 | 0.0049 | 76.366 |
| MULTICOM-CLUSTER | 0.405 | 0.397 | 0.280 | 7.058 | 0.838 | 0.355 | 83.001 |
| VoroMQA | 0.412 | 0.394 | 0.277 | 7.307 | 0.964 | 0.165 | 91.363 |
The legend is the same as in Table 3.
Pairwise comparisons of SVMQA and ProQ2. (A) GDT_TS score of the model selected by SVMQA (GDT_TSSVMQA) versus ProQ2 (GDT_TSproQ2) on Stage2 CASP11 targets and (B) The plot of the GDT_TS score of the model against their SVMQA score for Stage2 CASP11 target T0795
We examined the result of T0795 in detail to understand the reason for SVMQA’s apparent failure. Figure 2B shows GDT_TS score versus the SVMQA score. Four of the 8 energy-based terms (DFIREdDFIRE, RWplus, ODE and DFIREGOAP) favored myprotein-me_TS2 as the best model (GDT_TS 11.58), which apparently influenced the SVMQA ranking. SVMQA picked myprotein-me_TS2 as the best model and myprotein-me_TS1 as the second best model (65.44). When we examined the energy landscape of all 19 features (see Supplementary Figs S3 and S4), we observed that five of the 8 energy-based terms heavily favored myprotein-me_TS2 over mypro-tein-me_TS1 while consistency-based terms mostly favored them other way. Therefore, it appears that the current version of SVMQA was trained to favor energy-based terms over consistency-based terms for this target. Another possibility is that the screening out procedure of low TM-score/GDT_TS score decoys prior to training might have not properly discriminated low-quality 3D models. That is, if we had screened out low the GDT_TS models of T0795, SVMQA could have easily identified myprotein-me_TS1 as the best model.
Using the P-value threshold of 0.01, the difference in CCrank between SVMQA and four other methods (ProQ2, ProQ2-refine, MULTICOM-NOVEL and RFMQA) is significant. The ρrank and τrank of SVMQA were higher than the other single-model QA methods. These results show that ranking the structural models by SVMQA was significantly better than the top three CASP11 predictors (ProQ2, ProQ2-refine and MULTICOM-NOVEL). In terms of model selection, SVMQA is similar to ProQ2 and better than other single-model QA methods.
We note that SVMQA outperformed our previous method RFMQA both in ranking structural models and in model selection. The differences between them are as follow: (i) The size of the training dataset used in SVMQA (CASP8-10 domain targets) was larger than that used in RFMQA (CASP8-9 domain targets); (ii) 19 input features were used in SVMQA, whereas only 9 of these features were used in RFMQA; (iii) The objective function was different between these two methods: SVMQA was optimized based on CCrank, while RFMQA was optimized based on TMloss; and (iv) SVMQA used two separate predictors for TM-score and GDT_TS score of a given 3D model, whereas RFMQA only used a single predictor for TM-score. Consequently, when selecting the best model from the decoys, SVMQA generated more robust results than RFMQA by employing two separate predictors.
3.4.3 Domain based performance of various methods on Stage2 CASP11 targets
We evaluated the performance of QA methods separately on single-domain targets and multi-domain targets. Figure 3 shows that SVMQA performed exceptionally well for single-domain targets than the other methods did by producing the lowest GDTloss (5.618) and the highest sum Z-score (55.411). The improved performance may have been due to the fact that domain-based targets were used to develop the SVMQA machine. For multi-domain targets, VoroMQA performed better than the other single-model methods did with the lowest GDTloss (4.788) and the highest sum Z-score (43.996). Notably, SVMQA performance on multi-domain targets was similar to those of MULTICOM-NOVEL, MULTICOM-CLUSTER and RFMQA in terms of GDTloss. This result indicates that although SVMQA was trained on single domain targets, it performed reasonably well for multi-domain models.
The performance of single-model QA methods on single-domain and multi-domain Stage2 CASP11 targets. X-axis is the method name and y-axis is (A) Average GDTloss and (B) Summation of Z-score
3.5 Performance of SVMQA in CASP12 experiment
In addition to SVMQA, 41 other QA methods had participated in the CASP12 blind prediction, including single-model QA methods, quasi-single-model QA methods and consensus methods. In its blind prediction, SVMQA was the best QA method in selecting good-quality models from decoys in terms of GDTloss (http://predictioncenter.org/casp12/qa_diff2best.cgi) according to the CASP assessors. This is the first time in CASP that single-model methods (SVMQA, ProQ3 and MESHI_SERVER) outperformed the best consensus method. As mentioned earlier, previously, the best consensus method had always outperformed the best single-model method. We believe that this was due to the fact that there were more easy targets than hard targets in previous CASP experiments but in CASP12, there were as many hard targets as easy targets.
Figure 4 shows three best and three worst examples of model selection by SVMQA in CASP12 Stage2. In the first three examples (A–C), SVMQA models were better than the other top four QA models (single-model methods ProQ3 and Meshi-server, a quasi-single-model method ModFOLD6_rank and a consensus method Wallner) and in the other three examples (D–F) SVMQA performed relatively poorly. The SVMQA models and the best models of the six targets are respectively shown in magenta and cyan. In Figure 4(A–C), it should be noted that SVMQA models were all better than the other four QA models. In the case of T0884, the SVMQA model was slightly worse than Wallner and ModFOLD6_rank models. The orientation of the C-terminal α-helical segment (Figure 4D, marked by circle) was different between the SVMQA model and the best model. In the case of T0885, the SVMQA model was worse than those selected by Meshi-server and ProQ3. This is also an α-helical protein whose arrangement of helices from N- to C-terminal was slightly different between the SVMQA model and the Meshi-server model, which was the best model. T0900 (single-domain and free modeling target) was the only target where the SVMQA model was significantly worse than the other four QA models. The SVMQA model was of a much larger size (radius of gyration = 16.436 Å) than the other 4 QA models (12.55 Å < radius of gyration < 13.26 Å) and the β-sheet arrangement was quite different between the SVMQA model and the best model.
Examples of good and bad predictions by SVMQA on Stage2 CASP12 targets. X-axis and Y-axis are respectively the SVMQA score and the actual GDT_TS score of the models. The SVMQA model and the best model of the decoys are respectively shown in magenta and cyan, whose superposed 3D models are also shown using the same color. The models selected by ProQ3, Meshi-server, ModFOLD6_rank, and Wallner are respectively shown in green open square boxes, black open inverted triangles, red open triangles, and blue open circles
4 Conclusion
In this study, we introduced a novel single-model QA method, which we call SVMQA. SVMQA predicts a given model’s global QA score as the average of the predicted TM-score and GDT_TS score by combining two separate predictors, SVMQA_TM and SVMQA_GDT. We constructed SVMQA using 19 input features including 8 energy-based terms and 11 consistency-based terms between predicted values from the sequence of a target protein and calculated values from the 3D structure of a model. SVMQA is used both to rank protein 3D models and to select the best from a given pool.
Benchmarking tests on SVMQA were very promising. When tested on I-TASSER and 3DRobot decoys, SVMQA outperformed all five energy-based methods tested together (dDFIRE, RWplus, OPUS-PSP, GAOP and DFIREGOAP) both in ranking the structural models and in selecting the best model. When tested on our in-house server models generated during CASP11, the SVMQA model selection was better than both of our CASP11 in-house QA procedure and the actual CASP11 submission (model 1 of our server nns). We also tested SVMQA on both Stage1 and Stage2 targets of CASP11. SVMQA was shown to perform as either the best or the second best among single-model QA methods. In particular, for Stage2 CASP11 targets, SVMQA outperformed the three best single-model QA methods (ProQ2, ProQ2-refine and MULTICOM-NOVEL) in terms of ranking the 3D models, while in selecting the best model, SVMQA was similar to ProQ2.
Based on the above successes, SVMQA was integrated into our modeling pipeline for CASP12 targets. According to the CASP12 assessment, SVMQA was the best method for selecting good-quality models from given decoys in terms of GDTloss even when compared to consensus methods. Overall, SVMQA performed significantly better in terms of ranking structural models and better in terms of model selection than the other single-model methods listed in this study. CASP12 marks the first CASP case where a single-model QA method outperformed the best consensus method. This is likely due to the fact that, in CASP12, there were as many hard targets as easy targets, leading to better performance of single-model methods. These results indicate that SVMQA can contribute to the successful 3D modeling of difficult target proteins in terms of model selection.
Acknowledgements
The authors thank Korea Institute for Advanced Study for providing computing resources (KIAS Center for Advanced Computation Linux Cluster) for this work. We thank Andrew Brooks for his critical reading of and suggestions for the manuscript.
Funding
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (No. 2008-0061987).
Conflict of Interest: none declared.
References



