-
PDF
- Split View
-
Views
-
Cite
Cite
Ludovica Montanucci, Piero Fariselli, Pier Luigi Martelli, Rita Casadio, Predicting protein thermostability changes from sequence upon multiple mutations, Bioinformatics, Volume 24, Issue 13, July 2008, Pages i190–i195, https://doi.org/10.1093/bioinformatics/btn166
- Share Icon Share
Abstract
Motivation: A basic question in protein science is to which extent mutations affect protein thermostability. This knowledge would be particularly relevant for engineering thermostable enzymes. In several experimental approaches, this issue has been serendipitously addressed. It would be therefore convenient providing a computational method that predicts when a given protein mutant is more thermostable than its corresponding wild-type.
Results: We present a new method based on support vector machines that is able to predict whether a set of mutations (including insertion and deletions) can enhance the thermostability of a given protein sequence. When trained and tested on a redundancy-reduced dataset, our predictor achieves 88% accuracy and a correlation coefficient equal to 0.75. Our predictor also correctly classifies 12 out of 14 experimentally characterized protein mutants with enhanced thermostability. Finally, it correctly detects all the 11 mutated proteins whose increase in stability temperature is >10°C.
Availability: The dataset and the list of protein clusters adopted for the SVM cross-validation are available at the web site http://lipid.biocomp.unibo.it/~ludovica/thermo-meso-MUT.
Contact: [email protected]
1 INTRODUCTION
Predicting the thermostability of a biomolecule, given its sequence, is one of the big challenges of protein biochemistry and biotechnology (Bommarius et al., 2006; Razvi and Scholtz, 2006). In this respect, it is of great relevance developing a tool that can score protein sequences in order to screen thermostable mutants among a plethora of alternative mutated sequences (Hoppe and Shomburg, 2005). The accumulation of genomic data, comprising thermophilic organisms, allows for a comprehensive investigation of nucleotidic and amino acidic sequences with the aim of discovering universal determinants of thermophilic life. Many studies have attempted to correlate thermostability to both the genome and proteome compositions. At the DNA level, differences in the codon usage between thermophilic and mesophilic organisms have been described (Lobry and Chessel, 2003; Lobry and Necsulea, 2006; Lynn et al., 2002; Singer and Hickey, 2002; Takami et al., 2004). Recently, a codon frequency index could highlight robust determinants of thermostability capable of discriminating thermophilic from mesophilic genomes (Montanucci et al., 2007). When residue composition in proteomes and/or protein sequences and structures were analyzed, the increased frequency in charged residues and ion pairs was recognized as the most remarkable feature of thermostable proteins (Farias and Bonato, 2003; Kreil and Ouzounis, 2001; Shure and Claverie, 2003; Szilágyi and Závodsky, 2000; Zhang and Fang, 2006a). However, many other compositional features may influence protein thermostability (see Zhou et al., 2008, for a recent review, and references therein) and molecular determinants of thermal resistance at the protein level still remain elusive. Recently, the fraction of a set of protein residues (I, V, Y, W, R, E, L) in the proteome was correlated with the optimal growth temperature of the correspondent organism (Zeldovich et al., 2007).
In this article, we address the problem of screening mutations that affect protein thermostability and develop a novel method able to sort out thermostable protein variants at the sequence level.
As discussed above, several methods have been described for discriminating among thermophilic and mesophilic proteins. In the present work, our approach is different in that we aim at predicting whether a set of mutations (including deletions and insertions) can enhance the thermostability of a given protein. For this reason, and differently from previous implementations (Zhang and Fang, 2006b) we trained a SVM method on the compositional difference computed for 2328 pairs of mesophilic and thermophilic proteins that share high pairwise sequence identity (≥70%).
2 METHODS
2.1 Training dataset
Our training/testing dataset consists of 2328 pairs of protein sequences with the property that one member belongs to a thermophilic microbial organism and the other to a mesophilic one. Since we are interested in detecting small differences in composition, we considered only protein pairs sharing a sequence identity ≥70%. The corresponding pairwise alignment coverage is >80% for the vast majority of the cases (∼90%).
This dataset was derived from an extensive all-against-all BLAST search among the proteomes of 112 prokaryotes (12 of which are thermophilic) belonging to different genera, comprising Archaea and Bacteria. From the outputs of the BLAST runs only aligned sequence pairs comprising a mesophilic and a thermophilic protein were selected. Only protein pairs sharing at least 70% of sequence identity were retained. The final number of pairs is 2328, including 378 thermophilic and 1015 mesophilic proteins.
2.2 Definition of training and testing sets
One of the major problems in developing and evaluating predictors is avoiding the similarity between training and testing sets; usually random splitting is not a correct procedure ( Appendix 1). For this reason we clustered proteins using a very conservative method. We considered protein sequences as graph nodes. Two nodes are linked by an edge if the local identity between the two corresponding sequences is >30%. The graph is then a forest and the connected components define our clusters. In this way, a cluster may contain proteins that by themselves do not share a sequence identity >30% but that are connected through a path of similar proteins. This procedure grouped the 1393 protein sequences of the dataset into 184 non-overlapping clusters, containing the 2328 pairs of interest. Also, each cluster contains proteins that are <30% identical to those of all the other clusters.
Although apparently too restrictive, this clustering procedure is safer for defining training and testing sets. We then used a ‘leave-one-cluster-out’ training procedure by predicting all the proteins of one cluster with the model trained on the remainder of the data set.
In order to prove the necessity for a similarity-clustered validation scheme, we generated also a random splitting of the data set and we trained and tested the method accordingly. From Table A1 of Appendix 1, it is evident that when random splitting is adopted, the SVM performance increases due to the higher level of homology between the training and testing sequences.
2.3 A test set of experimentally investigated protein mutations
To validate our method with real-world applications, we collected an experimental dataset derived from the literature. This set consists of mesophilic proteins that have been experimentally mutated resulting in proteins that show an increased optimal functional temperature (or melting temperature Tm). This experimental set consists of 14 mutants derived from 10 different wild-type proteins (for details, see Table 1). Among the data reported in the literature we did not consider examples that: (i) do not specify the optimal (nor the melting) temperature increment; (ii) describe proteins thermally stabilized by means of chemical post-translational modifications instead of residue mutations (such as Annaluru et al., 2007; Siddiqui and Cavicchioli, 2005; Li et al., 2007; Minagawa et al., 2007; Ruller et al., 2007; Salazar et al., 2003; Stephens et al., 2007).
Protein name . | Length . | Temp. (○C) . | Mutant name . | Temp. (○C) . | Mutated residues . |
---|---|---|---|---|---|
Shble | 124 | T m: 67.4 | HTS | T m: 85.1 | G18E,D32V,L63Q,G98V |
UVF | T m: 99 | 39 mutations | |||
Dmeh | 54 | T m: 49 | |||
UMC | T m: 99 | 40 mutations | |||
β-GUS | 603 | 45 | TR3337 | 65 | Q493R,T509A,M532T,N550S,G559S,N566S |
mt1 | T m: 69.7 | A46K,S48R | |||
BsCSP | 67 | T m: 53.8 | |||
mt2 | T m: 83.7 | M1R,E3K,K65I | |||
EcHPH | 341 | 51 | hph5 | 67 | D20G,A118V,S225P,Q226L,T246A |
12x | 59.7 | V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V | |||
PTDH | 355 | 39 | |||
opt14 | 64.4 | V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M | |||
CbADH | 452 | T m: 65.5 | Q100P | ΔTm: +11.5 | Q100P |
FAOX | 372 | 37 | FAOX_TE | 45 | T60A,A188G,M244L,N257S,L261M |
PDAO | 347 | 45 | F42C | 55 | F42C |
mt18 | ΔTm: +7, | A58E,P65S,Q191R,T271R | |||
PhyA | 467 | 55 | |||
mt24 | ΔTm: >+7 | A58E,P65S,Q191R,T271R,E228K,S149P,F131L |
Protein name . | Length . | Temp. (○C) . | Mutant name . | Temp. (○C) . | Mutated residues . |
---|---|---|---|---|---|
Shble | 124 | T m: 67.4 | HTS | T m: 85.1 | G18E,D32V,L63Q,G98V |
UVF | T m: 99 | 39 mutations | |||
Dmeh | 54 | T m: 49 | |||
UMC | T m: 99 | 40 mutations | |||
β-GUS | 603 | 45 | TR3337 | 65 | Q493R,T509A,M532T,N550S,G559S,N566S |
mt1 | T m: 69.7 | A46K,S48R | |||
BsCSP | 67 | T m: 53.8 | |||
mt2 | T m: 83.7 | M1R,E3K,K65I | |||
EcHPH | 341 | 51 | hph5 | 67 | D20G,A118V,S225P,Q226L,T246A |
12x | 59.7 | V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V | |||
PTDH | 355 | 39 | |||
opt14 | 64.4 | V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M | |||
CbADH | 452 | T m: 65.5 | Q100P | ΔTm: +11.5 | Q100P |
FAOX | 372 | 37 | FAOX_TE | 45 | T60A,A188G,M244L,N257S,L261M |
PDAO | 347 | 45 | F42C | 55 | F42C |
mt18 | ΔTm: +7, | A58E,P65S,Q191R,T271R | |||
PhyA | 467 | 55 | |||
mt24 | ΔTm: >+7 | A58E,P65S,Q191R,T271R,E228K,S149P,F131L |
In the first two columns the wild-type protein name and the protein length are reported. In the fourth column, the name of the mutant is reported. Columns 3 and 5 report optimal functional temperatures of the wild-type and the mutated sequence, respectively; Tm when present refers to the melting temperature; column 6 reports the mutated residues. In the case of Dmeh, the two mutants have 39 and 40 mutated residues, respectively (Shah et al., 2007). The considered proteins are: Shble: bleomycin-binding protein from the mesophilic bacterium Streptoalloteichus hindustanus (Brouns et al., 2005); Dmeh: Drosophila melanogaster engrailed homeodomain (Shah et al., 2007); β-GUS: β-glucuronidase (Xiong et al., 2007); BsCSP: cold shock proteins from Bacillus subtilis (Max et al., 2007); EcHPH: Escherichia coli hygromycin B phosphotransferase (Nakamura et al., 2005); PTDH: phosphite dehydrogenase from Pseudomonas stutzeri (Johannes et al., 2005; McLachlan et al., 2007) CbADH: Clostridium beijerinckii alcohol dehydrogenase (Goihberg et al., 2007); FAOX: fructosyl-amino acid oxidase from Corynebacterium sp. (Sakaue and Kajiyama, 2003); pDAO: porcine kidney D-amino acid oxidase (Bakke et al., 2006); PhyA: 3-phytase A from Aspergillus niger (Zhang and Lei, 2007).
Protein name . | Length . | Temp. (○C) . | Mutant name . | Temp. (○C) . | Mutated residues . |
---|---|---|---|---|---|
Shble | 124 | T m: 67.4 | HTS | T m: 85.1 | G18E,D32V,L63Q,G98V |
UVF | T m: 99 | 39 mutations | |||
Dmeh | 54 | T m: 49 | |||
UMC | T m: 99 | 40 mutations | |||
β-GUS | 603 | 45 | TR3337 | 65 | Q493R,T509A,M532T,N550S,G559S,N566S |
mt1 | T m: 69.7 | A46K,S48R | |||
BsCSP | 67 | T m: 53.8 | |||
mt2 | T m: 83.7 | M1R,E3K,K65I | |||
EcHPH | 341 | 51 | hph5 | 67 | D20G,A118V,S225P,Q226L,T246A |
12x | 59.7 | V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V | |||
PTDH | 355 | 39 | |||
opt14 | 64.4 | V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M | |||
CbADH | 452 | T m: 65.5 | Q100P | ΔTm: +11.5 | Q100P |
FAOX | 372 | 37 | FAOX_TE | 45 | T60A,A188G,M244L,N257S,L261M |
PDAO | 347 | 45 | F42C | 55 | F42C |
mt18 | ΔTm: +7, | A58E,P65S,Q191R,T271R | |||
PhyA | 467 | 55 | |||
mt24 | ΔTm: >+7 | A58E,P65S,Q191R,T271R,E228K,S149P,F131L |
Protein name . | Length . | Temp. (○C) . | Mutant name . | Temp. (○C) . | Mutated residues . |
---|---|---|---|---|---|
Shble | 124 | T m: 67.4 | HTS | T m: 85.1 | G18E,D32V,L63Q,G98V |
UVF | T m: 99 | 39 mutations | |||
Dmeh | 54 | T m: 49 | |||
UMC | T m: 99 | 40 mutations | |||
β-GUS | 603 | 45 | TR3337 | 65 | Q493R,T509A,M532T,N550S,G559S,N566S |
mt1 | T m: 69.7 | A46K,S48R | |||
BsCSP | 67 | T m: 53.8 | |||
mt2 | T m: 83.7 | M1R,E3K,K65I | |||
EcHPH | 341 | 51 | hph5 | 67 | D20G,A118V,S225P,Q226L,T246A |
12x | 59.7 | V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V | |||
PTDH | 355 | 39 | |||
opt14 | 64.4 | V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M | |||
CbADH | 452 | T m: 65.5 | Q100P | ΔTm: +11.5 | Q100P |
FAOX | 372 | 37 | FAOX_TE | 45 | T60A,A188G,M244L,N257S,L261M |
PDAO | 347 | 45 | F42C | 55 | F42C |
mt18 | ΔTm: +7, | A58E,P65S,Q191R,T271R | |||
PhyA | 467 | 55 | |||
mt24 | ΔTm: >+7 | A58E,P65S,Q191R,T271R,E228K,S149P,F131L |
In the first two columns the wild-type protein name and the protein length are reported. In the fourth column, the name of the mutant is reported. Columns 3 and 5 report optimal functional temperatures of the wild-type and the mutated sequence, respectively; Tm when present refers to the melting temperature; column 6 reports the mutated residues. In the case of Dmeh, the two mutants have 39 and 40 mutated residues, respectively (Shah et al., 2007). The considered proteins are: Shble: bleomycin-binding protein from the mesophilic bacterium Streptoalloteichus hindustanus (Brouns et al., 2005); Dmeh: Drosophila melanogaster engrailed homeodomain (Shah et al., 2007); β-GUS: β-glucuronidase (Xiong et al., 2007); BsCSP: cold shock proteins from Bacillus subtilis (Max et al., 2007); EcHPH: Escherichia coli hygromycin B phosphotransferase (Nakamura et al., 2005); PTDH: phosphite dehydrogenase from Pseudomonas stutzeri (Johannes et al., 2005; McLachlan et al., 2007) CbADH: Clostridium beijerinckii alcohol dehydrogenase (Goihberg et al., 2007); FAOX: fructosyl-amino acid oxidase from Corynebacterium sp. (Sakaue and Kajiyama, 2003); pDAO: porcine kidney D-amino acid oxidase (Bakke et al., 2006); PhyA: 3-phytase A from Aspergillus niger (Zhang and Lei, 2007).
2.4 Support vector machines
Two support vector machines (SVMs) were trained with linear kernel functions, using the libsvm package (http://www.csie.ntu.edu.tw/~cjlin/libsvm). They differ in the input encoding adopted. The first SVM (L20) takes as input 20-valued vectors containing the difference of the residue composition in each pair of the dataset. The second SVM (L400) takes as input 400-valued vectors containing the difference of the dipeptide composition in each pair of the dataset.
For each encoding type (L20 and L400) and for each protein pair in the data set, two different input vector sets were derived: a vector set encoding the composition (residue or dipeptide) difference between the thermophilic sequence and the mesophilic one (positive set), and a vector set encoding the composition (residue or dipeptide) difference between the mesophilic sequence and the thermophilic one (negative set). By this a total of 4656 input vectors were defined. This encoding procedure ensures balancing of the positive and negative examples.
For each trained SVM we evaluated the performance using different values for the C parameter in the range of 0.1–100 000 (see the libsvm package). The linear kernel SVM showed a high degree of robustness and the accuracy was not significantly affected by a specific value of the C parameter. The results presented below are computed using values for the C parameter equal to 10 000 and 1000 for L20 and L400, respectively.
Finally, a combined SVM was derived by taking the average values of the probabilities given by L20 and L400.
2.5 Evaluating the predictor performances








3 RESULTS
3.1 Scoring the SVM predictors
The scoring indexes of the different methods were evaluated with the ‘leave-one-cluster-out’ procedure and are reported in Table 2. It is evident that both SVM L20 and SVM L400 perform quite well and that accuracies are almost indistinguishable in spite of the fact that L400 has far more detailed input to work with. This indicates that the most relevant information is contained in the composition difference of two highly homologous thermophilic and mesophilic proteins (≥70%) and that more detailed information of the difference in dipeptide composition does not significantly increase the performance. However, the two predictors are able to extract slightly different features as indicated by the finding that the combined SVM predictor outperforms the single L20 and L400, reaching an accuracy of 88% and a correlation coefficient of 0.75. The scoring improvement of the combination of two different methods is not unexpected since it is theoretically founded when they capture different features (Sollich and Krogh, 1996). The same picture holds when the ROC curves reported in Figure 1 are considered and where the true positive rate [Sensitivity(+)] is plotted versus the false positive rate [1-Sensitivity(−)].

ROC curve of the three predictors. Solid gray line: L20 SVM predictor; dotted black line: L400 SVM predictor; solid black line: combined predictor.
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity(%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 86 | 0.73 | 87 | 86 | 86 | 87 |
L400 | 85 | 0.70 | 85 | 85 | 85 | 85 |
Combined | 88 | 0.75 | 88 | 88 | 88 | 88 |
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity(%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 86 | 0.73 | 87 | 86 | 86 | 87 |
L400 | 85 | 0.70 | 85 | 85 | 85 | 85 |
Combined | 88 | 0.75 | 88 | 88 | 88 | 88 |
The performances are evaluated using the leave-one-cluster-out procedure on the training dataset. The symbol + and − indicate the direction of increased an decreased thermostability, respectively.
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity(%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 86 | 0.73 | 87 | 86 | 86 | 87 |
L400 | 85 | 0.70 | 85 | 85 | 85 | 85 |
Combined | 88 | 0.75 | 88 | 88 | 88 | 88 |
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity(%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 86 | 0.73 | 87 | 86 | 86 | 87 |
L400 | 85 | 0.70 | 85 | 85 | 85 | 85 |
Combined | 88 | 0.75 | 88 | 88 | 88 | 88 |
The performances are evaluated using the leave-one-cluster-out procedure on the training dataset. The symbol + and − indicate the direction of increased an decreased thermostability, respectively.
We should also consider that the predictions are very well balanced, reaching the same sensitivity and specificity values for the two classes.
3.2 Robustness of the performance
The SVM performance can be affected by the encoding procedure given the different protein identity values (70–92%) and different protein lengths (35–1512 residues) in the dataset of protein pairs. In Figures 2 and 3, we show the accuracy of the combined SVM as a function of protein sequence identity (Fig. 2) and length (Fig. 3) of the pairs, respectively. It is evident that the performance is quite independent of both the identity value and protein length of the pair.

The accuracy of the combined SVM method is plotted with respect to the sequence identity, grouped into bins of identity, in the pair. Bars indicate the frequency of pairs in the training set with a given identity value.

The accuracy of the combined SVM method is plotted with respect to the protein length in the pair. For each pair the maximum protein length was chosen. Bars indicate the frequency of pairs in the training set with a given protein length.
An important issue when implementing and testing a predictive method is the possibility to compute a reliability score of the prediction. This helps in scoring the performance of the method; furthermore, it can also be used to sort out the set of mutations that are more likely to increase protein thermostability in a rational computer-aided protein design.
In this respect, we tested the behavior of our method by computing its accuracy as function of the reliability measure [see Section 2, Equation (8)]. From the data shown in Figure 4 it can be concluded that for about half of the dataset the method accuracy is >95%.

The accuracy of the combined SVM method is plotted with respect to the reliability index. Bars represent the fraction of the database with a given value of reliability index.
3.3 A blind test on experimentally validated data
In order to validate our methods on a real-world application we further tested them on an experimentally verified dataset.
For sake of precision we checked if the sequences of the experimental set were included in the training set. For this reason, the 10 wild-type sequences of the experimental set were aligned with the BLAST program against all the sequences of the training set. In nine cases, BLAST gave no hits. Only the Bacillus subtilis CSP (BsCSP) retrieved BLAST hits with nine (seven of which mesophilic and two thermophilic) proteins in the training set. Since these proteins were all included in a unique cluster, the predictions for the two BsCSP mutants were carried out using the SVM model trained without the cluster containing all the BsCSP ‘homologues’.
Protein pairs included in the experimental sets are endowed with an average number of mutations that is very small with respect to the pairs included in the training set. Despite this fact, the results reported in Table 3 show that the performances of our methods on the experimental set are similar to those obtained on the training/testing dataset. It is also worth noticing that the combined SVM method correctly predicts all the 11 experimental mutants whose thermostability is endowed with a ΔT value >10○C (Table 3).
Protein . | Mutant . | ΔT○C . | N ○ muts . | L20 . | L400 . | Combined . |
---|---|---|---|---|---|---|
Dmeh | UVF | 50 | 39 | Yes | Yes | Yes |
Dmeh | UMC | 50 | 40 | Yes | Yes | Yes |
BsCSP | mt2 | 29.9 | 3 | Yes | Yes | Yes |
PTDH | opt14 | 25.4 | 14 | Yes | Yes | Yes |
PTDH | 12x | 20.7 | 12 | Yes | Yes | Yes |
β-GUS | TR3337 | 20 | 6 | No | Yes | Yes |
Shble | HTS | 17.7 | 4 | Yes | Yes | Yes |
EcHPH | hph5 | 16 | 5 | Yes | Yes | Yes |
BsCSP | mt1 | 15.9 | 2 | Yes | Yes | Yes |
CbADH | Q100P | 11.5 | 1 | Yes | No | Yes |
pDAO | F42C | 10 | 1 | No | Yes | Yes |
FAOX | TE | 8 | 5 | Yes | No | No |
PhyA | mt24 | >7 | 4 | Yes | Yes | Yes |
PhyA | mt18 | 7 | 7 | Yes | No | No |
Accuracy for all the mutations (%) | 12/14 (86) | 11/14 (79) | 12/14 (86) | |||
Accuracy for the subset with ΔT≥10○C (%) | 9/11 (82) | 10/11 (91) | 11/11 (100) |
Protein . | Mutant . | ΔT○C . | N ○ muts . | L20 . | L400 . | Combined . |
---|---|---|---|---|---|---|
Dmeh | UVF | 50 | 39 | Yes | Yes | Yes |
Dmeh | UMC | 50 | 40 | Yes | Yes | Yes |
BsCSP | mt2 | 29.9 | 3 | Yes | Yes | Yes |
PTDH | opt14 | 25.4 | 14 | Yes | Yes | Yes |
PTDH | 12x | 20.7 | 12 | Yes | Yes | Yes |
β-GUS | TR3337 | 20 | 6 | No | Yes | Yes |
Shble | HTS | 17.7 | 4 | Yes | Yes | Yes |
EcHPH | hph5 | 16 | 5 | Yes | Yes | Yes |
BsCSP | mt1 | 15.9 | 2 | Yes | Yes | Yes |
CbADH | Q100P | 11.5 | 1 | Yes | No | Yes |
pDAO | F42C | 10 | 1 | No | Yes | Yes |
FAOX | TE | 8 | 5 | Yes | No | No |
PhyA | mt24 | >7 | 4 | Yes | Yes | Yes |
PhyA | mt18 | 7 | 7 | Yes | No | No |
Accuracy for all the mutations (%) | 12/14 (86) | 11/14 (79) | 12/14 (86) | |||
Accuracy for the subset with ΔT≥10○C (%) | 9/11 (82) | 10/11 (91) | 11/11 (100) |
Protein is the short name of the wild-type protein (refer to Table 1 for details); Mutant is the name of the mutated sequence; ΔT is the experimentally measured increase in the optimal (or melting) temperature; N.muts is the number of mutations. The correct (yes) or incorrect (no) predictions of the three methods are reported in the last three columns.
Protein . | Mutant . | ΔT○C . | N ○ muts . | L20 . | L400 . | Combined . |
---|---|---|---|---|---|---|
Dmeh | UVF | 50 | 39 | Yes | Yes | Yes |
Dmeh | UMC | 50 | 40 | Yes | Yes | Yes |
BsCSP | mt2 | 29.9 | 3 | Yes | Yes | Yes |
PTDH | opt14 | 25.4 | 14 | Yes | Yes | Yes |
PTDH | 12x | 20.7 | 12 | Yes | Yes | Yes |
β-GUS | TR3337 | 20 | 6 | No | Yes | Yes |
Shble | HTS | 17.7 | 4 | Yes | Yes | Yes |
EcHPH | hph5 | 16 | 5 | Yes | Yes | Yes |
BsCSP | mt1 | 15.9 | 2 | Yes | Yes | Yes |
CbADH | Q100P | 11.5 | 1 | Yes | No | Yes |
pDAO | F42C | 10 | 1 | No | Yes | Yes |
FAOX | TE | 8 | 5 | Yes | No | No |
PhyA | mt24 | >7 | 4 | Yes | Yes | Yes |
PhyA | mt18 | 7 | 7 | Yes | No | No |
Accuracy for all the mutations (%) | 12/14 (86) | 11/14 (79) | 12/14 (86) | |||
Accuracy for the subset with ΔT≥10○C (%) | 9/11 (82) | 10/11 (91) | 11/11 (100) |
Protein . | Mutant . | ΔT○C . | N ○ muts . | L20 . | L400 . | Combined . |
---|---|---|---|---|---|---|
Dmeh | UVF | 50 | 39 | Yes | Yes | Yes |
Dmeh | UMC | 50 | 40 | Yes | Yes | Yes |
BsCSP | mt2 | 29.9 | 3 | Yes | Yes | Yes |
PTDH | opt14 | 25.4 | 14 | Yes | Yes | Yes |
PTDH | 12x | 20.7 | 12 | Yes | Yes | Yes |
β-GUS | TR3337 | 20 | 6 | No | Yes | Yes |
Shble | HTS | 17.7 | 4 | Yes | Yes | Yes |
EcHPH | hph5 | 16 | 5 | Yes | Yes | Yes |
BsCSP | mt1 | 15.9 | 2 | Yes | Yes | Yes |
CbADH | Q100P | 11.5 | 1 | Yes | No | Yes |
pDAO | F42C | 10 | 1 | No | Yes | Yes |
FAOX | TE | 8 | 5 | Yes | No | No |
PhyA | mt24 | >7 | 4 | Yes | Yes | Yes |
PhyA | mt18 | 7 | 7 | Yes | No | No |
Accuracy for all the mutations (%) | 12/14 (86) | 11/14 (79) | 12/14 (86) | |||
Accuracy for the subset with ΔT≥10○C (%) | 9/11 (82) | 10/11 (91) | 11/11 (100) |
Protein is the short name of the wild-type protein (refer to Table 1 for details); Mutant is the name of the mutated sequence; ΔT is the experimentally measured increase in the optimal (or melting) temperature; N.muts is the number of mutations. The correct (yes) or incorrect (no) predictions of the three methods are reported in the last three columns.
3.4 Analysis of the dominant SVM parameters


The values of the components of the hyperplane vector of SVM L20 are plotted as bars. The average compositional differences obtained by averaging all the training examples are plotted as dots connected by a line.
4 CONCLUSIONS
Several papers addressed so far the problem of characterizing determinants of thermostability. This is possible at the genome and at the proteome level, provided that determinants are statistically robust enough (Montanucci et al., 2007; Zeldovich et al., 2007, references therein). Other works have also attempted to discriminate whether a given sequence might belong to a thermophilic or a mesophilic organism (Zhou et al., 2008, for a recent review, and references therein). To perform this task, the information derived from entire genomes, proteomes or sets of thermophilic and mesophilic sequences was exploited. The problem was also tackled by means of SVM and others machine learning approaches (Zhang and Fang, 2006b).
In this article, we address a different issue since we try to derive automatic rules to predict when a set of mutations (including deletions and insertions) can enhance protein thermostability and this is novel. For this reason, a direct comparison with previous works is not possible, given the different inputs and goals. To our purpose, we explicitly sorted out the subset of protein sequences among thermophilic and mesophilic organisms that share a high degree of similarity and this was adopted for training/testing with a similarity-clustered procedure.
A careful analysis of the support vectors of our method highlights that residues contributing the most to protein thermostability in the protein set are the same derived by previous approaches on different protein sets, corroborating our observations.
Finally, our methods are tested on experimentally determined and never-seen-before protein mutants with enhanced thermostability. The results indicate that our best combined SVM predictor correctly classifies 12 mutated proteins out of 14. Furthermore, it correctly detects all the 11 mutated proteins that are endowed by an increase of the melting/optimal functional temperature, as experimentally characterized, of >10○C (Table 3).
ACKNOWLEDGEMENTS
Funding: R.C. acknowledges the receipt of the following grants: FIRB 2003 LIBI–International Laboratory of Bioinformatics and the support to the Bologna node of the Biosapiens Network of Excellence project within the European Union's VI Framework Programme (contract number LSGH-CT-2003-503265).
Conflict of Interest: none declared.
REFERENCES
APPENDIX 1
A.1 Results on a random splitting of training and testing sets
A three-fold cross validation was carried out using random splitting of the 4656 examples in the training set. The examples in the dataset were randomly split into three sets. The training/testing splitting was therefore carried out regardless of the redundancy and sequence similarity among the considered sequences. At each cross-validation run, 3104 examples were used for training and 1552 for the test. The performances of the obtained classifier are shown in Table A1. When these results are compared to those shown in Table 1, it is evident that SVM scoring is enhanced when redundancy among the training and testing set is retained.
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity (%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 93 | 0.86 | 93 | 93 | 93 | 93 |
L400 | 97 | 0.94 | 97 | 97 | 97 | 97 |
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity (%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 93 | 0.86 | 93 | 93 | 93 | 93 |
L400 | 97 | 0.94 | 97 | 97 | 97 | 97 |
L20 is the SVM trained with the residue composition. L400 is trained with the difference in dipeptide composition of the sequences. Symbols + and − indicate increased and decreased thermostability classes, respectively.
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity (%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 93 | 0.86 | 93 | 93 | 93 | 93 |
L400 | 97 | 0.94 | 97 | 97 | 97 | 97 |
Method . | Accuracy (%) . | Correlation . | Sensitivity (%) . | Specificity (%) . | ||
---|---|---|---|---|---|---|
. | . | . | + . | − . | + . | − . |
L20 | 93 | 0.86 | 93 | 93 | 93 | 93 |
L400 | 97 | 0.94 | 97 | 97 | 97 | 97 |
L20 is the SVM trained with the residue composition. L400 is trained with the difference in dipeptide composition of the sequences. Symbols + and − indicate increased and decreased thermostability classes, respectively.