Abstract

Motivation: A basic question in protein science is to which extent mutations affect protein thermostability. This knowledge would be particularly relevant for engineering thermostable enzymes. In several experimental approaches, this issue has been serendipitously addressed. It would be therefore convenient providing a computational method that predicts when a given protein mutant is more thermostable than its corresponding wild-type.

Results: We present a new method based on support vector machines that is able to predict whether a set of mutations (including insertion and deletions) can enhance the thermostability of a given protein sequence. When trained and tested on a redundancy-reduced dataset, our predictor achieves 88% accuracy and a correlation coefficient equal to 0.75. Our predictor also correctly classifies 12 out of 14 experimentally characterized protein mutants with enhanced thermostability. Finally, it correctly detects all the 11 mutated proteins whose increase in stability temperature is >10°C.

Availability: The dataset and the list of protein clusters adopted for the SVM cross-validation are available at the web site http://lipid.biocomp.unibo.it/~ludovica/thermo-meso-MUT.

Contact:  [email protected]

1 INTRODUCTION

Predicting the thermostability of a biomolecule, given its sequence, is one of the big challenges of protein biochemistry and biotechnology (Bommarius et al., 2006; Razvi and Scholtz, 2006). In this respect, it is of great relevance developing a tool that can score protein sequences in order to screen thermostable mutants among a plethora of alternative mutated sequences (Hoppe and Shomburg, 2005). The accumulation of genomic data, comprising thermophilic organisms, allows for a comprehensive investigation of nucleotidic and amino acidic sequences with the aim of discovering universal determinants of thermophilic life. Many studies have attempted to correlate thermostability to both the genome and proteome compositions. At the DNA level, differences in the codon usage between thermophilic and mesophilic organisms have been described (Lobry and Chessel, 2003; Lobry and Necsulea, 2006; Lynn et al., 2002; Singer and Hickey, 2002; Takami et al., 2004). Recently, a codon frequency index could highlight robust determinants of thermostability capable of discriminating thermophilic from mesophilic genomes (Montanucci et al., 2007). When residue composition in proteomes and/or protein sequences and structures were analyzed, the increased frequency in charged residues and ion pairs was recognized as the most remarkable feature of thermostable proteins (Farias and Bonato, 2003; Kreil and Ouzounis, 2001; Shure and Claverie, 2003; Szilágyi and Závodsky, 2000; Zhang and Fang, 2006a). However, many other compositional features may influence protein thermostability (see Zhou et al., 2008, for a recent review, and references therein) and molecular determinants of thermal resistance at the protein level still remain elusive. Recently, the fraction of a set of protein residues (I, V, Y, W, R, E, L) in the proteome was correlated with the optimal growth temperature of the correspondent organism (Zeldovich et al., 2007).

In this article, we address the problem of screening mutations that affect protein thermostability and develop a novel method able to sort out thermostable protein variants at the sequence level.

As discussed above, several methods have been described for discriminating among thermophilic and mesophilic proteins. In the present work, our approach is different in that we aim at predicting whether a set of mutations (including deletions and insertions) can enhance the thermostability of a given protein. For this reason, and differently from previous implementations (Zhang and Fang, 2006b) we trained a SVM method on the compositional difference computed for 2328 pairs of mesophilic and thermophilic proteins that share high pairwise sequence identity (≥70%).

2 METHODS

2.1 Training dataset

Our training/testing dataset consists of 2328 pairs of protein sequences with the property that one member belongs to a thermophilic microbial organism and the other to a mesophilic one. Since we are interested in detecting small differences in composition, we considered only protein pairs sharing a sequence identity ≥70%. The corresponding pairwise alignment coverage is >80% for the vast majority of the cases (∼90%).

This dataset was derived from an extensive all-against-all BLAST search among the proteomes of 112 prokaryotes (12 of which are thermophilic) belonging to different genera, comprising Archaea and Bacteria. From the outputs of the BLAST runs only aligned sequence pairs comprising a mesophilic and a thermophilic protein were selected. Only protein pairs sharing at least 70% of sequence identity were retained. The final number of pairs is 2328, including 378 thermophilic and 1015 mesophilic proteins.

2.2 Definition of training and testing sets

One of the major problems in developing and evaluating predictors is avoiding the similarity between training and testing sets; usually random splitting is not a correct procedure ( Appendix 1). For this reason we clustered proteins using a very conservative method. We considered protein sequences as graph nodes. Two nodes are linked by an edge if the local identity between the two corresponding sequences is >30%. The graph is then a forest and the connected components define our clusters. In this way, a cluster may contain proteins that by themselves do not share a sequence identity >30% but that are connected through a path of similar proteins. This procedure grouped the 1393 protein sequences of the dataset into 184 non-overlapping clusters, containing the 2328 pairs of interest. Also, each cluster contains proteins that are <30% identical to those of all the other clusters.

Although apparently too restrictive, this clustering procedure is safer for defining training and testing sets. We then used a ‘leave-one-cluster-out’ training procedure by predicting all the proteins of one cluster with the model trained on the remainder of the data set.

In order to prove the necessity for a similarity-clustered validation scheme, we generated also a random splitting of the data set and we trained and tested the method accordingly. From Table A1 of  Appendix 1, it is evident that when random splitting is adopted, the SVM performance increases due to the higher level of homology between the training and testing sequences.

2.3 A test set of experimentally investigated protein mutations

To validate our method with real-world applications, we collected an experimental dataset derived from the literature. This set consists of mesophilic proteins that have been experimentally mutated resulting in proteins that show an increased optimal functional temperature (or melting temperature Tm). This experimental set consists of 14 mutants derived from 10 different wild-type proteins (for details, see Table 1). Among the data reported in the literature we did not consider examples that: (i) do not specify the optimal (nor the melting) temperature increment; (ii) describe proteins thermally stabilized by means of chemical post-translational modifications instead of residue mutations (such as Annaluru et al., 2007; Siddiqui and Cavicchioli, 2005; Li et al., 2007; Minagawa et al., 2007; Ruller et al., 2007; Salazar et al., 2003; Stephens et al., 2007).

Table 1.

Experimental dataset description

Protein nameLengthTemp. (C)Mutant nameTemp. (C)Mutated residues
Shble124T  m: 67.4HTST  m: 85.1G18E,D32V,L63Q,G98V
UVFT  m: 9939 mutations
Dmeh54T  m: 49
UMCT  m: 9940 mutations
β-GUS60345TR333765Q493R,T509A,M532T,N550S,G559S,N566S
mt1T  m: 69.7A46K,S48R
BsCSP67T  m: 53.8
mt2T  m: 83.7M1R,E3K,K65I
EcHPH34151hph567D20G,A118V,S225P,Q226L,T246A
12x59.7V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V
PTDH35539
opt1464.4V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M
CbADH452T  m: 65.5Q100PΔTm: +11.5Q100P
FAOX37237FAOX_TE45T60A,A188G,M244L,N257S,L261M
PDAO34745F42C55F42C
mt18ΔTm: +7,A58E,P65S,Q191R,T271R
PhyA46755
mt24ΔTm: >+7A58E,P65S,Q191R,T271R,E228K,S149P,F131L
Protein nameLengthTemp. (C)Mutant nameTemp. (C)Mutated residues
Shble124T  m: 67.4HTST  m: 85.1G18E,D32V,L63Q,G98V
UVFT  m: 9939 mutations
Dmeh54T  m: 49
UMCT  m: 9940 mutations
β-GUS60345TR333765Q493R,T509A,M532T,N550S,G559S,N566S
mt1T  m: 69.7A46K,S48R
BsCSP67T  m: 53.8
mt2T  m: 83.7M1R,E3K,K65I
EcHPH34151hph567D20G,A118V,S225P,Q226L,T246A
12x59.7V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V
PTDH35539
opt1464.4V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M
CbADH452T  m: 65.5Q100PΔTm: +11.5Q100P
FAOX37237FAOX_TE45T60A,A188G,M244L,N257S,L261M
PDAO34745F42C55F42C
mt18ΔTm: +7,A58E,P65S,Q191R,T271R
PhyA46755
mt24ΔTm: >+7A58E,P65S,Q191R,T271R,E228K,S149P,F131L

In the first two columns the wild-type protein name and the protein length are reported. In the fourth column, the name of the mutant is reported. Columns 3 and 5 report optimal functional temperatures of the wild-type and the mutated sequence, respectively; Tm when present refers to the melting temperature; column 6 reports the mutated residues. In the case of Dmeh, the two mutants have 39 and 40 mutated residues, respectively (Shah et al., 2007). The considered proteins are: Shble: bleomycin-binding protein from the mesophilic bacterium Streptoalloteichus hindustanus (Brouns et al., 2005); Dmeh: Drosophila melanogaster engrailed homeodomain (Shah et al., 2007); β-GUS: β-glucuronidase (Xiong et al., 2007); BsCSP: cold shock proteins from Bacillus subtilis (Max et al., 2007); EcHPH: Escherichia coli hygromycin B phosphotransferase (Nakamura et al., 2005); PTDH: phosphite dehydrogenase from Pseudomonas stutzeri (Johannes et al., 2005; McLachlan et al., 2007) CbADH: Clostridium beijerinckii alcohol dehydrogenase (Goihberg et al., 2007); FAOX: fructosyl-amino acid oxidase from Corynebacterium sp. (Sakaue and Kajiyama, 2003); pDAO: porcine kidney D-amino acid oxidase (Bakke et al., 2006); PhyA: 3-phytase A from Aspergillus niger (Zhang and Lei, 2007).

Table 1.

Experimental dataset description

Protein nameLengthTemp. (C)Mutant nameTemp. (C)Mutated residues
Shble124T  m: 67.4HTST  m: 85.1G18E,D32V,L63Q,G98V
UVFT  m: 9939 mutations
Dmeh54T  m: 49
UMCT  m: 9940 mutations
β-GUS60345TR333765Q493R,T509A,M532T,N550S,G559S,N566S
mt1T  m: 69.7A46K,S48R
BsCSP67T  m: 53.8
mt2T  m: 83.7M1R,E3K,K65I
EcHPH34151hph567D20G,A118V,S225P,Q226L,T246A
12x59.7V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V
PTDH35539
opt1464.4V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M
CbADH452T  m: 65.5Q100PΔTm: +11.5Q100P
FAOX37237FAOX_TE45T60A,A188G,M244L,N257S,L261M
PDAO34745F42C55F42C
mt18ΔTm: +7,A58E,P65S,Q191R,T271R
PhyA46755
mt24ΔTm: >+7A58E,P65S,Q191R,T271R,E228K,S149P,F131L
Protein nameLengthTemp. (C)Mutant nameTemp. (C)Mutated residues
Shble124T  m: 67.4HTST  m: 85.1G18E,D32V,L63Q,G98V
UVFT  m: 9939 mutations
Dmeh54T  m: 49
UMCT  m: 9940 mutations
β-GUS60345TR333765Q493R,T509A,M532T,N550S,G559S,N566S
mt1T  m: 69.7A46K,S48R
BsCSP67T  m: 53.8
mt2T  m: 83.7M1R,E3K,K65I
EcHPH34151hph567D20G,A118V,S225P,Q226L,T246A
12x59.7V71I,E130K,Q132R,Q137R,I150F,Q215L,R275Q,L276Q,I313L,V315A,A319E,A325V
PTDH35539
opt1464.4V71I,E130K,Q132K,Q137H,I150F,Q215L,R275L,L276C,I313L,V315A,A319E,A325V,A146S,F198M
CbADH452T  m: 65.5Q100PΔTm: +11.5Q100P
FAOX37237FAOX_TE45T60A,A188G,M244L,N257S,L261M
PDAO34745F42C55F42C
mt18ΔTm: +7,A58E,P65S,Q191R,T271R
PhyA46755
mt24ΔTm: >+7A58E,P65S,Q191R,T271R,E228K,S149P,F131L

In the first two columns the wild-type protein name and the protein length are reported. In the fourth column, the name of the mutant is reported. Columns 3 and 5 report optimal functional temperatures of the wild-type and the mutated sequence, respectively; Tm when present refers to the melting temperature; column 6 reports the mutated residues. In the case of Dmeh, the two mutants have 39 and 40 mutated residues, respectively (Shah et al., 2007). The considered proteins are: Shble: bleomycin-binding protein from the mesophilic bacterium Streptoalloteichus hindustanus (Brouns et al., 2005); Dmeh: Drosophila melanogaster engrailed homeodomain (Shah et al., 2007); β-GUS: β-glucuronidase (Xiong et al., 2007); BsCSP: cold shock proteins from Bacillus subtilis (Max et al., 2007); EcHPH: Escherichia coli hygromycin B phosphotransferase (Nakamura et al., 2005); PTDH: phosphite dehydrogenase from Pseudomonas stutzeri (Johannes et al., 2005; McLachlan et al., 2007) CbADH: Clostridium beijerinckii alcohol dehydrogenase (Goihberg et al., 2007); FAOX: fructosyl-amino acid oxidase from Corynebacterium sp. (Sakaue and Kajiyama, 2003); pDAO: porcine kidney D-amino acid oxidase (Bakke et al., 2006); PhyA: 3-phytase A from Aspergillus niger (Zhang and Lei, 2007).

2.4 Support vector machines

Two support vector machines (SVMs) were trained with linear kernel functions, using the libsvm package (http://www.csie.ntu.edu.tw/~cjlin/libsvm). They differ in the input encoding adopted. The first SVM (L20) takes as input 20-valued vectors containing the difference of the residue composition in each pair of the dataset. The second SVM (L400) takes as input 400-valued vectors containing the difference of the dipeptide composition in each pair of the dataset.

For each encoding type (L20 and L400) and for each protein pair in the data set, two different input vector sets were derived: a vector set encoding the composition (residue or dipeptide) difference between the thermophilic sequence and the mesophilic one (positive set), and a vector set encoding the composition (residue or dipeptide) difference between the mesophilic sequence and the thermophilic one (negative set). By this a total of 4656 input vectors were defined. This encoding procedure ensures balancing of the positive and negative examples.

For each trained SVM we evaluated the performance using different values for the C parameter in the range of 0.1–100 000 (see the libsvm package). The linear kernel SVM showed a high degree of robustness and the accuracy was not significantly affected by a specific value of the C parameter. The results presented below are computed using values for the C parameter equal to 10 000 and 1000 for L20 and L400, respectively.

Finally, a combined SVM was derived by taking the average values of the probabilities given by L20 and L400.

2.5 Evaluating the predictor performances

The evaluation of the performance of the two-class classifiers was determined using the following statistical indexes. The two classes are indicated as positive (+: increased thermostability) and negative (−: decreased thermostability). The overall accuracy of the classifier is defined as:
(1)
where P is the total number of correct classifications and N is the total number of examples. The correlation coefficient C is defined as:
(2)
where p and n are the number of correctly classified examples of the positive and negative class, respectively; o is the number of examples of the negative class predicted as belonging to the positive one, and u is the number of examples of the positive class predicted as belonging to the negative one. The normalization factor D is given by:
(3)
The probability of correct prediction, or specificity, for the two classes is defined as:
(4)
(5)
The coverage, or sensitivity, for the two classes is defined as:
(6)
(7)
We define the reliability score for each SVM as:
(8)
where P(i) is the probability assigned by a SVM for the i-th input vector. It ranges from 0 to 9.

3 RESULTS

3.1 Scoring the SVM predictors

The scoring indexes of the different methods were evaluated with the ‘leave-one-cluster-out’ procedure and are reported in Table 2. It is evident that both SVM L20 and SVM L400 perform quite well and that accuracies are almost indistinguishable in spite of the fact that L400 has far more detailed input to work with. This indicates that the most relevant information is contained in the composition difference of two highly homologous thermophilic and mesophilic proteins (≥70%) and that more detailed information of the difference in dipeptide composition does not significantly increase the performance. However, the two predictors are able to extract slightly different features as indicated by the finding that the combined SVM predictor outperforms the single L20 and L400, reaching an accuracy of 88% and a correlation coefficient of 0.75. The scoring improvement of the combination of two different methods is not unexpected since it is theoretically founded when they capture different features (Sollich and Krogh, 1996). The same picture holds when the ROC curves reported in Figure 1 are considered and where the true positive rate [Sensitivity(+)] is plotted versus the false positive rate [1-Sensitivity(−)].

ROC curve of the three predictors. Solid gray line: L20 SVM predictor; dotted black line: L400 SVM predictor; solid black line: combined predictor.
Fig. 1.

ROC curve of the three predictors. Solid gray line: L20 SVM predictor; dotted black line: L400 SVM predictor; solid black line: combined predictor.

Table 2.

Performances of the two SVMs and the Combined SVM method

MethodAccuracy (%)CorrelationSensitivity (%)
Specificity(%)
++
L20860.7387868687
L400850.7085858585
Combined880.7588888888
MethodAccuracy (%)CorrelationSensitivity (%)
Specificity(%)
++
L20860.7387868687
L400850.7085858585
Combined880.7588888888

The performances are evaluated using the leave-one-cluster-out procedure on the training dataset. The symbol + and − indicate the direction of increased an decreased thermostability, respectively.

Table 2.

Performances of the two SVMs and the Combined SVM method

MethodAccuracy (%)CorrelationSensitivity (%)
Specificity(%)
++
L20860.7387868687
L400850.7085858585
Combined880.7588888888
MethodAccuracy (%)CorrelationSensitivity (%)
Specificity(%)
++
L20860.7387868687
L400850.7085858585
Combined880.7588888888

The performances are evaluated using the leave-one-cluster-out procedure on the training dataset. The symbol + and − indicate the direction of increased an decreased thermostability, respectively.

We should also consider that the predictions are very well balanced, reaching the same sensitivity and specificity values for the two classes.

3.2 Robustness of the performance

The SVM performance can be affected by the encoding procedure given the different protein identity values (70–92%) and different protein lengths (35–1512 residues) in the dataset of protein pairs. In Figures 2 and 3, we show the accuracy of the combined SVM as a function of protein sequence identity (Fig. 2) and length (Fig. 3) of the pairs, respectively. It is evident that the performance is quite independent of both the identity value and protein length of the pair.

The accuracy of the combined SVM method is plotted with respect to the sequence identity, grouped into bins of identity, in the pair. Bars indicate the frequency of pairs in the training set with a given identity value.
Fig. 2.

The accuracy of the combined SVM method is plotted with respect to the sequence identity, grouped into bins of identity, in the pair. Bars indicate the frequency of pairs in the training set with a given identity value.

The accuracy of the combined SVM method is plotted with respect to the protein length in the pair. For each pair the maximum protein length was chosen. Bars indicate the frequency of pairs in the training set with a given protein length.
Fig. 3.

The accuracy of the combined SVM method is plotted with respect to the protein length in the pair. For each pair the maximum protein length was chosen. Bars indicate the frequency of pairs in the training set with a given protein length.

An important issue when implementing and testing a predictive method is the possibility to compute a reliability score of the prediction. This helps in scoring the performance of the method; furthermore, it can also be used to sort out the set of mutations that are more likely to increase protein thermostability in a rational computer-aided protein design.

In this respect, we tested the behavior of our method by computing its accuracy as function of the reliability measure [see Section 2, Equation (8)]. From the data shown in Figure 4 it can be concluded that for about half of the dataset the method accuracy is >95%.

The accuracy of the combined SVM method is plotted with respect to the reliability index. Bars represent the fraction of the database with a given value of reliability index.
Fig. 4.

The accuracy of the combined SVM method is plotted with respect to the reliability index. Bars represent the fraction of the database with a given value of reliability index.

3.3 A blind test on experimentally validated data

In order to validate our methods on a real-world application we further tested them on an experimentally verified dataset.

For sake of precision we checked if the sequences of the experimental set were included in the training set. For this reason, the 10 wild-type sequences of the experimental set were aligned with the BLAST program against all the sequences of the training set. In nine cases, BLAST gave no hits. Only the Bacillus subtilis CSP (BsCSP) retrieved BLAST hits with nine (seven of which mesophilic and two thermophilic) proteins in the training set. Since these proteins were all included in a unique cluster, the predictions for the two BsCSP mutants were carried out using the SVM model trained without the cluster containing all the BsCSP ‘homologues’.

Protein pairs included in the experimental sets are endowed with an average number of mutations that is very small with respect to the pairs included in the training set. Despite this fact, the results reported in Table 3 show that the performances of our methods on the experimental set are similar to those obtained on the training/testing dataset. It is also worth noticing that the combined SVM method correctly predicts all the 11 experimental mutants whose thermostability is endowed with a ΔT value >10C (Table 3).

Table 3.

SVM performances for the experimental dataset

ProteinMutantΔTCN    mutsL20L400Combined
DmehUVF5039YesYesYes
DmehUMC5040YesYesYes
BsCSPmt229.93YesYesYes
PTDHopt1425.414YesYesYes
PTDH12x20.712YesYesYes
β-GUSTR3337206NoYesYes
ShbleHTS17.74YesYesYes
EcHPHhph5165YesYesYes
BsCSPmt115.92YesYesYes
CbADHQ100P11.51YesNoYes
pDAOF42C101NoYesYes
FAOXTE85YesNoNo
PhyAmt24>74YesYesYes
PhyAmt1877YesNoNo
Accuracy for all the mutations (%)12/14 (86)11/14 (79)12/14 (86)
Accuracy for the subset with ΔT≥10C (%)9/11 (82)10/11 (91)11/11 (100)
ProteinMutantΔTCN    mutsL20L400Combined
DmehUVF5039YesYesYes
DmehUMC5040YesYesYes
BsCSPmt229.93YesYesYes
PTDHopt1425.414YesYesYes
PTDH12x20.712YesYesYes
β-GUSTR3337206NoYesYes
ShbleHTS17.74YesYesYes
EcHPHhph5165YesYesYes
BsCSPmt115.92YesYesYes
CbADHQ100P11.51YesNoYes
pDAOF42C101NoYesYes
FAOXTE85YesNoNo
PhyAmt24>74YesYesYes
PhyAmt1877YesNoNo
Accuracy for all the mutations (%)12/14 (86)11/14 (79)12/14 (86)
Accuracy for the subset with ΔT≥10C (%)9/11 (82)10/11 (91)11/11 (100)

Protein is the short name of the wild-type protein (refer to Table 1 for details); Mutant is the name of the mutated sequence; ΔT is the experimentally measured increase in the optimal (or melting) temperature; N.muts is the number of mutations. The correct (yes) or incorrect (no) predictions of the three methods are reported in the last three columns.

Table 3.

SVM performances for the experimental dataset

ProteinMutantΔTCN    mutsL20L400Combined
DmehUVF5039YesYesYes
DmehUMC5040YesYesYes
BsCSPmt229.93YesYesYes
PTDHopt1425.414YesYesYes
PTDH12x20.712YesYesYes
β-GUSTR3337206NoYesYes
ShbleHTS17.74YesYesYes
EcHPHhph5165YesYesYes
BsCSPmt115.92YesYesYes
CbADHQ100P11.51YesNoYes
pDAOF42C101NoYesYes
FAOXTE85YesNoNo
PhyAmt24>74YesYesYes
PhyAmt1877YesNoNo
Accuracy for all the mutations (%)12/14 (86)11/14 (79)12/14 (86)
Accuracy for the subset with ΔT≥10C (%)9/11 (82)10/11 (91)11/11 (100)
ProteinMutantΔTCN    mutsL20L400Combined
DmehUVF5039YesYesYes
DmehUMC5040YesYesYes
BsCSPmt229.93YesYesYes
PTDHopt1425.414YesYesYes
PTDH12x20.712YesYesYes
β-GUSTR3337206NoYesYes
ShbleHTS17.74YesYesYes
EcHPHhph5165YesYesYes
BsCSPmt115.92YesYesYes
CbADHQ100P11.51YesNoYes
pDAOF42C101NoYesYes
FAOXTE85YesNoNo
PhyAmt24>74YesYesYes
PhyAmt1877YesNoNo
Accuracy for all the mutations (%)12/14 (86)11/14 (79)12/14 (86)
Accuracy for the subset with ΔT≥10C (%)9/11 (82)10/11 (91)11/11 (100)

Protein is the short name of the wild-type protein (refer to Table 1 for details); Mutant is the name of the mutated sequence; ΔT is the experimentally measured increase in the optimal (or melting) temperature; N.muts is the number of mutations. The correct (yes) or incorrect (no) predictions of the three methods are reported in the last three columns.

3.4 Analysis of the dominant SVM parameters

In order to get a better insight into the determinants of protein thermostability, we analyzed the support vectors selected by the SVMs during the training phase. In particular, since our kernel is linear, we easily obtained the hyper-plane vector W used by SVM for classification (Burges, 1998). The components of the vector normal to the discriminative hyper-plane can be computed as
(9)
where j runs from 1 to 20 and i runs over all the SVM support vectors; the αi are the positive and negative Lagrange multipliers and the Xi are the support vectors, respectively. In practice, given our choice of input vectors, the positive components of W imply a thermophilic propensity while the negative ones indicate a mesophilic character. The 20 components (one for each residue) of the W hyper-plane vector are plotted in Figure 5. These components can be contrasted with the average residue composition difference among thermophilic and mesophilic protein pairs as computed on the whole dataset (Figure 5, line and dots). It can be noticed that although the majority of the residues have the same sign in the two types of plots, the relevance of the single components is different. The SVM selection of the best hyper-plane induces a different and weighted choice of the components with respect to a simple statistical classifier (as highlighted in Figure 5). In particular, we can observe that residues that have been previously identified as robust determinants of thermostability (Montanucci et al., 2007, and references therein) are also highlighted by the SVM method. These are: (i) the increased abundance of glutamic acid (E) and lysine (K) and (ii) the decreased abundance of glutamine (Q) in thermostable chains. Besides glutamic acid (E) and lysine (K), the combined SVM classifier also highlights other residues as peculiar of a thermophilic character : alanine (A), isoleucine (I), leucine (L), methionine (M), proline (P), valine (V), tryptophan (W), tyrosine (Y), arginine (R) and histidine (H). In turn, the remaining residues (C, F, G, N, Q, S, T, D) are highlighted as indicative of a mesophilic character. It should be stressed, however, that none of these components alone is statistically significant to infer thermostability versus mesophilicity. Indeed, the distribution of the SVM components of the hyperplane vector is different from that of the simple average composition difference between thermophilic and mesophilic sequences in the data set (Figure 5, line and dots). The most unexpectedly relevant component highlighted by SVM is the tryptophan difference. The relative low abundance of tryptophan in protein sequences may have hampered this finding by previously adopted statistical classifiers. However, this feature is in agreement with more recent data indicating that a fraction of a set of protein residues (I, V, Y, W, R, E, L), including tryptophan, in the proteome is correlated with the optimal growth temperature of the correspondent organism (Zeldovich et al., 2007).
The values of the components of the hyperplane vector of SVM L20 are plotted as bars. The average compositional differences obtained by averaging all the training examples are plotted as dots connected by a line.
Fig. 5.

The values of the components of the hyperplane vector of SVM L20 are plotted as bars. The average compositional differences obtained by averaging all the training examples are plotted as dots connected by a line.

4 CONCLUSIONS

Several papers addressed so far the problem of characterizing determinants of thermostability. This is possible at the genome and at the proteome level, provided that determinants are statistically robust enough (Montanucci et al., 2007; Zeldovich et al., 2007, references therein). Other works have also attempted to discriminate whether a given sequence might belong to a thermophilic or a mesophilic organism (Zhou et al., 2008, for a recent review, and references therein). To perform this task, the information derived from entire genomes, proteomes or sets of thermophilic and mesophilic sequences was exploited. The problem was also tackled by means of SVM and others machine learning approaches (Zhang and Fang, 2006b).

In this article, we address a different issue since we try to derive automatic rules to predict when a set of mutations (including deletions and insertions) can enhance protein thermostability and this is novel. For this reason, a direct comparison with previous works is not possible, given the different inputs and goals. To our purpose, we explicitly sorted out the subset of protein sequences among thermophilic and mesophilic organisms that share a high degree of similarity and this was adopted for training/testing with a similarity-clustered procedure.

A careful analysis of the support vectors of our method highlights that residues contributing the most to protein thermostability in the protein set are the same derived by previous approaches on different protein sets, corroborating our observations.

Finally, our methods are tested on experimentally determined and never-seen-before protein mutants with enhanced thermostability. The results indicate that our best combined SVM predictor correctly classifies 12 mutated proteins out of 14. Furthermore, it correctly detects all the 11 mutated proteins that are endowed by an increase of the melting/optimal functional temperature, as experimentally characterized, of >10C (Table 3).

ACKNOWLEDGEMENTS

Funding: R.C. acknowledges the receipt of the following grants: FIRB 2003 LIBI–International Laboratory of Bioinformatics and the support to the Bologna node of the Biosapiens Network of Excellence project within the European Union's VI Framework Programme (contract number LSGH-CT-2003-503265).

Conflict of Interest: none declared.

REFERENCES

Annaluru
N
et al.
,
Thermostabilization of Pichia stipitis xylitol dehydrogenase by mutation of structural zinc-binding loop
J. Biotechnol.
,
2007
, vol.
129
(pg.
717
-
722
)
Bakke
M
et al.
,
Thermostabilization of porcine kidney D-amino acid oxidase by a single amino acid substitution
Biotechnol. Bioeng.,
,
2006
, vol.
93
(pg.
1023
-
1027
)
Bommarius
AS
et al.
,
High-throughput screening for enhanced protein stability
Curr. Opin. Biotechnol
,
2006
, vol.
17
(pg.
606
-
610
)
Brouns
SJ
et al.
,
Engineering a selectable marker for hyperthermophiles
J. Biol. Chem.,
,
2005
, vol.
280
(pg.
11422
-
11431
)
Burges
CJC
A Tutorial on Support Vector Machines for Pattern Recognition
,
1998
Boston
Kluwer Academic Publishers
Farias
ST
Bonato
MCM
,
Preferred amino acids and thermostability
Genet. Mol. Res.
,
2003
, vol.
2
(pg.
383
-
393
)
Goihberg
E
et al.
,
A single proline substitution is critical for the thermostabilization ofClostridium beijerinckiialcohol dehydrogenase
Proteins
,
2007
, vol.
66
(pg.
196
-
204
)
Hoppe
C
Shomburg
D
,
Prediction of protein thermostability with direction and distance-dependent knowledge-based potential
Prot. Sci.
,
2005
, vol.
14
(pg.
2682
-
2692
)
Johannes
TW
et al.
,
Directed evolution of a thermostable phosphite dehydrogenase for NAD(P)H regeneration
Appl. Environ. Microbiol.,
,
2005
, vol.
71
(pg.
5728
-
5734
)
Kreil
DP
Ouzounis
CA
,
Identification of thermophilic species by the amino acid composition deduced from their genomes
Nucleic Acids Res.
,
2001
, vol.
29
(pg.
1608
-
1615
)
Li
Y
et al.
,
A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments
Nat. Biothecnol.
,
2007
, vol.
25
(pg.
1051
-
1016
)
Lobry
JR
Chessel
D
,
Internal correspondence analysis of codon and amino-acid usage in thermophilic bacteria
J. Appl. Genet.
,
2003
, vol.
44
(pg.
235
-
261
)
Lobry
JR
Necsulea
A
,
Synonymous codon usage and its potential link with optimal growth temperature in prokaryotes
Gene
,
2006
, vol.
385
(pg.
128
-
136
)
Lynn
DJ
et al.
,
Synonymous codon usage is subjected to selection in thermophilic bacteria
Nucleic Acids Res.
,
2002
, vol.
30
(pg.
4272
-
4277
)
Max
KE
et al.
,
Optimized variants of the cold shock protein from in vitro selection: structural basis of their high thermostability
J. Mol. Biol.,
,
2007
, vol.
369
(pg.
1087
-
1097
)
McLachlan
MJ
et al.
,
Further improvement of phosphite dehydrogenase thermostability by saturation mutagenesis
Biotechnol. Bioeng.
,
2007
, vol.
99
(pg.
268
-
274
)
Minagawa
H
et al.
,
Improving the thermal stability of lactate oxidase by directed evolution
Cell. Mol. Life Sci.
,
2007
, vol.
64
(pg.
77
-
81
)
Montanucci
L
et al.
,
Robust determinants of thermostability highlighted by a codon frequency index capable of discriminating thermophilic from mesophilic genomes
J. Proteome Res.
,
2007
, vol.
6
(pg.
2502
-
2508
)
Nakamura
A
et al.
,
In vivo directed evolution for thermostabilization ofEscherichia colihygromycin B phosphotransferase and the use of the gene as a selection marker in the host-vector system of Thermus thermophilus
J. Biosci. Bioeng.
,
2005
, vol.
100
(pg.
158
-
163
)
Razvi
A
Scholtz
JM
,
Lessons in stability from thermophilic proteins
Prot. Sci
,
2006
, vol.
15
(pg.
1569
-
1578
)
Ruller
R
et al.
,
Thermostable variants of the recombinant xylanase a from Bacillus subtilis produced by directed evolution show reduced heat capacity
Proteins
,
2007
, vol.
70
(pg.
1280
-
1293
)
Sakaue
R
Kajiyama
N
,
Thermostabilization of bacterial fructosyl-amino acid oxidase by directed evolution
Appl. Environ. Microbiol.
,
2003
, vol.
69
(pg.
139
-
145
)
Salazar
O
et al.
,
Thermostabilization of a cytochrome P450 peroxygenase
Chembiochem.,
,
2003
, vol.
4
(pg.
891
-
893
)
Shah
PS
et al.
,
Full-sequence computational design and solution structure of a thermostable protein variant
J. Mol. Biol.
,
2007
, vol.
372
(pg.
1
-
6
)
Shure
K
Claverie
JM
,
Genomic correlates of hypertermostability: an update
J. Biol. Chem.
,
2003
, vol.
278
(pg.
17198
-
17202
)
Siddiqui
KS
Cavicchioli
R
,
Improved thermal stability and activity in the cold-adapted lipase B fromCandida antarcticafollowing chemical modification with oxidized polysaccharides
Extremophiles
,
2005
, vol.
9
(pg.
471
-
476
)
Singer,G.A.C. and Hickey
DA
,
Thermophilic prokaryotes have characteristic patterns of codon usage, amino acid composition and nucleotide content
Gene
,
2002
, vol.
317
(pg.
39
-
47
)
Sollich
P
Krogh
A
,
Learning with ensembles: how overfitting can be useful
Advances in Neural Information Processing Systems 8
,
1996
Cambridge, MA
MIT Press
(pg.
190
-
196
)
Stephens
DE
et al.
,
Directed evolution of the thermostable xylanase from Thermomyces lanuginosus
J. Biotechnol.
,
2007
, vol.
127
(pg.
348
-
354
)
Szilágyi
A
Závodsky
P
,
Structural differences between mesophilic, moderately thermophilic and extremely thermophilic protein subunits:results of a comprehensive survey
Structure,
,
2000
, vol.
8
(pg.
493
-
503
)
Takami
H
et al.
,
Thermoadaptation trait revealed by the genome sequence of thermophilic Geobacillus kaustophilus
Nucleic Acids Res.
,
2004
, vol.
32
(pg.
6292
-
6303
)
Xiong
AS
et al.
,
Concurrent mutations in six amino acids in β-glucuronidase improve its thermostability.
Prot. Eng. Design Select.,
,
2007
, vol.
20
(pg.
319
-
325
)
Zeldovich
KB
et al.
,
Protein and DNA sequence determinants of thermophilic adaptation
PLoS Comput. Biol.
,
2007
, vol.
3
pg.
e5
Zhang
G
Fang
B
,
Study on the discrimination of thermophilic and mesophilic proteins based on dipeptide composition
Chinese J. Biothechnol.,
,
2006
, vol.
22
(pg.
293
-
928
)
Zhang
G
Fang
B
,
Support vector machine for discrimination of thermophilic and mesophilic proteins based on amino acid composition
Prot. Pept. Lett.,
,
2006
, vol.
13
(pg.
965
-
970
)
Zhang
W
Lei
XG
,
Cumulative improvements of thermostability and pH-activity profile ofAspergillus nigerPhyA phytase by site-directed mutagenesis
Appl. Microbiol. Biotechnol.
,
2007
, vol.
77
(pg.
1033
-
1040
)
Zhou
XX
et al.
,
Differences in amino acids composition and coupling patterns between mesophilic and thermophilic proteins
Amino Acids,
,
2008
, vol.
34
(pg.
25
-
33
)

APPENDIX 1

A.1 Results on a random splitting of training and testing sets

A three-fold cross validation was carried out using random splitting of the 4656 examples in the training set. The examples in the dataset were randomly split into three sets. The training/testing splitting was therefore carried out regardless of the redundancy and sequence similarity among the considered sequences. At each cross-validation run, 3104 examples were used for training and 1552 for the test. The performances of the obtained classifier are shown in Table A1. When these results are compared to those shown in Table 1, it is evident that SVM scoring is enhanced when redundancy among the training and testing set is retained.

Table A1.

Performances obtained with random splitting of the cross-validation sets

MethodAccuracy (%)CorrelationSensitivity (%)
Specificity (%)
++
L20930.8693939393
L400970.9497979797
MethodAccuracy (%)CorrelationSensitivity (%)
Specificity (%)
++
L20930.8693939393
L400970.9497979797

L20 is the SVM trained with the residue composition. L400 is trained with the difference in dipeptide composition of the sequences. Symbols + and − indicate increased and decreased thermostability classes, respectively.

Table A1.

Performances obtained with random splitting of the cross-validation sets

MethodAccuracy (%)CorrelationSensitivity (%)
Specificity (%)
++
L20930.8693939393
L400970.9497979797
MethodAccuracy (%)CorrelationSensitivity (%)
Specificity (%)
++
L20930.8693939393
L400970.9497979797

L20 is the SVM trained with the residue composition. L400 is trained with the difference in dipeptide composition of the sequences. Symbols + and − indicate increased and decreased thermostability classes, respectively.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.