pSumo-CD: Predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating...

Motivation: Sumoylation is a posttranslational modification (PTM) process, in which small ubiquitin-related modifier (SUMO) is at-taching by covalent bonds to substrate protein. It is critical to many different biological processes such as replicating genome, expressing gene, localizing and stabilizing proteins; unfortunately, it is also involved with many major disorders including Alzheimer’s and Parkinson’s diseases. Therefore, for both basic research and drug development, it is important to identify the sumoylation sites in proteins. Results: To address such a problem, we developed a predictor called pSumo-CD by incorporating the sequence-coupled infor- mation into the general pseudo amino acid composition (PseAAC) and introducing the covariance discriminant (CD) algorithm, in which a bias-adjustment term, which has the function to automati- cally adjust the errors caused by the bias due to the imbalance of training data, had been incorporated. Rigorous cross-validations indicated that the new predictor remarkably outperformed the existing state-of-the-art prediction method for the same purpose. Availability: For the convenience of most experimental scientists, a user-friendly web-server for pSumo-CD has been established at http://www.jci-bioinfo.cn/pSumo-CD, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
In vivo, protein post-translational modification (PTM or PTLM) is an important biological mechanism for expanding the genetic code as well as regulating cellular physiology. Small Ubiquitin-like Modifier (SUMO) proteins are a type of PTM that play an important role in subcellular transport, transcription, DNA repair and signal transduction. Recent researches have indicated that sumoylation can promote the binding ability of proteins, and that some proteins, such as claspin, whose binding function is sumoylation-dependent. Many diseases and disorders, such as Alzheimer's and Parkinson's diseases, have been found to be closely related with sumoylation. Therefore, identification of sumoylation sites in proteins is important not only for in-depth understanding *To whom correspondence should be addressed. many important biological processes but also for developing effective drugs.
Identification of sumoylation sites with biological and chemical approaches is more laborious and slow; particularly the sumoylation is reversible and instable. With the avalanche of protein sequences generated in the postgenomic era, it is highly demanded to develop computational methods as a complimentary tool to the pure experimental methods.
Actually, considerable efforts have been made in this regard, including the method called SUMOsp developed by Xue et al. (Xue, et al., 2006), the method SUMOsp2.0 by Ren et al. (Ren et al., 2009), and the method GPS-SUMO by Zhao et al. (Zhao et al., 2014); all of them were developed by using the group-based phosphorylation scoring algorithm. Meanwhile, based on SVM (support vector machine), the methods SUMOPre and SUMOhydro were developed by Xu et al. (Xu et al, 2008) and Chen et al. (Chen, 2012), respectively. Very recently, based on the linear discriminant analysis, Xu et al. (Xu, 2016) proposed the method SUMO_LDA. Each of these methods did make contribution in stimulating the development of such an important area. Particularly, the most recent method SUMO_LDA (Xu, et al., 2016), which was established by combining three different types of sequence features into the general pseudo amino acid composition (PseAAC) (Chou, 2011), achieved quite decent success rates. In considering, however, the topic's importance and also the urgency of demanding more powerful computational tools in this area, further efforts aiming at prediction of protein sumoylation sites are definitely needed.
The present study was initiated in an attempt to develop a new and more powerful predictor by (1) incorporating the vectorized sequencecoupling model into the general form of pseudo amino acid composition, and (2) installing the covariance discriminant operation engine into the prediction system. Rigorous cross-validations indicated that the new predictor significantly outperformed the existing state-of-the-art predictor (Xu, et al., 2016) in both overall accuracy (> 10%) and stability (>16%).
According to the Chou's 5-step rule (Chou, 2011) and concurred by many investigators in a series of recent publications (Chen et al., 2016a;Chen et al., 2016c;Jia et al., 2016a;Jia et al., 2016b;Jia et al., 2016c;Jia et al., 2016d;Liu et al., 2016a;Liu et al., 2016b;Liu et al., 2016d;Liu et al., 2016e;Qiu et al., 2016a;Xiao et al., 2016) for developing a new prediction method that can be widely used by broad users, we should consider the following five points: (1) a good benchmark dataset used to train or test the new model; (2) an effective mathematical formulation to 2 Materials and Methods

Benchmark Datasets
The benchmark dataset used in this study was derived from the same 510 proteins as used by Xu et al. (Xu et al., 2016). The complete amino acid sequences of these proteins can be obtained from UniProt (Apweiler et al., 2004). In the last decade or so, various consensus motifs for SUMO have been suggested. Regardless of their many differences in details, there is one thing in common, i.e., they all contain the amino acid residue Lys or K. To make the description mathematically more rigorous and clear, the Chou's scheme (Chou, 2001c) was adopted to formulate peptide samples, as done recently by many authors in studying the nitrotyrosine sites (Xu et al., 2014), methylation sites (Qiu et al., 2014), and protein-protein binding sites (Jia et al., 2015b). According to Chou's scheme, a potential hydroxylation site-containing peptide sample can be generally expressed by where the symbol ॶ denotes the single amino acid code K, the subscript ξ is an integer, R ିஞ represents the ξ-th upstream amino acid residue from the center, the R ାஞ the ξ-th downstream amino acid residue, and so forth. The (2ξ + 1)-tuple peptide sample ‫۾‬ ஞ (ॶ) can be further classified into the following two categories: , if its center belongs to sumoylation site where ‫۾‬ ஞ ା (ॶ) denotes a true sumoylation segment, ‫۾‬ ஞ ି (ॶ) a false sumoylation segment, and the symbol ∈ means "a member of" in the set theory.
In literature the benchmark dataset usually consists of a training dataset and a testing dataset: the former is used for training a model, while the latter for testing the model. But as pointed out in a comprehensive review (Chou and Shen, 2007a), there is no need to artificially separate a benchmark dataset into the two parts if the prediction model is examined by the jackknife test or subsampling (K-fold) cross-validation since the outcome thus obtained is actually from a combination of many different independent dataset tests. Thus, the benchmark dataset for the current study can be formulated as where the positive subset ॺ ஞ ା contains only the true sumoylation samples ‫۾‬ ஞ ା (ॶ), and the negative subset ॺ ஞ ି contains the false sumoylation segments ‫۾‬ ஞ ି (ॶ) only (see Eq.2); while ⋃ denotes the symbol of "union" in the set theory.
Since the length of peptide sample ‫۾‬ ஞ (ॶ) is 2ξ + 1 (see Eq.1), the benchmark dataset with different ξ value will contain peptide segments with different number of amino acid residues. In the study carried out recently by Xu et al. (Xu et al., 2016), however, they selected ξ = 10; i.e., the length of peptide sample is 21 (4) In order for facilitating comparison with their method, in this study let us also assign the value 10 for the window parameter ξ. Thus, Eq.1 and Eq.3 can be reduced to The detailed procedures in constructing the benchmark dataset ॺ are as follows.
(1) As done in (Chou, 2001c), slide the 21-tuple peptide window along each of the aforementioned 510 protein sequences, and collected were only those peptide segments that have K (Lys or lysine) at the center (see Eq.1).
(2) If the upstream or downstream in a protein sequence was less than 10 or greater than ‫ܮ(‬ − 10) where L is the length of the protein sequence concerned, the lacking amino acid was filled with a dummy residue X. (3) The peptide segment samples thus obtained were put into the positive subset if their centers have been experimentally annotated as the sumoylation sites; otherwise, into the negative subset.
(4) The peptide samples thus obtained were subject to a screening procedure to window those that had ≥ 40% pairwise sequence identity to any other in a same subset. By following the above procedures, we finally obtained 755 positive samples, 9,944 negative samples. For readers' convenience, their detailed sequences are given in Online Supporting Information S1 and Online Supporting Information S2, respectively.

Incorporating sequence-coupled information into general pseudo amino acid composition
With the avalanche of biological sequence generated in the post-genomic age, one of the most important problems in computational biology is how to formulate a biological sequence with a discrete model or a vector, yet still considerably keep its sequence pattern or order information. This is because all the existing machine-learning algorithms can only handle vector but not sequence samples, as elaborated in (Chou, 2015).
To address this problem, the pseudo amino acid composition (Chou, 2001a;Chou, 2005b) or PseAAC was proposed. Ever since the concept of pseudo amino acid composition or Chou's PseAAC (Cao et al., 2013;Du et al., 2012;Lin and Lapointe, 2013) was proposed, it has rapidly penetrated into nearly all the areas of computational proteomics (see, e.g., (Ahmad et al., 2016;Dehzangi et al., 2015;Kabir and Hayat, 2016;Khan et al., 2015;Kumar et al., 2015;Mondal and Pai, 2014;Tang et al., 2016;Wang et al., 2015) as well as a long list of references cited in (Chen et al., 2015b;Du et al., 2014)) and many biomedicine and drug development areas (Zhong and Zhou, 2014;Zhou, 2015;Zhou and Zhong, 2016). Because it has been widely and increasingly used, recently three powerful open access soft-wares, called 'PseAAC-Builder' (Du, et al., 2012), 'propy' (Cao et al., 2013), and 'PseAAC-General' (Du et al., 2014), were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC (Chou, 2011), including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode (see Eqs.9-10 of (Chou, 2011)), "Gene Ontology" mode (see Eqs.11-12 of (Chou, 2011)), and "Sequential Evolution" or "PSSM" mode (see Eqs.13-14 of (Chou, 2011)). Inspired by the successes of using PseAAC to deal with protein/peptide sequences, three web-servers (Chen et al., 2014c;Chen et al., 2015c;Liu et al., 2015c) were developed for generating various feature vectors for DNA/RNA sequences as well. Particularly, recently a powerful web-server called Pse-in-One (Liu et al., 3 2015d) has been developed that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' studies.
According to the general PseAAC (Chou, 2011), the peptide sequence in Eq.5 can be formulated as and In the above Eq.7, ‫‬ ିଵ ା (R ିଵ |R ିଽ ) is the conditional probability of amino acid R ିଵ occurring at the left 1st position (see Eq.5) given that its closest right neighbor is R ିଽ , ‫‬ ିଽ ା (R ିଽ |R ି଼ ) is the conditional probability of amino acid R ିଽ occurring at the left 2nd position given that its closest right neighbor is R ି଼ , and so forth. Note that in Eq.7, only ‫‬ ିଵ ା (R ିଵ ) and ‫‬ ାଵ ା (R ାଵ ) are of non-conditional probability since the right neighbor of R ିଵ and the left neighbor of R ାଵ are always K or Lys. All these probability values can be easily derived from the positive samples in Supporting Information S1, as done in . Likewise, the components in Eq.8 are the same as those in Eq.7 except for that they are derived from the negative samples in Supporting Information S2.
To make the feature vector as defined in Eq.6 easier to be expressed in the CD algorithm, let us define where T is the transposing operator, and Suppose the standard feature vectors for the peptide samples in ॺ ା and ॺ ି are, respectively, expressed by where Ψ , ା is the u-th component of the feature vector for the k-th peptide sample in the positive dataset ॺ ା , Ψ , ି for the k-th peptide sample in the negative dataset ॺ ି , ܰ ା the total number of peptide samples in ॺ ା , and ܰ ି the total number of peptide samples in ॺ ି . Thus, whether a query peptide sequence sample ℙ belongs to the sumoylation site subset ॺ ା or non-sumoylation site subset ॺ ି will be judged by where Sgn(δ) is the argument of δ that minimize ॲ(ℙ, ℙ ઼ തതതത ), which according to the CD algorithm is defined as (Chou and Maggiora, 1998) In Eq.14 Λ is the dimension of the feature vector that is constant and hence the term Λln (2π) can be ignored in this study. ो(ॺ ஔ ) is the prior probability of subset ॺ ஔ . For the current study, we have where ܰ ା =755 and ܰ ି =9,944 (see Supporting Information S1 and S2). ℂ ஔ in Eq.14 is the covariance matrix of the subset ॺ ஔ , as given by and the elements therein are given by Its determinant is denoted by |ℂ ஔ | and its inverse matrix by ℂ ஔ ିଵ . Thus, the Mahalanobis distance (Mahalanobis, 1936) is given by (Chou, 1995b;Chou and Zhang, 1994) It is important to point out, however, that the 20 components in Eq.9 are defined by the probability distribution (see Eq.10) and hence they are constrained by some sort of condition (Chou, 1995b). Therefore, of the 20 components, only 19 are completely independent. As a consequence, the covariance ℂ ஔ of Eq.16 must be a singular one (Chou and Zhang, 1994), making the ॲ function (Eq.14) divergent and meaningless. To avoid such a situation, the dimension-reducing procedure (Chou and 4 Zhang, 1994) was adopted. The concrete procedure is as follows. Instead of 20-D space, a peptide sample is defined in a 19-D space by leaving out one of its 20 components. The remaining 19 components will be completely independent; therefore, the corresponding covariance matrix ℂ ஔ will no longer be singular. However, a question might be raised: which one of the 20 components should be left out? The answer is any one. Will it yield different outcome by removing a different component? The answer is no. The reason is that, according to Chou's invariance theorem, the outcome of the Mahalanobis distance will remain the same regardless of which one of the components is left out. For the rigorous proof of Chou's invariance theorem, see the Appendix A of (Chou, 1995b). For a brief introduction about Chou's invariance theorem, see the Wikipedia article at https://en.wikipedia.org/wiki/Chou's_invariance_theorem. Accordingly, in practical calculation, instead of the 20 × 20 matrix version as shown in Eq.16, we should use its 19 × 19 matrix version for ℂ ஔ . Furthermore, the 2 nd term in the ॲ function of Eq.14 can be formulated as (Chou and Elrod, 1999a;Chou and Elrod, 1999b) ln|ℂ ஔ | = ൜ ln(λ ଵ λ ଶ ⋯ λ ଵଽ λ ଶ ) for the 20×20 version ln(λ ଶ λ ଷ λ ସ ⋯ λ ଵଽ ) for the 19×19 version (19) where 0 = λ ଵ < λ ଶ < λ ଷ < ⋯ are the eigenvalue values of the determinant |ℂ ஔ |. Note that in the early studies of using CD algorithm to predict protein structural classes (Chou, 1995b;Chou and Zhang, 1994;Chou, 1995a), only the term of Mahalanobis distance (Eq.18) was used to calculate the ॲ function of Eq.14. In a series of subsequent studies on predicting protein structural classes (Chou, 1999;Chou and Maggiora, 1998;Liu and Chou, 1998;Zhou, 1998;Zhou and Assa-Munt, 2001), protein subcellular localization (Chou, 2000a;Chou and Elrod, 1998;Chou and Elrod, 1999b;Zhou and Doctor, 2003), membrane protein types (Chou and Elrod, 1999a), GPCR types (Chou, 2005a;Chou and Elrod, 2002;Elrod and Chou, 2002), enzyme family classes (Chou and Elrod, 2003), and nucleosome positioning (Chen, et al., 2012a), the 2 nd term ln|ℂ ஔ | in Eq.14 was included as well, and the prediction quality was remarkably improved. In this study, we are to further include the 3 rd term 2lnൣो(ॺ ஔ )൧ in Eq.14. The reason to do so is that in the current case the prior probability of subset ॺ ା is much smaller than that of subset ॺ ି (see Eq.15). In other words, when the CD algorithm was used to study a system in which the subsets of benchmark dataset was not very skewing, it would be OK by just using the 1 st and 2 nd terms of Eq.14. But when the benchmark dataset is very unbalanced or highly skewing, the 3 rd term must be taken into account. Otherwise, if ॺ ା ≪ ॺ ି , many positive sample might be incorrectly predicted as negative; and vice versa. To fix such a bias problem when using the operation engines rather than CD algorithm, special treatments, such as dataset optimization (Xiao et al., 2015), Monto Carlo sampling (Jia et al., 2016a), and fusion approach (Jia et al., 2016b;Jia et al., 2016d;Qiu et al., 2016b), were needed. The advantage of using CD algorithm is that it has automatically included the function to deal with the skewing dataset problem, but it was just ignored by the aforementioned investigators and hence missing the contribution from the 3 rd term of Eq.14.

Results and discussion
As pointed out in the Introduction section, one of the keys in establishing a useful predictor is how to properly evaluate its anticipated success rates.
To realize this, we need to consider two things: one is what metrics or scales should be used to quantitatively measure its prediction quality; the other is what validation method should be adopted to calculate or derive the metrics values. Below, let us address the two problems

A set of four metrics
The following four metrics are usually used in literature to measure the quality of binary classification: (1) overall accuracy or Acc; (2) Mathew's correlation coefficient or MCC; (3) sensitivity or Sn; and (4) specificity or Sp (see, e.g., (Chen, et al., 2007)). Unfortunately, the conventional formulations for the four are not intuitive and that most experimental scientists feel difficult to understand them, particularly for the one of MCC. Interestingly, by using the Chou's symbols and derivation in studying signal peptides (Chou, 2001b), the aforementioned four metrics can be easily converted into a set of following equations (Chen, et al., 2013;Xu, et al., 2013): where ܰ ା represents the total number of sumoylation sites investigated whereas ܰ ି ା the number of true sumoylation sites incorrectly predicted to be of non-sumoylation site; ܰ ି the total number of the non-sumoylation sites investigated whereas ܰ ା ି the number of non-sumoylation sites incorrectly predicted to be of sumoylation site. According to Eq.20, it is crystal clear to see the following. When ܰ ି ା = 0 meaning none of the true sumoylation sites are incorrectly predicted to be of non-sumoylation site, we have the sensitivity Sn = 1. When ܰ ି ା = ܰ ା meaning that all the sumoylation sites are incorrectly predicted to be of non-sumoylation site, we have the sensitivity Sn = 0. Likewise, when ܰ ା ି = 0 meaning none of the non-sumoylation sites are incorrectly predicted to be of sumoylation site, we have the specificity Sp = 1; whereas ܰ ା ି = ܰ ି meaning that all the non-sumoylation sites are incorrectly predicted to be of sumoylation sites, we have the specificity Sp = 0. When ܰ ି ା = ܰ ା ି = 0 meaning that none of sumoylation sites in the positive dataset and none of the non-sumoylation sites in the negative dataset are incorrectly predicted, we have the overall accuracy Acc = 1 and MCC = 1; when ܰ ି ା = ܰ ା and ܰ ା ି = ܰ ି meaning that all the sumoylation sites in the positive dataset and all the non-sumoylation sites in the negative dataset are incorrectly predicted, we have the overall accuracy Acc = 0 and MCC = −1 ; whereas when ܰ ି ା = ܰ ା /2 and ܰ ା ି = ܰ ି /2 we have Acc = 0.5 and MCC = 0 meaning no better than random guess. Therefore, using Eq.20 has made the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient much more intuitive and easier-to-understand, particularly for the meaning of MCC, as concurred recently by many investigators (see, e.g., (Chen et al., 2016a;Chen et al., 2015a;Chen et al., 2016b;Chen et al., 2014a;Chen et al., 2014b;Chen et al., 2016c;Ding et al., 2014;Jia et al., 2015a;Jia, et al., 2015b;Jia, et al., 2016a;Liu, et al., 2015a;Liu, et al., 2016b;Liu et al., 2015b;Liu et al., 2016c;Liu et al., 2016d;Liu et al., 2015e;Qiu et al., 2016a;Qiu et al., 2016b;Xiao et al., 2015;Xiao et al., 2016)).

5
Note that, however, the set of equations defined in Eq.20 is valid only for the single-label systems. For the multi-label systems whose emergence has become more frequent in system biology (Chou, et al., 2012;Lin, et al., 2013;Xiao, et al., 2011) and system medicine (Xiao, et al., 2013), a completely different set of metrics are needed as elaborated in (Chou, 2013).

Cross-Validation
With a set of well-defined metrics to measuring the quality of a predictor, the next thing is what kind of validation method should be used to score these metrics. In predictive analytics, the following three cross-validation methods are often used: (1) independent dataset test, (2) subsampling (or K-fold cross-validation) test, and (3) jackknife test (Chou and Zhang, 1995). Of these three, however, the jackknife test is deemed the least arbitrary that can always yield a unique outcome for a given benchmark dataset as elucidated in (Chou, 2011). Accordingly, the jackknife test has been widely recognized and increasingly used by investigators to examine the quality of various predictors (see, e.g., (Ahmad et al., 2016;Cai et al., 2003;Chou and Cai, 2003;Chou and Cai, 2005;Dehzangi et al., 2015;Fan et al., 2015;Ju et al., 2016;Kabir and Hayat, 2016;Khan et al., 2015;Kumar, et al., 2015;Mondal and Pai, 2014;Shen, et al., 2007;Tang, et al., 2016;Zhou, 1998;Zhou and Assa-Munt, 2001;Zhou and Doctor, 2003)). However, to reduce the computational time, in this study we adopted the 10-fold cross-validation, as done by most investigators with SVM and random forests algorithms as the prediction engine. Besides, doing so would also facilitate the comparison since the reported results by Xu et al. (Xu et al., 2016) was also derived from 10-fold crossvalidation.

Result analysis and comparison
The success rates achieved by the new predictor iSumo-CD via the 10fold cross-validation test on the same 510 proteins used by Xu et al. (Xu et al., 2016) are given in Table 1, where for facilitating comparison, the corresponding rates obtained by the predictor SUMO_LDA (Xu, et al., 2016) are also listed. As we can see from the table, the rate of Acc by new predictor pSumo-CD is 97.88%, which is about 11% higher than that by SUMO_LDA (Xu et al., 2016). The rate of MCC by pSumo-CD is 0.846, about 16% higher than that of SUMO_LDA. It is instructive to point out that, of the four metrics defined in Eq.20, the most important are the Acc and MCC (Chen et al., 2016a;Chen et al., 2016c): the former reflects the overall accuracy of a predictor; while the latter, its stability in practical applications. The metrics Sn and Sp are used to measure a predictor from two opposite angles. When, and only when, both Sn and Sp of the predictor A are higher than those of the predictor B, can we say A is better than B (Liu et al., 2016d). In other words, Sn and Sp are actually constrained with each other (Chou, 1993). Therefore, it is meaningless to use only one of the two for comparing the quality of two predictors. A meaningful comparison in this regard should consider the rates of both Sn and Sp, or even better consider the rate of their combination that is none but just the score of MCC as shown in Table 1. The scores here were generated by the 10-fold cross-validation on the 510 proteins as used by Xu et al. (Xu et al., 2016). b The predictor developed by Xu et al. (Xu et al., 2016). c The predictor proposed in this paper. d See Eq.20 for the metrics definition.
Why could the proposed method be able to increase the prediction quality so substantially? First, a special term in the CD algorithm, namely 2lnൣो(ॺ ஔ )൧ of Eq.14, has been taken into account in the current study. The term has the function to adjust the errors caused by the bias in a highly unbalanced benchmark dataset. Second, the amino-acid-coupled effects around the sumoylation sites have been taken into account via the conditional probability approach as formulated in Eqs.6-10. As a result, the cluster of the true-sumoylation samples (Fig.1) can be more distinctly separated with that of the false-sumoylation samples, leading to a better success rates in discriminating them from each other. Similar remarkable successes have also been observed in predicting beta-turns (Zhang and Chou, 1997), alpha-turns (Chou, 1997), tight turns and their types in proteins (Chou, 2000b), specificity of GalNAc-transferase (Chou, 1995c), HIV protease cleavage sites (Chou, 1993;Zhang and Chou, 1993), as well as signal peptide cleavage sites (Chou, 2001d;Chou and Shen, 2007b;Shen and Chou, 2007).

Figure 1.
Histograms to show the clusters of the true and false sumoylation peptide samples that are expressed by the general PseAAC of Eq.10. Each of the components is marked on the horizontal axis, and its average value on the vertical axis. The red histogram is for the mean value derived from the positive subset, and the blue for that from the negative subset. See the text for further explanation.

Web server and user guide
To enhance the value of its practical applications, the web-server for pSumo-CD has been established at http://www.jci-bioinfo.cn/pSumo-CD. Furthermore, to maximize the convenience for the majority of experimental scientists, a step-by-step guide is provided below.
Step 1. Opening the web-server at http://www.jcibioinfo.cn/pSumo-CD, you will see its top page on your computer screen, as shown in Fig.2. Click on the Read Me button to see a brief introduction about the pSumo-CD predictor.
Step 2. Either type or copy/paste the query protein sequences into the input box at the center of Fig.2. The input sequence should be in the FASTA format. For the examples of sequences in FASTA format, click the Example button right above the input box.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the three query protein sequences in the Example window as the input, in about 20 seconds after your submitting, you will see the following on the screen of your computer: (1) The 1 st query pro- Positive samples Negative samples 6 tein (O95644) contains 51 K residues, of which residue 277, 293, and 703 are highlighted with red, meaning able to be of sumoylation. (2) The 2 nd query protein (B7ZR65) contains 25 K residues, of which residues 55, 253, and 365 are able to be of sumoylation. (3) The 3 rd query protein (P03496) contains 12 K residues, of which only the one at position 70 is highlighted with red meaning able to be of sumoylation. All the (51 + 25 + 12) = 88 predicted outcomes are fully consistent with experimental observations except for the following two cases: residue 55 in the 2 nd query protein was over-predicted (false positive) and residue 219 in the 3 rd query protein was missed (false negative).
Step 4. As shown on the lower panel of Fig.2, you may also choose the batch prediction by entering your e-mail address and your desired batch input file (in FASTA format of course) via the Browse button. To see the sample of batch input file, click on the button Batch-example.
Step 5. Click the Supporting Information button to download the benchmark dataset used in this study.
Step 6. Click on the Citation button to find the relevant papers that play the key roles in developing the new prediction method.

Conclussion
The pSumo-CD predictor is a new bioinformatics tool for identifying the sumoylation sites in proteins. Compared with the existing state-of-the-art predictor in this area, its prediction quality is much better, with remarkably higher overall accuracy and stability. For the convenience of most experimental scientists, we have provided its web-server and a step-bystep guide, by which users can easily obtain their desired results without the need to go through the detailed mathematics. The reason of including them in this paper is for the integrity of the new prediction method, and also for that some interesting techniques, such as incorporating the sequence-coupled approach into the general PseAAC, introducing the prior probability term in the CD algorithm to adjust the bias errors caused by unbalanced training dataset, and applying Chou's invariance theorem to overcome the divergence problem, may be of use as well in developing other tools in computational biology.
We anticipate that pSumo-CD will become a very useful high throughput tool for both basic research and drug development in the areas relevant to the protein sumoylation.