GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs

Small ubiquitin-like modifiers (SUMOs) regulate a variety of cellular processes through two distinct mechanisms, including covalent sumoylation and non-covalent SUMO interaction. The complexity of SUMO regulations has greatly hampered the large-scale identification of SUMO substrates or interaction partners on a proteome-wide level. In this work, we developed a new tool called GPS-SUMO for the prediction of both sumoylation sites and SUMO-interaction motifs (SIMs) in proteins. To obtain an accurate performance, a new generation group-based prediction system (GPS) algorithm integrated with Particle Swarm Optimization approach was applied. By critical evaluation and comparison, GPS-SUMO was demonstrated to be substantially superior against other existing tools and methods. With the help of GPS-SUMO, it is now possible to further investigate the relationship between sumoylation and SUMO interaction processes. A web service of GPS-SUMO was implemented in PHP + JavaScript and freely available at http://sumosp.biocuckoo.org.


INTRODUCTION
By covalently modifying specific lysine residues in protein substrates, or by non-covalently interacting with proteins, small ubiquitin-like modifiers (SUMOs) play an essential role in the regulation of a variety of biological processes, including gene expression, DNA repair, chromosome assembly, and cellular signaling (1)(2)(3)(4). Along with the accumulating research on its biological functions, there are abundant evidences that the aberrance of SUMO regulation is highly associated with various diseases, such as neurodegenerative diseases (5,6), congenital heart defects (7), diabetes (8) and cancers (9). Therefore, the identification of SUMO modifi-cation sites and SUMO-interaction motifs (SIMs) in proteins is fundamental for understanding the biological functions and regulatory mechanisms of SUMOs, and provides potential targets for further diagnostic and therapeutic consideration.
The process of proteins being covalently modified by SUMOs is called as sumoylation, which is one of the most important and ubiquitous post-translational modifications (PTMs) of proteins (10,11). Previously, experimental studies suggested that most of sumoylation sites follow a canonical consensus motif of -K-X-E ( , a hydrophobic amino acid, such as A, I, L, M, P, F, V or W; X, any amino acid residue) (12,13). However, our collective experimental data shows that approximately 40% (400 out of 983 sites) of known sumoylation sites do not conform to the above motif (Supplementary Table S1 and 3). In this regard, the current understanding of sumoylation recognition is still inadequate.
Recently, it was reported SUMOs can non-covalently interact with other proteins through targeting specific SIMs, which were also called as SUMO-binding motifs (SBMs) (14- 16). For example, the SUMO interaction of Daxx modulates its sumoylation and is critical for targeting Daxx to PML oncogenic domains (PODs) for the transcriptional repression (17). Also, the non-covalent interaction of SUMO-2 and CoREST1, but not the sumoylation, is essential for organizing the transcriptional corepressor complex of LSD1/CoREST1/HDAC (18 (22). Although nearly ten types of SIMs were experimentally identified, each one can only recall a small proportion of known SIMs, and no one can present a major profile for SIMs.
Because of the complicated features, systematic analysis of sumoylation and SUMO interaction is still a great challenge. In contrast with labor-intensive and time-consuming experimental identifications, in silico prediction of sumoylation sites and SIMs in proteins can greatly narrow down the number of candidates, and generate helpful information for further verification. In the past decade, our group has made great efforts in developing a series of tools for predicting protein sumoylation sites (25,26). In early 2006, we released an online service of SUMOsp based on the first-generation Group-based Prediction System (GPS) algorithm (25). Subsequently, a matrix mutation (MaM) method was integrated into SUMOsp to upgrade the performance and a new version of SUMOsp 2.0 was introduced (26). In addition, other researchers have also constructed several reliable tools for the prediction of sumoylation sites, including SUMOplot (http://www.abgent.com/sumoplot), seeSUMO (27), SUMOpre (28) and SUMOhydro (29). However, these tools only focused on the sumoylation prediction, while a predictor for SIMs is still need to be developed.
In this work, by improving the prediction algorithm and adding the novel SIM prediction feature, we developed an updated version of SUMOsp and renamed it as GPS-SUMO. From the scientific literature, we manually collected 983 sumoylation sites in 545 proteins and 137 known SIMs in 80 proteins as the non-redundant data sets, respectively. Subsequently, the fourth-generation GPS algorithm integrated with the PSO (30,31) method was employed for training and predicting. For convenience, a user-friendly web interface was developed using PHP + JavaScript, and is freely available at http://sumosp.biocuckoo.org.

IMPLEMENTATION
By searching the scientific literature (published before September 2013) in the PubMed with the keywords of 'SUMO', 'sumoylation' and 'sumoylated', we updated our previous training data set. The latest data set contained 1059 sumoylation sites in 594 proteins and 151 SIMs in 88 proteins. An online database of these experimentally ver-ified sites was then developed and the intact annotations from UniProt and NCBI were integrated. To avoid overestimation of the prediction accuracy, the redundant sites should be removed, and the CD-HIT (32) with a threshold of 40% sequence identity was used to single out homologous proteins. If two proteins were found to be modified at the same position and to have more than 40% sequence identity, only one of the two proteins was preserved. In particular, 71 sumoylation sites were randomly picked from the latest collected data set to construct an additional testing data set. Due to the data limitations, an additional testing data set for SUMO interaction was not constructed. Finally, 912 sumoylation sites in 510 protein (Supplementary Table  S1) and 137 SIMs in 80 proteins (Supplementary Table S2) were retained as the non-redundant training data sets.
In GPS-SUMO, the fourth-generation GPS algorithm was applied to predict the potential sumoylation sites and SIMs. As previously reported, consecutive hydrophobic residues are crucial for the non-covalent SUMO interaction (14,15). Based on the experimental observations, we summarized a hydrophobic motif of [IVL]{3,5} from the experimentally verified SIMs. Only 5 (∼3.6%) of 137 known SIMs did not follow this pattern (Supplementary Table S2). This motif was then integrated into the GPS algorithm to filter potentially false positive hits for the SIMs prediction. To enhance the prediction accuracy, the GPS algorithm adopted an additional procedure with four sequential training steps: k-means clustering, motif length selection (MLS), weight training (WT) and matrix mutation (MaM) (33). In the previous version (33), WT and MaM were implemented in a random mutation algorithm that required a long time for training and frequently resulted in a non-convergent result. Therefore, in the fourth-generation GPS algorithm, the Particle Swarm Optimization (30,31) algorithm was integrated to accelerate the training processes of WT and MaM steps and greatly improve the prediction performance. Before describing the PSO improvement, we re-illustrated the WT process as shown in Equation (1): The w i was the scoring weight and w i represented the numeric changes in the scoring weight after the training process. The WT process was aimed at finding a set of w i with the optimal performance. Similarly, the MaM process can also be described as shown in Equation (2): The S(a,b) was the optimal substitute score for an amino acid pair of a and b, while S(a,b) was the substitute score in the BLOSUM62 matrix. The ΔS(a,b) represented the numeric changes in the substitution score for an amino acid pair of a and b. Thus, the MaM approach sought a set of ΔS(a,b) that maximized the prediction performance. In the first step of PSO, a population array of particles with random current positions and velocities was initialized. For each particle, the leave-one-out (LOO) validation was used to evaluate the prediction performance in the current position. In this case, the randomly generated w i and S(a,b) were directly assigned to the current position. Next, the particles in the population communicated with each other through a neighborhood structure with a ring topology.
Guided by the best position found in a specific neighborhood, a particle moved to a new position that much closer to the globally optimal one. By iteratively performing the above steps until a convergence criterion was met, an optimized solution for WT and MaM was achieved and a more accurate predictive model was obtained in GPS-SUMO. Also, the training efficiency of GPS-SUMO was considerably improved. More details on the fourth-generation GPS algorithm were provided in the Supplementary Methods.
To construct the online service of GPS-SUMO, we developed an easy-to-use web interface using PHP and JavaScript. For convenience, we tested the online service on a variety of internet browsers, including Internet Explorer, Mozilla Firefox and Google Chrome. To support the largescale prediction, stand-alone packages were also developed in Java SE 6 and supported by Windows, Linux and Mac OS.

RESULTS
For the preparation of training data sets, we took known sumoylation sites as the positive dataset, while all other non-sumoylated lysines in the same substrates were taken as the negative dataset. Similarly, the experimentally identified SIMs were regarded as positive data, while all other unidentified SIMs following the [IVL]{3,5} motif in the same proteins were taken as negative data. More details were shown in Supplementary Methods. Totally, the nonredundant training data set of sumoylation included 912 positive sites and 24491 negative sites (Supplementary Table  S1), and ∼60% of all known sites are consensus sites (Supplementary Figure S1A). In SUMOsp 2.0, we only classified known sumoylation sites into two clusters, including the consensus group with sites following the -K-X-E motif and the non-consensus group (26). In this work, we further used the k-means clustering approach to group non-consensus sites into two clusters, and demonstrated that such a procedure considerably improved the prediction performance. More clusters will result in fewer number of non-consensus sites in each cluster, and make the predictive model unstable. The sequence logos were illustrated by Weblogo (34) for each cluster (Supplementary Figure S1). For the two non-consensus groups, one cluster considerably enriches sites following the -K-X-D motif (Supplementary Figure S1B), whereas the sequence profile is ambiguous for the other cluster (Supplementary Figure S1C). For SUMO interaction, the training data set contained 137 positive sites and 1699 negative sites.
To evaluate the prediction performance of GPS-SUMO, we performed the self-consistency validation (Self), LOO validation and n-fold cross-validations (n-fold) of the training data set. For each validation, the sensitivity (23), specificity (Sp), accuracy (Ac), Mathew correlation coefficient (MCC) and precision (Pr) were calculated. The receiver operating characteristic (ROC) curves were drawn. Also, the values of area under the curve (AUC) were calculated. For sumoylation site prediction, the AUC values of Self, LOO, 4-, 6-, 8-and 10-fold validations were 0.8779, 0.8754, 0.8673, 0.8687, 0.8726 and 0.8746, respectively ( Figure 1A). Also in the SIM prediction, the AUC scores were calculated as 0.9737 for Self, 0.9729 for LOO, 0.9510 for 4-fold, 0.9529 for 6-fold, 0.9625 for 8-fold and 0.9633 for 10-fold validations ( Figure 1B). The similar results of different validations demonstrated that GPS-SUMO is a stable and robust prediction tool. Based on the above evaluation, the three thresholds of high, medium and low stringency were chosen for GPS-SUMO. In order to balance the prediction performance, the medium stringency was selected as the default threshold.
In order to demonstrate the superiority of GPS-SUMO, we compared the prediction performance of GPS-SUMO with SUMOsp 2.0 and other existing tools or methods. Because SUMOsp 2.0 was demonstrated to be better than SUMOplot and SUMOpre (26), and SUMOhydro (29) did not support the batch prediction for multiple sequences, here we only compared GPS-SUMO with a recently released predictor of seeSUMO (27). To avoid any bias, the same training data set used in GPS-SUMO was adopted in SUMOsp 2.0 and seeSUMO. For comparison, the LOO validation was carried out, and the ROC curves were plotted in Figure 1C. In our results, the AUC values were calculated as 0.8754 for GPS-SUMO, 0.8467 for SUMOsp, 0.8297 for the seeSUMO with the Support Vector Machines (SVMs) algorithm and 0.8007 for the seeSUMO with the Random Forest (RF) algorithm, respectively ( Figure 1C). Thus, the GPS-SUMO exhibited superior than other tools, and the PSO algorithm did increase the prediction accuracy. For SUMO interaction, a comparison was carried out between the GPS-SUMO and other experimentally verified motifs (Table 1). We fixed the Sp value of the GPS-SUMO to com-  pare the Sn and Ac scores with these motifs. Obviously, the GPS-SUMO generate much better performance than the motifs (Table 1).
To further evaluate the accuracy of GPS-SUMO, an additional testing set was used. This test set contained 71 positive sumoylation sites and 1376 negative sites from the most recently collected data set (Supplementary Table S3). Due to the data limitation, an additional data set for SUMO interaction prediction was not available. In Figure 1D, the ROC curve of the LOO validation was drawn and the AUC was calculated as 0.8629. Thus, our results suggested that the GPS-SUMO is still robust and accurate for the prediction of new data.

USAGE
The online service of GPS-SUMO can predict both sumoylation sites and SIMs in a convenient manner (Figure 2A). In the console panel, a drop-down list was provided for selecting different prediction type. Also, an example button was provided. Two threshold panels for sumoylation and SUMO interaction prediction are located in the bottom left corner. Different prediction thresholds can be easily selected from the corresponding drop-down lists.

Input description
In the web server page of GPS-SUMO, users can input one or multiple protein sequences with FASTA format in the text-box or upload a FASTA format file via the file selection dialog box. In order to guarantee a safe run of our web server, a maximal file size of 2M is allowed to be uploaded

Output description
All of the prediction results are presented in a tabular form containing the information of FASTA title, modified position, modified peptide, predicted score, prediction cutoff and modified type ( Figure 2B). In the position column, the precise sumoylation sites as well as the SIMs are shown. Also, the predicted peptide for sumoylation or SUMO interaction is displayed in the peptide column with the sumoylation site or SIM shown in red. The cluster score and its corresponding cutoff are presented in the score and cutoff column, respectively. In the last column, the category of the SUMO regulation is indicated.

DISCUSSION
GPS-SUMO is a novel web server that can be used to predict both covalent sumoylation sites and non-covalent SIMs. With a new generation GPS algorithm integrated with PSO method, the sumoylation site prediction performance was greatly improved. Also, by integrating a core hydrophobic motif of [IVL]{3,5}, GPS-SUMO achieves a precise prediction on SIMs. Taken together, we propose that GPS-SUMO will prove to be a highly useful web server in SUMO modification research. Moreover, the capability for predicting both sumoylation sites and SIMs in GPS-SUMO makes it potentially valuable for the investigation of the relationship between sumoylation and SUMO interaction.
During the training process, we classified all known sumoylation sites into two groups including the consensus group and non-consensus group based on the canonical motif of -K-X-E. Although our algorithm generated satisfying performance on the consensus group, the performance for non-consensus group still remained to be improved. Therefore, in the subsequent release, a much more accurate GPS model will be applied in the prediction of the non-consensus sumoylation sites. As previously mentioned, the SIM has a distinct hydrophobic core. In this regard, a scoring strategy reflecting amino acid hydrophobicity will be further integrated to improve the accuracy of SIM prediction in the near future.
Previous studies revealed that serine and threonine residues are abundant in proximity to the hydrophobic core of a subset of SIMs, and raised a concern that whether phosphorylation plays an important role in the regulation of SUMO-interacting activity (15,35). Indeed, the E3 ligase PIASx␣ is phosphorylated in vivo at serines adjacent to the hydrophobic core of the SIM, and the PIASx␣ phosphorylation modulates its interacting to SUMO1 but not to SUMO2 (15). By searching several public databases of the phosphorylation, including Phospho.ELM (36), Phospho-SitePlus (37) and PhosphoGRID (38), we totally obtained 1016 non-redundant phosphorylation sites in 80 known SUMO-interacting proteins. Then we looked through the flanking regions such as (5,5), (10,10) and (15,15) with at least one phosphorylation site around known SIMs and negative peptides following the [IVL]{3,5} motif (Table 2). Using the Chi-Squared Test (Right-Tail) (39), we observed that phosphorylation sites are significantly over-represented in adjacent regions of known SIMs against negative peptides ( Table 2, P-value << 0.01). Totally, there were 20 (14.6%) and 27 (19.7%) known SIMs with at least one phosphorylation sites in the adjacent regions of (5, 5) and (10,10), respectively (Table 2). When the flanking region was extended to (15,15), the number of SIMs with flanking phosphorylation sites was only moderately increased, while the level of significance was decreased. In this regard, although the phosphorylation information was not integrated into GPS-SUMO, at least a known phosphorylation site in the (10, 10) flanking region of the SIM can be a good indicator for further identifying phospho-regulated SIMs.