MoRFchibi SYSTEM: software tools for the identification of MoRFs in protein sequences

Molecular recognition features, MoRFs, are short segments within longer disordered protein regions that bind to globular protein domains in a process known as disorder-to-order transition. MoRFs have been found to play a significant role in signaling and regulatory processes in cells. High-confidence computational identification of MoRFs remains an important challenge. In this work, we introduce MoRFchibi SYSTEM that contains three MoRF predictors: MoRFCHiBi, a basic predictor best suited as a component in other applications, MoRFCHiBi_Light, ideal for high-throughput predictions and MoRFCHiBi_Web, slower than the other two but best for high accuracy predictions. Results show that MoRFchibi SYSTEM provides more than double the precision of other predictors. MoRFchibi SYSTEM is available in three different forms: as HTML web server, RESTful web server and downloadable software at: http://www.chibi.ubc.ca/faculty/joerg-gsponer/gsponer-lab/software/morf_chibi/


INTRODUCTION
Protein-protein interactions (PPIs) play essential rolls in most biological processes in cells. Work in the last two decades has revealed that intrinsically disordered protein regions (IDRs) mediate many interactions as their structural flexibility enables them to ideally fit their target domain's binding surfaces (1). Currently, IDR binding sites are classified under two overlapping categories: short linear motifs (SLiMs) (2) and molecular recognition features or elements (MoRFs) (3). SLiMs are defined as conserved, short (3-10 amino acids) linear motifs that can mediate PPIs and other types of interactions (2). Importantly, SLiMs are not only found in IDRs, about 20% of known SLiMs are located in globular protein domains (2). MoRFs, on the other hand, are strictly located within IDRs. Additionally, MoRFs un-dergo disorder-to-order transitions upon binding to partners (3)(4)(5)(6)(7). Based on the structure they adopt upon binding, MoRFs are sub-categorized into three basic groups: ␣-MoRFs (form ␣-helices upon binding), ␤-MoRFs (form ␤strands) and -MoRFs (form irregular structures) (8). While most MoRFs are shorter than 25 residues, some MoRFs are 50 or more residues long. MoRFs are found in proteins that are involved in diverse cellular processes in all three domains of life (8).
High accuracy computational identification of MoRFs remains a significant challenge in computational biology. A number of MoRF identification tools are currently available including ANCHOR (9), MoRFpred (10), fMoRFpred (8), MFSPSSMpred (11), DISOPRED3 (12), MoRF CHiBi (13) and MoRF CHiBi Web (14). ANCHOR predicts MoRFs by estimating interaction energies between residues. MoRFpred and fMoRFpred utilize SVM models (and multiple sequence alignment for MoRFpred) in their predictions. MFSPSSMpred and DISOPRED3 predict MoRFs based on a SVM model with RBF kernel. MoRF CHiBi utilizes two SVM models with sigmoid and RBF kernels to predict MoRFs relying on local physiochemical sequence properties. MoRF CHiBi Web predictions are generated by hierarchically incorporating scores of MoRF CHiBi with those of IDR predictions and conservation assessments using Bayes rule. While the prediction precisions of the first five general MoRF predictors are about equal, MoRF CHiBi Web provides more than twice that precision. Other tools only target categories of MoRFs, including ␣-MoRF-Pred-I (15) and ␣-MoRF-Pred-II (16) that identify ␣-MoRFs, and retro-MoRF (17) that targets MoRFs with high sequence similarity to already known MoRFs or their reversed sequences. Furthermore, the recently developed DisoRDPbind method has an extended target space that covers intrinsically disordered regions involved in interactions with any type of partner including protein, RNA or DNA (18).
In this work, we introduce MoRFchibi SYSTEM, a series of MoRF predictors that serve different purposes and users. MoRFchibi SYSTEM includes these predictors in three forms: as HTML server, RESTful web server and downloadable software.
MoRF CHiBi relies on two SVMs modules to predict MoRFs based solely on local physicochemical sequence properties. MoRF CHiBi is the least accurate choice in MoR-Fchibi SYSTEM. It processes more than 11 000 residues per minute (please see the benchmarking section and (13)).
MoRF CHiBi Light utilizes Bayes rule to incorporate MoRF CHiBi scores with disorder scores generated by ESpritz (19). MoRF CHiBi Light is significantly more accurate than MoRF CHiBi and it is the most accurate in targeting longer MoRF sequences among MoRFchibi SYSTEM predictors (MoRFs with more than 30 residues, see the 'Benchmarking' section). MoRF CHiBi Light processes more than 10 500 residues per minute.
MoRF CHiBi Web predictions are the most accurate in the MoRFchibi SYSTEM (please see the 'Benchmarking' section). They are generated by supplementing MoRF CHiBi with disorder and conservation information. As functional elements, MoRFs are more conserved compared to other parts of IDRs (20,21). Therefore, an initial conservation score (ICS) is assembled by incorporating three values from the PSI-BLAST (22) position specific scoring matrixes (PSSMs) using Bayes rule. Then, a MoRF conservation score (MCS) is obtained by processing ICS with intrinsic disorder predictions (IDP) (14). MoRF DC is then computed by combining the MCS and intrinsic disorder predictions using Bayes rule. And finally, Bayes rule is used again to generate MoRF CHiBi Web from MoRF DC and MoRF CHiBi . MoRF CHiBi Web processes ∼500 residues per minute.

Datasets
One major challenge in the development of MoRF predictors is the sparseness of experimentally verified MoRFs that can be used for training and testing. To overcome this problem, the authors of MoRFpred (10) implemented an approach similar to that introduced by Mohan et al. (3), who searched the Protein Data Bank (23) for short peptides (potential MoRFs) that are in complex with longer protein partners (presumably globular domains). Disfani et al. (10) collected 885 sequences, each annotated by a single 6-25 residue long MoRF, and divided these sequences into a training set, TRAINING HT and a test set, TEST HT, such that sequences in TRAINING HT share <30% identity with those in TEST HT. TRAINING HT, contains 421 sequences with 245 984 residues, 5396 of them in MoRFs and TEST HT, contains 464 sequences with 296 362 residues, 5779 of them in MoRFs. ( HT; for highthroughput collection).
Although the large number of sequences in TEST HT provides more robustness in the evaluation, this set is not ideal because most of its MoRFs are not experimentally validated to be disordered in isolation, it includes many homologous sequences (redundant), and each sequence is only annotated by a single MoRF (under annotated). Therefore, we assembled a second test set, TEST EXP53. First, we joined four test sets that have previously been collected by the authors of ANCHOR (9), MoRFpred (10) and DISO-PRED3 (12). MoRFs in these sets have been experimentally validated for their disordered character in isolation. Then we filtered out sequences with more than 30% identity to TRAINING HT, as well as redundant sequences at a 30% identity cut-off. TEST EXP53 has 53 sequences with a total of 2432 MoRF residues that we further divided into 729 from short MoRF sections (up to 30 residues) and 1703 from long MoRF sections (more than 30 residues). Importantly, in contrast to TEST HT where each sequence is annotated by a single MoRF even if more may be present, sequences in TEST EXP53 are annotated with all known MoRFs.
We also used a third test set, TEST EXP9, to compare the prediction quality of the MoRFchibi SYSTEM predictors with that of MFSPSSMpred and DISOPRED3. These two SVM-RBF predictors are trained on an extended set of MoRFs including most of those found in our TEST HT and TEST EXP53 sets. The nine sequences of TEST EXP9, collected by the authors of DISOPRED3, are not homologous to any sequence used in the training of DISOPRED3, MFSPSSMpred and the predictors of MoR-Fchibi SYSTEM. MoRFs in TEST EXP9 have been experimentally validated to be disordered in the unbound state. TEST EXP9 includes 12 MoRFs with 163 MoRF residues.

BENCHMARKING
In the following, we will first summarize the comparison between the predictions made with MoRFchibi SYSTEM and other available servers. Details of this comparison can be found in Malhis et al. (14). Then, we will provide recommendations for the user of MoRFchibi SYSTEM based on results from this comparison.
Using TEST HT and TEST EXP53, we evaluated MoR-Fchibi SYSTEM predictions and compared them with those made by the most frequently used MoRF predictors in the field, MoRFpred, fMoRFpred and ANCHOR (Tables 1-3). Then, we used the much smaller TEST EXP9 set to compare performances with those of MFSPSSMpred and DISOPRED3 (Table 4). We compared the area under the curve (AUC, in Table 1), the prediction specificity at given sensitivities (Tables 2 and 4) and the precision as a function of different sensitivities (Table 3).
These comparisons reveal that all three MoRFchibi SYS-TEM predictors perform better than other methods regardless of which evaluation metric is used. Importantly, MoRF CHiBi Web generated less than half the false positive rate for the same true positive rate at any practical threshold values (see (14)). The comparison (Tables 1-3) also reveals that MoRFchibi SYSTEM predictors, MoRFpred, fMoRFpred and ANCHOR identify short MoRFs better than long ones. This may be expected as all these predictors were trained on datasets that contain only short MoRFs. The results on TEST EXP53 further reveal a limited contribution of conservation information to the identification     of long MoRFs. MoRF CHiBi Web , which uses conservation information, does not perform as well in the identification of long MoRFs as MoRF CHiBi Light , which may suggest that the percentage of conserved residues in long MoRFs is lower than that in short MoRFs. For MoRF predictors that are based on machine learning, the problem of over scoring MoRFs that are very similar to those used in its training can lead to novel MoRFs being masked by those over scored training MoRFs. With only one of the four sub-components of MoRF CHiBi Web directly trained on its training data (13), MoRF CHiBi Web provides high scoring consistency compared to single module predictors. To measure this consistency, we compared the MoRF CHiBi Web performance on its training set TRAIN-ING HT to that on the TEST HT. Results show only a small difference in MoRF CHiBi Web performances between the two sets (an AUC of 0.825 for TRAINING HT versus 0.806 for TEST HT).
Based on these results and the processing speeds (see above) of the different MoRFchibi SYSTEM predictors, the following recommendations for users can be made: MoRF CHiBi Web is the most accurate in MoRFchibi SYS-TEM and outperforms previously developed predictors significantly (significance assessed with t-Test; all P-values are available on the server's webpage). However, it is rather slow because the calculation of conservation scores requires a time consuming multiple sequence alignment step. Thus it is most appropriate for low-throughput, high-accuracy MoRF predictions. It is particularly strong in the search for short (<30 residues) MoRFs.
MoRF CHiBi Light is not far behind MoRF CHiBi Web in terms of its prediction performance. However, it is much faster and, therefore, most appropriate for high-throughput MoRF predictions. It shows a small advantage over MoRF CHiBi Web in the search for long (>30) MoRFs (Tables 1-3).
MoRF CHiBi , is the least accurate among the three MoR-Fchibi SYSTEM predictors but still superior to the other available predictors. As its predictions are solely based on information learned from a training set of MoRFs, it is least likely to interfere with other parts when integrated into multi-unit bioinformatics tools. It is also the fastest in MoRFchibi SYSTEM.

Input
The input for MoRFchibi SYSTEM is the primary amino acid sequence in fasta format. To balance priorities of different users, requests to the HTML and the RESTful web servers are limited to a single sequence each. However, there is no limit on the number of sequences that can be processed in each run of the downloadable software.

Output
The output is presented in two different forms: a downloadable text table and an interactive graphic chart. Six propensity scores are generated for each residue in the query sequence:  (14)).
Each of these scores is normalized to approximately fit a Gaussian probability density function specified by the normal distribution N(0.5, 0.01) and is limited to the range [0..1] as described in the article (14). In addition, the downloadable release includes two high-throughput options, one only generates the MC scores, and the other generates the MC and the MCL scores.

Usage example
The CD3E human protein P07766 has a disordered region at its C-terminus (residues 153-207) (24). This IDR includes a MoRF that covers residues 180-202 (PDB: 1A81 B and PDB: 2ROL B). MoRF propensity scores generated by MoRF CHiBi (MC), which are based on the local physicochemical properties of the sequence, correctly identify this MoRF region (Figure 2, green curve). However, MoRF CHiBi scores for residues 80-117 and 142-164 are similarly high. Combining disorder predictions and conservation information in the MoRF DC (MDC, purple curve) provides high prediction scores for the region 170-200, which is longer than the actual MoRF. The integration of the MoRF CHiBi and MoRF DC scores in MoRF CHiBi Web (MCW, red curve) provides the best result with a clearly distinct peak in the score chart between residues 180-202, which is where the MoRF is located.

The CHiBi server overview
Once a query sequence is submitted to either the HTML or the RESTful web servers, a job object is created and a URL address pointing to its future results is returned to the client. To prevent being dominated by a large number of query sequences from a single 'client' (defined below), each server utilizes a two tiers queue structure (Figure 3). Jobs are inserted into the first-in first-out server queue while the job at the top of the queue is been processed by the MoRFchibi SYSTEM software. Each client can place up to two jobs in the server queue, if more sequences are submitted by a single client, extra jobs are placed temporarily in that client's private queue. Once a client job at the top of the server queue is completed, it will be released from the queue and the job at the top of that client queue (when exist) will be moved by the job manager to the tail of the server queue. Client queues are located on the server, thus, once the links to the future result pages are secured, users can safely disconnect from the server.
Two main differences exist between the HTML and the RESTful servers: first, clients in the HTML server are browser sessions, and they are IP addresses in the RESTful web server. Second, in the HTML server, client queues  An example for the MoRFchibi SYSTEM two tiers queue structure. Eight jobs are in the server queue, two from each client. The 'red' client's job at the top of the server queue [P04777] is being processed by the MoRFchibi SYSTEM software. Jobs in the 'completed Jobs' section (top right) can be accessed through their associated URL links. Once [P04777] is completed, it will be released from the server queue and the job at the top of the 'red' client queue [Q98XH7] will be moved by the Job Manager to the tail of the server queue.