Sequence analysis cat RAPID signature : identification of ribonucleoproteins and RNA-binding regions

Motivation: Recent technological advances revealed that an unexpected large number of proteins interact with transcripts even if the RNA-binding domains are not annotated. We introduce catRAPID signature to identify ribonucleoproteins based on physico-chemical features instead of sequence similarity searches. The algorithm, trained on human proteins and tested on model organisms, calculates the overall RNA-binding propensity followed by the prediction of RNA-binding regions. catRAPID signature outperforms other algorithms in the identification of RNA-binding proteins and detection of non-classical RNA-binding regions. Results are visualized on a webpage and can be downloaded or forwarded to catRAPID omics for predictions of RNA targets. Availability and implementation: catRAPID signature can be accessed at http://s.tartaglialab.com/ new_submission/signature. Contact: gian.tartaglia@crg.es or gian@tartaglialab.com Supplementary information: Supplementary data are available at Bioinformatics online.


Introduction
RNA-binding proteins (RBPs) use RNA-binding domains (RDs) to recognize target RNAs and to regulate co-/post-transcriptional processes. Examples of classical RDs include RNA-recognition motif (RRM), double-stranded RNA-binding domain (dsRRM), K-homology (KH), RGG box and the Pumilio/FBF (PUM) domain (Lunde et al., 2007). In addition to classical RDs, recent experimental studies on HeLa (Castello et al., 2012), HEK298 (Baltz et al., 2012) and mESC (Kwon et al., 2013) cells, indicate that a number of RNA-interacting proteins contain non-classical RDs (ncRDs) for which annotation is not yet available. Discovery of new RDs is a challenging task: domaindetection tools, such as HMMER (Finn et al., 2011) and BLAST (Camacho et al., 2009) rely on sequence similarity searches to identify annotated RDs and fail to recognize newly discovered RBPs. Similarly, other methods such as RNApred (Kumar et al., 2011) predict RNAbinding ability using features of annotated RDs that might be different in ncRDs. Alternatives to identify RNA-binding regions include BindNþ (Wang et al., 2010), PPRInt (Kumar et al., 2008) and RNAbindRþ (Walia et al., 2014), but the algorithms have been trained to identify single amino acids and not contiguous regions. catRAPID signature overcomes these limitations by (i) predicting the propensity of a protein to interact with RNA and (ii) identifying RNA-binding regions through physico-chemical properties instead of sequence patterns. The algorithm is an extension of the catRAPID approach (Bellucci et al., 2011) to predict protein-RNA interactions and the cleverSuite algorithm (Klus et al., 2014) to classify protein groups using physico-chemical features.

Algorithm and performances
To build catRAPID signature we exploited a number of physico-chemical properties reported in our previous publication (Klus et al., 2014): Applications Note • We used each physico-chemical property [e.g. structural disorder (Castello et al., 2012)] to build a signature, or profile, containing position-specific information arranged in a sequential order from the N-to the C-terminus; • We computed Pearson correlation coefficient between signatures of annotated human RDs and same-length regions taken from RNA-binding proteins as well as negative controls (Supplementary  Table S1 and online Documentation); • We identified a number of discriminating physico-chemical properties, their associated RDs and correlation cutoffs (Supplementary  Table S2 and online Documentation).
For each protein, we calculated the fraction of residues with correlation coefficients above the cutoffs that are associated with physico-chemical properties and RDs (Table S2; online Documentation), which we then used to train catRAPID signature. Using a Support Vector Machine with RBF-kernel (online Documentation), we built a method for the (i) identification of ribonucleoproteins and (ii) prediction of RNA-binding regions: i. catRAPID signature shows an AUC ¼ 0.76 for discrimination of 950 RBPs from 950 negative cases (10-fold cross-validation; Supplementary Fig. S1, Table S1). On an independent test set (  Table S4). In addition, we observed high performances on a protein dataset whose RNA-binding sites have been determined through X-ray and NMR ( Supplementary Fig. S3 and online Documentation).

Server description and example
The input of the server is a FASTA sequence. To illustrate the output with an example, we studied the RNA-binding ability of Fragile X Mental Retardation Protein FMRP. catRAPID signature predicts that FMRP binds to RNA (overall interaction score ¼ 0.85; Fig. 1A; Fig. S4) and correctly identifies two peaks corresponding to the KH domains and one peak in the RGG box (Ascano et al., 2012) [ Fig. 1A,B and C; 'classical' score ¼ 0.73]. In addition, catRAPID signature indicates that the N-terminus (amino acids 1-215; Fig. 1B) has RNA-binding ability ('putative' score ¼ 0.74), which is in agreement with very recent evidence revealing the presence of a novel KH domain (Myrick et al., 2015). Comparing experimental targets [number of PAR-CLIP binding sites ! 1] (Ascano et al., 2012) with transcriptome-wide predictions of FMRP N-terminus [amino acids 1-215; Fig. 1D] (Agostini et al., 2013) we observed a significant enrichment in predicted interaction propensities (P-value < 1 À9 calculated with Kolmogorov-Smirnov test on 105 Â 10 3 transcripts of which 7 Â 10 3 positives), which suggests that the N-terminus contributes to the RNA-binding ability of the full-length FMRP.

Conclusions
As newly discovered RDs are not annotated, traditional domaindetection tools fail their identification. catRAPID signature addresses this limitation by detecting binding regions through physico-chemical features. Our algorithm will be helpful to investigate components of ribonucleoprotein complexes and to identify RNA-binding regions.