RNAct: Protein–RNA interaction predictions for model organisms with supporting experimental data

Abstract Protein–RNA interactions are implicated in a number of physiological roles as well as diseases, with molecular mechanisms ranging from defects in RNA splicing, localization and translation to the formation of aggregates. Currently, ∼1400 human proteins have experimental evidence of RNA-binding activity. However, only ∼250 of these proteins currently have experimental data on their target RNAs from various sequencing-based methods such as eCLIP. To bridge this gap, we used an established, computationally expensive protein–RNA interaction prediction method, catRAPID, to populate a large database, RNAct. RNAct allows easy lookup of known and predicted interactions and enables global views of the human, mouse and yeast protein–RNA interactomes, expanding them in a genome-wide manner far beyond experimental data (http://rnact.crg.eu).


INTRODUCTION
RNA-binding proteins (RBPs) are key in RNA splicing, processing, export, localization and regulation of translation and are implicated in a number of pathologies in humans.Examples include heterogeneous and life-threatening genetic disorders, such as amyotrophic lateral sclerosis (1), spinocerebellar ataxia and retinitis pigmentosa, among others (2,3).Human proteins encoded by 1393 genes currently have experimental evidence of RNA-binding activity (4)(5)(6).These proteins contain one or more RNA-binding regions, either in the form of canonical globular domains or of more recently discovered, intrinsically disordered RNA interaction regions (7,8).Additionally, protein-protein interaction interfaces and enzymatic active sites are sometimes employed for RNA binding (4,9).Protein-RNA interactions form an intricate network, and RNAs play structural roles in many types of phase-separated biological condensates, such as stress granules (10).
However, the number of RBPs for which the identity of their interaction partners is known is much lower.Two hundred fifty Homo sapiens RBPs currently have highthroughput experimental data on the identity of their target RNAs (11,12), obtained mostly by various sequencingbased methods such as eCLIP, iCLIP, HITS-CLIP, PAR-CLIP and RIP-seq.Much smaller datasets are available for Mus musculus (38 RBPs (12)), Drosophila melanogaster (29 RBPs from RIP-seq (13)) and Saccharomyces cerevisiae (69 RBPs from RIP-Chip ( 14)).A comprehensive collection of CLIP data is available in the recently expanded POSTAR database (12), previously called CLIPdb, which also includes motif-based target predictions for a set of human and mouse RBPs (88 and 82, respectively).
To bridge the gap between the 1393 known RBPs and the 250 for which we have experimental knowledge of interaction partners, we used an established, experimentally validated (15,16) protein-RNA interaction prediction method, catRAPID (17)(18)(19), to generate proteomeand transcriptome-wide sets of interaction predictions.Our database now covers the H. sapiens, M. musculus and S. cerevisiae genomes and contains a total of 5.87 billion pairwise interactions.This reflects nearly 120 years of computation time on the Centre for Genomic Regulation's highperformance computing cluster, and for the first time provides all possible protein-RNA interactions in these species.
RNAct makes available our genome-wide protein-RNA interaction predictions and combines them with powerful and intuitive search functionality, including pairwise search for sets of proteins and RNAs.The display is enriched with useful annotation, including transcript support level (TSL) and APPRIS classification for isoforms and RNA subcellular localization from the RNALocate database.Known RBPs as well as interactions confirmed by large-scale experiments from the ENCODE project are clearly highlighted.

Transcriptomes
Transcriptomes were obtained from GENCODE (for human and mouse) (21) and Ensembl (for yeast) (22).GEN-CODE 'basic' RNAs are a representative subset prioritizing full-length protein-coding transcripts over partial or non-coding transcripts for a given gene.The GENCODE release used for human is Release 27 (genome assembly GRCh38.p10), and both the 'basic' (98 608 transcripts with successful interaction predictions) and 'non-basic' (100 722 transcripts) subsets were obtained for full coverage of the human GENCODE transcriptome.These sets are kept separate for performance reasons, and the protein view currently does not show non-basic human RNAs (except in the pairwise search).For mouse, GENCODE release M16 (genome assembly GRCm38.p5) was used, retaining only the 'basic' subset (76 532 transcripts, ∼58% of the mouse GENCODE transcriptome) due to resource and computation time constraints.For yeast, all coding and non-coding transcripts from the Ensembl 92 release (April 2018) were included (7029 transcripts with successful interaction predictions).
All FASTA sequence files used are available for download in the RNAct Download section.A small number of these sequences were excluded from RNAct due to limitations of the catRAPID algorithm: short or extreme length (proteins ≤50 aa or >14 507 aa, RNAs ≤50 nt or >28 227 nt), or unsuccessful RNA secondary structure prediction using the ViennaRNA package which catRAPID relies on (23).

Interaction predictions (catRAPID maximum fragment score)
To compute the interaction propensity scores, we used the catRAPID approach (17) with the fragmentation procedure (18,19) and normalized for sequence lengths (19).For each protein-RNA pair, the fragments with the maximum interaction propensity score are used to assess overall binding ability (Figure 1A).The catRAPID score shows a receiver operating characteristic (ROC) area under the curve (AUC) of 0.78 with high-confidence eCLIP data (212 256 interactions with human GENCODE 'basic' RNAs, replicated in at least one cell line studied in ENCODE and in all replicates in each).
When including all eCLIP interactions regardless of replication (723 881 interactions for GENCODE 'basic' RNAs), this AUC is still 0.76.Normalizing the prediction score by sequence lengths, similarly to a previous work (19), we found that the predictive performance decreases slightly (to an AUC of 0.71 on the high-confidence interactions, and of 0.70 on all).This indicates a size effect, potentially due to the RNAse digestion step in CLIP protocols.We stress that the method was trained on X-ray and NMR data, and that its performance on the experimental CLIP data reflects its predictive power (Figure 1B).RNAct displays the lengthnormalized prediction scores, with raw catRAPID scores available for download upon request.

Experimental interaction data (ENCODE eCLIP)
Experimental interaction data covering 119 human RBPs using eCLIP in the HepG2 and K562 cell lines (170 total experiments) were obtained from the ENCODE Project in narrowPeak format (11,24,25).This represents the largest single dataset of experimental protein-RNA interaction data currently available.Additional experimentally determined interactions covering 69 RBPs in yeast using RIP-Chip were obtained from a compilation by Mittal et al. (14).

Protein and RNA annotation
A very recent census of proteins with experimental evidence of RNA-binding activity in human (1393 known RBP genes), mouse (1914 known RBPs) and yeast (1273 known RBPs) was used to flag proteins as known RBPs in RN-Act (4).Additionally, an older census of 1542 RBPs, which used features such as domain composition and known roles of proteins, was used to flag a further 658 human RNAct proteins as known RBPs (3).Overall, 5097 proteins in RN-Act are flagged as 'Known RBPs', with 2031 of these being human.
In addition to annotated, known RBPs, we obtained predictions of RNA-binding activity from SONAR (26) (1923 predicted human RBPs) and catRAPID signature (27).catRAPID signature was used with a threshold score of 0.735, equivalent to a z-normalized value of 1 (one standard deviation above the mean) for the score distribution for known human RBPs from Hentze et al. (4), resulting in 1268 predicted human RBPs.Overall, 2779 human proteins in RNAct are flagged as 'Predicted RBPs', 1721 of these being novel (not 'known').
RNA subcellular localization was obtained from the RNALocate database with very minor curation, removing a handful of ambiguous or non-subcellular terms (28).Basic protein annotation including gene symbols, full protein names and sequence length was obtained from UniProt.RNA annotation including transcript symbols (e.g.'TARDBP-201'), length, biotype (e.g.'protein coding', 'lincRNA'), GENCODE 'basic' status and TSL were obtained from GENCODE and Ensembl.Principal (primary) and alternative isoform classifications were obtained from APPRIS (29).

Technical aspects
RNAct is implemented in PHP on an Apache server using a MariaDB SQL backend, storing ∼450 GB of pre-sorted tables.The interaction predictions were calculated over several months on a shared set of 80 HP BL460c nodes with two Intel Xeon E5-2680 2.70 GHz CPUs and 120 GB of usable DDR3-1600 memory each, using 8 cores per cluster job.These are part of the CRG's high-performance computing cluster.The open-source Bootstrap library was used to ensure correct display on devices of any screen size, including mobile devices.Several icons were included from Font Awesome and the Noun Project (please see the About section of the website for attributions).RNAct collects no data on its users.

Search functionality
RNAct is built for extreme ease and speed of real-world use.The landing page (Search) contains a single search box which allows entry of any protein or RNA identifier (e.g.'tdp43' or 'hotair').Unless the term is highly ambiguous (e.g.'ataxin'), most searches resolve to a single gene symbol, giving a choice of species and protein or RNA on the disambiguation page that follows (Figure 2).Table 1 shows a list of realistic search terms that are resolved successfully by RNAct.This is achieved by 'guessing' the identifier type, moving outwards from specific to more ambiguous options, if necessary.There is no built-in limit to the number of search results returned, allowing searches for e.g.'RNA binding', 'vault RNA' or 'lysine demethylase'.
This design minimizes tedious input elements (e.g. a species dropdown box) and instead facilitates discovery and comparison across protein families and species.Matching fields are highlighted in green, which allows intuitive selection of the intended match (e.g. the RNA transcript in question when searching for 'ENST00000237536') while leaving room for additional useful choices (e.g. the corresponding protein for transcript 'ENST00000237536').The search box is available in the top right of every page and is easily navigated to by pressing the tab key.

Protein view
Once a protein of interest is selected, the Protein view (Figure 3) shows a list of RNA interaction partners prioritized by prediction score.Alternatively, the view can be sorted by experimental results simply by clicking on the experimental columns.The length, GENCODE 'basic' status, APPRIS classification and TSL (22) for each transcript are shown, allowing isoform quality assessment.Links out to Ensembl and UniProt for additional transcript and protein information respectively are provided (with an arrow symbol).

RNA view
Once an RNA is selected, the RNA view shows a list of predicted protein interaction partners prioritized by prediction score.Alternatively, the view can be sorted by experimental results simply by clicking on the experimental columns.Interactions with experimental evidence are highlighted (14,24), as are known (3,4) and predicted (26,27) RBPs.Links out to Ensembl and UniProt for additional information are provided.

Advanced pairwise search
A common use case for RNAct is the prediction of interactions within a set of proteins and RNAs, allowing the rapid prioritization of candidates for validation, and the analysis of specific pathways or systems.The Pairwise search feature allows entry of a set of proteins and a set of RNAs, either in multiple lines or separated by commas, and allows any identifier types which the Search function can resolve, including ambiguous queries (e.g. for 'lysine demethylase').The only limitation is the total number of pairs queried, which is currently limited to 10 000 (allowing entry of e.g. 100 proteins and 100 RNAs).

Browse proteins or RNAs
These views list all proteins or RNAs contained in RN-Act, i.e. the human, mouse and yeast reference proteomes and transcriptomes.In the Browse Proteins view, proteins are listed in order of availability of experimental interaction data (e.g. from eCLIP), evidence of RNA-binding activity (known or predicted RBPs), species and gene symbol.This allows the easy retrieval of known RBPs, particularly those with experimental interaction data.In the Browse RNAs view, transcripts are sorted by species, gene symbol, GENCODE 'basic' status, APPRIS classification, TSL and descending transcript length.This means that the best-supported transcript for a given gene will appear first.

Download
All RNAct protein-RNA interaction prediction data for human, mouse and yeast are available from the Download page.For human, the predictions are split into two sets for performance reasons: GENCODE 'basic' transcripts (covering a representative subset of 98 608 RNAs), and 'nonbasic' transcripts making up the rest of the transcriptome.Both files can be concatenated for a full view of the human protein-RNA interactome, covering 20 778 proteins and 199 330 RNA transcripts.For mouse, only the GEN-CODE 'basic' transcripts are currently available, while the full annotated transcriptome is available for yeast.The RN-Act predictions are licenced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Licence (CC BY-NC-SA 4.0).A complete set of supporting tables containing protein and RNA annotations, identifier mappings used internally for searching, and the experimental data used (e.g.eCLIP) is available on the Download page as well.We intend to complete and add predictions for additional species such as C. elegans and D. melanogaster.

About
The About page gives more details on the algorithm and datasets used, provides literature references and answers what we expect to be frequently asked questions, including contact details.

DISCUSSION
RNAct provides an easy-to-use view of protein-RNA interactions in model organisms.It is intended to grow, both D604 Nucleic Acids Research, 2019, Vol.47, Database issue  in terms of the number of species covered (currently human, mouse and yeast) and in terms of the experimental datasets provided.We hope our database will be particularly useful for studying gene regulatory events and networks at the post-transcriptional level (30).In addition to protein-centric datasets, recently published interactomes for the MALAT1, NEAT1 and NORAD long non-coding RNAs (lncRNAs) from a mass spectrometry-based method make it likely that additional RNA-centric datasets will be published in the near future (31).We are actively imple-menting features such as flagging interactions which are experimentally validated at low throughput, and allowing users to add articles supporting a given interaction.Interactions supported by the presence of an RNA-binding domain and its corresponding motifs are also intended to be highlighted in future (32).Additionally, we are considering to report the predicted binding regions for each interaction from catRAPID, similar to a CLIP binding profile, although this would require us to upgrade our server infrastructure due to the terabytes of data involved for all pair-  wise interactions.In summary, RNAct provides easy access to genome-scale protein-RNA interaction predictions with useful supporting annotation and experimental interaction evidence.

Figure 1 .
Figure 1.(A) Interaction propensity scores for the background (sampled from slightly over 2 billion human protein-RNA pairs; light red) and positive set (212 256 high-confidence protein-RNA interactions revealed by eCLIP; cyan).The z-score reported in the results pages is computed on the right-skewed blue distribution, with the solid cyan line indicating the mean and the dashed line indicating a z-score of 1 (one standard deviation above the mean).(B) The area under the ROC curve of 0.78 (0.72 upon length normalization) indicates the predictive performance of the catRAPID method on recent high-confidence experimental eCLIP data from the ENCODE project.

Figure 2 .
Figure 2. Search results (disambiguation page).This page allows selection of the protein or RNA of interest across the 3 species currently in RNAct.

Figure 3 .
Figure 3.The Protein view.This page shows a list of potential RNA interaction partners prioritized by catRAPID length-normalized prediction score.Alternatively, the page can be sorted by eCLIP experimental results by clicking on the 'P-value' or 'fold change' columns.Useful information on the protein of interest, such as whether it is a known or predicted RBP and whether experimental interaction data (e.g. from eCLIP experiments) exists for it is shown at the top of this view, and transcript annotation and quality information are shown as badges for each RNA.Links out to Ensembl and UniProt are provided.Other links lead to the protein's or RNA's view within RNAct.

Table 1 .
Examples of realistic search terms successfully resolved by RNAct