EuRBPDB: a comprehensive resource for annotation, functional and oncological investigation of eukaryotic RNA binding proteins (RBPs)

Abstract RNA binding proteins (RBPs) are a large protein family that plays important roles at almost all levels of gene regulation through interacting with RNAs, and contributes to numerous biological processes. However, the complete list of eukaryotic RBPs including human is still unavailable. Here, we systematically identified RBPs in 162 eukaryotic species based on both computational analysis of RNA binding domains (RBDs) and large-scale RNA binding proteomic data, and established a comprehensive eukaryotic RBP database, EuRBPDB (http://EuRBPDB.syshospital.org). We identified a total of 311 571 RBPs with RBDs (corresponding to 6368 ortholog groups) and 3,651 non-canonical RBPs without known RBDs. EuRBPDB provides detailed annotations for each RBP, including basic information and functional annotation. Moreover, we systematically investigated RBPs in the context of cancer biology based on published literatures, PPI-network and large-scale omics data. To facilitate the exploration of the clinical relevance of RBPs, we additionally designed a cancer web interface to systematically and interactively display the biological features of RBPs in various types of cancers. EuRBPDB has a user-friendly web interface with browse and search functions, as well as data downloading function. We expect that EuRBPDB will be a widely-used resource and platform for both the communities of RNA biology and cancer biology.


INTRODUCTION
RNA binding proteins (RBPs) are involved in the regulation of the metabolism, transportation, translation and function of both coding and non-coding RNAs through direct RNA-protein interaction (1). RBPs ensure the smooth flowing of genetic information from DNA to RNA, and ultimately to proteins, making them essential and instrumental for all physiological and pathological processes (1). Numerous diseases have been caused by the aberrant of expression or function of RBPs, including cancer, metabolic disorders and neuropathies (2)(3)(4).
Comprehensive identification and annotation of all RBPs are primary and crucial steps for characterization of their functions. To date, several RBPs databases exist for a few eukaryotes, but these databases only collected a small number of well-characterized RBPs from one or few species. For example, RBPDB is a database focusing on the collection of experimentally validated RBPs and RNA binding domains (RBDs), and it contained only 1171 RBPs from human, mouse, fly and worm (5). ATtRACT is a manually curated database that collects compiled information for only 370 well-characterized RBPs from 39 species (6). Clearly, the RBP repertoire collected by these existing databases are far from complete for any species, human included.
RBPs bind to RNA via structurally well-defined RBDs, such as Dead box helicase domain, RNA recognition motif (RRM) (7,8). Here, we annotated proteins containing a RBD as canonical RBPs. Additionally, many studies have suggested the existence of complex protein-RNA interactions that do not require canonical RBDs (9,10), instead through other structures such as intrinsically disordered regions (IDRs) (11). It is thus challenging to identify non-canonical RBPs without known RBDs in a highthroughput and unbiased manner. Recent advances in RNA binding proteome (RBPome) technology significantly facilitate the large-scale identification of non-canonical RBPs (12)(13)(14)(15)(16)(17)(18), including the capture of polyadenylated RNA interactome (11,(16)(17)(18)(19)(20)(21), click chemistry-based capture of RNA interactome (13), and orthogonal organic phase separation (OOPS) of RBPs (14,15,19). These methods crosslink the RBPs with RNA using UV, then apply different strategies to extract total RBPs from cells or tissues. The purified total RBPs are used to analyze the RBPome based on mass spectrometry (MS). These RBPome technologies have been applied to many eukaryotes, including human (11,15,16,(19)(20)(21), mouse (12) and fly (18), and identified a large number of novel canonical and non-canonical RBPs. It should be noted that as an experimental method, none of RBPome technologies is capable of capturing the complete category of RBPs, due to the limitation of total RBP purification strategy and MS technology (12)(13)(14)(15)(16)(17)(18)(19). Moreover, most of the present RBPome studies applied stringent filtering process to control for the false positivity, which is associated with high false negativity and low sensitivity.
In the rapid progression of RNA biology field (1), a great need exists to build a comprehensive eukaryotic RBP database to explore the annotation, expression and function of RBPs. To address this, we collected a full list of RBDs from both Pfam (22) and published RBPome datasets from 6 eukaryotes (human, mouse, zebrafish, yeast, fly and worm) (Supplemental Table S1). In parallel, we predicted RBPs based on RBDs using HMMER (23) from the genomes of 162 eukaryotes. Upon integration, we established currently the most comprehensive database of eukaryotic RBP, EuRBPDB ( Figure 1). EuRBPDB contains a total of 315 222 RBPs, with detailed annotations for each RBP. Moreover, given the crucial role of RBP in cancer biology, in order to facilitate users to explore the clinical relevance of RBPs, we separately built a Cancer web interface to display integrated cancer-associated omics datasets. The database has a user-friendly interface to interactively exhibit and search the detailed annotations. EuRBPDB will therefore greatly promote the investigation and understanding of the RNA biology.

Identification and annotation of RBPs
All protein sequences of 162 eukaryotes were downloaded from Ensembl database (24) (release 96, http://www. ensembl.org/). Proteins were annotated as canonical RBPs if they contain one or more domains known to directly interact with RNA. The search of RBPs was based on the searching of sequence homologs of known RBDs in proteins using probabilistic models known as profile hidden Markov models [4]. The present RBD list was curated based on the comprehensive RBD list established by Gerstberger et al. (25). After careful examination, we found that eight RBDs (RRM6, KH 3, MRL1, Ribosomal S3 N, Lactamase B2, tRNA synt 2b, RnaseH, tRNA anti) have been removed by Pfam, and thus they were eliminated from our list. Finally, we obtained a total of 791 RBDs (can be downloaded from http://EuRBPDB.syshospital.org/data/ download/791 RBDs.PFam.gz). We extracted RBD HMM profiles from the Protein families (Pfam) database (Pfam HMM profiles, release v32) (22), and applied the hmmsearch program in HMMER (v3.2.1) (23) package to search for all of the eukaryotic protein sequences against the RBD HMM profiles to identify RBPs. Proteins with E-value less than 0.0001 were considered as bona fide canonical RBPs. In total, we identified 311 571 canonical RBPs from 162 eukaryotic species. In parallel, we manually collected largescale RBPome datasets of human, mouse, zebrafish, yeast, fly and worm from 21 published works (Supplementary  Table S1). Human non-canonical RBPs are required to be detected in at least two RBPome datasets. For other species, the RBPs detected by any RBPome were included in EuRBPDB. As a result, we obtained 3651 non-canonical RBPs from six species. Finally, EuRBPDB collected a total of 315 222 RBPs, representing the largest eukaryotic RBP database currently available. EuRBPDB has four lines of evidence of RNA-binding for each RBP, namely (i) literatures supporting of RNA-binding capacity, (ii) RNAbinding domain, (iii) RBPome and (iv) RNA-binding sites detected by CLIP-Seq. We graded those RBPs with only one of four pieces of evidence as 'putative' in Description section on the Basic information subpage. The basic information, GO and phenotype annotation of RBPs were obtained from NCBI, Genecards and Ensembl databases. The protein-protein interaction (PPI) information was parsed from STRING database (26). The pathway annotation was obtained from KEGG database (27). Expression data were obtained from GTEx (28) and SRA.

Classification of eukaryotic RBP family
We characterized and classified canonical RBPs by their sequence-specific RBDs. RBP family was named as the RBD domain if its RBPs only contain one type of RBD. If a RBP contains multiple types of RBDs, it was categorized into each of the family. All non-canonical RBPs were classified as non-canonical RBP family. In total, we obtained 686 RBP families.

Orthologs and paralogs
The reciprocal best hit (RBH) method (29) was used to predict the putative orthologs of RBPs among different species. We performed the all-against-all BLASTP (v2.7.1+) search between proteins of two genomes with strict cutoffs (Evalue ≤ 1e-6, coverage ≥ 50%, identity ≥ 30%) and annotated the reciprocal best hit pairs as orthologs. Paralogs was predicted by the BLAST score ratio (BSR) (30) approach. BLASTP search was conducted in each genome with the same parameters as in orthologs search. The BSR value cutoff was set to 0.4.
Nucleic Acids Research, 2020, Vol. 48, Database issue D309 Figure 1. A system-level overview of the EuRBPDB core framework. A total of 315 222 RBPs, including 311 571 canonical RBPs and 3651 non-canonical RBPs, were identified by combination of computational RBP searching with RBPome profiling. All RBPs were annotated by information retrieved from public database, like NCBI, Ensembl, STRING, KEGG and GeneCards. Cancer-relevant RBPs were identified by literature mining and systematic TCGA data analysis. All the results generated by EuRBPDB were deposited in MySQL relational databases and displayed in the web pages. All species photos were downloaded from Ensembl database (24).

Cellular effects of drugs to RBP expression
Two L1000 assay level-5 datasets (GSE92742 and GSE70138) (33) generated by the Library of Integrated Cellular Signatures (LINCS) project were downloaded from GEO. These datasets contain over 1 600 000 subdatasets measuring the effects 30 744 drugs on the RNA profiles of 44 cell lines. L1000 assay datasets were parsed and displayed by campR (v1.0.1) and ggplot2 (v3.1.0) R packages as suggested by LINCS project. Expression of RBPs is displayed as z-score.

RNA binding sites of RBPs
A total of 227 eCLIP isogenic replicated datasets generated from K562 (120 RBPs) and HepG2 (103 RBPs) cell lines and human adrenal gland tissues (two RBPs) were retrieved from ENCODE database (https://www. encodeproject.org/). Peak and bam files of each datasets were downloaded. We used intersectBed of bedtools package (v2.27.1) (34) to annotate each peak, and used cover-ageBed of bedtools to retrieve the RPM value of each peak.

Literature analysis of RBP
Literature mining was conducted in geneclip3 (http://ci. smu.edu.cn/genclip3/). In brief, Entrez ids of all RBPs were submitted to geneclip3. Key words of function model of geneclip3 were set as 'cancer or tumor' to search for cancerassociated literatures, and 'RNA binding or RNA-binding' to search for literatures on RNA-binding. Geneclip3 was run in GeneRIF mode to search for cancer-associated literatures, and in MEDLINE mode to search for literatures on RNA-binding. The searching will return the PubMed IDs of all literatures that study the RBPs in cancers or RNA-binding capacity. The information of all literatures was retrieved from PubMed based on PubMed IDs. RBPs reported in 3 cancer-relevant studies were considered to be cancer-associated.

The web-based exploration of RBPs
EuRBPDB provides genome-wide identification of RBPs in large amount of eukaryotic species based on HMMER D310 Nucleic Acids Research, 2020, Vol. 48, Database issue searching results combined with RBPome datasets analyses. In total, 315 222 RBPs, including 311 571 canonical RBPs corresponding to 6368 ortholog groups and 3651 noncanonical RBPs, were identified in 162 eukaryotic species. With the systematic annotation of these RBPs, we designed a user-friendly web interface for users to query the database conveniently and interactively. Users can either browse the entire RBP list of any 162 eukaryotes collected in database, or search for any RBP in any eukaryotes of interest. Eu-RBPDB provides two different ways to browse the data, one is to browse by species, the other is to browse by family defined by RBDs. On the 'Species' page, 162 species were classified into 12 categories according to Ensembl taxonomy. To browse the RBP list of each species, users just need to click the species image of interest, and retrieve the detailed RBP information through the following steps: families→family gene list →single gene annotation. On the 'Family' page, EuRBPDB lists all 686 RBP families from 162 eukaryotes. RBP families were ordered by family size in descending order. By clicking the family name, users will get all RBPs grouped by species in this family. Users can also obtain the detailed information of RBP through the following steps: species→gene list →single gene annotation.
Users can search the specific RBP of interest using the quick search box at the top right corner of navigation bar in any page, the search will return all RBPs in any species matching the searching criteria. To browse the detailed information of any specific RBP, users can specify both the species and RBP name/ID in 'Search' page. Both search and browser functions direct users to the detailed information page of any specific RBP. This page comprises of two subpages, namely 'Basic Information' subpage and 'Cancer Related Information' subpage (only for human RBPs currently). All two subpages consist of a number of information sections constructed by data collected from other published databases. We can readily add any new sections to these subpages, and thus it is easy and convenient to update EuRBPDB regularly. In Basic information subpage, Eu-RBPDB provides basic information including gene structure (Gene Model section), evidences for RNA-binding (RBDs, RBPome, RPI and Literatures sections), expression (Expression section), and functional annotation (PPI, Pathway and Gene Ontology sections etc.). 'Cancer Related Information' subpage will be introduced in the following sections.

Cancer web interface
RBPs contribute extensively and significantly to numerous processes in cancer biology. To facilitate RBP research in cancer, EuRBPDB provides cancer associated annotation of RBPs in Cancer web interface. Through systematic literature mining using geneclip3 (http://ci.smu.edu.cn/ genclip3/), we found that a total of 727 RBPs are reported to be associated with human cancers (reported by at least three literatures). Among them, 144 RBPs were frequently investigated (reported by >20 literatures). Moreover, we conducted differential expression, somatic mutation, CNV, as well as survival analysis based on TCGA data to reveal comprehensively the alterations of RBPs in human cancers. As a result, we identified 1361 RBPs showing aberrant expres-sion in at least one cancer type, 2900 RBPs harboring nonsense and/or missense mutations (1761 of them mutated in RBD regions), 2851 RBPs having genomic deletions or amplifications, and 2897 RBPs exhibiting significant survival correlation.
Mutational analysis of RBDs showed that certain cancer types such as Pheochromocytoma and Paraganglioma (PCGP) and PAAD, have higher mutational rate targeting RBD regions than others (Supplementary Figure S1A). This result is congruent with the findings that the expression and functions of RBPs have cell-type specificity (12,18). On the other hand, certain RBDs have higher mutation rates across human cancers (Supplementary Figure S1B), such as MMR HSR1 and RRM 1 domain. Notably, mutations in RRM 1 of RBM10 have been suggested to play important role in the development and progression of lung adenocarcinomas (35)(36)(37), highlighting that our analysis is capable of identifying functional mutations in cancer-associated RBPs. These results together suggest further investigation of the functional significance of candidate RBPs and RBD in cancer biology.
It is conceivable that larger number of genes mutated in a given RBP PPI network will result in higher degree of network dysregulation. A bar plot showing the number of aberrant RBP PPI network (defined as number of mutated gene >30% within the network) of each cancer is provided in Cancer interface. To facilitate the users to explore the number of mutated genes of each RBP PPI network in each cancer type, we added a bar plot under the PPI network figure in Basic information subpage of each RBP.
Among RBPs with cancer-associated alterations, most of them have hitherto not been reported to be associated with any cancers, providing a valuable and novel resource for cancer researchers. EuRBPDB provides the overview of the cancer-associated RBPs in 'Cancer' page, as well as the list of published and novel cancer-associated RBPs deposited in EuRBPDB. By clicking the 'Details' link of each RBP, users can be redirected to detailed information page of RBP with Cancer Related Information subpage. There are six sections in this subpage, showing the literatures investigating selected RBP (Literatures), differential expression boxplot (Differential Expression), mutations in RBP (mutation), copy number variation (CNV), survival analysis (survival), as well as the expression changes across 44 different cell lines under the treatment of ∼2000 drugs (33).

RBPredictor web-server for the annotation of eukaryotic RBPs
A web-based tool, RBPredictor, was further developed to assist users to determine whether the protein of interest (from any eukaryote) is a putative canonical RBP. Such RBP prediction is based on the RBD sets used in this study, and we performed hmm-search program in HMMER (v3.2.1) package to determine whether the protein sequence submitted is a putative RBP (25). In 'RBPredictor' page, users are only required to input one or multiple protein sequences in fasta format, or submit a fasta file with protein sequences. If an input protein is identified as a putative RBP, RBPredictor will also list all potential RBDs such protein harbors.

DISCUSSION AND CONCLUSIONS
In this study, we systematically identified eukaryotic RBPs by integrating both large-scale RBPome experimental data and computational RBD identification data. We identified a total of 311 571 high-confident canonical RBPs corresponding to 6368 ortholog groups in 162 eukaryotes, and 3651 non-canonical RBPs without known RBDs in six eukaryotes (human, mouse, zebrafish, fly, worm and yeast). Currently, all non-canonical RBPs were grouped into noncanonical RBP protein family. 311 571 canonical RBPs formed 686 protein families. Except some large RBP families, such as RRM 1 (33 193 RBPs, 597 ortholog groups), zf-met (20 101 RBPs, 589 ortholog groups), zf-C2H2 (16 879 RBPs, 507 ortholog groups), MMR HSR1 (22 986 RBPs, 467 ortholog groups), most RBP families contain small amount of ortholog group (median: 4) (Supplementary Figure S2). 2961 RBPs were identified in human with high confidence, including 1836 canonical RBPs and 1135 non-canonical RBPs, significantly expanding the human RBP repertoire. Moreover, most human RBPs were found to have cancer-related alterations. We systematically annotated all eukaryotic RBPs in this study, and constructed the most comprehensive eukaryotic RBPs database, Eu-RBPDB. Through the integration of various large-scale omics data (such as CLIP-Seq, RNA-Seq and L1000 assay), EuRBPDB provides a comprehensive platform to explore the function and cancer-relevance of RBPs. Users can readily obtain basic, functional and cancer-relevant information of any RBPs of interest from EuRBPDB ( Figure 2). EuRBPDB also provides a RBPredictor web-server, which enables users to easily and rapidly determine whether a eukaryote protein not included in EuRBPDB is an RBP. Eu-RBPDB provides a framework to systematically identify eukaryotic RBPs based on RBD searching and RBPome data.
Identification of RBP through RBD matching is a highly effective and accurate approach (25). However, recent RBPome studies showed that a large number of proteins without canonical RBDs also bind RNA, and many of them bind RNA through IDRs (11). Therefore, clearly it is insufficient to identify RBPs merely based on RBD searching. On the other hand, RBPome methods are likewise incapable of detecting all RBPs because of the (i) context-dependent RNA binding capacity of many RBP approach (1); (ii) restricted expression pattern of RBPs, since the RBPome were performed in only a few cell types; (iii) technical limitation of purification strategy of total RBP (14,15,19); (iv) low sensitivity of MS technology. We also find that only about half of human canonical RBPs can be detected by different RBPome methods (Supplementary Figure S3, Supple-  mentary Table S1). Thus, presently a comprehensive way to acquire a more complete RBP repertoire is to combine the computational RBP searching with RBPome profiling.
To verify the reliability of RBP dataset we generated, we have cross-checked against all current RBP databases. The results showed that EuRBPDB identified the vast majority of the RBPs (ranging from 90.1% to 100%) across different species collected by other databases (Figure 3), validating the accuracy and consistency of our work. Furthermore, we used the GO annotation to evaluate the robustness and accuracy of our human RBP list. Indeed, we found that 95.3% of canonical RBP and 73.8% of non-canonical RBP were annotated by RNA-related GO terms, such as 'RNA-binding', 'RNA modification' and 'endoribonuclease activity'. These results together highlight that our RBP identification approach has high accuracy and robust performance.
Many databases have been established to aid the research of RNA biology (5,6,38). However, currently no comprehensive RBP database is available for all species. All existing RBP databases focus on the collection and integration of the structure, RBD, RBP binding sites or disease correlation of small amount of well-characterized RBPs in a limited types of eukaryotes, such as RBPDB (5), AT-tRACT (6), SpliceAid-F (38), POSTAR2 (39), starBase (40) etc. Compared with these RBP databases, EuRBPDB provides the largest eukaryotic RBP repertoire (315 222 RBPs, forms 6368 ortholog groups), the most comprehensive functional and cancer-associated annotation, and an intuitive and easy-to-use web interface. Therefore, EuRBPDB provides a powerful platform to decode the RBP function and regulatory mechanisms.

FUTURE DIRECTIONS
EuRBPDB is a comprehensive eukaryotic RBP database, characterizing RBPs of 162 eukaryotic genome-wide. With the ever-increasing amount of RBPome and eukaryotic genome data, we will continue to update and maintain the RBP repertoire and annotation regularly. We will also integrate additional omics datasets (e.g. CLIP-seq, RNA-Seq) from public databases like Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) to further improve our understanding of the function and regulatory mechanism of RBPs.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.