RBPMetaDB: a comprehensive annotation of mouse RNA-Seq datasets with perturbations of RNA-binding proteins

Abstract RNA-binding proteins (RBPs) may play a critical role in gene regulation in various diseases or biological processes by controlling post-transcriptional events such as polyadenylation, splicing and mRNA stabilization via binding activities to RNA molecules. Owing to the importance of RBPs in gene regulation, a great number of studies have been conducted, resulting in a large amount of RNA-Seq datasets. However, these datasets usually do not have structured organization of metadata, which limits their potentially wide use. To bridge this gap, the metadata of a comprehensive set of publicly available mouse RNA-Seq datasets with perturbed RBPs were collected and integrated into a database called RBPMetaDB. This database contains 292 mouse RNA-Seq datasets for a comprehensive list of 187 RBPs. These RBPs account for only ∼10% of all known RBPs annotated in Gene Ontology, indicating that most are still unexplored using high-throughput sequencing. This negative information provides a great pool of candidate RBPs for biologists to conduct future experimental studies. In addition, we found that DNA-binding activities are significantly enriched among RBPs in RBPMetaDB, suggesting that prior studies of these DNA- and RNA-binding factors focus more on DNA-binding activities instead of RNA-binding activities. This result reveals the opportunity to efficiently reuse these data for investigation of the roles of their RNA-binding activities. A web application has also been implemented to enable easy access and wide use of RBPMetaDB. It is expected that RBPMetaDB will be a great resource for improving understanding of the biological roles of RBPs. Database URL: http://rbpmetadb.yubiolab.org


Introduction
A lack of fully structured metadata limits the wide use of valuable RNA-Seq datasets in public repositories such as Gene Expression Omnibus (GEO) (1) and ArrayExpress (2). To fill this gap, manual curation has been shown to be an effective way to collect data resources (3) and has been applied to develop and maintain metadata databases (4). For example, microarray and RNA-Seq datasets have been curated for the downstream analyses in Expression Atlas (5) and in epidermal development (6). We previously launched two databases, RNASeqMetaDB (7) and SFMetaDB (8), to facilitate access to the metadata of publicly available mouse RNA-Seq datasets with perturbed disease-related genes and splicing factors, respectively. Here, we present a new database, RBPMetaDB, for the metadata of RNA-Seq datasets with perturbed RNA-binding proteins (RBPs).
RBPs play a critical role in multiple cellular processes in eukaryotes. RBPs bind to double-or single-stranded RNA molecules and are potential key factors in biological processes, such as pre-mRNA splicing, RNA methylation and protein translation (9). Besides influencing each of these processes, RBPs also provide a link between them (10). The perturbation of these intricate networks can destroy the coordination of complex post-transcriptional events and lead to disease (11).
According to recent genomic data and evidence derived from animal models, RBPs play a crucial role in the pathogenesis of many complex human diseases, including neurological disorders (12), Mendelian diseases (13) and cancer (14). These diseases have been demonstrated to have strong associations with aberrant functions or expression of RBPs, which can impact many different genes and pathways. Some diseases can be caused by loss of function of RBPs, such as Fragile X syndrome, paraneoplastic neurologic syndromes and spinal muscular atrophy (9). For example, Fragile X syndrome can be caused by the deficiency of gene fragile X mental retardation (FMR1) (15). Alternatively, some diseases can be caused by gain of function of RBPs, including myotonic dystrophy, Fragile X tremor ataxia syndrome, and oculopharyngeal muscular dystrophy (OPMD) (9). For instance, OPMD is generated by the accumulation of aggregates in the nuclei of skeletal muscle fibers caused by mutants in the protein PABPN1 (16). And a deficiency of PABPN1 can induce progressive muscle weakness in muscular dystrophy (17).
To investigate the functions of RBPs in biological processes or diseases such as the ones mentioned above, a large number of studies have been conducted, resulting in exponential growth of RBP-related papers in recent years ( Figure 1). For example, >1000 papers were published in 2017 alone. Among the studies on RBPs, a large number of RNA-Seq datasets have been generated in loss-or gain-offunction experiments and are publicly available from online repositories like GEO (1). However, because GEO does not have a stringent requirement for metadata of the submitted datasets, the metadata are non-uniformly maintained across different datasets, resulting in inconsistent dataset annotation and sometimes ambiguity. Such a deficiency makes it difficult to identify useful datasets with high precision and recall, which limits the wide use of the datasets.
To address this challenge, we worked to curate RNA-Seq datasets from GEO and ArrayExpress with one or more RBPs being perturbed, e.g. by knock-out, knock-down or overexpression. Important dataset annotations such as genotypes and PubMed references were manually curated to ensure high accuracy. Curated datasets can be used in gene expression analysis (18,19) and alternative splicing analysis (20,21) for biological hypothesis generation (11) via a signature comparison approach (22). To facilitate the use of our curated datasets, the metadata information of these datasets  1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 1988 1987 1986 Number of papers related to RBPs was imported into a database called RBPMetaDB. It should be mentioned that our database differs greatly from Expression Atlas in the sense that the latter is not mainly about datasets where specific genes are perturbed and so are not guaranteed to be complete in this aspect.
In this paper, we describe our main curation methods used in constructing RBPMetaDB and the statistics of the database. To demonstrate the use of RBPMetaDB, a number of promising candidate RBPs have been identified by comparing RBPs with RNA-Seq datasets and all the RBPs annotated in Gene Ontology (GO). In addition, a web application has been developed to host the database to broaden the use of curated metadata and the original raw datasets among biomedical communities.

Metadata curation of GEO/ArrayExpress RNA-Seq datasets and RBPMetaDB web application deployment
To collect RNA-Seq datasets for RBPs from GEO comprehensively, we first extracted 1587 mouse RBPs annotated in GO (accession GO: 0003723) (23 and against ArrayExpress using the query (<official_ symbol> AND organism:"Mus musculus" AND exptype:"sequencing assay" AND exptype:"rna assay"). These queries resulted in 1194 unique datasets in mice. Due to the limitations of the search functions of GEO and ArrayExpress, many of these datasets do not have perturbed RBPs despite the official symbols of some RBPs being mentioned in the titles or descriptions of the datasets. To retain the datasets with perturbations of RBPs, we manually curated each dataset (24) and retained datasets with biological replications per comparison condition, with at least one RBP being knocked-out, knockeddown or overexpressed (along with the corresponding wild-type or control samples) in mice. For the datasets that do not have associated PubMed IDs on GEO and ArrayExpress, we manually added the PubMed IDs.
To facilitate access to these datasets, we launched a database called RBPMetaDB (http://rbpmetadb.yubiolab. org). RBPMetaDB is implemented using Flask (http://flask. pocoo.org), a microframework for web development in Python. The MySQL database is used for data storage. The website of RBPMetaDB is freely available, and it presents the GEO/ArrayExpress accession numbers, descriptions, number of samples, associated curated RBPs, perturbation and PubMed references for each RNA-Seq dataset.

Domain structure analysis of RBPs
Protein domain structure analysis of RBPs was performed to identify critical RBPs for future studies. First, all RBPs annotated to the 'RNA binding' GO term (GO: 0003723) were retrieved using the R package GO.db (25). Using the UniProt annotation of the Pfam families assigned to the RBP protein domains (26), the number of RBPs with specific Pfam families was calculated using RBPs with the curated RNA-Seq datasets and using the total RBPs, respectively. To investigate the RNA binding effect, the number of RBPs of Pfam families with RNA binding activity was calculated, where RNA-related Pfam families were searched using the RESTful interface in the Pfam database. Figure 3 plots the number of RBPs with Pfam families specific to RNA binding for the RBPs with RNA-Seq data and all the RBPs. By comparing the domain families of the RBPs with RNA-Seq datasets to those of all the RBPs, the RBPs in relatively less-studied domain families can be promising candidates for future RBP studies.

Data statistics
RBPMetaDB has 292 RNA-Seq datasets with 187 perturbed RBPs (Supplementary Table S1), which account for only $10% of all annotated RBPs. Among these 187 RBPs, over 30% of them have more than one corresponding RNA-Seq dataset. Approximately 90% of datasets in RBPMetaDB have only one perturbed RBP, meaning that most studies are small-scale and well-focused. Also, RBPs with RNA-Seq data tend to have DNA-binding activity. To systematically examine the DNA-binding activity of RBPs, the GO term 'DNA binding' (GO: 0003677) was used to extract the genes with DNA-binding activity. By overlapping with DNA-binding proteins, 66 RBPs with RNA-Seq datasets and 207 RBPs without RNA-Seq datasets were shown to have DNA-binding activity. Taking the total 1587 RBPs as background, Fisher's exact test showed an enrichment of DNA-binding activity in RBPs with datasets compared with RBPs without datasets (P-value <5:8 Â 10 À14 ). Specifically, for RBPs in RBPMetaDB, the proportion between RBPs with and without DNA-binding activity is 0.68 (66 over 97). On the contrary, the proportion of RBPs that do not have RNA-Seq datasets is only 0.17 (207 over 1219). This large difference suggests that many datasets in RBPMetaDB were collected for their DNA-binding activity instead of RNA-binding activity, and these datasets are likely to be underanalyzed for RNA-binding activity, providing a costeffective opportunity to reanalyze these datasets to study their related RNA biology. For example, Ezh2 is the moststudied gene, with 35 RNA-Seq datasets in RBPMetaDB. However, most studies of EZH2, as a catalytic subunit of polycomb repressive complex 2, focus on its capacity for mono-, di-and tri-methylation of histone H3 on lysine K27 (H3K27me1/2/3) (27). Figure 2a shows that the main RBP perturbation type of all the datasets in RBPMetaDB is knock-out ($67%). The rest is knock-down ($18%), overexpression ($9%), knock-in ($3.5%) and other ($2.8%, e.g. treated with inhibitors or point mutation). Figure 2b shows that the US and Europe dominate the generation of RNA-Seq datasets for studying RBPs, with contributions of 60.1% and 23.4% of all the datasets, respectively. In addition, Figure 2c shows an increasing number of papers published about the RNA-Seq datasets in RBPMetaDB from 2010 to 2017. This increasing research interest worldwide will stimulate more investigation on RBPs.

Comparison of RBPs using protein domain analysis
Protein domains, as conserved protein structural units, typically characterize certain functional aspects of a protein, and proteins sharing similar domains tend to share similar functions. Since RBPs bind to RNAs, they should have RNA-binding domain. We therefore extracted the domain family information of all the RBPs according to Pfam domain family annotation (28). The number of publications additional families are fairly well-studied, including DEAD (DEAD/DEAH box helicase, PF00270), KH_1 (KH domain, PF00013), dsrm (double-stranded RNA-binding motif, PF00035) and HA2 (helicase-associated domain, PF04408). However, none of the RBPs with domains from two highly dominant domain families, LSM (PF01423) and GTP_EFTU_D2 (PF03144), has related RNA-Seq datasets yet, and they may be good candidates for future high-throughput sequencing studies. What's more, among the RBPs without related RNA-Seq datasets, 140 RBPs already have one or more mouse models (Supplementary Table S1) on the International Mouse Strain Resource (IMSR) (29). For example, the gene Cleavage Stimulation Factor Subunit 2 Tau Variant (Cstf2t) has been demonstrated to be an important stagespecific regulator of Crem mRNA processing that controls Crem polyadenylation in mouse testis. Cstf2t can lead to an overall decrease of the Crem mRNAs generated from internal promoters in Cstf2t À/À mice (30,31). Therefore, these 140 RBPs can be promising candidates for RNA-Seq studies in the future.

Web interface
To facilitate the use of RBPMetaDB, a user-friendly website has been launched. The website allows users to access all the key information related to the curated RNA-Seq datasets, including the GEO/ArrayExpress accession numbers, dataset titles, numbers of samples, associated RBPs, perturbation types and PubMed IDs (Figure 4). The contents in these fields are linked to the corresponding entries in GEO/ArrayExpress, metadata information for each dataset, MGI gene symbol and PubMed, respectively. In the table view of the website, the first 10 entries are shown by default, but the user can easily select the number of entries to be visualized from a pop-up menu on the left side (Label A). Each table has six columns about the metadata in RBPMetaDB (Label B), and all columns can be sorted in ascending or descending order by clicking column headers. The search boxes at the bottom of all the fields support field-specific search by regular expression (Label C). For example, to search for multiple gene symbols in the 'RNA binding proteins' column, one can specify the gene symbols joined by 'j'. By searching a gene of interest, users can find all RNA-Seq datasets with the gene perturbed. Take as an example METTL3, which is an important enzyme involved in the post-transcriptional methylation of internal adenosine residues in eukaryotic mRNA (32)-it can be demonstrated that RPBMetaDB greatly outperforms GEO in terms of search efficiency. When the keyword 'Mettl3' is searched on RBPMetaDB, it returns six highly accurate mouse RNA-Seq datasets from Mettl3 loss-or gain-offunction studies (Figure 5a). GEO returns 35 mouse RNA-Seq datasets with the query of 'Mettl3' in dataset titles and descriptions (Figure 5b), but it is impossible to directly identify which RNA-Seq datasets are from loss-or gain-offunction experiments of Mettl3. On the contrary, RBPMetaDB does not return irrelevant datasets of a given RBP, and it returns more accurate results than GEO.

Conclusions
RBPMetaDB provides the first comprehensive, manually curated database of mouse RNA-Seq datasets with specific RBPs being perturbed. At the time of writing, it consists of the annotation of 292 mouse RNA-Seq datasets. These datasets provide valuable information for studying RNAbinding activity of RBPs. To keep RBPMetaDB updated, every six months, we will extract an updated list of RBPs from GO annotation to search GEO and ArrayExpress for newly released mouse RNA-Seq datasets and curate them.  The results will be added to RBPMetaDB. RBPMetaDB will provide a valuable resource for many different research communities to understand how RBPs are involved in a variety of biological or disease processes.

Supplementary data
Supplementary data are available at Database Online.