RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data

Although more than 100 different types of RNA modifications have been characterized across all living organisms, surprisingly little is known about the modified positions and their functions. Recently, various high-throughput modification sequencing methods have been developed to identify diverse post-transcriptional modifications of RNA molecules. In this study, we developed a novel resource, RMBase (RNA Modification Base, http://mirlab.sysu.edu.cn/rmbase/), to decode the genome-wide landscape of RNA modifications identified from high-throughput modification data generated by 18 independent studies. The current release of RMBase includes ∼9500 pseudouridine (Ψ) modifications generated from Pseudo-seq and CeU-seq sequencing data, ∼1000 5-methylcytosines (m5C) predicted from Aza-IP data, ∼124 200 N6-Methyladenosine (m6A) modifications discovered from m6A-seq and ∼1210 2′-O-methylations (2′-O-Me) identified from RiboMeth-seq data and public resources. Moreover, RMBase provides a comprehensive listing of other experimentally supported types of RNA modifications by integrating various resources. It provides web interfaces to show thousands of relationships between RNA modification sites and microRNA target sites. It can also be used to illustrate the disease-related SNPs residing in the modification sites/regions. RMBase provides a genome browser and a web-based modTool to query, annotate and visualize various RNA modifications. This database will help expand our understanding of potential functions of RNA modifications.

Although more than 100 types of RNA modifications have been described so far, most of them were thought to be abundant in tRNAs, rRNAs and snRNAs, but rare in mRNAs and in regulatory non-coding RNAs (ncRNAs).
Many novel functional roles of RNA modifications have been revealed by functional experiments in recent years. For example, m 6 A has been predicted to affect protein translation and localization (1)(2)(3)(4)(5) or mRNA stability (18) and stem cell pluripotency (19,20). Pseudouridylation of nonsense codons suppresses translation termination both in vitro and in vivo, suggesting that RNA modification may provide a new way to expand the genetic code (21). Importantly, many modification enzymes are dysregulated and genetically mutated in many disease types (1). For example, genetic mutations in pseudouridine synthases cause mitochondrial myopathy, sideroblastic anemia (MLASA) (22) and dyskeratosis congenital (23). However, the relationships between genetic variants identified from genome-wide association studies (GWAS) and modification sites identified by above-D260 Nucleic Acids Research, 2016, Vol. 44, Database issue mentioned various high-throughput methods were yet unexplored.
In this study, we developed RMBase to facilitate the annotation, visualization, analysis and discovery of RNA modification sites from large-scale modification sequencing data ( Figure 1). In RMBase, we performed a largescale integration of public RNA modification sites generated by high-throughput sequencing technology, and provided the RNA epigenetic map for various cell types that are presently available (Table 1). RMBase provides web interfaces to show the relationships between miRNA targets and RNA modifications. Furthermore, by integrating GWAS data into database, RMBase can be used to illustrate the clinically relevant RNA modification sites. As the integration of more than 100 types of RNA modifications, it is expected to help the researchers to investigate the potential functions and mechanisms of RNA modifications.

Identification and annotation of m 6 A modification sites
To obtain high resolution m 6 A modification sites, we predicted exact m 6 A positions from MeRIP-Seq or m 6 A-seq peaks by searching for consensus DRACH (where D denotes A, G or U, R denotes A or G and H denotes A, C or U) motifs as described by previous study (17,33). All these exact m 6 A positions were annotated as the abovementioned descriptions.

Identification of disease-related SNPs in modification sites
As described in our previous study (34), disease/phenotype associated SNPs were curated from published GWAS data provided by the NHGRI GWAS Catalog (35), Johnson and O'Donnell (36), dbGAP (37) and GAD (38). Additional SNPs in linkage disequilibrium (LD) with reported diseaserelated loci were selected with the criteria requiring an r 2 value over 0.5 in at least one of the four populations (CEU, CHB, JPT and YRI) genotype data of the HapMap project (release 28) (39). For each SNP, rs ID was lifted to dbSNP bulid 141 based on the 'RsMergeArch.bcp' and 'SNPHistory.bcp' table from dbSNP, and genomic coordinates were lifted to the hg19 assembly using the UCSC LiftOver tool. All these disease-related SNPs or LD SNPs were intersected with the modification regions, extended by an additional 10 nt in both the 5 -and 3 -directions for each modification site. Modification regions were defined according to the binding length of modification synthases (1,4), such as Fibrillarin (FBL, the methyltransferase) bind to complementary regions with at least 10 nt (40).

Association analysis of miRNA targets with RNA modification sites
All miRNA-target interactions for human and mouse were downloaded from our starBase platform (41,42). All miRNA target sites were intersected with RNA modification sites to identify modifications that may influence miRNA-target interaction.

Database and web interface implementation
All data sets were processed and stored in a MySQL Database Management System. The database query and user interface were developed using PHP and JavaScript. The query result table is based on jQueryUI and DataTables, which is a highly flexible tool for sorting and filtering the search result.

RMBase genome browser
We constructed RMBase Genome Browser to provide an integrated view of reference sequences, modification sequencing data, aligned sequencing reads, RNA modification sites, protein-coding genes, ncRNA genes and transcripts. RM-Base Browser is built on JBrowse (43) which is a fast, smooth scrolling and zooming genome browser.

The genome-wide landscape of various RNA modification types
We integrated 139 025 RNA modification sites generated by 18 independent studies to profile the genome-wide modification landscape of more than 100 types of RNA modifications (Table 1). To provide more useful information, we generated extensive annotations and analyses for all RNA modification sites. Therefore, RMBase can be used to show the Nucleic Acids Research, 2016, Vol. 44, Database issue D261 Figure 1. System overview of RMBase core framework. We integrated a large set of RNA modification sites generated by 18 independent studies to profile the comprehensive genome-wide modification landscape of more than 100 types of RNA modifications. Integrative analysis of RNA modification sites has shown extensive post-transcriptional modification of RNA. Our combined analysis of RNA modification data with GWAS and miRNA target data identified thousands of miRNA targets and disease-related SNPs resided in the modification sites. High-throughput modification sequencing data were mapped to genomes and displayed in genome browser. All results generated by RBMBase are deposited in MySQL relational databases and displayed in the visual browser and web page. modification sites of distinct modification types varied from several to thousands, and the genomic context distributions of modification sites for different types distinguished from each other.

Annotating the association between RNA modifications and miRNA target sites
To help users investigate the association between RNA modifications and miRNA target sites, we collected all CLIP-Seq experimentally supported miRNA target sites from starBase database (41) and associated these data with all RNA modification sites from RMBase. RMBase allows users to retrieve all the RNA modification sites located within miRNA binding sites reported so far.

Predicting GWAS-associated modification sites
Although GWAS have revealed a significant number of genetic variants related to diseases or phenotypes, a considerable portion of these identified loci remain not been functionally explained to date (44). To help users explore whether some modifications may be the real causation for diseases or phenotypes, we collected a total of 87 677 unique disease-related SNPs from four public GWAS data source.
In addition, we also performed LD analysis to extract SNPs that had high LD relationship with disease-related SNPs using a threshold of r 2 > 0.5 in at least one population from the HapMap CEU, CHB, JPT and YRI genotype data, which yielded a total of 895 968 disease-related or LD SNPs (34). By comparing the genomic coordinates of SNPs with all modification sites in human, RMBase can be used to illustrate the disease-related SNPs which are mapped to modification sites.

The web-based exploration of different types of RNA modification sites
We provided five web interfaces (Pseudouridine/ , m 6 A, m 5 C, 2-O-Me and otherType) which may be used to display RNA modification sites from various modification types. For each type of the RNA modification, users can select species in the query page. In the result page, the basic information of modification sites was displayed in a data table which includes 10 distinct fields to describe the details of modification sites. For each interface, the numbers of RNA modification sites are indicated in bottom-left corner of table. The user can also click on the title of the table to sort RNA modification sites according to various features, such as chromosome, genome positions, the number of supporting experiments, modId, the gene names or the gene types. User can input the keyword in search box to filter the results. The users can click on a modId within the table to launch a detailed page that provides further information about the RNA modification site in question. The detailed information for a modification site includes a description of the modification site, the list of supporting experiments and sequence that was extended by an additional 20 nt in both the 5 -and 3 -directions for the modification site. The 'PubMed ID' section enabled the retrieval of the primary articles yielding the modification data. Click the ID link to visit the NCBI PUBMED website. The interface for modSNP and modMirTar was also provided and organized similarly to the above-mentioned interface, as well as disease-related SNP and miRNA-target interaction information. Users can explore their relationships between modification sites and SNP or miRNA target sites by similar ways.

Visualization of various modification sequencing data using the RMBase genome browser
To facilitate visualization of the various modification sequencing data sets and exploration of RNA modification sites, we provide RMBase genome browser that is built on JBrowse (43) (Figure 2). In the query page of the browser, users can input one interested genomic region or gene name in the 'search term' and select corresponding genome assembly to gain an integrated view of various genomic features. Information on RNA modification sites, aligned reads generated by modification sequencing methods, as well as gene annotations from Ensembl were provided. Figure 2 illustrated the visualization of genomic context for 'PseudoU site 871' modification site located within MALAT1 lncRNA using RMBase Browser. Users can click the '+' or '−' button at the top to shrink or extend on the center of the annotation tracks window. Users can open the track select panel by clicking 'Select Tracks' button located in the upper-left corner and choose different types of modification data sets derived from various cell lines or treatments. To explore RNA modification sites on a particular gene, users can type its gene symbol in the position textbox and then click the 'GO' button to update the display image to determine what modification sites are located within the gene.

Associating other data with modification sites using webbased modTool server
We provide the web-based modTool, which offers a simple and user-friendly interface to annotate modification sites in genomic regions uploaded by user. The user is required to select an intended organism and then upload genomic regions in the browser extensible data (BED) format. After the user has completed the data submission, a typical iteration of the modTool program may require several seconds or minutes to finish. The output of this program mainly consisted of a data table that included 10 distinct fields to describe the details of hits. The results include the query name, modification positions on genomes, modification type, the number of supporting experiments or studies, gene name, gene type (e.g. protein-coding or ncRNA) and regions (CDS, 3 UTR, exon, 5 UTR, intron, intergenic) on genes. Users can reorder any columns in the result table. Thus, it is convenient for data view and comparison in the user-defined vision style. Moreover, the keyword search was supported to scale down the results. Only 200 entries of hit information are displayed in the table, and users can obtain all results in text format by clicking on the 'export' button.

DISCUSSION AND CONCLUSIONS
By integrating a large set of RNA modification sites derived from all available high-throughput modification sequencing methods (Pseudo-seq, CeU-seq, Aza-IP, MeRIP-Seq, m 6 A-seq, RiboMeth-seq) and public resources, RMBase reveals extensive post-transcriptional modification of RNA in mammalian and yeast.
In comparison with the other databases related to RNA modifications, including MODOMICS (26), RNAMDB (45) and MeT-DB (46) which collected modification sites identified by traditionally experimental methods or contain one modification type only, the advances of our RMBase database are as follows: (i) RMBase provides the annotation and analysis of various public modification sequencing data generated by Pseudo-seq, CeU-seq, Aza-IP, m 6 A-seq and RiboMeth-seq, which are the newest high-throughput technology for the transcriptome-wide identification of RNA modification sites in both animals and plants. disease-related SNPs resided in the modification sites. These results will help to reveal the real causations and mechanisms for diseases or phenotypes identified from GWAS studies. (v) RMBase also illustrates relationships between RNA modification sites and miRNA target sites. (vi) In RMBase, we provided RMBase genome browser to provide a quick overview of a particular region in the genome and for visually correlating various types of features (Figure 2). This browser may provide an integrated view of modification sequencing data, RNA modification sites, proteincoding genes and ncRNA genes ( Figure 2). (vii) RMBase provides the comprehensive annotation of various types of RNA modifications ( Figure 1) and a new web-based tool, modTool, to annotate modification sites in genomic regions uploaded by user. (viii) RMBase provides a variety of interfaces and graphic visualizations to facilitate analysis of the massive and heterogeneous modification data in normal tissues and cancer cells.
Overall, RMBase allows an integrative analysis of various high-throughput modification data that provide insights into the epigenetic regulation of the transcriptome. As genome-wide high-throughput sequencing data for RNA modifications become more and more available, RMBase will help researchers further investigate these data and discover potential functional roles of RNA modifications hidden in these data.

FUTURE DIRECTIONS
With the development of new high-throughput modification sequencing method, there will be more and more single nucleotide resolution modification sequencing data. We have built an automatic pipeline which is run in our high-performance computer servers to automatically annotate, analyze and merge all high-throughput modification data sets, and then import these data into our local MySQL database. We will continually maintain and update the database every two months or whenever new highthroughput modification data sets are released in public databases. RMBase will continue to expand the storage space and improve the computer server performance for storing and analyzing these new data, and we will develop or integrate new tools to decode the landscape of RNA modifications.

AVAILABILITY
RMBase is freely available at http://mirlab.sysu.edu.cn/ rmbase/. The RMBase data files can be downloaded and used in accordance with the GNU Public License and the license of primary data sources.