MeT-DB: a database of transcriptome methylation in mammalian cells

Methyltranscriptome is an exciting new area that studies the mechanisms and functions of methylation in transcripts. The MethylTranscriptome DataBase (MeT-DB, http://compgenomics.utsa.edu/methylation/) is the first comprehensive resource for N6-methyladenosine (m6A) in mammalian transcriptome. It includes a database that records publicaly available data sets from methylated RNA immunoprecipitation sequencing (MeRIP-Seq), a recently developed technology for interrogating m6A methyltranscriptome. MeT-DB includes ∼300k m6A methylation sites in 74 MeRIP-Seq samples from 22 different experimental conditions predicted by exomePeak and MACS2 algorithms. To explore this rich information, MeT-DB also provides a genome browser to query and visualize context-specific m6A methylation under different conditions. MeT-DB also includes the binding site data of microRNA, splicing factor and RNA binding proteins in the browser window for comparison with m6A sites and for exploring the potential functions of m6A. Analysis of differential m6A methylation and the related differential gene expression under two conditions is also available in the browser. A global perspective of the genome-wide distribution of m6A methylation in all the data is provided in circular ideograms, which also act as a navigation portal. The query results and the entire data set can be exported to assist publication and additional analysis.


INTRODUCTION
The past decades have witnessed the great progress in DNA methylation research (1,2). Yet, research on methylation in transcripts is still at its very early stage. Thus far, N6-methyladenosine (m 6 A) is known to be the most abound form of methylation in transcripts. The recent surge of interest in transcriptome-wide m 6 A methylation stemmed largely from two seminal studies (3,4), where a new enrichment-based high-throughput method called MeRIP-Seq, or methylated RNA immunoprecipitation sequencing, was introduced to identify for the first time the presence of over 12 000 m 6 A sites in about 20% of mammalian genes. Transcriptomewide analysis also revealed that m 6 A sites reside in the highly evolutionarily conserved regions with a consensus sequence motif of RRACH that is also conserved between human and mouse, suggesting the possibility of conserved regulatory functions of m 6 A. A series of subsequent papers that used MeRIP-Seq and other high-throughput methods have started to produce evidences and new hypotheses on involvement of m 6 A in regulating many functions (5).
Currently, analysis of m 6 A distribution in mRNA showed that m 6 A is enriched in the 3 untranslated region (UTR) and near stop codon. Among 3 UTRs that contain an m 6 A site, two-thirds also have at least one TargetScan-predicted miRNA binding site, suggesting that m 6 A might be involved in miRNA-mediated gene silencing. Existing studies also suggested that m 6 A recruits different RNA binding proteins (RBP) to mediate its own methylation cycle and to regulate various other cellular functions. For instance, transcriptome m 6 A methylation is known to be catalyzed by the methyltransferase complex that includes methyltransferase-like 3 (METTL3), Wilms tumor 1-associated protein (WTAP) and METTL14 (6). In-D198 Nucleic Acids Research, 2015, Vol. 43, Database issue terestingly, WTAP is a pre-mRNA-splicing regulator and a recent PAR-CLIP analysis also showed that METTL3 binding sites reside mostly in intronic regions of pre-mRNA (7). These evidences suggest a likely involvement of m 6 A in splicing but the splicing factors and the mechanism by which it regulates splicing is unclear. In addition to splicing, m 6 A has also been shown to promote mRNA degradation through recruiting the human YTH domain family 2 protein to the methylated mRNA and this recruitment redirects mRNA to mRNA decay compartment as opposed to ribosome for translation (8). In contrast, m 6 A was also suggested to stabilize mRNA, where loss of m 6 A in an mRNA is coupled with decrease in its expression level. The degree and context by which m 6 A would choose to meditate mRNA degradation over stability is completely unknown. Recent studies also showed that m 6 A methylation can be reversed by two endogenous enzymes, obesityassociated protein (FTO) and ␣-ketoglutarate-dependent dioxygenase alkB homolog 5 (ALKBH5) (9). However, the functional differences between the two demethylases and whether there exists additional demethylase are still elusive. Interestingly, knockout FTO in mouse midbrain impairs dopamine dopaminergic activity, which suggested that m 6 A might indirectly control neuronal activity and behavioral responses. These exciting advancements in transcriptome m 6 A methylation have inspired a new research area now known as methyltranscriptome. Continuing effort in deciphering m 6 A functions under various contexts represents a grand challenge to the biology community in coming years (10).
Along with the increased amount of m 6 A MeRIP-Seq data sets, there is a strong need for an integrated database that facilitates the exploration of these data sets. To meet this need, we created the first integrated MethylTranscriptome DataBase or MeT-DB (http://compgenomics.utsa. edu/methylation/). MeT-DB includes a comprehensive collection of m 6 A sites predicted from all published MeRIP-Seq data sets. To enhance its utility, MeT-DB also integrated m 6 A sites with the binding sites of miRNA, splicing factors and several RBP. It also includes gene expression data for every MeRIP-Seq data samples. The web-based interface also includes an internal genome browser with browsing and query functionalities optimized for exploration and analyses of MeRIP-Seq data. Particularly, the browser provides the analysis of differential methylation and differential expression of m 6 A methylated genes under two conditions recorded in the database. MeT-DB will help shed lights on the biological functions of m 6 A methyltranscriptome.

MATERIALS AND METHODS
A schematic diagram illustrating data sets and functionality of MeT-DB is shown in Figure 1. Currently, MeT-DB includes MeRIP-Seq and other related data for human (UCSC hg19) and mouse (UCSC mm9).

MeRIP-Seq data
The MeRIP-Seq technology combines the essential technologies of methylated DNA immunoprecipitation (MeDIP), immunoprecipitation of RNA-binding proteins (RIP) and RNA sequencing (RNA-Seq) to enable highresolution detection of transcriptome-wide methylation. MeRIP-Seq immunoprecipitates heavily fragmented, methylated transcripts [including mRNAs and long non-coding RNAs (lncRNAs)] and then sequences the pull-down RNA fragments for computational processing. Two types of samples, the immunoprecipitated (IP) and the control, are obtained for each experiment, where an IP sample includes sequenced reads from the methylated fragments that are immunoprecipitated and a control sample includes the sequenced reads from fragments in the cell transcriptome, which is essentially the RNA-Seq measurement of transcript abundance. The methylation sties, or peaks, are detected by examining the enrichment of the reads in IP samples relative to those in input samples (11).
MeT-DB includes a total of 74 m 6 A MeRIP-Seq IP/input samples (Supplementary Table S1) from 22 experimental conditions (GSE29714, GSE37005, GSE48037, GSE47217, GSE46705; Supplementary Table S2) collected from five independent studies (3,4,8,12,13) and the data sets were downloaded from NCBI Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo/) and/or Short Read Archive (SRA; http://www.ncbi.nlm.nih.gov/sra). The data set (GSE46880) from (14) was excluded because it has no input samples and thus does not conform completely to the MeRIP-Seq protocol. Methylation site prediction, or MeRIP-Seq peak calling, was performed on the data sets from each of the 22 conditions according to the following two processing pipelines. The first pipeline was proposed in (3), where for each condition, MeRIP-Seq IP and input sample files are first concatenated into a single IP and input fastq file, respectively. Then, reads in the IP/input files are aligned to the genome by Tophat2 (version 2.0.11) (15), which implicitly calls Bowtie2 (version 2.1.0) with default options. Afterwards, peak calling is performed on the input and IP BAM files by MACS2 (version 2.0.10.20131216) (16), a chromatin immunoprecipitation sequencing (ChIP-Seq) peak calling algorithm, with default significance level 0.05. The second pipeline is based on a specialized MeRIP-Seq peak calling tool, exomePeak R/Bioconductor package (11). exomePeak adopts an exome-based approach, which performs peak calling on the pooled exons of a gene instead of on the genome as MACS2 does to account for peaks spanning exon-exon junctions. As a result, exomePeak can also report the strand information of predicted peaks. More accurate background estimation is also implemented in ex-omePeak by using only reads from the exonic regions. ex-omePeak has been shown to achieve higher specificity than MACS2 in predicting methylation sites from MeRIP-Seq data (11). In this second pipeline, each IP/input sample is first aligned to the reference genome by Tophat2, and then exomePeak is applied to call peaks using all the samples from a condition with default options. The called peaks from biological replicates are merged internally by exome-Peak package.
For each predicted RNA methylation site, its chromosomal location including start/end position, strand information (only for exomePeak result) and fold enrichment are reported. In addition, annotation information is generated, including the gene identifiers of transcript (UCSC MeT-DB includes all public available mammalian MeRIP-Seq data sets for RNA methylation sites predicted using exomePeak and MACS2 algorithms. Other related data including binding sites of miRNAs, splicing factors, and RNA binding proteins as well as gene expression profiles were also integrated. MeT-DB also includes an internal genome browser to facilitate data query, visualization, analysis and export. ID, Entrez Gene ID, Gene Symbol, RefSeq ID) as well as the information about whether the m 6 A site is located on 5'UTR/Coding DNA Sequences (CDS)/3'UTR or mRNA/lncRNA, where a lncRNA is defined as a transcript with exonicregions longer than 200nt and without any known CDS defined in the transcript database.

Gene expression
Genome-wide expression levels for each of the 22 conditions were estimated from MeRIP-Seq input samples. Reads were first aligned to the corresponding genome (hg19 or mm9) with Tophat2 (15), and then Cufflinks (v2.1.1) (17) was employed to estimate the gene expression profiles with default settings and by using RefSeq gene annotations (http: //cufflinks.cbcb.umd.edu/igenomes.html).

miRNA target sites
To facilitate exploration of m 6 A and miRNA regulation, predicted miRNA target sites from TargetScan (version 6.2) (18) and miRanda (August 2010 release) (19) as well as Star-Base (version 2.0) (20) recorded sites that are detected by PAR-CLIP experiments were included.

Splicing factor and RNA binding protein binding sites
Six-hundred-and-fifty-five and one-hundred-and-twentyfive binding sties of human and mouse splicing factors, respectively, were obtained from SpliceAidF (v1.1 03/2013) (21). Each site includes the name of the binding splicing factor and the genome location of the binding site. Similarly, PAR-CLIP and HITS-CLIP detected binding locations of 24 human and 5 mouse RBPs were retrieved from StarBase.

Gene annotation and cancer genes
We use UCSC hg19 and mm9/refFlat table for transcript annotation, which includes RefSeq ID, symbols and other positional information. For human genes, we also downloaded HUGO annotation (August 2014) to extract all official gene symbols, as well as their associated 'previous Name' or 'Synonyms' to map all possible gene symbol input. m 6 A has also been implicated to be involved in cancer (22)(23)(24). To help further investigate this association, cancer genes and tumor suppressor genes were also imported to the database. Seven-hundred-and-sixty-one human and sixhundred-and-twenty-eight mouse tumor suppressor genes were obtained from TSGene database (25). In addition, 522 cancer genes were downloaded from COSMIC (v70) (26).

Database implementation, circular ideogram, web interface and availability
All data sets were processed and stored in a MySQL Database Management System (version 5.1.69) installed on an X86-64 server with Redhat Linux GNU OS. The database consists of 101 tables that comprise ∼1.74 billion records.
Circular ideograms at both genome and chromosome levels were created to display the distribution of exomePeak predicted sites. To generate an ideogram image, the site location data were parsed and formatted into data of segments, where for the chromosome image, a segment is 1 million bp in length, whereas for the genome image, a segment includes 5 million bp. The formatted site location data were then passed, along with the formatted locations of cancer gene data, to Circos (27) for image generation, where the methylation fold enrichments were scaled logarithmically. At last, individual links to display the data in the browser D200 Nucleic Acids Research, 2015, Vol. 43, Database issue were added to different sections and cancer genes in the ideogram.
A web application including an internal genome browser was designed to provide user-friendly access to database searching and data visualization. The database query, genome browser and user interface were developed using PHP 5.3.3, JavaScript, WZ Jsgraphics library v. 3.05, jQuery JavaScript library v1.8.3, XHTML 1.0 and SVG 1.1. The web application is publicly accessible at http: //compgenomics.utsa.edu/methylation/ and the Frequent Asked Questions (FAQ) section of the website provides detailed description of its functions and features.

Predicted m 6 A sites
A large number of m 6 A sites were identified by exomePeak and MACS2 from these 22 MeRIP-Seq data sets, and the result is summarized in Table 1 and Supplementary Table  S3- . This is expected because MACS2 reports a single peak that spans multiple exons as multiple peaks. However, for the same data set, the numbers of peaks reported by ex-omePeak and MACS2 are significantly correlated (Pearson correlation 0.6743622; P-value = 0.0005777; Figure 2B). MEME suite (28) was also employed to predict the sequence motif of exomePeak detected m 6 A sites. Despite large variation in the numbers of detected m 6 A sites across different MeRIP-Seq conditions, the most enriched consensus sequences are all similar to the previously reported RRACH motif of m 6 A (3,4) (Supplementary Table S7), suggesting that the exomePeak detected m 6 A sites should have high specificity.

Data browser
A web interface with an internal browser was developed for integrative exploration of the m 6 A sites predicted by ex-omePeak and MACS2 under different conditions (Supplemental Figure S1). The interface displays query results in a browser, which includes multiple tracks for visualizing the data. The gene tracks show the coordinates and structures of UCSC RefSeq genes. The gene symbols are displayed at the top of each gene track. The gene structure including UTRs, exons, introns and the strand direction is displayed according to the conventions of the UCSC genome browser. The strand information is also color-coded (blue for the positive strand and red for the negative strand) in addition to the arrows that indicate the direction from 5 to 3 of a gene. The site prediction tracks include the predicted sites and their degree of enrichment for each of the conditions as well as aggregated from all conditions. Each track displays bars spanning the predicted sites, whose heights are proportional to their methylation enrichments and whose color define their strand information consistent with that for gene tracks. The browser also includes tracks that display the miRNA target sites data from miRanda, TargetScan and StarBase, the splicing factor binding data from SpliceAidF, and StarBase CLIP data for RBP sites. Vertical lines are displayed at the location of all these sites and the name of the binding miRNA/splicing factor/RBP is shown when moving the cursor over the bar. In addition to tracks that display different data, the browser also provides a series of functions including navigation, track selection, data export, import data and linking to UCSC Genome Browser to display the present region of the genome. An alignment line that follows the cursor was also implemented to help align the same location in different tracks. The user can find more detailed description of the browser functions in the FAQ/Tutorial page of the website.

Circular ideogram
The circular ideograms display exomePeak predicted m 6 A sites and the methylation enrichment at genome and chromosome levels (Supplementary Figure S2). The genomelevel ideogram is clickable by chromosome, which will link to a chromosome ideogram, where the user can select a region within the chromosome or click a predefined cancer gene placed on the circular ideogram to display the associated data in the browser.

Query the database
The database can be queried in three different ways. First, it is possible to search by 'Gene Symbol', 'RefSeq ID' or 'Locus' by entering this information in the search box appeared in the homepage as well as the 'Browse' page. Second, the user can query through the circular ideogram by selecting a region of interest or a cancer gene. Third, the user can also select from the set of pre-linked genes of interest in the 'Genes to explore' section of the homepage. Clicking on a gene will trigger the display of default tracks in the browser.

Differential methylation and differential expression
The users can also investigate differential methylation as well as differential gene expression, from which the relationship between the context specific methylation and gene expression can be observed. To perform this analysis, the user can select the exomePeak/MACS2 predictions tracks from any two conditions by checking the checkboxes next to the two tracks. By clicking the 'Compare' button, the differential m 6 A methylation of the sites (methylation enrichment value from Condition 1 subtracts that of Condition 2) in the displayed region between the two selected tracks (the first checked checkbox counting from top will be considered as Condition 1 and the other is Condition 2 experiment) and the corresponding differential expression will be calculated (log2-transformed expression level of Condition 1 subtracted that of Condition 2) and displayed. For the differential methylation track, bars are placed at the differential methylation sites, where an up-right bar represents hypermethylation (more methylation enrichment in condition one), a down-ward bar represents hypo-methylation (less methylation enrichment in condition one) and the height of the bar is proportion to the degree of differential methylation (absolute logarithm fold peak enrichment). For the differential expression track, a block that spans the entire region of a differential expressed gene is placed, whose color  indicates the state of differential expression (up-regulation in red, down-regulation in green and no-change in black).

Custom track data upload
User can upload one track of his/her own peak calling result in BED format (either common-delimited or tab-delimited. For example, chr17, 7559000, 7560000, 2). The input text-box is at the upper-right corner, next to UCSC logo. The custom track will be displayed below aggregated/consensus tracks, and only peaks within current viewing genomic regions will be displayed. The uploaded data will not be stored in the server, and user can remove the track by pressing the 'Remove' button below the input text-box.

Data export
There are three different ways that the user can export the data. First, predicted sites by exomePeak and MACS2 from all tracks in the displayed genomic region of the browser can be exported by clicking the 'Export Data' icon at the upper-right corner of the browser. Second, peak prediction results of a user defined gene set can be exported as a common-separated value (CSV) file from the 'Download' page. Third, peak calling results for every sample by exome-Peak or MACS2 can be downloaded from Table 3 in the 'Download' page.

Utility of MeT-DB: a working example
If the user is interested in investigating any potential influence of m 6 A on a tumor suppressor gene, BRCA1, a query can be initiated by directly clicking the name BRCA1 listed in the 'Genes to explore' section of the homepage or by entering 'BRCA1' or its gene coordinate in the search box. The query results from the database including all the predicted m 6 A peaks in all BRCA1 isoforms by exomePeak and MACS2 as well as the binding sites of BRCA1 related miRNAs, splicing factors and RBPs are displayed in the genome browser (Supplementary Figure S1). The gene track shows that BRCA1 has six different isoforms. Examining the common peaks predicted by exomePeak and MACS2 from the aggregated prediction tracks revealed two clusters of m 6 A sites, one in the stop codon and 3'UTR and the other in an exon of four (4) BRCA1 isoforms. In the BRCA1 3'UTR, miRanda, TargetScan and StarBase all reported multiple target sites from a large number of miR-NAs, some of which are inside the m 6 A sites. The user can further examine the tracks of splicing factor and RBP sites, which reveals a binding site of splicing factor Hu Antigen R in the vicinity of m 6 A sites in the 3'UTR. For the second m 6 A site cluster, the user might be interested in investigating the conditions that lead to this isoform specific m 6 A methylation by inspecting the 'predictions per sample' tracks. In this case, the m 6 A site cluster is predicted by both algorithms to exist in sample 'HEK293T S1', 'HEK293T S1', 'HeLa2 ctrl' and 'HeLa2 KO M3', where the first two came from HEK293T cells, while the latter two came from HeLa cells. The user can also inspect the reads densities in the corresponding input and IP samples of 'HEK293T S1' by turning on the plus icon at the left side of the sample label 'HEK293T S1'. Reads enrichment in the IP sample relative to the input sample is visible at both predicted site clusters. If the user wishes to further assess the differential m 6 A methylation and the differential gene expression between sample 'HEK293T S1' and 'u2os ctrl', the analysis can be performed by first checking the checkboxes to the right of these two tracks and then click the 'Compare' button. The 'Differential Methylation' and 'Differential Expression' tracks will appear below the section of sample tracks. In this case, we see up-right bars at both site clusters in BRCA1 in the 'Differential Methylation' track, indicating hyper-methylation in sample 'HEK293T S1' relative to 'u2os ctrl' at these sites. In contrast, the 'Differential expression' track shows that BRCA1 is down-expressed (green block) in 'HEK293T S1' relative to 'u2os ctrl' with a log-fold change of -3.38. These displayed m 6 A peaks in the browser can be exported as a CSV file by clicking the 'Export Data' button at the top of the browser. If the user is also interested in gaining a global perspective of the methylation surrounding BRCA1, the user can navigate to the circular ideogram of chromosome 17 (Supplementary Figure S2).

DISCUSSION
MeT-DB is the first attempt to establish a resource of m 6 A methylation in mammalian transcriptome. MeT-DB corroborates the widespread existence of m 6 A in transcriptome by bringing together predicted m 6 A sites from all existing MeRIP-Seq data sets from different studies. It includes a build-in genome browser to facilitate visualization and comparison of m 6 A methylations in different contexts. As it provides wealth information related to m 6 A, it is a valuable resource for both experimental and computational biologists who aim to improve our understanding of the biological mechanisms and functions of m 6 A. As m 6 A MeRIP-Seq and other RNA methylation related technologies are being applied to a broader set of species, cell lines, tissues and conditions, the MeT-DB database will be continuously updated and improved. The MeT-DB database currently covers only m 6 A methylation. In the foreseeable future, data from other abundant posttranscriptional RNA methylation, such as 5-methycytocine (m 5 C), will be included as they become available.