MicroRNAs (miRNAs) are small RNAs ∼22 nt in length that are involved in the regulation of a variety of physiological and pathological processes. Advances in high-throughput small RNA sequencing (smRNA-seq), one of the next-generation sequencing applications, have reshaped the miRNA research landscape. In this study, we established an integrative database, the YM500 (http://ngs.ym.edu.tw/ym500/), containing analysis pipelines and analysis results for 609 human and mice smRNA-seq results, including public data from the Gene Expression Omnibus (GEO) and some private sources. YM500 collects analysis results for miRNA quantification, for isomiR identification (incl. RNA editing), for arm switching discovery, and, more importantly, for novel miRNA predictions. Wetlab validation on >100 miRNAs confirmed high correlation between miRNA profiling and RT-qPCR results (R = 0.84). This database allows researchers to search these four different types of analysis results via our interactive web interface. YM500 allows researchers to define the criteria of isomiRs, and also integrates the information of dbSNP to help researchers distinguish isomiRs from SNPs. A user-friendly interface is provided to integrate miRNA-related information and existing evidence from hundreds of sequencing datasets. The identified novel miRNAs and isomiRs hold the potential for both basic research and biotech applications.
MicroRNAs (miRNAs) are a family of small RNAs, ∼22 nt in length, that act as key post-transcriptional regulators of gene expression, modulating the translational efficiency and/or the stability of target mRNAs. Small RNA sequencing (smRNA-seq), one of the next-generation sequencing (NGS) applications, enables detection and profiling of miRNAs with particularly high levels of sensitivity and accuracy (1). Furthermore, smRNA-seq allows discovery of previously uncharacterized miRNA species and has revealed unexpected complexity among miRNAs. These smRNA-seq discoveries include not only novel miRNAs but also a series of miRNA variants, termed isomiRs (2). In recent years, smRNA-seq datasets have grown rapidly and have been deposited in public databases such as the Gene Expression Omnibus (GEO) (3) and ArrayExpress (4). Exploration of these massive datasets, however, remains a daunting challenge, and an integrative meta-analysis of all smRNA-seq datasets has not yet been well performed.
For the detection and identification of novel miRNAs, smRNA-seq is very promising because it is not as time-consuming and expensive as small RNA cloning methods (5). Many sequencing software tools have been developed to identify novel miRNAs. Li et al. evaluated eight software tools, namely miRDeep (6), miRanalyzer (7), miRTRAP (8), MIReNa (9), mirTools (10), DSAP (11), miRNAkey (12) and mireap (13), based on their common features and key algorithms, and recommended the best tools in predicting novel miRNAs for different data types (5).
IsomiRs are commonly reported in deep sequencing studies (14–27) and are unlikely to be due simply to degradation or sequencing errors (25,28). These variants have been reported to be biologically relevant and functionally cooperative partners of canonical miRNAs (14,21). The variations present in IsomiRs can be grouped into three types: editing (nucleotide substitution), trimming and addition (29). The latter two types cause 5′ and 3′ end-length heterogeneity of miRNAs. Editing is a consequence of adenosine or cytidine deaminase activities and causes nucleotide changes at different positions of the mature miRNAs (15,16,28,30–33). It has previously been shown that several miRNAs, edited in the seed sequence and with an increased level of editing throughout development, result in diversifying target recognition (16). Trimming results in the shorter mature miRNAs compared with the canonical ones. The 3′-to-5′ exoribonuclease Nibbler has also been reported to control 3′ end processing of miRNAs in Drosophila (34,35). Non-template nucleotide additions at the 3′ end of miRNAs have been reported as the common form of miRNA enzymatic modification (18,21,28) and can influence miRNA stability (36) and the efficiency of target repression (37). It has further been revealed that the frequency of 3′ addition to specific miRNAs changes with differentiation of human embryonic stem cells (18). Several enzymes, such as MTPAP, PAPD4, PAPD5, ZCCHC6, ZCCHC11 and TUT1, have been reported to govern 3′ nucleotide addition to miRNAs (18,36–38).
The miR/miR* nomenclature has been used to represent the dominant and minor mature products of precursor miRNAs. However, several studies have reported that the arm that makes the dominant product can change in different tissues, stages and species (14,39–43). Such changes have been called ‘arm switching’ and are likely to be general (39). For example, Grimson et al. have reported an instance of developmental arm switching between the embryonic and adult stages of sponges for miR-2015 (43). Cloonan et al. also listed several miRNAs whose dominant arm changes over different human tissues (14). Therefore, in release 19 of the miRBase database (miRBase R19), the nomenclature for mature miRNAs now designates them as -5p and -3p, rather than miR/miR*, in all species.
In this study, we present the YM500 database, which includes pipelines for miRNA quantification, isomiR identification, arm switching discovery and novel miRNA prediction from smRNA-Seq data. YM500 contains the results of meta-analysis from hundreds of public smRNA-Seq datasets and dozens of in-house ones. YM500 aims to provide researchers with integrated miRNA-related information with various graphical visualization pages from hundreds of sequencing datasets via a user-friendly web interface.
MATERIALS AND METHODS
Data collection and pre-processing
As shown under Pre-processing in Figure 1, there are 609 Illumina smRNA-Seq datasets, including 468 human and 141 mouse ones, in YM500. 34 out of 609 are in-house datasets, and the others are from the GEO public repository. All in-house data were deposited in the GEO database with an accession number of GSE39841. Detailed information for the datasets is provided in Supplementary Table S1. 3′ adaptors of the FASTQ raw data were trimmed, and the trimmed data were collapsed by the FASTX toolkit (http://hannonlab.cshl.edu/fastx_toolkit). A QC report, including length distribution, box plot of phred quality scores, etc., was then generated. For the public datasets containing only sequences and read counts, we transformed them into FASTA format and examined their length distribution. It has been suggested that if the smRNA-seq reads are abundant in 19∼24 nt, they are good for miRNA analysis (44). Thus, datasets which did not fit this criterion were discarded. For all datasets, reads shorter than 17 nt or longer than 30 nt were filtered out.
Prediction of novel miRNAs and their downstream targets
It has been noted that there is no evidence that the miRNAs may represent fragments of mRNAs or other known RNA types (45). As shown in the Novel miRNA Prediction module in Figure 1, before predicting novel miRNAs, we used Bowtie (46) with options: -v 0 -f –norc to filter out reads that map to known miRNA precursors (miRBase) (45), other functional RNAs (Rfam) (47) and mRNAs (RefSeq) (48). Then, we adopted three prediction tools, namely miRDeep2, miReap and miRanalyzer, for predicting novel miRNAs. All of the prediction results were merged and unified to remove redundant records. According to our experience, there are still some predicted novel miRNAs that were mapped to known transcripts. We further applied BLAST to remove reads which were fully mapped to RefSeq or Rfam with identity >90%. To get a more reliable result, we also used miReNa as a second filter to filter out those that do not satisfy numerical Criteria I–V to describe a pre-miRNA in miReNa. About two-thirds of the putative miRNAs could not fulfil miReNa criteria and were filtered out. Finally, we filtered out the putative miRNAs located in exon regions (defined by RefSeq). These filtrations aimed to reduce the false positive rate. Filtration results are shown in Supplementary Figure S1. Most of the putative novel miRNAs that were predicted by multiple algorithms were preserved. There are 90 and 637 putative novel miRNAs predicted by three algorithms and any two algorithms, respectively (Supplementary Figure S1). Second structures of the stem-loop miRNA precursors were predicted by RNAfold (49). Furthermore, the target genes of these putative miRNAs were predicted by two well-known algorithms: TargetScan (50) and miRanDa (51). All information was stored in a local MySQL server.
miRNA quantification, isomiR identification and arm switching discovery
As shown in Figure 1, predicted novel miRNAs were combined with known miRNAs from miRBase R19. All pre-processed datasets were mapped to the combined miRNA list using Bowtie with options: -a -v 1 -S -f –norc, and then alignment results were produced in a BAM file format by SAMtools (52). The BAM files were processed by in-house JAVA software for miRNA quantification and isomiR identification. These analysis results were then stored in a local MySQL database. The arm switching events between two groups of samples were determined by two criteria: one was that the averaged RPKM (Reads Per Kilobase of transcript per Million mapped reads) value of a miRNA must be larger than 100. The second was that the ratios of two arms (5p/3p) in two groups must be significantly different according to the Student T-test with a P-value < 0.05 performed with the R language.
Validation of miRNA expression via stem-loop real-time PCR
Quantification of mature miRNAs was validated by a stem-loop real-time RT-qPCR system performed as previously described (53). Samples ranging from 100 ng to 1 μg of total RNA were used to perform reverse transcription (RT) using the RevertAidTM Reverse transcriptase kit (K1622; Fermentas, Glen Burnie, Maryland, USA) as directed by the manufacturer. Real-time PCR reactions were performed using MaximaTM SYBR Green qPCR Master Mix (K0222; Fermentas, Glen Burnie, Maryland, USA), and the specific products were detected and analysed using the StepOneTM sequence detector (Applied Biosystems, USA). Primers were designed on the basis of the sequenced miRNAs by using FastPCR (54). The miRNA expression data were normalized against U6 small nuclear RNA.
YM500 provides four interactive query interfaces (Expression, Novel miRNAs, isomiRs and arm-switching) and various graphical visualization pages to present the analysis results of hundreds of smRNA-Seq datasets.
YM500 allows expression visualization according to a user’s customized selections. This feature helps users to select miRNAs according to ID lists or miRNA cluster definitions (Supplementary Figure S2A). For a query regarding a single miRNA, we provide the histogram expression in all samples and the expression profiles by tissue type of the specific miRNA. For a query regarding multiple miRNAs, samples can be selected according to the annotation (Supplementary Figure S2B). A recheck page (Supplementary Figure S2C) helps users to select the specific miRNA and samples for heatmap visualization (Figure 2A). A download link for the normalized expression data of selected samples and miRNAs is also provided for further analysis. Differential expression of 114 miRNAs was confirmed by RT-qPCR. The Pearson's correlation coefficient, R, between NGS analysis and RT-qPCR results was 0.84 (Figure 2B).
Whenever researchers identify potentially novel miRNAs, they can search for existing evidence of the novel miRNAs among the hundreds of samples in YM500. Novel miRNAs can be searched for according to the exact sequence of a mature miRNA or genomic location. Figure 3A illustrates the provided mature and precursor miRNA information, including the prediction algorithms, the predicted target genes, the expression profiles, the RNA secondary structures and the hyperlinks to three commonly used genome browsers. A density plot of reads that mapped to the stem-loop precursor indicates the percentage of reads overlapping the loci (Figure 3B). This provides an overview of length heterogeneity for a miRNA. Figure 3C shows a view of the deep sequencing reads and illustrates all sequences that map to the same novel miRNA. In the same figure, we also provide read numbers and numbers of datasets/samples in which a novel miRNA sequence was found. Reads can be filtered according to the number of mismatches to the hairpin sequence, the read count and the number of datasets/samples. Furthermore, each specific sequence (for example, the sequence in the rectangle in Figure 3C) has a hyperlink to a page (shown in Figure 3D) which contains the detailed information for the sequence, including the expression histogram, the raw counts and the RPM (Reads Per Million mapped reads) in each sample.
This section helps researchers to find the isomiRs of the known miRNAs in miRBase. As shown in Figure 4A, users can define the criteria of isomiRs according to number of mismatch, number of read counts, number of expressed samples and isomiR types (trimming or addition at 5′/3′ end). Figure 4B illustrates the information of edited sites, which are determined by the editing rate in Figure 4A. Editing information is also compared to dbSNP (Build 135). As shown in Figure 4B, both editing and an SNP were found at the 23rd base of hsa-miR-211-5p. However, the type of nucleotide alteration is different, indicating that such nucleotide substitution is a putative editing event rather than a common variant. All of the isomiRs defined in Figure 4A are detailed in Figure 4C, which indicates the mismatch sites, number of reads and number of samples containing the isomiR. Similar to the view of NGS data in the Novel miRNAs part (Figure 3C), each sequence in Figure 4C has a hyperlink to a page, shown in Figure 3D, which summarizes the details of the specific isomiR. Figure 4D shows the summary information of all isomiRs of a specific mature miRNA by a sequence logo format (where the height of each character is proportional to the total read counts of a miRNA). Figure 4E is similar to Figure 4D, but the height of each character is normalized to the read counts in each base and indicates the editing rate in each base. Similar to Figure 3B, Figure 4F is a density plot showing the distribution of reads in a mature miRNA and illustrates the length heterogeneity.
As far as arm switching is concerned, YM500 provides two ways to investigate this phenomenon. YM500 allows users to select a specific precursor miRNA and profiles the expression of two arms between samples and tissues (Figure 5A). This helps researcher to quickly view arm switching events in a specific miRNA species. Another method of illustration allows users to select two groups of samples, according to annotations for the database, and YM500 will identify precursor miRNAs whose dominant expression switches from one arm to the other between the two groups (Figure 5B).
YM500 integrates the analysis results of miRNA profiling/quantification, isomiR detection, novel miRNA prediction and arm switching identification. The reliability of in silico data were tested via experimental validation on in-house samples or by paper survey. For quantification, our miRNA profiling results were highly consistent to those of RT-qPCR, with a Person’s correlation coefficient 0.84 (Figure 2D). For novel miRNA discovery, we used multi-algorithms to explore as many putative miRNAs as possible and adopted several filtration steps to reduce false positive discoveries. The definition of ‘novel miRNA’ in YM500 is with respect to miRBase and we are collecting the information of ‘novel miRNA’ claimed by other references as another source of evidences but this task is undergoing. A dozen of miRNAs have been validated in our lab for their existence and expression patterns by RT-qPCR (Supplementary Table S2). Using ‘NM_hsa_1300’ as an example (Figure 3), wetlab evidence such as the RT-qPCR melting curve proved its existence (Supplementary Figure S3). However, RT-qPCR can prove the ‘existence’ of putative miRNAs. We use Ago1/2-mediated RNA-immunoprecipitation (RNA-IP) plus further sequencing as another line of evidence, and found that >1000 novel miRNAs in YM500 are indeed associated with the RNA-induced silencing complex (RISC) (unpublished data). For isomiRs, a U-to-G substitution in the ninth base of mmu-let-7a-5p (32) is discovered in YM500. Such substitution events (putative editing event) may result in a significant increase in stability of down-regulated targets (32). The second example is that we also discovered three reported A-to-I putative editing events in three miRNAs, which were the 7th, 8th and 9th bases of the mature product, and edited in 25%, 18% and 11% of the reads in (55), indicating that additional variability is tolerated in the functionally important seed region.
As for an example of arm switching, for hsa-mir-154 (Figure 5A), Cloonan et al. have reported that the expression of the two arms would be switched in different tissues (14). They demonstrated the switch between expression dominance from the 3p arm (ovary) to the 5p arm (brain and placenta). It has been shown that alternative mature miRNAs produced from the same precursor have different targeting properties and therefore different biological functions (56). Hence, the changes in arm choice of hsa-mir-154 might have significant functional consequences. The expression of a specific arm may dominate some tissue-specific functions. Besides, hsa-mir-144 has been reported as a cancer/disease marker in several studies (57–59) but our arm switching results show that it might have significant tissue-specificity. It may need to be further investigated the mechanisms of hsa-mir-144 which are related to cancer and tissue-specificity.
The advantage of using smRNA-Seq for miRNA researches is that smRNA-Seq can discover novel miRNAs and isomiRs. At this point, the number and functions of isomiRs remain unclear, but it has been reported that they might be quite prevalent in creatures. YM500 allows researchers to define the criteria of isomiRs, as well as providing existing evidence for various isomiRs from hundreds of smRNA-Seq datasets. A representative example of expression patterns and raw read counts in each smRNA-seq dataset is shown in Figure 4. We defined a candidate isomiR by the criteria shown in Figure 4A (allow one mismatch; the read count and sample count are at least 100 and 5, respectively). At the same time, the editing event is defined by the editing rate at least 1% of the total reads (pass the criteria) mapped to hsa-miR-211-5p. (Please note that all of criteria could also be customer-defined). In this case, 13 sites of the canonical miRNAs have isomiRs with mismatch, but there is only one putative editing event in the 23rd base with 834 supporting reads in more than a dozen of datasets (Figure 4B and C). Besides, there are two distinct isomiRs that have the same editing site (the rectangle in Figure 4C). We also provide the detail information of each isomiR via hyperlinks on web interface. These isomiRs still need extensive wetlab validations to prove their existence and functionality. Especially, the editing events need genomic evidences from the same sample to rule out novel SNPs or mutations. However, YM500 is a screening tool to help researchers reduce the numbers of candidate isomiRs and could be severed as a line of evidences for their existence. We believe these customer-defined criteria and selections would help researchers to exclude most sequencing errors and other artifacts.
The same holds true for novel miRNAs. When a small RNA is consistently detected in various datasets, it does constitute autonomous evidence that it is prevalent and thus the likely result of a specific biogenesis (6). As shown in Figure 3, for a putative novel miRNA, YM500 provides the expression profiles, the prediction algorithms, the sequences, the counts mapping to the miRNA, the secondary structure, etc. This information helps researchers to evaluate the prediction result. For example, according to the suggestions of Kozomara and Griffiths-Jones (45), the pattern of reads focusing on the mature region (Figure 3B) supports a high-confidence miRNA annotation with multiple reads (10–20 as cutoffs) from independent datasets. In contrast, the pattern of reads overlapping the sequence of a putative miRNA does not support the annotation of a miRNA, with multiple offset reads distributed across the locus (45). Besides, most reads mapping to a given mature miRNA annotation should have the same 5′ end whereas the 3′ end may be significantly more variable (45). YM500 helps researchers to check these characteristics via the presentation page (Figure 3). For isomiRs and novel miRNAs, YM500 provides existing evidence from hundreds of smRNA-Seq datasets. Whenever researchers find interesting results (such as some specific isomiRs and novel miRNAs) in their own datasets, they can validate their results in YM500.
In comparison to other databases related to smRNA-seq, including miRBase, deepBase (60) and the isomiRs database of Lee et al. (24), YM500 analyse miRNA-related information from many dimensions, and datasets included in meta-analysis are more abundant. The deepBase database contains 185 smRNA-Seq datasets for seven organisms and is also a platform for annotating and discovering small and long non-coding RNAs (miRNAs, siRNAs, piRNAs, etc.) from next-generation sequencing data. The isomiRs database of Lee et al. contains only 18 smRNA-Seq datasets for two organisms and lists only the isomiR information in a single sample per query. Although miRBase covers much more species, miRBase does not include expression profiles of known miRNAs, RNA editing, arm switching or miRNA prediction results. For novel miRNA, YM500 adopts four different algorithms and provides target prediction, expression profiles of novel miRNA and hyperlinks to genome browsers. deepBase lists only the results of novel miRNA predicted by miRDeep, and no other additional information is provided. Besides, there are only 79, 9 and 5 human smRNA-seq datasets in miRBase, deepBase and the isomiRs database of Lee et al., respectively. YM500 contains 468 human datasets (from public databases or in-house), which is the most comprehensive collection so far. Another unique part of YM500 is that our interactive web interface and customer-defined criteria also help researchers to retrieve these four types of analysis results. Finally, to our best knowledge, there is no other resource providing arm switching information. Comparing with these databases, YM500 provides a flexible web interface, more enhanced resolution and novel findings owing to the integrated pipelines for miRNA (including miRNA expression, IsomiRs, novel miRNAs and arm-switching) and the large number of smRNA-Seq datasets of various tissue/cell types.
Supplementary Data are available at NAR Online: Supplementary Tables 1 and 2 and Supplementary Figures 1–3.
Funding for open access charge: National Science Council [NSC98-2320-B-010-020-MY3, NSC99-2320-B-350-003-MY3, NSC100-2627-B-010-007, NSC100-2314-B-075-066-MY2 and NSC101-2320-B-010-059-MY3]; Taipei Veterans General Hospital [V101E2-008 and Cancer Excellence Center Plan, DOH101-TD-C-111-007]; National Research Program for Biopharmaceuticals [DOH101-TD-PB-111-TM007]; National Yang-Ming University via the Ministry of Education, Aim for the Top University Plan; UST-UCSD International Center for Excellence in Advanced Bioengineering sponsored by the Taiwan NSC I-RiCE Program [NSC100-2911-I-009-101, in part].
Conflict of interest statement. None declared.