miRBase is the primary online repository for all microRNA sequences and annotation. The current release (miRBase 16) contains over 15 000 microRNA gene loci in over 140 species, and over 17 000 distinct mature microRNA sequences. Deep-sequencing technologies have delivered a sharp rise in the rate of novel microRNA discovery. We have mapped reads from short RNA deep-sequencing experiments to microRNAs in miRBase and developed web interfaces to view these mappings. The user can view all read data associated with a given microRNA annotation, filter reads by experiment and count, and search for microRNAs by tissue- and stage-specific expression. These data can be used as a proxy for relative expression levels of microRNA sequences, provide detailed evidence for microRNA annotations and alternative isoforms of mature microRNAs, and allow us to revisit previous annotations. miRBase is available online at: http://www.mirbase.org/ .
miRBase is the primary online repository for microRNA sequences and annotations. The main aims of miRBase are:
to curate a consistent nomenclature scheme by which novel microRNAs are named;
to act as the central repository for all published microRNA sequences, and to facilitate online searching and bulk download of all microRNA data;
to provide human-readable and computer-parsable annotation of microRNA sequences (for example, functional data, references, genome mappings);
to provide access to the primary evidence that supports microRNA annotations; and
to link to and aggregate microRNA target predictions and validations.
The miRBase database was established in 2002 (then called the microRNA Registry) to provide microRNA researchers with stable and unique gene names for their novel microRNA discoveries (aim 1) and an archive of all microRNA sequences (aim 2) ( 1–3 ). Official gene names assigned by miRBase should be used in the published version of articles that describe their identification; gene names are assigned in confidence for inpress manuscripts ( 1 ). The incorporation of short RNA deep-sequencing data into miRBase as evidence for microRNA annotation (aim 4) is described in this update. Expansion of textual and functional annotation (aim 3), and an aggregation service for the growing number of microRNA target predictions and validations (aim 5) are the subject of future work.
From its inception, miRBase was designed to be a focused resource that could make and facilitate significant contributions in a rapidly growing field. We currently capture data types from user submissions and from publications that describe novel microRNAs, for example, the experimental method used to identify the sequence. Alongside the microRNA name, miRBase assigns a stable accession number to each stem–loop and mature sequence to allow tracking of improved annotations between releases. The primary transcripts of almost all microRNAs remain unannotated, but we aim to develop a mechanism to include annotations as they are determined. Where genome assemblies are available, microRNAs are mapped to their locations, clusters of microRNAs are highlighted, and overlaps with annotated protein-coding genes are described. Families of microRNAs are constructed, and links are provided to entries in other databases, to predicted targets, and to the primary literature. miRBase provides several methods to access the sequence data: by browsing, sequence similarity, genomic coordinate intervals, keyword search and bulk download. MicroRNA gene nomenclature and details of the data types available in miRBase have been discussed previously ( 1–3 ). We focus here on growth of the database, recent developments to integrate deep-sequencing data, and future plans.
miRBase is a vital tool for microRNA research. In order to maintain the usefulness of the resource, we must develop tools keep pace with increased rates of microRNA discovery that have been facilitated by next-generation technologies such as deep sequencing. The number of microRNAs deposited in miRBase has risen approximately exponentially ( Figure 1 ). In the last 3 years alone, the number of microRNA sequences in the database has almost trebled. At the time of writing, miRBase (release 16) contains over 15 000 microRNA loci, expressing over 17 000 distinct mature sequences, from 142 species.
Current semi-automated procedures for building miRBase entries from submitted data and from supplementary data to publications have been sufficient to keep pace with historical rates of microRNA identification. Post-2007, almost all of the growth of miRBase has been driven by deep-sequencing experiments. Each experiment may discover 10s or 100s of novel microRNAs. Many novel sequences are specific to a studied tissue or stage and are not conserved between species. Each experiment may also provide evidence for a large number of pre-existing miRBase entries. The computational and curational challenge involved in dealing with the increased data volume is significant. Our initial developments and improvements to incorporate data from deep-sequencing experiments into miRBase involve a sequence and read visualization tool and a pipeline for rapid mapping. These improvements provide the infrastructure to track increased rates of sequence deposition in the coming years. However, the number of publications that describe microRNA experiments and functional analysis is also increasing exponentially ( Figure 1 ). For example, in 2009 alone, over 2300 articles in PubMed reference the keyword ‘microRNA’ (including 477 reviews), only 8 years after the first use of the term in the literature. Curating the functional annotation contained in this corpus is currently impossible. Recent improvements in text-mining approaches for biomedical literature [reviewed in refs ( 4 , 5 )] provide some hope that automated textual annotation of microRNAs may become feasible in the medium term.
INCORPORATING DATA FROM DEEP-SEQUENCING EXPERIMENTS
As discussed, the majority of the evidence supporting microRNA annotations now comes from deep-sequencing experiments. We have developed an interface to view reads from RNA deep-sequencing data mapped to microRNA loci on the miRBase web site. Briefly, we identify entries in the Gene Expression Omnibus ( http://www.ncbi.nlm.nih.gov/geo/ ) with short-read data that map to references discovering novel microRNAs. We extract the read sequences and counts from the GEO entry and map the reads to the set of miRBase hairpin sequences for a given organism using Bowtie ( 6 ) allowing at most two mismatches between the read and the hairpin sequence. The first deep-sequencing data sets incorporated into miRBase are from human, Drosophila melanogaster , Arabidopsis thaliana , rice and three nematode genomes. Additional data sets will be added rapidly. Ideally, all microRNA sequences submitted to miRBase will be linked to GEO submissions, or followed by the submission of the deep-sequencing data for inclusion in miRBase.
The view of deep-sequencing data can be accessed from miRBase stem–loop and mature pages ( Figure 2 ). Reads can be filtered by the number of mismatches to the hairpin sequence, the read count and by experiment. Each experiment is annotated with species, tissue, stage and methodology information. These tags enable the user to search for experiments by tissue expression on the miRBase search page ( http://www.mirbase.org/search.shtml ). Read counts for mature microRNAs are commonly used as a proxy for relative expression levels. As the number of deep-sequencing experiments increases, these data provide extensive information about the expression profiles of microRNAs across tissues, stages and organisms. A good example of the power of these data can be seen in the small RNA data made available through the MODENCODE project for D. melanogaster ( 7 ). The patterns of read mappings also provide valuable evidence for relative abundance of mature sequences from different arms, for isoforms of mature microRNA sequences, and for the confidence in a given microRNA annotation. For example, Figure 2 shows a number of isoforms of mature microRNAs derived from both arms of the dme-mir-317 hairpin precursor. The dominant mature sequence from the 3′-arm is 21 nt in length, whereas the mature sequence annotated in miRBase 16 is extended to 24 nt. The 5′-mature sequence is unannotated in miRBase 16. The miRBase view of read data therefore provides the user with a clear picture of the profile of short RNAs generated from a microRNA locus, and facilitates the correction of errors and omissions in future releases.
DEEP-SEQUENCING DATA AIDS DISCRIMINATION BETWEEN MICRORNAS AND OTHER RNA SPECIES AND FRAGMENTS
Guidelines for microRNA annotation were established in 2003 ( 8 ), requiring evidence of expression of a ∼22 nt sequence (for example, cloning, sequencing or northern blot), together with evidence for a microRNA precursor structure (predicted stem–loop flanking the mature sequence). Updated annotation criteria were recently suggested to distinguish microRNAs from other classes of short RNAs in plants ( 9 ). These standards have proved extremely powerful in maintaining a clean data set of microRNA sequences for the community.
The increased rates of microRNA detection afforded by deep-sequencing technologies provide challenges to the level of confidence required to annotate a sequence as a microRNA. A typical RNA deep-sequencing experiment will identify millions of short sequences. Increased coverage results in detection of sequences of ever-lower abundance. It therefore becomes more and more challenging to distinguish true microRNAs from fragments of other transcripts, other short RNAs and spurious transcription. The eukaryotic genome also contains millions of predicted hairpins, so a flanking stem–loop structure should be considered necessary but not sufficient to annotate a sequence as a microRNA. If poorly analysed, a single data set thus has the potential to generate a large number of dubious annotations, swamping the real microRNAs. However, correct interpretation of RNA deep-sequencing data provides several additional signals to help distinguish microRNAs from other sequences. A number of recent publications have attempted to define and use criteria based on patterns of mapped reads ( 10–14 ), and a consensus set of guidelines is starting to emerge:
Multiple reads (10–20 are commonly used cutoffs) support the presence of the mature microRNA (preferably from multiple independent experiments);
The reads map to an extended sequence region (e.g. an assembled contig), and the sequence flanking the putative mature microRNA folds to form a microRNA precursor-like hairpin with strong pairing between the mature microRNA and the opposite arm. Reads that map very many times to a genome sequence should be discarded;
Mapped reads do not overlap other annotated transcripts (i.e. there is no evidence that the short reads may represent fragments of mRNAs or other known RNA types);
Reads mapping to a locus support consistent processing of the 5′-end of the mature sequence (for example, the majority of reads overlapping a given mature microRNA annotation should have the same 5′-end; the 3′-end may be significantly more variable); and
Ideally, reads will support the presence of mature sequences from both arms of the predicted hairpin (so-called miR and miR* sequences), and the putative mature sequences should base-pair with the correct 3′-overhang.
Consistent 5′-end processing (point 4 above), and observation of miR and miR* sequences (point 5) appear to be crucial for discrimination between high-confidence microRNAs and fragments of other RNAs in deep-sequencing data ( 10–12 ). Figure 2 A shows the miRBase view of deep-sequencing reads mapping to the dme-mir-317 precursor region. The pattern of reads clearly supports a high-confidence microRNA annotation, with over 65 000 reads from 14 experiments ( 15 ) supporting the 5′-end of a mature sequence derived from the 3′-arm of the hairpin, and over 5000 reads supporting the miR* sequence [unannotated in miRBase ( 16 )]. In contrast, the pattern of reads overlapping the ath-MIR2935 sequence does not support the annotation of a microRNA ( Figure 2 B), with multiple offset reads distributed across the locus. In addition, the Arabidopsis reads shown are isolated from different Argonaute complexes—the majority of reads are not associated with the microRNA AGO1 complex, rather with AGO4 ( 16 ). These data suggest that MIR2935 should be removed from the miRBase microRNA catalogue in future releases.
There are three main areas of future development of miRBase currently planned. We invite feedback and comments on these areas.
Improvement of community contribution of and access to microRNA data
miRBase is a community resource. The majority of the primary sequence data is submitted by users. We plan to improve methods and interfaces for both data access and data submission. For example, webservices will allow programmatic access to all miRBase data, and batch search and download tools will be made available. We plan to allow users to add and update textual annotation in a wiki interface. The Rfam database of RNA families ( 17 ) has successfully developed a community annotation project using Wikipedia ( 18 ).
Expansion to include microRNA candidates and predictions
As discussed above, deep-sequencing experiments provide the majority of novel microRNA annotations. Next-generation experimental techniques also provide low-level evidence for comparatively large numbers of lower confidence sequences, often published as candidate microRNAs, which fall below the current standards and are therefore not present in miRBase. We plan to extend the database to include different classes of data as supplements, with associated confidence levels for each annotation, for example:
High-confidence microRNA genes, with support for the presence of both miR and miR*, sequenced in many copies from many independent samples, as described above;
Predicted homologues of known microRNAs in related organisms;
Low abundance sequences, cloned or sequenced in only a few copies, without miR* evidence;
Sequenced short RNAs that are absent from assembled genomes; and
Computational microRNA predictions.
High-confidence annotations will form the default microRNA set that is most relevant to the majority of users. Lower confidence annotations will be provided as supplements to those data. Their inclusion will encourage validation of low confidence sequences, and the availability of experimental resources (for example, off-the-shelf microarrays).
A microRNA target aggregation service
From 2006, miRBase provided a set of microRNA target predictions, branded miRBase targets, in collaboration with the Enright lab ( 2 ). In 2009, the miRBase targets resource was devolved back into the Enright lab at the EBI, and rebranded microCosm. In its place, we will develop an aggregation service that integrates microRNA target predictions from all the popular target prediction sites [for example, TargetScan ( 19 ), PicTar ( 20 ), microCosm, DIANA-microT ( 21 ), microrna.org ( 22 )], together with lists of validated microRNA target sites [currently curated by resources such as TarBase ( 23 ), miRecords ( 24 ), miRTarBase]. This will initially involve displaying the top hits from each algorithm on existing miRBase sequence pages. In the longer term, an interface to search for hits from the aggregated algorithms will be developed, and consensus predictions from subsets of the algorithms will be provided. All results will link to the original data sources. To eliminate any long-term curation burden, we will accept depositions of microRNA target predictions and validated targets from users and target prediction groups.
The miRBase database is available online and free to all without restriction at: http://www.mirbase.org/ and for download in various formats (including FASTA sequences, GFF genome coordinates and MySQL database dumps) from ftp://mirbase.org/pub/mirbase/CURRENT/ . Nomenclature queries, feedback and comments are welcomed at firstname.lastname@example.org.
miRBase is funded by the Biotechnology and Biological Sciences Research Council (BB/G022623/1). Funding for open access charge: The BBSRC (BB/G022623/1).
Conflict of interest statement . None declared.
We thank the Wellcome Trust Sanger Institute for previous hosting and support, Simon Moxon for insight into deep-sequencing data in plants and Antonio Marco and Matthew Ronshaugen for helpful comments on the article.