-
PDF
- Split View
-
Views
-
Cite
Cite
Guillem Ylla, Tianyuan Liu, Ana Conesa, MirCure: a tool for quality control, filter and curation of microRNAs of animals and plants, Bioinformatics, Volume 36, Issue Supplement_2, December 2020, Pages i618–i624, https://doi.org/10.1093/bioinformatics/btaa889
- Share Icon Share
Abstract
microRNAs (miRNAs) are essential components of gene expression regulation at the post-transcriptional level. miRNAs have a well-defined molecular structure and this has facilitated the development of computational and high-throughput approaches to predict miRNAs genes. However, due to their short size, miRNAs have often been incorrectly annotated in both plants and animals. Consequently, published miRNA annotations and miRNA databases are enriched for false miRNAs, jeopardizing their utility as molecular information resources. To address this problem, we developed MirCure, a new software for quality control, filtering and curation of miRNA candidates. MirCure is an easy-to-use tool with a graphical interface that allows both scoring of miRNA reliability and browsing of supporting evidence by manual curators.
Given a list of miRNA candidates, MirCure evaluates a number of miRNA-specific features based on gene expression, biogenesis and conservation data, and generates a score that can be used to discard poorly supported miRNA annotations. MirCure can also curate and adjust the annotation of the 5p and 3p arms based on user-provided small RNA-seq data. We evaluated MirCure on a set of manually curated animal and plant miRNAs and demonstrated great accuracy. Moreover, we show that MirCure can be used to revisit previous bona fide miRNAs annotations to improve miRNA databases.
The MirCure software and all the additional scripts used in this project are publicly available at https://github.com/ConesaLab/MirCure. A Docker image of MirCure is available at https://hub.docker.com/r/conesalab/mircure.
Supplementary data are available at Bioinformatics online.
1 Introduction
Since their discovery in 1993 (Lee et al., 1993; Wightman et al., 1993) microRNAs (miRNAs) have emerged as important regulators of gene expression at the post-transcription level. The miRNAs are short (∼22 nt) non-coding RNAs that bind to target messenger RNA (mRNAs) to block their translation and/or trigger degradation. Due to their precise and powerful mechanism of action, miRNAs are involved in the regulation of a myriad of biological processes and are involved in numerous diseases (Rupaimoole and Slack, 2017).
The miRNA biogenesis initiates by the transcription of the miRNA primary transcript, a long RNA containing one or more miRNAs precursors, each of them folding into a hairpin-like structure. These hairpins are cleaved by endonucleases—Drosha in animals and Dicer-like 1 in plants— to release the miRNA precursor. In plants, the loop of the miRNA precursor is cleaved again by Dicer-like 1 and the resulting RNA duplex is exported to the cytoplasm. In animals, the nuclear miRNA precursor is first exported and then the hairpin loop is cleaved by Dicer. Both cleavages result in a cytoplasmic ∼22 nt RNA duplex with 2 nt hanging at the 3′ end of each RNA strand (Axtell and Meyers, 2018; Budak and Akpinar, 2015; Ha and Kim, 2014; Millar and Waterhouse, 2005). This RNA duplex is recognized by the RISC complex, in which the Argonaute proteins select one of the two strands while the other arm is degraded. The strand chosen by RISC is called the ‘mature arm’ or ‘mature miRNA’, while the other arm is called ‘star arm’ and often represented as ‘miRNA*’. The mature miRNA holds complementarity with the target mRNAs at a small seed sequence (6–8 nt), which mediates the binding to this target.
Prediction of miRNA genes is a common step in genome annotation pipelines. Frequently used strategies to predict miRNAs in a newly sequenced genome employ ab initio, homology and high-throughput sequencing-based methods, and more recently, machine-learning approaches. However, miRNA annotation is challenging due to the miRNA’s short length and low conservation outside the 6–8 nt that compose the seed region.
The ab initio miRNA prediction methods predict miRNAs genes without any data other than the genome sequence. These methods are usually based on the identification of hairpin structures from nucleotide sequences and on the utilization of machine learning (Allmer and Yousef, 2012). Software implementing these methods include MirPred (Brameier and Wiuf, 2007), MirPred-Random Forest (Jiang et al., 2007), MirPred-SVM (Ng and Mishra, 2007), Triplet-SVM (Xue et al., 2005) and MiRPara (Wu et al., 2011). Ab initio methods are useful to identify novel miRNAs, however, they tend to have high false positive rates, as hairpin-like structures are also present in other kinds of non-coding RNAs (Sacar and Allmer, 2014).
The homology-based methods, such as MirScan (Kaufman and Miska, 2010), MapMi (Guerra-Assunção and Enright, 2010), miROrtho (Gerlach et al., 2009) and miRNAminer (Artzi et al., 2008) rely on sequence conservation with previously annotated miRNAs genes. Some of these methods also incorporate the prediction of secondary structures to increase their accuracy. Consequently, these methods are limited by their inability to annotate novel miRNAs and by the accuracy of the database used as a reference.
With the expansion of high-throughput sequencing technologies, new methods for miRNA predictions emerged based on small RNA-seq. These methods typically rely on the mapping of ∼22 nt sequencing reads to genome regions and are aided by other sorts of evidence, such as conservation with close species and expression decrease upon Dicer knock-down (Berezikov et al., 2006; Cristino et al., 2011; Reddy et al., 2009; Ruby et al., 2006). The most commonly used software for miRNA annotation based on sequencing data is mirDeep (Friedländer et al., 2008) and its variants mirDeep* (An et al., 2013), mirDeep2 (Friedländer et al., 2012), mirDeep-P (Yang and Li, 2011) and miRPlant (An et al., 2014). These methods combine small RNA-seq data, secondary structure predictions and probabilistic models to identify miRNA candidates. Even though these tools are generally better than homology and ab initio methods, they still suffer from high false positive rates (Axtell and Meyers, 2018). Therefore, new tools are constantly being developed to improve the accuracy and precision of miRNA annotations (Lei and Sun, 2014; Paicu et al., 2017).
Among miRNA databases, miRBase (Kozomara and Griffiths-Jones, 2011) is perhaps the most widely used. However, miRBase does not curate miRNAs, and acts more as a repository giving the responsibility of miRNA curation to submitters. Consequently, it has been reported that only 16% of metazoan miRNAs in miRBase are robustly supported (Fromm et al., 2015). Even for humans, only about a third of the miRNA genes from miRBase have solid evidence (Fromm et al., 2015).
This heterogeneous scenario in the landscape of miRNA prediction tools results in a high heterogeneity for the number, identity and quality of miRNAs. Often the miRNA annotations published for a given species by different authors or available at different databases barely overlap. This creates uncertainty when designing miRNA experiments, analyzing small RNA expression data, or when training machine-learning models for prediction. Currently, the few sources of reliable miRNA annotations are those that went through an extensive manual curation process. This process involves running different algorithms and collecting different pieces of information that are then put together for human assessment. This is a tedious and time-consuming effort. To assist researchers in the selection of bona fide miRNAs out of a list of putative miRNA annotations, we developed MirCure. MirCure is a Shiny tool for quality control and filtering of both existing miRNA annotations and de novo miRNA predictions. MirCure obtains a quality score based on the automatic evaluation of different sources of miRNA evidence that can be used to accept, reject or revisit miRNA calls. Moreover, MirCure generates a visual report for each miRNA candidate in which all the relevant miRNA biogenesis criteria are represented, thereby greatly facilitating the curation process. We validate MirCure on a set of extensively curated miRNA annotations both for animals and plants. To our knowledge, MirCure is the first tool to perform automatic quality control of miRNA annotations based on small RNA-seq and strict biogenesis criteria.
2 Materials and methods
2.1 MirCure software
The MirCure toolset consists of three software components, the MirCure R application, and two supporting scripts to convert source files to the MirCure required formats. MirCure is written in R with a Shiny interface and runs both in desktop computers and HPC clusters. To facilitate the installation process, MirCure is available as an R package that automatically installs all R dependencies, as well as Docker image (see Supplementary Material for installation and usage instructions). MirCure requires miRNA genome annotation data provided as three gff3 files, describing precursor annotations and annotations for the two miRNA arms. The gff3 files corresponding to the miRNA arms may use 5′/3′ or mature/star as annotation styles. When arms are annotated as 3′/5 ′, MirCure considers as mature the arm with the highest number of small RNA-seq reads. Additionally, the PreapareMirbase script is used to transform miRBase annotations to a MirCure-compatible format, which includes the prediction of the often-missing star arm annotations. Finally, we include mirDeep_2_mirCure.R to format the output of the mirDeep2 (Friedländer et al., 2012) to the MirCure format. MirCure also requires as input files the genome sequence (fasta) of the organism and a bam file with small RNA-seq data mapped to this genome sequence.
The MirCure software runs the pipeline described in Figure 1, that implements state-of-the-art guidelines for robust miRNAs annotation in animals and plants, by combining expression evidence (i.e. small RNA-seq data) and biogenesis information (Ambros et al., 2003; Fromm et al., 2015; Meyers et al., 2008; Tarver et al., 2013; Taylor et al., 2014; Voinnet, 2009). This pipeline first adjusts miRNA annotations based on small RNA-seq data, and then evaluates three different aspects of miRNA quality: secondary structure, gene expression and conservation. For each of these aspects, a score is calculated, and relevant graphical outputs are generated. The final quality scores assigned to each miRNA candidate is a function of different subscores calculated in the three MirCure steps (Table 1). The weight of each subscore on the final score calculation is different for animals and plants due to the differences in their biogenesis.

Schematic representation of the MirCure pipeline. After loading the input files and retrieving the sequences from the genome, MirCure adjusts the annotation of the two miRNA arms based on small RNA-seq reads. Then, the application calculates the precursor sequence fold and evaluates its structure. In the next steps, MirCure checks the expression of different miRNA regions, evaluates the conservation of the mature arm, and integrates all the collected information into a final score. Results are reported graphically
Features evaluated by MirCure . | Condition . | Score range or penalty value . | |
---|---|---|---|
Animals . | Plants . | ||
Structure | |||
Hairpin-like | Lack of loop | −5 | −5 |
Arms complementarity | <16 nt complementary | −5 | −5 |
Endonuclease cleavage | 2 nt overhang on the 3′ end. | [0, 1] x2 | [0, 1] x2 |
Arm length | Arm length <20 or >26 nt | −2 x2 | 0 |
Expression | |||
miRNA arms | Zero reads in both arms | −5 | −5 |
Reds in both miRNA arms | [0, 2.5] | [0, 2.5] | |
Loop | Reads mapping in the loop | [−2, −0.75] | [−2, −0.75] |
Flanking regions | Reads mapping in the two flanking nucleotides | [−2, −0.25] | [−2, −0.25] |
Read homogeneity at 5′ end | No homogeneity | −0.5 | −0.5 |
Conservation | |||
Mature arm conservation | No putative orthologs found | −0.25 | −0.5 |
Putative orthologs found | [0, 1] | [0, 2] |
Features evaluated by MirCure . | Condition . | Score range or penalty value . | |
---|---|---|---|
Animals . | Plants . | ||
Structure | |||
Hairpin-like | Lack of loop | −5 | −5 |
Arms complementarity | <16 nt complementary | −5 | −5 |
Endonuclease cleavage | 2 nt overhang on the 3′ end. | [0, 1] x2 | [0, 1] x2 |
Arm length | Arm length <20 or >26 nt | −2 x2 | 0 |
Expression | |||
miRNA arms | Zero reads in both arms | −5 | −5 |
Reds in both miRNA arms | [0, 2.5] | [0, 2.5] | |
Loop | Reads mapping in the loop | [−2, −0.75] | [−2, −0.75] |
Flanking regions | Reads mapping in the two flanking nucleotides | [−2, −0.25] | [−2, −0.25] |
Read homogeneity at 5′ end | No homogeneity | −0.5 | −0.5 |
Conservation | |||
Mature arm conservation | No putative orthologs found | −0.25 | −0.5 |
Putative orthologs found | [0, 1] | [0, 2] |
Note: Ranges are provided within breaks and penalties are indicated in red. ‘x2’: the score or penalty applies to each of the two miRNA arms. More details are provided as Supplementary Table S1.
Features evaluated by MirCure . | Condition . | Score range or penalty value . | |
---|---|---|---|
Animals . | Plants . | ||
Structure | |||
Hairpin-like | Lack of loop | −5 | −5 |
Arms complementarity | <16 nt complementary | −5 | −5 |
Endonuclease cleavage | 2 nt overhang on the 3′ end. | [0, 1] x2 | [0, 1] x2 |
Arm length | Arm length <20 or >26 nt | −2 x2 | 0 |
Expression | |||
miRNA arms | Zero reads in both arms | −5 | −5 |
Reds in both miRNA arms | [0, 2.5] | [0, 2.5] | |
Loop | Reads mapping in the loop | [−2, −0.75] | [−2, −0.75] |
Flanking regions | Reads mapping in the two flanking nucleotides | [−2, −0.25] | [−2, −0.25] |
Read homogeneity at 5′ end | No homogeneity | −0.5 | −0.5 |
Conservation | |||
Mature arm conservation | No putative orthologs found | −0.25 | −0.5 |
Putative orthologs found | [0, 1] | [0, 2] |
Features evaluated by MirCure . | Condition . | Score range or penalty value . | |
---|---|---|---|
Animals . | Plants . | ||
Structure | |||
Hairpin-like | Lack of loop | −5 | −5 |
Arms complementarity | <16 nt complementary | −5 | −5 |
Endonuclease cleavage | 2 nt overhang on the 3′ end. | [0, 1] x2 | [0, 1] x2 |
Arm length | Arm length <20 or >26 nt | −2 x2 | 0 |
Expression | |||
miRNA arms | Zero reads in both arms | −5 | −5 |
Reds in both miRNA arms | [0, 2.5] | [0, 2.5] | |
Loop | Reads mapping in the loop | [−2, −0.75] | [−2, −0.75] |
Flanking regions | Reads mapping in the two flanking nucleotides | [−2, −0.25] | [−2, −0.25] |
Read homogeneity at 5′ end | No homogeneity | −0.5 | −0.5 |
Conservation | |||
Mature arm conservation | No putative orthologs found | −0.25 | −0.5 |
Putative orthologs found | [0, 1] | [0, 2] |
Note: Ranges are provided within breaks and penalties are indicated in red. ‘x2’: the score or penalty applies to each of the two miRNA arms. More details are provided as Supplementary Table S1.
2.2 Score calculation
MirCure analyzes miRNA annotations according to each of the three evaluation criteria listed in Table 1 and described in detail in Supplementary Table S1, creating a subscore for each of them. The relative weight of each subscore reflects the importance of the feature toward identifying bona fide miRNAs and is adjusted based on area under the curve (AUC) values.
MirCure curates the miRNA annotations based on structure evidence. For this, small RNA-seq data mapped against the reference genome are used to calculate the per nucleotide read coverage of the extended miRNA precursors (default, precursor plus 11 flanking nucleotides). When MirCure detects that any of the four nucleotides flanking the two annotated miRNA arms has a read coverage >80% of the mean coverage of the arm, it will extend the arm annotations to include those nucleotides. Alternatively, when the extremes of the annotated arms are not supported by a read coverage superior to the 80% of the arm coverage, MirCure shortens the arm annotation. This step fixes common errors of miRNA annotations where the miRNA arms are miss-annotated by few nucleotides. Both the user-provided annotation and MirCure adjusted ends are evaluated in the next step and the model with the highest score is finally returned. MirCure evaluates the length of the two miRNA arms considering that bona fide miRNAs in animals should be between 20 and 26 nt long. For each arm longer than 26 nt or shorter than 20 nt, MirCure gives a score penalty (−2) to the miRNA annotation.
MirCure checks for evidence of miRNA biogenesis signatures. The secondary structure of the extended precursor miRNAs is predicted using the RNAFold tool from ViennaRNA package (Lorenz et al., 2011) implemented in R within the LncFinder package (Han et al., 2019). Based on the predicted secondary structure, MirCure calculates the number of matching and non-matching bases in the miRNA: miRNA* duplex, and the presence of the endonucleases (Dicer and/or Drosha) cleavage signature, namely, 2 nt overhang on the 3′ sequence at both ends. If the number of complementary nucleotides between the two arms is lower than 16, the miRNA candidate gets a score penalty (−5). A hard score penalty (−5) is also applied to those precursors that do not display a hairpin structure with a loop region. MirCure also checks the endonuclease cleavage sites and gives a score of +1, 0.5, 0.75 or 0 for optimal, sub-optimal or incorrect cleavage signature, respectively.
MirCure assesses the expression evidence of the miRNA candidates. Using provided RNA-seq data, MirCure calculates the per nucleotide read coverage of the two miRNA arms, the loop region as well as the flanking region of the precursor. Bona fide miRNAs are expected to have RNA-seq reads properly supporting both arms, while a low proportion, if any, of reads should be detected in the loop and flanking regions. Additionally, due to the high cleavage precision of the endonucleases, it is expected that real miRNAs display read homogeneity on the 5′ extreme of each arm. Therefore, most of the reads mapping to a miRNA should start at the same nucleotide, and the read coverage plot should be ‘box-shaped’. Following this rationale, MirCure penalizes (−5) miRNA annotations without any RNA-seq reads and reward those with expression in both arms (+2 if more than 5 reads in each arm, and +2.5 if more than 100 reads). MirCure defines good read homogeneity at the miRNA 5′end when the 2 most 5′ nucleotides of the arm have mean read coverage similar (max. difference allowed 10%) of the mean coverage of whole arm sequence, otherwise it gives a penalty (−0.5) to the score. The coverage of the two flanking nucleotides of each arm is also considered and can result in penalties (up to −2 for coverage of the flanking region higher than 40% of the arms mean coverage). Similarly, the higher the coverage in the loop region the higher will be the penalty (up to −2 for a loop read coverage higher than 20% of the reads of the precursor).
MirCure checks the conservation of each mature miRNA. If miRNAs were provided as 5′/3′, MirCure will use the arm with the highest expression as mature miRNA. MirCure takes this mature miRNA sequence as a query to search at miRNA sequence databases for possible orthologs with shared seed. The putative orthologous miRNAs are aligned to the miRNA candidate with the ClustalOmega algorithm implemented in the msa R package (Bodenhofer et al., 2015). As default miRNA database, MirCure uses miRBase, as this includes multiple species, although the user can use any mature miRNA database provided in fasta format. MirCure gives a small penalty (−0.25 in animals and −0.5 in plants) to miRNAs without any putative orthologous miRNA and gives a score up to +2 in plants and +1 in animals, to miRNAs with more than 20 putative orthologs. Note that, this conservation check is less relevant for the filtering of bona fide miRNAs than the structure and the expression checks. However, the information generated in this step is a useful resource to name miRNA candidates based on the name of their orthologs in other species.
Finally, MirCure gathers all the information generated for each miRNA candidate to compute a final score as the sum of all scores and penalties of each section. By default, the MirCure selects all the miRNAs with a structure score equal or higher to 1.25 and expression score equal or higher to 2, or a total score higher than 3.75. Scores between 3 and 2.5, are recommended to be checked manually, while scores lower than 2.5, are miRNA annotations to be rejected.
2.3 MirCure validation
The miRBase database is the most extensive public repository of miRNAs. However, miRBase entries are not curated and different studies showed a large proportion of non-supported miRNA annotations (Fromm et al., 2015). In contrast, MirGeneDB is a smaller database of high-quality animal miRNA annotations. MirGeneDB entries are manually curated following criteria similar to those implemented in MirCure (Fromm et al., 2015, 2020). To test the accuracy of the MirCure automatic filtering of animal miRNAs, we ran MirCure on the miRBase (release 22) annotations of human (Homo sapiens) and mouse (Mus musculus), and results were compared to MirGeneDB, which was taken as the source of true miRNAs annotations. Similarly, we tested the MirCure performance on filtering plant miRNAs for two species: corn (Zea mays) and Arabidopsis thaliana. In this case, the lists of MirCure bona fide miRNAs were compared to those manually curated plant miRNAs from Taylor et al. (2014).
Small RNA-seq data used for MirCure input was obtained from public databases. Supplementary Table S2 lists the accession numbers and relevant metadata of these datasets. Downloaded fastq files were pre-processed with DNApi to detect adapters (Tsuji and Weng, 2016), and cutadapt (Martin, 2011) to remove low quality nucleotides. The clean reads were mapped against the genome with Bowtie2 (Langmead and Salzberg, 2012) with the parameters -L 18 -N 0, sam files were transformed to bam with samtools (Li et al., 2009), and the multiple bam files were merged in a single bam file for each species. We also tested MirCure on the de novo miRNA annotations predicted by mirDeep2. For this purpose, we ran mirDeep2 on the mouse and human genomes using the previously collected small RNA-seq datasets. The concordance between MirCure filtered miRNAs and mirDeep predictions was checked for genomic overlaps of the annotated precursors using the GenomicsAlignments R package (Lawrence et al., 2013).
3 Results
3.1 The MirCure app
The MirCure is a Shiny R application with a graphical user interface accessible through RStudio or any web browser. The use interface consists of a main dashboard, that contains basic usage information, a sidebar with the different functions and parameters and several upper tabs to access results (Fig. 2A). As input, MirCure requires three types of files: (i) miRNA candidate annotations to be evaluated (gff3 format), (ii) a reference genome (fasta) and (iii) small RNA-seq data mapped to the genome (bam file). On the sidebar, the user needs to indicate the organism, animal or plant, as this determines how the quality score is computed. Additionally, the sidebar functions allow users to tune the number of flanking bases to retrieve from the genome sequence for secondary structure predictions (default is 11), indicate the miRNA annotation style, and change the score threshold for calling bona fide miRNAs. MirCure interface is interactive and users must run sequentially each of the MirCure sections using the buttons on the sidebar. After each of the sections is completed, the output is available at the corresponding tab on the top bar.

Overview of MirCure application. (A) MirCure user graphic interface. (B) MirCure graphic output displaying a poorly supported miRNA annotation
The last button of MirCure sidebar integrates the output of each section and returns an extensive report for each miRNA candidate that includes the RNA-seq read coverage of each nucleotide, the number of reads mapped to different parts of the miRNA (5′ arm, 3′ arm and precursor), the secondary structure of the precursor, the presence of the endonuclease (Dicer and Drosha in animals and Dicer-like1 in plants) cleavage signature, the conservation of the mature arm in other species and a quality score (Fig. 2B). By default, MirCure pre-selects as bona fide miRNAs those candidates with a score above a user-defined threshold, with default values given to optimize accuracy (see below). The user can further curate the miRNA based on the information generated and displayed at the MirCure graphical interface and select or unselect miRNA candidates as correct annotations. Furthermore, the user might take advantage of the miRNA conservation information to rename the miRNA candidate using the editing option supported by MirCure. The confirmed miRNA annotations can be exported as gff and fasta files and the graphical support generated for each of them is available for download as pdf. Additionally, MirCure generates summary statistics of the miRNA quality score distribution, which can be used to adjust thresholds if necessary.
MirCure can be run on any average desktop or laptop computer and is compatible with most Linux-based computational clusters. For the herein described analysis, MirCure was run on a laptop with MacOS, 4 cores CPU 1.4 GHz, 8 GB RAM and SSD drive. On this setup, using 13.72 GB bam file and 1916 miRNA candidates, MirCure required an average of 7.15 s per miRNA candidate.
3.2 The MirCure analysis criteria
MirCure analysis is based on three quality criteria that generate both a score and a graphical output. These criteria recapitulate state-of-the-art guidelines for the control of miRNA quality. Figure 3 shows a representation of the MirCure output for each of these three assessment criteria for ‘good’ (high scoring) and ‘bad’ (low scoring) miRNA candidates. Figure 3A represents the evaluation of structure criteria where bona fide miRNAs are expected to show 2 nt overhang on the 3′ end of both miRNA extremes, and display hairpin structures. Figure 3B depicts the gene expression-based features. Good supporting data shows sharp alignment of reads on the mature arm and low expression at the loop and flanking regions.

Graphical representation of MirCure evaluation criteria for high scoring and low scoring miRNAs. (A) Structural criteria. (B) Gene expression-based evidence
3.3 Validation of MirCure
We evaluated the accuracy of MirCure algorithm by analyzing miRNAs annotations from two different species of animals, human and mouse, and two species of plants, corn and A.thaliana. As a source of true miRNAs, we used the manually curated miRNAs from MirGeneDB and Taylor et al. for animals and plants, respectively. Additionally, we demonstrate MirCure for the validation of de novo miRNA annotations predicted with mirDeep2.
3.3.1 MirCure results on the curation of miRBase annotations
For each of the four chosen species, human, mouse, corn and Arabidopsis, we retrieved 820, 596, 240 and 402 million RNA-seq reads, respectively, from which an average of 91.96% of reads were mapped to their respective genomes (Supplementary Table S2). We obtained current miRNA annotations for the four species from miRBase release 22. The script Mir-base_2_MirCure.R was used to process miRBase gff3 files to create one separate gff3 file for each miRNA arm. The processed miRBase files and the mapped reads to the respective genomes were used as input for MirCure.
We obtained the MirCure default threshold value using the above described plant and animal data. For this, we ran MirCure with different threshold values and the resulting classification of miRNAs was compared against the manually curated miRNA annotations. False positive, true positive, false negative and false positive rates were computed, and receiver operating characteristic (ROC) curves for each of the four species were obtained (Fig. 4). ROC curves were similar for all species with an AUC between 0.83 and 0.9 (Table 2), indicating a high performance of the MirCure filtering approach. The optimal threshold in humans was 3.75 while for other species, it was set to 2. Using this threshold, we show a great performance of the MirCure in terms of accuracy, specificity and precision (Table 2). Globally, the ∼88.56% of the animal and ∼91% of plant miRNAs that passed MirCure quality control were in the corresponding manual curated lists and therefore considered true positives. There were a low number of miRNA genes that were selected by MirCure while not present in the bona fide reference lists and noted as putative false positives. After careful manual analysis of the supporting information of these putative false positives, we concluded that in most cases, these miRNA genes had strong evidence for being considered bona fide miRNAs (Supplementary Table S3 and Fig. S1), suggesting that these miRNA gene annotations might be missed by the curation effort, which would result in an overestimation of our false positives rates. However, a large number of miRNA genes were present in the curated set but not retained by MirCure as high-quality calls, compromising sensitivity (Supplementary Table S4). We also analyzed those miRNAs to conclude that in most cases, the lack of support by gene expression was the cause of the low score (Supplementary Fig. S1). This result makes sense, since only a limited number of small RNA-seq datasets were used in our analysis and they did not cover all possible cell types expressing miRNAs. From this analysis, we concluded that MirCure has high accuracy for detecting bona fide transcripts from a list of potential candidates provided that suitable miRNA-seq data is available. Our results also confirm the high proportion of false calls that populate the miRBase database.

ROC curves of MirCure benchmarking. (A) miRBase database. (B) miRDeep predictions
Performance results of the MirCure automatic filtering of miRBase miRNAs using MirGeneDB as ground truth
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 233 | 1385 | 273 | 27 | 84.4 | 46.1 | 98.1 | 89.6 | 0.80 |
Mouse | 238 | 825 | 163 | 34 | 84.4 | 59.4 | 96.0 | 87.5 | 0.90 |
Corn | 97 | 14 | 57 | 2 | 65.3 | 63.0 | 87.5 | 98.0 | 0.85 |
Arabidopsis | 71 | 171 | 68 | 14 | 74.7 | 51.1 | 92.4 | 83.5 | 0.83 |
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 233 | 1385 | 273 | 27 | 84.4 | 46.1 | 98.1 | 89.6 | 0.80 |
Mouse | 238 | 825 | 163 | 34 | 84.4 | 59.4 | 96.0 | 87.5 | 0.90 |
Corn | 97 | 14 | 57 | 2 | 65.3 | 63.0 | 87.5 | 98.0 | 0.85 |
Arabidopsis | 71 | 171 | 68 | 14 | 74.7 | 51.1 | 92.4 | 83.5 | 0.83 |
TP, putative true positives; TN, putative true negatives; FN, putative false negatives; FP, putative false positives; Acc., accuracy; Sens., sensitivity; Spec., specificity, Prec., precision; AUC, area under the curve.
Performance results of the MirCure automatic filtering of miRBase miRNAs using MirGeneDB as ground truth
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 233 | 1385 | 273 | 27 | 84.4 | 46.1 | 98.1 | 89.6 | 0.80 |
Mouse | 238 | 825 | 163 | 34 | 84.4 | 59.4 | 96.0 | 87.5 | 0.90 |
Corn | 97 | 14 | 57 | 2 | 65.3 | 63.0 | 87.5 | 98.0 | 0.85 |
Arabidopsis | 71 | 171 | 68 | 14 | 74.7 | 51.1 | 92.4 | 83.5 | 0.83 |
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 233 | 1385 | 273 | 27 | 84.4 | 46.1 | 98.1 | 89.6 | 0.80 |
Mouse | 238 | 825 | 163 | 34 | 84.4 | 59.4 | 96.0 | 87.5 | 0.90 |
Corn | 97 | 14 | 57 | 2 | 65.3 | 63.0 | 87.5 | 98.0 | 0.85 |
Arabidopsis | 71 | 171 | 68 | 14 | 74.7 | 51.1 | 92.4 | 83.5 | 0.83 |
TP, putative true positives; TN, putative true negatives; FN, putative false negatives; FP, putative false positives; Acc., accuracy; Sens., sensitivity; Spec., specificity, Prec., precision; AUC, area under the curve.
3.3.2. MirCure results for filtering of de novo miRNA predictions
We evaluated the usefulness of MirCure for curating de novo miRNA annotations. De novo miRNAs in the human and mouse genomes were predicted using mirDeep2, resulting in 793 and 518 miRNAs, respectively. In this case, the ROC curves were again similar for both species (Fig. 4B) with an AUC of 0.85 and 0.82 (Table 3). Within the mirDeep2 predictions, MirCure automatically identified 160 and 119 as bona fide miRNAs, from which 136 and 110 are in MirGeneDB (Table 3). Further manual analysis of the putative false positives identified 12 and 9 miRNAs selected by MirCure with strong evidence scores but not present in MirGeneDB as they were mirDeep2 newly predicted miRNAs. We concluded these were most likely wrong false positive calls.
Performance results of the MirCure automatic filtering of mirDeep2 miRNAs using MirGeneDB as ground truth
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 136 | 427 | 206 | 24 | 71.0 | 39.8 | 49.7 | 85.0 | 0.85 |
Mouse | 110 | 208 | 110 | 9 | 61.4 | 36.5 | 95.8 | 92.4 | 0.82 |
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 136 | 427 | 206 | 24 | 71.0 | 39.8 | 49.7 | 85.0 | 0.85 |
Mouse | 110 | 208 | 110 | 9 | 61.4 | 36.5 | 95.8 | 92.4 | 0.82 |
TP, putative true positives; TN, putative true negatives; FN, putative false negatives; FP, putative false positives; Acc., accuracy; Sens., sensitivity; Spec., specificity; Prec., precision; AUC, area under the curve.
Performance results of the MirCure automatic filtering of mirDeep2 miRNAs using MirGeneDB as ground truth
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 136 | 427 | 206 | 24 | 71.0 | 39.8 | 49.7 | 85.0 | 0.85 |
Mouse | 110 | 208 | 110 | 9 | 61.4 | 36.5 | 95.8 | 92.4 | 0.82 |
Species . | TP . | TN . | FN . | FP . | Acc. (%) . | Sens. (%) . | Spec. (%) . | Prec. (%) . | AUC . |
---|---|---|---|---|---|---|---|---|---|
Human | 136 | 427 | 206 | 24 | 71.0 | 39.8 | 49.7 | 85.0 | 0.85 |
Mouse | 110 | 208 | 110 | 9 | 61.4 | 36.5 | 95.8 | 92.4 | 0.82 |
TP, putative true positives; TN, putative true negatives; FN, putative false negatives; FP, putative false positives; Acc., accuracy; Sens., sensitivity; Spec., specificity; Prec., precision; AUC, area under the curve.
4 Discussion
Here, we present MirCure, an easy-to-use computational tool to perform quality control reports of miRNA annotations and assist in the curation of miRNAs annotations. MirCure returns, for each evaluated miRNA, an extensive report supported by figures that compile all the information necessary to proceed with manual evaluation of miRNA predictions. Additionally, MirCure includes automatic adjustment of miRNA gene models and an automatic filtering step, thereby facilitating the selection of bona fide miRNAs. The backbone of the MirCure algorithm, namely the list of bona fide criteria that miRNAs must fulfill, is similar to those applied in manual curation efforts in plants and animals (Fromm et al., 2020; Taylor et al., 2014; Ylla et al., 2016), and in this sense, the main goal of MirCure is to offer a computational support for the application of these well-established curation criteria. By transforming these criteria into a measurable score, the curation can proceed automatically or with reduced human intervention. The default score threshold for the automatic filtering has been optimized based on the herein analyzed species and might need to be tuned depending on the data provided by the user. Thus, we encourage the users to manually check miRNA ranking around the default threshold and adjust if necessary.
MirCure takes advantage of the accumulated knowledge about miRNA biogenesis and seek traces of this process within the miRNA annotations. These biogenesis fingerprints include, among others, the hairpin-like secondary structure of the precursor miRNA in which the two miRNA arms, mature and star, are partially complementary with the two most extreme 3′ nucleotides overhang. These biogenesis fingerprints, however, are meaningless if they are not supported by transcriptomic evidence of the two putative miRNA arms. Therefore, MirCure uses high-throughput small RNA-seq data as the source of transcriptomic information. MirCure requires that the miRNA model is supported by ∼22 nt-long small RNA-seq reads mapping on both miRNA arms, and by displaying 2 nt overhang at the 3′ end on the predicted secondary structure. Consequently, MirCure is not able to validate miRNAs that are not contained in the user-provided small RNA-seq libraries. This implies that MirCure is limited to the validation of miRNAs expressed in the tissues provided by the user and therefore acts as a curation tool of specific datasets rather than a genome-wide annotation engine. However, the amount of small RNA-seq evidence that MirCure can accept is only limited by the user’s computational resources and therefore extensive sequencing information has the potential to allow for extended curation. As miRNA-seq assays are generally accessible, the inclusion of these data in miRNA annotation efforts is a reasonable requirement that has the advantage to significantly improve annotation accuracy.
To evaluate MirCure automatic bona fide miRNA calls, we compared its results to the results of two studies in which authors manually curated and filtered miRBase miRNAs for animals and plants. We observed that most of the MirCure automatically validated miRNAs in the human miRBase and Arabidopsis—89% and 64%, respectively—were also present in the list of manually confirmed miRNAs. Furthermore, when we manually explored the ∼10% of putative false positives, we found out that most of them scored similarly to MirGeneDB genes and had enough evidence to be confidentially considered miRNAs. Although we did not validate these miRNAs experimentally, this result confirms that MirCure, by facilitating automatic collection of multiple information sources under one single framework, might be in a better position to collect critical evidence otherwise missed by curators.
Overall, we have shown that MirCure is a useful tool to check the quality of miRNA annotations from different sources, such as existing databases and de novo annotations. By combining automatic scoring and filtering with graphical output of the supporting evidence, MirCure facilitates the miRNA curation process both in plant and animal species.
Funding
This work was partially supported by University of Florida start-up funds allocated to A.C.
Conflict of Interest: none declared.
Data availability
Small RNA-seq data used for MirCure input was obtained from public databases and is freely available. Supplementary Table S2 lists the accession numbers and relevant metadata of these datasets.