Computational prediction of the localization of microRNAs within their pre-miRNA

MicroRNAs (miRNAs) are short RNA species derived from hairpin-forming miRNA precursors (pre-miRNA) and acting as key posttranscriptional regulators. Most computational tools labeled as miRNA predictors are in fact pre-miRNA predictors and provide no information about the putative miRNA location within the pre-miRNA. Sequence and structural features that determine the location of the miRNA, and the extent to which these properties vary from species to species, are poorly understood. We have developed miRdup, a computational predictor for the identification of the most likely miRNA location within a given pre-miRNA or the validation of a candidate miRNA. MiRdup is based on a random forest classifier trained with experimentally validated miRNAs from miRbase, with features that characterize the miRNA–miRNA* duplex. Because we observed that miRNAs have sequence and structural properties that differ between species, mostly in terms of duplex stability, we trained various clade-specific miRdup models and obtained increased accuracy. MiRdup self-trains on the most recent version of miRbase and is easy to use. Combined with existing pre-miRNA predictors, it will be valuable for both de novo mapping of miRNAs and filtering of large sets of candidate miRNAs obtained from transcriptome sequencing projects. MiRdup is open source under the GPLv3 and available at http://www.cs.mcgill.ca/∼blanchem/mirdup/.


INTRODUCTION
MicroRNAs (miRNAs) are short (generally 19-24 nucleotides) noncoding single-stranded RNA molecules that are involved in posttranscriptional regulation by targeting messenger RNAs (1)(2)(3). In animals, miRNAs expression is a multistep process (4): (i) transcription of the primary miRNAs (pri-miRNAs) by RNA polymerase II, (ii) cleavage of the pri-miRNA by Drosha and the RNAse III enzyme to isolate long hairpins called miRNA precursors (pre-miRNAs) and (iii) extraction by Dicer of the miRNA-miRNA* duplex from the pre-miRNA. In plants, Drosha and Dicer are replaced by Dicer Like 1 (5). The miRNA* is the complementary region of the miRNA on the other arm of the hairpin with a shift of 2 nt in the 5 0 direction (6). After separation of the two strands of the duplex, the miRNA is mature and ready to be attached to the RISC complement. It then targets mRNAs by perfect or imperfect complementarity (7). In some cases, both miRNA and miRNA* are functional (8).
Although experimental techniques for unambiguous identification of miRNAs exist (35), they remain slow and expensive. Sequencing of short RNAs followed by mapping to a reference genome has become an approach of choice (36)(37)(38), but many small RNA molecules are unlikely to be miRNAs, while many true miRNAs are likely to be expressed only under rare circumstances not easily covered experimentally. For those reasons, computational prediction of miRNAs continues to play an important role in genomics.
Most miRNA prediction approaches rely, at least in part, on the specific hairpin shape of the secondary structure of the pre-miRNA (39). These include ProMir (40,41), TripletSVM (42), miRabela (43), miPred (44), SSCprofiler (45), microPred (46), HHMMiR (47), SplamiR (48), miRFinder (49), MiRenSVM, the only tools that handle multiloop hairpins (50), and many others. All these tools are trained on known miRNAs stored in MiRbase (51), a repository of miRNAs (mostly) experimentally validated. The prediction of the hairpin can be combined with comparative genomics approaches that posit that, in addition to their typical secondary structure, pre-miRNAs exhibit high sequence and structure conservation across species (52,53). However, most computational approaches labeled as miRNA predictors are actually pre-miRNA predictors, in the sense that they identify candidate genomic regions that may form pre-miRNAs but rarely attempt to determine the position of the miRNA itself within them.
Computationally predicted pre-miRNAs are often combined with high-throughput short-RNA sequencing data, in an attempt to determine which of the large number of expressed small RNAs may indeed be miRNAs. This kind of approach is challenging, though, as short reads may be incorrectly mapped, or may come from degradation products from the pre-miRNA, especially from the miRNA*, or from other types of RNA molecules. Predictions from deep sequencing can be obtained by considering the abundance and distribution of reads mapped to a candidate pre-miRNA, where read stacks and Dicer products mapped on a reference inform about the location of the miRNA. This strategy is used by miRdeep (6,54), miRdeep* (55), MIReNA (56) and miRanalyzer (57,58). However, lowly expressed miRNA, often lineage-specific (59) or condition-specific (60) ones, will be difficult to detect because Dicer products and the miRNA* are completely degraded.
To the best of our knowledge, only six mature miRNA predictors have been proposed to date. MIRcheck (28) identifies 20-nt regions of a given plant pre-miRNA using a predetermined set of rules and constraints. MiRalign (61) finds miRNAs positions by aligning pre-miRNAs with miRbase, thereby preventing from finding new miRNAs. ProMir (40) identifies human pre-miRNAs and their mature miRNAs by combining sequence and structural features in a paired hidden Markov model. MatureBayes (62) identifies 22-nt regions that are likely mature miRNA candidates based on sequence and secondary structure information using a Naive Bayes classifier. MaturePred (63) locates fixed-length miRNAs in plants based on miRNA-miRNAs* features and a support vector machine predictor. Finally, MiRmat (64) seeks Drosha and Dicer processing sites in vertebrates using a random forest predictor.
Although the recent research activity related to miRNA prediction shows the importance of the problem, existing tools have severe limitations. First, most tools are trained specifically on data from certain phyla [e.g. plants (28), humans (42,44) or viruses (33)], which limits their applicability. Second, most mature miRNA prediction tools seek mature miRNA of a fixed length, although in most species miRNAs lengths vary from 19 to 24 nt. Third, tools are typically trained once, at the time of publication, based on the training data available at that time. This means that they do not benefit from the rapid increase in the quality and quantity of experimentally verified miRNAs available. Finally, accessibility remains an issue, with ProMir 2 being unavailable and MaturePred, MiRalign and MiRmat being only available as web servers, which limits that usability for large-scale analyses.
In this article, we introduce miRdup (miRNA duplex), a tool for the validation of a candidate mature miRNA or the prediction of the precise position and length of the mature miRNA within a candidate pre-miRNA, based on a combination of sequence and structural features. We trained models separately on data from five lineages (mammals, fishes, arthropods, nematodes and plants), which increases species specificity and allows the discovery of features that distinguishes miRNAs from different species. The algorithm works on both single hairpin and multiloop pre-miRNAs. Finally, miRdup automatically downloads and trains on the latest miRbase release, to ensure it benefits from the most up-to-date data.

Datasets
MiRNAs and pre-miRNAs sequences were downloaded from miRbase (http://www.mirbase.org/) (52) release 19, which contains 19 823 unique mature miRNAs/pre-miRNAs pairs. We note that until recently, miRNAs and miRNA* used to be annotated separately in mirBase and were thought to be functionally distinct, with the former playing a functional role and the latter being a non-functional by-product. This view has changed now owing to reports of functional activity of miRNAs* (65), and miRbase has stopped distinguishing between the miRNAs and miRNAs* (miRbase blog, 27 April 2011). We chose to follow this direction by considering all miRNAs and miRNAs* as functional, labeling them as either 3 prime or 5 prime depending on their location on the pre-miRNA hairpin. We note, however, that for >78% of cases, only one miRNA is annotated in a given pre-miRNA, with the complementary region not being annotated as functional.
For the purpose of training classifiers, negative sets of non-miRNAs were generated as follows. For each positive example (pair of miRNA and pre-miRNA), a negative example was generated by randomly relocating the miRNA along the same pre-miRNA sequence, preserving the miRNA's length, but excluding the exact position of the true miRNA or of any other known miRNAs. Note

Feature vectors and training
Each training example was represented as a set of 100 features listed in Supplementary Table S1. The minimum free energy (MFE) and the secondary structure of the pre-miRNAs and the miRNA-miRNA* candidate duplexes were obtained with RNAfold and RNAduplex, from Vienna package (66), using default parameters. To perform the ranking of attributes and classifier training and evaluation, we used Weka and its libraries (67). All classifiers were trained using 10-fold cross-validation. Attributes ranking was performed using information gain evaluator (InfoGain evaluator) (68) with the Ranker search method (69) in Weka with default parameters and 10-fold cross validation. Ranker ranks attributes by their individual evaluations in conjunction with other attribute evaluators like ReliefF (70), GainRatio (71) and Entropy (72).
MiRdup uses a random forest classifier [a combination of decision tree predictors trained on a random subset of features sampled independently (73)], combined with the Adaboost M1 method (74). Adaboost is a machine learning meta-algorithm that is used in combination of many other machine learning algorithms to improve their performance (75). The random forest was trained with an unlimited maximum depth of the trees and 50 generated trees (Weka options: -I 50 -K 0 -S 1). Adaboost used 10 iterations, reweighting and a weight threshold of 100 (Weka options: -P 100 -S 1 -I 10). The other classifiers considered were (i) a support-vector machine (SVM) classifier (76) The efficiency of a given classifier was measured as a function of its number of true positive (TP), false positive (FN), true negative (TN) and false negative (FN) predictions. A classifier performance is typically measured by its sensitivity Sn ¼ TP=ðTP+FNÞ and specificity Sp ¼ TN=ðTN+FPÞ, as well as by its total prediction accuracy ACC ¼ ðTP+TNÞ=ðTP+TN+FP+FNÞ

MiRNA prediction
MiRdup can be used in two modes. In the validation mode, miRdup takes as input a pre-miRNA sequence and the position of a candidate miRNA, and returns a score that reflects the likelihood that the candidate is a true miRNA. In the prediction mode, the only input to miRdup is a pre-miRNA sequence, and it evaluates all possible miRNAs and reports the most likely miRNAcontaining duplex. For each candidate, starting position p and length 16 l 30 on a pre-miRNA of length n, miRdup calculates score (p, l) using the random forest classifier, as described above. Although candidate miRNAs could simply be ranked based on these scores, we found that the following post-processing approach produced more accurate predictions. We first calculate, for each starting position p, the consensus scores for starting position score S(p) and ending position E(p): We then identify the position p and length l that results in the largest combined start and end position scores:

RESULTS AND DISCUSSION
We developed miRdup, a classifier for the mature miRNA validation and identification in a given pre-miRNA sequence (see 'Materials and Methods' section). In the former case, miRdup assigns a score to a given candidate mature miRNAs within its pre-miRNA sequence. In the latter, it determines the most likely position of a mature miRNA within a given pre-miRNA sequence. MiRdup is based on a random forest binary classifier using a set of sequence and structural features of the candidate miRNA-miRNA* duplex. By training mirDup on lineage-specific subsets of miRbase, one obtains classifiers that can take advantage of miRNA features that are specific to that clade, which helps improve the accuracy of predictions. Here, we report on the accuracy of miRdup predictions in various settings, and contrast sequence and structure features that are informative for five selected clades: Mammals (mostly primates, rodents and carnivores), plants (mostly crucifers, maize and rice), fish (mostly zebrafish and fugu), arthropods (insects and crustaceans, etc.) and nematodes (Caenorhabditis and Pristionchus pacificus, etc.).

Evaluation of individual predictive features
We evaluated a set of sequence and structural features (Supplementary Table S1), summarized in Table 1, which may potentially help characterize the position of the miRNA on the pre-miRNA hairpin. They were chosen based on previous studies focusing on miRNA prediction (42,44) and on the many properties that could characterize the duplex. These include numerical features describing the position and length of particular structural elements in the putative miRNA, such as bulges and bases pairs, or distance of the miRNA from the start/end of the hairpin (Figure 1). We also included summary statistics on the primary miRNA sequence (e.g. mononucleotide and dinucleotide frequencies) and the predicted secondary structure of the miRNA/miRNA* duplex [frequency of base pairs types (G-U, C-G or A-U), frequency of local sequence/structure triplets (A sequence/structure triplet corresponds to a nucleotide coupled by the sequence of presence/absence of base-pairing at that position and the two flanking positions. For example, 'A(.(' represents a case where a nucleotide A is in a bulge surrounded by two base pairs, and 'U.(.' means that a U is paired but its two neighbors are not.) (42) and MFE of the duplex].
We note that we also considered adding structural features based on ensembles of structures rather than MFE structures. However, these features did not prove more informative than their MFE-based counterparts and were not retained.
Features vary in their power to distinguish positive from negative examples. Identifying and removing uninformative features is often important to avoid overfitting and improve computational time (80), although this problem is less of an issue for algorithms based on decision trees and forests of random decision trees (81) than for SVMs (82). Features were ranked based on the information gain they provide (Table 2). We observe that the most influential features are those related to structural aspects of the miRNA-miRNA* duplex (number of base pairs, MFE, number/size of bulges, position of miRNA in the pre-miRNA hairpin loop). On the opposite, primary sequence features and triplet frequencies showed little discriminative power. We note that because our positive and negative examples were size-matched, miRNA length was not considered informative.

Mature miRNAs exhibit species-specific properties
We then assessed the power of each feature at distinguishing true miRNA from negative examples in specific lineages. Figure 2A-H shows distribution of feature values for some of those that vary significantly between lineages based on Kolmogorov-Smirnov test (P-value < 0.05 for at least one of the comparisons between lineage-specific distribution and the distribution obtained from all MirBase). The length of miRNAs varies significantly between species, where plant miRNA are generally 21 nt long and almost never 23 nt, while animal miRNAs have a broader, more regular miRNA length distribution with a mode at 22 nt (Figure 2A). Plant  miRNAs also stand out with duplexes that are on average more stable (lower free energy) than animals' (Figure 2B), while arthropods and, to a lesser extent, nematodes are often less stable. This is also reflected in various structural properties such as the presence of fewer and shorter bulges ( Figure 2C, D). In fact, >13% of plant miRNAs have no bulge at all (100% base-paired positions, Figure 2E) and >33% have at least 10 consecutive base pairs starting at positions 0 (start) or 1 ( Figure 2F), two properties that are much more rare in animals. Sixty percent to 90% of animal miRNAs are located within 10 bp of the terminal loop of the pre-miRNA ( Figure 2G), whereas plant miRNAs are often found much further, in agreement with the fact that plants usually have longer precursors (83). The GC content of miRNAs exhibits significant variations between species ( Figure 2H), with fish miRNAs being notably less GC-rich than those of other species. Finally, we noted a remarkable nucleotide composition bias at the first position of the miRNA with 40% (in mammals) to 60% (in fish) of miRNAs starting with a U nucleotide ( Figure 2I). Feature ranking was then repeated on each set of species separately. While certain structural features such as those relating to the number of base pairings ranked consistently high for all lineages, others, in agreement with the results presented in Table 2, are ranked differently for different species (Table 2 and Supplementary Tables S1). In particular, the distance to and overlap with the terminal loop showed decreased informativity in plants, while the MFE and the number of base pairs in the duplex were more informative in plants than animals.
To illustrate an important use of miRdup, we used it to reanalyze a set of 1670 miRNA predicted by MiRdeep2 (54) from short-RNA sequencing data in human cancer lines (SRA SRR029124). MiRdup-mammals validated only 755 (45%) of these candidate miRNAs. There are multiple lines of evidence that suggest that the candidates that were rejected by miRdup were indeed MiRdeep2 false positives. First, only 3% of the candidate miRNAs that were rejected by miRdup overlapped annotated miRNAs from MiRbase, whereas this fraction was of 47% among candidate miRNAs that were validated by miRdup. Second, we observe that among the miRNAs predicted by MiRdeep2 and validated by miRdup, a large proportion (46.5%) overlap highly conserved sequences among mammals [based on PhastCons highly conserved elements (87)], whereas this proportion drops to only 19.2% among MiRdeep2 miRNA predictions that were rejected by miRdup. These results suggest that the candidate miRNAs rejected by miRdup are either non-functional, or are atypical, unannotated and poorly conserved miRNAs. Finally, we also reanalyzed the pool of pre-miRNAs and their mature miRNAs predicted and published by the authors of miRdeep (6), and miRdeep2 (54). MiRdup confirmed 89% (201 on 226) and 84% (98 on 117) of the identified miRNAs, respectively.

Prediction of a miRNA position within a pre-miRNA
MiRdup can be used to predict the most likely miRNA duplex location, i.e. the most likely miRNA in 5 prime (5p) and 3 prime (3p), within a given pre-miRNA. Given a pre-miRNA sequence and a trained classifier for the  Table 2. binary decision problem, miRdup computes prediction scores for every possible combination of miRNA length (16-30 nt) and starting position and then identifies the pair of starting and ending positions, located within 16-30 nt of each other, for which the total evidence is highest (see 'Materials and Methods' section). We finally return the predicted miRNA and its miRNA*. Figure 4 shows an example of the prediction made for a typical pre-miRNA, Drosophila melanogaster's dme-mir-10.
To estimate the accuracy of miRdup at locating miRNAs within pre-miRNAs, we calculated the minimum distance between the true and predicted miRNA/miRNA*, for both the start and end positions ( Figure 5 and Supplementary Figure S1). When trained and evaluated on data from all five lineages combined, miRdup made perfect predictions of start and end positions in 28.7% and 20.18% of the cases, respectively, and was within 3 nt in 68.9% and 68.3% of cases, respectively. This is significantly better than MatureBayes, miRalign and ProMir 1, the only miRNA predictors we were able to compare with. When evaluated on the same data set, MatureBayes yields only 18.8% and 13.3% exact miRNA duplex start and end position predictions, while MiRalign yields 18.8% and 7.9%, MaturePred 10.34% and 9.14% and ProMir1 6.13% and 8.01%. The results also indicate that $10% of the predictions are off by >10 nt with miRdup, versus $20% for the best of the competitors ( Figure 5).
Results obtained using the appropriate lineage-specific version of miRdup generally improve on the multilineage  The highest accuracy for each column is in bold. For cases where a predictor is applied to data from the lineage it is trained on, the numbers reported are obtained by 10-fold cross-validation.

The miRdup program
MiRdup is distributed as a java program making use of libraries from the Weka (67) and ViennaRNA (66) packages. The workflow is schematized in Figure 6. MiRdup can either be trained on a user-provided dataset of known miRNAs and pre-miRNAs, or can automatically download the latest version of mirBase and be trained on all of it or on a lineage-specific subset. For example, if 'ruminantia' is specified as clade of interest, the predictor will be trained only on Bos taurus and Ovis aries, which are (currently) the only two species present in miRbase in this clade. The set of negative examples is constructed on the fly by randomizing the position of miRNAs on the pre-miRNA. A MFE secondary structure is obtained for each pre-miRNAs, and features are calculated. Finally, the random forest predictor is trained. MiRdup can be run in two modes. In the first, miRdup takes as input a pre-miRNA sequence (with or without predicted secondary structure) and a candidate miRNA position, and assigns a score reflecting the likelihood that the candidate is a real miRNA. In the second case, miRdup evaluates every possible combination of miRNA position and length, and reports the most likely pair.  Figure 5. Cumulative distribution of the minimum distance between the true and predicted miRNAs or miRNAs* starts (up) and ends (down), i.e. the proportion of cases where the prediction is within x bases of the true start/end positions. Multilineage miRdup predictions are compared with MatureBayes (57), MiRalign (61), MaturePred (63) and PromiR1 (40) for all experimentally validated pre-miRNAs from miRbase, except for MaturePred, where our analysis was limited to only 2400 miRNAs submitted owing to web server constraints. For MatureBayes and Promir, a small number of queries were rejected by the web server and were thus excluded from the results. We only show distances of up to 10 nt, but in some rare cases, errors are substantially larger (up to 250 nt). Results for lineage-specific miRdup compared with MatureBayes for mammals, arthropods, nematods, fish and plants are shown in Supplementary Figure S1. on the complete mirBase database requires <80 min, and the miRNA prediction phase takes $10 s for a given pre-miRNA of 100 nt.

CONCLUSIONS
Although the structural properties of pre-miRNAs are well characterized (88) and have largely been exploited for their predictions (89), the sequence and structure properties that allow Dicer to recognize the exact position of the mature miRNA remains poorly understood (90). For this reason, computational approaches for the identification of miRNAs within pre-miRNA are rare and relatively inaccurate. Such predictors are, however, of great importance. First, working hand in hand with pre-miRNA predictors, they are essential for the de novo computational miRNA annotation of new genomes. Second, they play an important role even for miRNA annotation projects that have the benefit of short-RNA sequencing data. Indeed, from our experience, the classical approach of identifying likely miRNAs by retaining only reads that map to a genomic regions with strong pre-miRNA potential (as predicted by miPred (44) or HHMMiR (47), for example) still yields tens of thousands predictions. Considering only candidates overlapping pre-miRNAs predicted by more than one tool can reduce this number, but the consequences on sensitivity and specificity are hard to quantify. A more reasonable number of predictions can be obtained by more recent tools such as miRDeep (6,54), although even it often produced unlikely miRNA predictions. MiRdup then offers the opportunity to discard these likely false positives while retaining a high sensitivity.
MiRdup is a flexible, accurate, fast and user-friendly tool for the localization of mature miRNAs in pre-miRNA. It complements a wide array of computational tools that aim to identify pre-miRNAs and should be used as a posttreatment of predicted hairpins or to validate the miRNA function of short RNA reads mapped to a reference genome. MiRdup's speed and flexibility let it to be trained on data from specific lineages, which allows it to take advantage of species-specific miRNA properties. Because it is automatically trained on the latest version of mirBase, it remains up-to-date and can take advantage of increasingly large and accurate sets of miRNA annotations. The multilineage version of MiRdup outperforms the only other miRNA predictor available for download, matureBayes (62). The lineage-specific version is even more accurate, as it is able to take advantage of features such as the presence of a uracyl at the first position of the vast majority of fish miRNAs, or the increased stability of the miRNA-miRNA* duplex in plants.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online: Supplementary Table 1