Motivation: MicroRNAs (miRNA) are essential 21–22 nt regulatory RNAs produced from larger hairpin-like precursors. Local sequence alignment tools such as BLAST are able to identify new members of known miRNA families, but not all of them. We set out to estimate how many new miRNAs could be recovered using a profile-based strategy such as that implemented in the ERPIN program.
Results: We constructed alignments for 18 miRNA families and performed ERPIN searches on animal genomes. Results were compared to those of a WU-BLAST search at the same E-value cutoff. The two combined approaches produced 265 new miRNA candidates that were not found in miRNA databases. About 17% of hits were ERPIN specific. They showed better structural characteristics than BLAST-specific hits and included interesting candidates such as members of the miR-17 cluster in Tetraodon. Profile-based RNA detection will be an important complement of similarity search programs in the completion of miRNA collections.
MicroRNAs (miRNAs) are small non-coding RNAs regulating gene expression through mRNA degradation or translation inhibition. miRNAs were first discovered in the nematode Caenorhabditis elegans and then identified in numerous plant and animal species. Experimental data have shown that they regulate functionally important pathways related to development (Reinhart et al., 2000), cell death (Brennecke et al., 2003), cancer (Calin et al., 2004) and neurological diseases (Dostie et al., 2003). The 21–22-nt-long mature miRNAs are synthesized from a longer 70–100-nt precursor (pre-miRNA) forming a long hairpin structure that contains the mature miRNA in either of its arms.
Although the first miRNAs were discovered using experimental methods (Lee and Ambros, 2001; Lagos-Quintana et al., 2001) many more were predicted through various computational screens, such as comparative genomics, that can detect entirely new RNA families (Lai et al., 2003; Lim et al., 2003a, b). Release 3.1 of the microRNA registry (Griffiths-Jones, 2004) contains 899 different miRNA precursors identified in Metazoa (Drosophila melanogaster, C. elegans, Caenorhabditis briggsae, Homo sapiens, Mus musculus and Rattus norvegicus) and plants (Arabidopsis thaliana and Oryza sativa). The microRNA registry contains experimentally validated miRNAs or close homologues identified using local alignment programs such as BLAST (Altschul et al., 1990). However, sequence alignment alone may fail to identify miRNAs that diverged too far apart in their primary sequence while retaining their base-paired structure. We expected that a more sensitive approach exploiting information from both miRNA structure and sequence could significantly improve detection of homologous miRNA genes. We here report the prediction of new members of known miRNA families using such a computational approach.
Two major computational techniques can exploit both RNA structure and sequence alignments for non-coding RNA (ncRNA) searches. Stochastic Context Free Grammars, or SCFGs (Sakakibara et al., 1994; Eddy and Durbin, 1994), provide a fully probabilistic description of RNA sequences and structure, albeit at a high computational cost which can be prohibitive at genomic scales. Alternatively, the ERPIN program represents RNA alignments as weight matrices or profiles (Gautheret and Lambert, 2001) and identifies matching sequences using a combined dynamic programming/profile scan algorithm. ERPIN profiles therefore captures both primary and secondary structure information, which is particularly well adapted to miRNA precursor identification. In its latest version, ERPIN also compensates for the incompleteness of training sets using pseudo-counts and performs accurate E-value calculations. Here we use ERPIN to identify new microRNA precursors in animal genomes. We then compare ERPIN predictions to those made by the WU-BLAST program (Gish, 1996–2004, http://blast.wustl.edu) and show that current efforts to complete miRNA collections would benefit from using a profile-based approach.
MATERIALS AND METHODS
miRNA training sets
Precursor sequences were downloaded from the miRNA registry 2.2, containing 593 sequences (513 animal + 50 plant) (Griffiths-Jones, 2004). Due to the diversity of miRNAs, it is not possible to build a single profile describing all families. However, precursor sequences within a single family of related miRNAs (e.g. let-7) are reasonably conserved and can be aligned easily. To obtain families of homologous miRNAs, we started with a CLUSTALW (Thompson et al., 1994) alignment of all 513 animal miRNA precursors in the miRNA registry. From a visual inspection of the CLUSTAL-generated tree, we extracted the 18 most salient miRNA clusters, ranging in size from 6 to 27 sequences. Precursor sequences contained within each cluster were not necessarily homologous, but they were close enough in terms of sequence and structure to produce a useful signature for the subsequent profile search. Some known miRNA families are clearly grouped together in these clusters, such as the let-7 miRNAs, the miR-17 family or the miR-2/miR-13 family. Average identity within clusters was 77%. As no automated method is available to perform accurate structure-based RNA alignments, we realigned each cluster with CLUSTALW and deduced a consensus secondary structure using the ALIFOLD program (Hofacker et al., 2002) with default parameters and a covariance weight empirically set to 1 or 2. The 18 training sets are available through the ERPIN server Web site (Lambert et al., 2004, http://tagc.univ-mrs.fr/erpin/
The novelty of miRNA candidates was assessed by comparing ERPIN or BLAST hits to the latest version of the miRNA registry (Griffiths-Jones, 2004) (at time of submission: version 3.1, containing 828 animal sequences) and to the 117 precursors identified by Tanzer and Stadler (2004) using a similarity search procedure.
The search database contained the following genomic sequences: C.elegans, C.briggsae, D.melanogaster, Human (Goldenpath 8.33), Mouse (Goldenpath 30.16), Rat (Goldenpath 11.2) and Zebrafish (version 3) from Ensembl (ftp://ftp.emsembl.org); Chimpanzee, Pig and Rabbit genomes from Ensembl traces (ftp://ftp.ensembl.org/pub/traces/); Chicken, Ciona intestinalis (release 1.0), Fugu rubripes (version 3.0), Sea urchin and Xenopus tropicalis (assembly 1.0) from the JGI genome portal (ftp://ftp.jgi-psf.org/pub/JGI_data/); Ciona savignyi from the Broad Institute (ftp://ftp.broad.mit.edu/pub/annotation/ciona/assembly_4_25_2_003/); Bos taurus, Drosophila pseudoobscura and Rhesus macacus from Baylor HGSC (ftp://ftp.hgsc.bcm.tmc.edu/pub/data); and Tetraodon nigroviridis from the NCBI Trace database (ftp://ftp.ncbi.nih.gov/pub/TraceDB/tetraodon_nigroviridis/).
ERPIN E-values are computed based on a discrete convolution analysis of profile scores, as explained in detail in http://tagc.univ-mrs.fr/erpin/. Computed E-values were compared to simulations performed on random databases for a variety of training sets including tRNAs, snoRNAs, SECIS elements and miRNAs. In each case, a remarkable agreement was observed between computed E-value and simulation. Figure 1 shows simulated and computed numbers of hits at different score levels, for a search performed on 300 Mb random sequences of uniform nucleotide composition. Training sets corresponding to the let-7, miR-133 and miR-17 families are shown here, containing 27, 10 and 6 sequences respectively. In each case, computed E-values faithfully reproduce observed number of hits. Other miRNA training sets and other means of randomization based on order-1 or order-2 Markov models of actual genome sequence behave similarly (data not shown).
ERPIN and BLAST parameters
To improve the search when using underpopulated training sets, ERPIN uses pseudo-counts based on the Henikoff and Henikoff (1996) procedure. Implementation details are provided on the ERPIN Web site http://tagc.univ-mrs.fr/erpin/. A pseudo-count weight of 0.2 was used for all searches. Other ERPIN parameters were left at default values. Search regions included all helices and the conserved single strands shown in upper case in Table 1. Search parameters for all training sets, including those not shown in Table 1, are provided on the ERPIN Web site. All hits with an E-value below 0.01 were retained. BLAST searches were performed using the WU-BLAST package (Gish, 1996–2004, http://blast.wustl.edu), with a word-length parameter decreased from 11 to 7 to improve sensitivity.
Hits obtained from both programs were filtered using Dust (Tatusov, unpublished) and Repeat Masker (Smit, unpublished) to mask repeats and low complexity regions. All programs were run on a 10-CPU computer cluster, with an average runtime of 4.7 h to screen both strands of the complete 14.3 Gb database for a single miRNA family.
RESULTS AND DISCUSSION
To identify new miRNA precursors in animal genomes, we built training sets for 18 miRNA families, using a semi-automated procedure based on CLUSTALW (Thompson et al., 1994) for RNA alignment and ALIFOLD (Hofacker et al., 2002) for secondary structure annotation. These training sets were then fed into ERPIN to scan a 14.3 Gb database comprising a total of 20 complete or partial genomes. ERPIN detected a total of 553 hits with an E-value <0.01, of which 270 were not previously reported in the current miRNA databases. Most of these novel candidates have an even lower E-value (e.g. 216 hits at E ≤ 10−4), indicating that strong miRNA candidates remain unannotated in current databases. Only five out of the 270 candidates displayed a non-hairpin structure (e.g. V-shaped) when submitted to secondary structure prediction by the RNAfold program (Hofacker, 2003). The fact that most ERPIN predictions adopt a correct hairpin folding is comforting but expected, since ERPIN base-pair profiles generally entail strong penalties for base mismatches.
We next asked what proportion of the ERPIN hits would have been detected using a conventional similarity search program. We used the WU-BLAST program at a sensitive setting (BLASTN word length = 7), using each sequence in the 18 training sets in turn as a query. Running these queries against the same database produced 261 novel hits at an E-value cutoff of 0.01. Fourteen of these sequences were predicted to fold incorrectly by RNAfold.
Figure 2 shows a Venn diagram of ERPIN and WU-BLAST hits. There are 219 hits common to ERPIN and WU-BLAST, among which five are not folded correctly. Although most of the new precursors are detected by both programs, 51 candidates escaped sequence similarity search alone and were only detected by profile search. All of these were correctly folded into a single hairpin structure. There were also 42 BLAST-specific predictions, although nine of them were not predicted to fold correctly. ERPIN fails to detect some BLAST hits because it does not currently allow for insertions relative to the longest sequence in the training set, while BLAST does allow for insertions. However, ERPIN-specific hits are more abundant and better folded, especially at low E-values. If one only considers correctly folded candidates in Fig. 2, 72% are detected by both programs, 11% are BLAST specific and 17% are ERPIN specific. Furthermore, the proportion of ERPIN specific hits increases with E-value cutoff, as shown in Table 2 for cutoffs ranging from 10−1 to 10−5. When using a high confidence level (E-values of 10−5 or better), ERPIN specific hits reach 30%. Candidate miRNA sequences identified by ERPIN, BLAST or both programs are available as supplemental data.
Table 1 shows the 51 ERPIN-specific hits identified in this study along with their consensus secondary structures. It appears that some single strands in our alignments could be paired together. This is a corollary of the ALIFOLD-based annotation procedure that tends to reject those base pairs not supported by covariation. Our goal here was to use an alignment/search procedure that was as automatic and straightforward as possible, but it is likely that alignments could be further refined, thus improving search performances.
An important miRNA annotation effort is currently being undertaken as part of the RFAM database (Griffiths-Jones et al., 2003). RFAM proposes both ‘seed’ alignments (validated miRNA precursors) and SCFG-based predictions. An accurate comparison of these and our predictions is not currently possible, as RFAM candidates have no associated E-value and are not published or described in detail. In any case, current RFAM predictions show no overlap with ERPIN candidates in Table 1.
Recently, Tanzer and Stadler (2004) studied the evolution of ‘miR-17’ gene clusters located on chromosomes 7, 13 and X in the human genome. This cluster, composed of miR-17, miR-18, miR19a, miR-19b, miR-20, miR-25, miR-92, miR-93, miR-106a and miR-106b, is proposed to result from ancient miRNA duplication events followed by duplications of the entire cluster. Our initial clustering procedure grouped the majority of these miRNAs into a single training set (Fig. 3A), containing miR-17, miR-18, miR-20, miR-93, miR-106a and miR-106b, whilst miR-92 fell into another training set (Fig. 3B). The phylogenetic trees in Figure 3 show the relationships between training set sequences (light grey) and new hits identified from these training sets by both ERPIN and BLAST (noted ‘EB’) or by ERPIN only (noted ‘E’). No BLAST-specific hit was obtained in these runs. It is noteworthy that ERPIN performed better than BLAST in identifying distant homologues in these examples. For instance, BLAST did not identify any new member of the miR-106a/miR-17 family, while ERPIN identified two homologues in Tetraodon (Fig. 3A, black arrows). Likewise, the miR-93 homologue identified in Tetraodon is ERPIN specific (Fig. 3A, white arrow); and the miR-92 and miR-92a homologues identified in Ciona are all ERPIN-specific (Fig. 3B, arrows). In these examples, BLAST identified orthologues between two mammalian or two Drosophila species, but profile search detected miRNAs that diverged earlier in the metazoan lineage, such as Ciona hits identified from their vertebrate homologues (e.g. miR-92, Fig. 3B).
To determine whether a miR-17-like cluster was indeed present in Tetraodon, we mapped the three ERPIN-specific Tetraodon hits (Fig. 3A and Table 1) onto the current genomic assembly available at Genoscope (http://www.genoscope.cns.fr/externe/tetraodon). Although the Tetraodon hits tni-1 and tni-2 are from different scaffolds, they both map to the same location on chromosome 1 (position 5477786–5777854) with one mutation for tni-2. Either one of these hits comes from an as yet unassembled part of the Tetraodon genome, or they both come from the same miRNA with a sequence error or variation. The tni-3 miRNA maps at position 5477580–5477644 of chromosome 1, in the vicinity of tni-1/tni-2. This suggests that a cluster containing both miR-93 and miR-106/miR-17 homologues is present on chromosome 1 of Tetraodon, which is probably related to the miR-17 cluster in mammals. To our knowledge, these putative members of the Tetraodon miR-17 cluster were not previously reported.
RNA genes do not produce as noticeable phylogenetic footprints as their protein-coding counterparts, and their detection by similarity search program is often restricted to relatively close homologues. In the case of miRNA precursors, a program such as BLAST or WU-BLAST failed in our examples to detect some divergent orthologues such as between vertebrates and tunicates, or even between mammals and fishes. This tendency is not verified for all miRNA clusters: in some cases (data not shown), BLAST hits were more divergent from training set sequences than ERPIN hits. Therefore, both programs may fail to detect ‘interesting’ distant homologues. In any case, using ERPIN increased the number of novel miRNA candidates by 17% at an E-value of 10−2, or by 25% at an E-value of 10−4, when compared to a BLAST search. Therefore, a comprehensive annotation of miRNA in genomes should involve both sequence similarity search and an RNA-specific profile or SCFG search. The detection of miRNA candidates for experimental validation will be significantly improved by combining these computational tools.
Supplementary data for this paper are available on Bioinformatics online.
aSpecies: cin (Ciona intestinalis), gga (Gallus gallus) cel (Caenorhabditis elegans), fru (Fugu rubripes), tni (Tetraodon nigroviridis), xtr (Xenopus tropicalis), dre (Dario rerio), rno (Rattus norvegicus), csa (Ciona savignyi), mmu (Mus musculus), cbr (Caenorhabditis briggsae.)
bSequences are aligned as in training set; boxes indicate secondary structure helices.
|ERPIN and BLAST||214||214||185||152||121|
|ERPIN and BLAST||214||214||185||152||121|
The authors thank Dr. Pascal Hingamp and Dr. Rémi Houlgatte for their careful reading of the manuscript and useful suggestions.