RIsearch: fast RNA–RNA interaction search using a simplified nearest-neighbor energy model

Motivation: Regulatory, non-coding RNAs often function by forming a duplex with other RNAs. It is therefore of interest to predict putative RNA–RNA duplexes in silico on a genome-wide scale. Current computational methods for predicting these interactions range from fast complementary-based searches to those that take intramolecular binding into account. Together these methods constitute a trade-off between speed and accuracy, while leaving room for improvement within the context of genome-wide screens. A fast pre-filtering of putative duplexes would therefore be desirable. Results: We present RIsearch, an implementation of a simplified Turner energy model for fast computation of hybridization, which significantly reduces runtime while maintaining accuracy. Its time complexity for sequences of lengths m and n is with a much smaller pre-factor than other tools. We show that this energy model is an accurate approximation of the full energy model for near-complementary RNA–RNA duplexes. RIsearch uses a Smith–Waterman-like algorithm using a dinucleotide scoring matrix which approximates the Turner nearest-neighbor energies. We show in benchmarks that we achieve a speed improvement of at least 2.4× compared with RNAplex, the currently fastest method for searching near-complementary regions. RIsearch shows a prediction accuracy similar to RNAplex on two datasets of known bacterial short RNA (sRNA)–messenger RNA (mRNA) and eukaryotic microRNA (miRNA)–mRNA interactions. Using RIsearch as a pre-filter in genome-wide screens reduces the number of binding site candidates reported by miRNA target prediction programs, such as TargetScanS and miRanda, by up to 70%. Likewise, substantial filtering was performed on bacterial RNA–RNA interaction data. Availability: The source code for RIsearch is available at: http://rth.dk/resources/risearch. Contact: gorodkin@rth.dk Supplementary information: Supplementary data are available at Bioinformatics online.

, and multiplied by -100 to build the score. This has the advantage that we get integers and can maximize over all possible local 'alignments'. Letters in the first column denote consecutive nucleotides in the query sequence (5' to 3'). Letters in the first row refer to the target sequence (3' to 5').
Reading example: The energy contribution of stacking a CG pair onto an AU pair (-2.2 kcal/mol) can be found in the row 'AC' and column 'UG', resulting in 220. -∞ prevents alignments ending with a mismatch, it is set to -2000 in the implementation. * denotes cases that the algorithm itself forbids, such as having a gap in one sequence followed by a gap in the other (insertion after deletion). It is set to an arbitrary value, as it is never read. ('TurnerNNDBxx') should be considered as reference, methods deviate from this because of rounding and because some of the more complex rules are not implemented in the algorithms. (*) The same predictions as given by RNAduplex, were also reported by RNAcofold [4], BINDIGO [5] (web server only), and PairFold (web server) [6]. Apparently, RNAhybrid does not apply the intermolecular initiation energy that is typically given with +4.1 kcal/mol, thus yielding much lower energies. The interaction in the first column is a self-complementary duplex, but none of the methods seems to implement the suggested symmetry correction. The third duplex shows the stack of GU followed by UG in two different contexts, stabilizing and destabilizing. As we only look at one previous position we cannot incorporate these scores. Similarly, look-up tables for the 2×2 internal loop in duplex number six are not used by RIsearch.  All measurements in this section were done with the GNU time command. Run times are given in seconds, the total amount of CPU time spend in user and kernel mode (%U + %S). For memory consumption we report the maximum resident set size (RSS) in Kilobytes (%M). There is a known bug in time, so this value is four times too large. However, the relative memory reduction is not affected. Values given in the following tables are what is reported by time without correction. For RNAplex we specified parameter -f 2, so it uses the simple energy model in the backtracking as well, instead of the full energy model. For both tools, RIsearch and RNAplex, we set the per-nucleotide penalty to 0.3 kcal/mol. All other parameters were left at their default values.

Using accessibility information
In order to run RNAplex -a, accessibility profiles need to be computed with RNAplfold. As memory requirements for the larger chromosomes exceeded our recources, we used human chromosome 21 here (the smallest one). With the recommended settings, RNAplfold runs more than 26 hours and uses more than 24 GiB to compute the accessibility profiles. The subsequent run of RNAplex in its current implementation also is more resource-demanding when making use of this information in comparison to the simple version. Screening this target with 5 miRNAs as query (as in Supplementary Table 3) Table 5: Prediction accuracy. Sensitivity (also recall or true positive rate (TPR)), positive predictive value (PPV) or precision, F-measure, and MCC (harmonic and geometric mean of the first two) were calculated for the set of 17 experimentally verified sRNA-mRNA interactions. Averages are given in the last four lines. We tested RNAplex using precomputed accessibility profiles (plex-a), as well as the basic version (plex-c) with per-nucleotide penalty -c 30. RIs99 and RIs04 stand for RIsearch using 1999 and 2004 parameter set, respectively, also with per-nucleotide penalty of 30 (= 0.3 kcal/mol). Numbers shown in gray italics refer to interactions that have not been found as the single best-scoring, but only when taking into account suboptimal solutions. For interactions marked with an asterisk (*), we have extracted a longer and a shorter version from the literature. For example, [8] identified residues that were protected in in vitro footprinting experiments and extended the target sites by biocomputational predictions. For this table we used the shorter forms (SF, with boundaries as given in Main Table 1). When instead using the longer forms (LF, maximum numbers of pairs shown in the original papers), we get the average measures as reported in the last two lines. When excluding suboptimal solutions (−subopt), RNAplex with accessibility misses only one interaction, while the other methods miss five each. When the top prediction does not share a base pair with the experimentally verified location, they contribute with 0 to the average. In all these cases it is enough to look at the three best suboptimal solutions in order to find one that overlaps the verified location. When allowing these suboptimal solutions (+subopt), the values as printed in gray italics contribute to the average.  Table 6: The table shows the interacting gene and miRNA, with chromosome information for target site location (name, strand, and the number of bases that have not been masked). For each tool, we report a threshold (thr) found by looking for the highest scoring hit that overlaps a verified interaction site of this mRNA-miRNA pair. For RIsearch and RNAplex this threshold is the ∆G, for GUUGle the match length, for TargetScanS the context+ score, for miRanda again the energy. As count we report the number of hits (of the given miRNA within this chromosome, direction) that fulfill this threshold, i.e., all predictions that score at least as good as the best verified interaction for that pair. GUUGle* uses only the seed of the miRNA as query (nt 1-8) instead of the whole mature miRNA (in GUUGle). For RIsearch and RNAplex we additionally intersected the hits with those from GUUGle* and present the counts in the last two columns (number of predictions that overlap complete GUUGle seed matches and fulfill the energy threshold as applied for RIsearch and RNAplex respectively). NF stands for 'not found'. We use two different methods to evaluate performance and give the results in the last two rows where the best result is highlighted in bold.  Table 7: RIsearch as pre-filter. For each of the interactions, we report the unfiltered number of predictions made by TargetScanS and miRanda, together with the relative reduction achieved by different (combination of) tools. G*: GUUGle* as described in previous table; RIs: RIsearch with a threshold of -11 kcal/mol; G*∩RIs: combination of both. Interactions that are denoted as not found (NF ), have been found by the filter, but not by the respective method. Results are summarized as their averages below. The last row shows the according averages, when using GUUGle instead of GUUGle*.
These results show that GUUGle can hardly reduce TargetScanS candidates. This is because TargetScanS uses an even stricter seed requirement, all candidates identified by TargetScanS are also found by GUUGle. The only exception is the "7mer-1a" criterion, an exact match to positions 2-7 of the mature miRNA (the seed) followed by an 'A'. In these cases, a perfect complementary stretch of six nucleotides is sufficient to be considered as candidate for TargetScanS. There are only three interactions where GUUGle in fact reduces the number of candidates.
For miRanda the trend is not as strong, but also here we see that RIsearch alone achieves a bigger candidate reduction than GUUGle alone. As both tools filter out different candidates, their combined effect is strongest.

10
The reductions that can be achieved by RIsearch as a prefilter differ widely (see Suppl. Table 7, e.g. for TargetScanS relative reduction varies between 1% and 70%). Part of the explanation is the difference in GC-content of the mature miRNA sequences. The higher the GC-content, the more likely are low binding energies. With the conservative threshold of -11 kcal/mol, the list of candidates can not be reduced substantially in those cases. One could address this, by choosing a stricter cut-off for miRNAs with a potentially stronger interaction. This relation can be seen in the figure to the right. The Pearson correlation coefficient r between the GC-content and the reduction in miRanda hits is -0.6595 (p-value: 0.00046) and for TargetScanS -0.6996 (p-value: 9.953e-5).