Annotating RNA motifs in sequences and alignments

RNA performs a diverse array of important functions across all cellular life. These functions include important roles in translation, building translational machinery and maturing messenger RNA. More recent discoveries include the miRNAs and bacterial sRNAs that regulate gene expression, the thermosensors, riboswitches and other cis-regulatory elements that help prokaryotes sense their environment and eukaryotic piRNAs that suppress transposition. However, there can be a long period between the initial discovery of a RNA and determining its function. We present a bioinformatic approach to characterize RNA motifs, which are critical components of many RNA structure–function relationships. These motifs can, in some instances, provide researchers with functional hypotheses for uncharacterized RNAs. Moreover, we introduce a new profile-based database of RNA motifs—RMfam—and illustrate some applications for investigating the evolution and functional characterization of RNA. All the data and scripts associated with this work are available from: https://github.com/ppgardne/RMfam.


INTRODUCTION
Characterizing functional RNAs is an extraordinarily difficult task. Even highly transcribed RNAs from model organisms have remained uncharacterized for decades after their discovery. A specific example is the 6S sRNA, which was discovered in 1971. The 6S sRNA is conserved across Bacteria and is highly expressed in stationary-phase cells (1,2). But the role of 6S as a regulator of RNA polymerase remained an enigma for almost three decades (3). Likewise, Y RNA, which was discovered in 1981, is broadly conserved across metazoans and is highly expressed (4). It took two and a half decades before Y RNAs were shown to be essential for the initiation of DNA replication (5). However, the mechanism for Y RNA function still remains unclear. These and similar examples show that it is remarkably difficult to functionally characterize RNAs, even after decades of work.
A new generation of tools for RNA discovery is now available thanks to powerful new sequencing technologies. Entire transcriptomes from species at different life stages, tissue types and conditions can be studied with RNA-seq (6)(7)(8). The total complement of RNA structures encoded in transcriptomes is also accessible with SHAPE-seq (9) and functional regions of entire genomes of bacteria can be probed with techniques like TraDIS and Tn-seq (10,11). The data obtained by these tools are unearthing novel RNAs at an unprecedented rate, many of which are evolutionarily conserved, highly expressed, activated under specific conditions, essential and fold into conserved secondary structures. Annotation efforts such as those by the Rfam consortium (12)(13)(14) are useful. However, many RNAs are not found in this database and many that have been curated remain uncharacterized (8). To make sense of the volumes of transcriptome data that is now being generated, annotating this data and functionally characterizing the cohort of RNAs of Unknown Function (RUFs) is critical. A complication for such work is that evolutionary turnover, as well as sequence variation can be high for ncRNAs (15,16). Consequently, homology searches and other sequence-alignmentbased analyses can be very challenging.
For the purposes of this work we define a RNA motif as a functional RNA structure that recurs within or across different RNA families. A motif may be characterized by a blend of primary, secondary and tertiary structural features. The motifs that have been characterized to date are involved in a diverse number of functions, including increasing structural stability (e.g. the GNRA tetraloop (17)(18)(19)), facilitating interactions with other biomolecules (e.g. the CsrA-binding motif (20)(21)(22)), specifying sub-cellular localization (e.g. the SRP S-domain (23)) and coordinating gene regulatory signals (e.g. the HuR mRNA binding motif (24)).
A number of publications detail bioinformatic methods for the de novo discovery of RNA secondary structure motifs from RNA primary sequences (25,26). There are also tools that can screen predicted RNA secondary structures (27) and RNA tertiary structures (28) for shared structural features. The knowledge-based approaches for the annotation of RNA motifs include sequence and structure descriptors (Eddy,S., unpublished data,29), primary and secondary structure-based profile methods for specific motifs, e.g. (30,31), and methods that combine primary, secondary and tertiary data (32). We complement these approaches by introducing a resource that identifies a range of previously characterized RNA motifs in RNA sequences and alignments using profile hidden Markov models (HMMs) (33)(34)(35) and covariance models (CMs) (35)(36)(37).
We present 34 alignments, consensus structures and corresponding probabilistic models of published RNA motifs. We call this resource RMfam, or RNA Motif Families (all associated data and computer code are freely available from our repository hosted on GitHub: http://github.com/ ppgardne/RMfam). These have been used to predict ∼1900 conserved motifs in the Rfam (v11.0) alignments of RNA families (these are available in Rfam (v12.0) (14)); many of which are confirmed in the published literature. Finally, we show examples of the applicability of our approach for studying RNA function, evolution and alignment curation.

The distinction between Rfam and RMfam
The Rfam database collects and curates 'seed alignments' of RNA families. These are non-coding RNAs, cis-regulatory elements and self-splicing introns. The alignments are manually constructed and annotated with consensus secondary structures, and used to seed probabilities for CMs for each family. The Rfam CMs are widely used for genome annotation projects to identify RNA loci (e.g. (38)). A requirement before each family can pass Rfam quality-control is that it is specific. In other words, there exists a bit score threshold for each CM that distinguishes between sequence matches that are related to the family and obvious false-positive matches. Consequently, many RNA motifs are not included in Rfam as they lack the required specificity (12)(13)(39)(40)(41). However, the Rfam 12.0 (2014) release of the Rfam database includes RMfam annotations for the first time (14).

What is an RNA motif?
As described in the Introduction, we define RNA motifs as functional RNA structures that recur within or across different RNA families. These are a blend of primary, secondary or tertiary structure. Fortunately, for the purposes of this work, the majority of internal RNA contacts are local (i.e. within 100 nucleotides) (42), therefore, the local probabilistic models (described below) can be used to capture the bulk of the information.
An example RNA motif is the GNRA tetraloop (see Figure 3). This RNA motif is one of the most prevalent hairpins found in a number of RNA families, including rRNA, RNase P, a variety of riboswitches, self-splicing RNAs and many others. It is characterized by a hairpin loop that contains a 4 nucluetide sequence matching the pattern 'GNRA'. The most prevalent of these are GAAA, GUGA, GCAA and GAGA. The terminal 'G' and 'A' are frequently involved in a non-canonical base pair (18), however, the loop may also be involved in some long-range tertiary interactions (17) which can be mimicked by a range of alternative conformations (19). Therefore, the GNRA tetraloop meets our criteria for a RNA motif as it is a functional structure that recurs in multiple families.
Accurate local probabilistic methods for annotating structured RNAs on DNA sequences called HMMs and CMs are now available (33)(34)(35)(36)(37)43). From a given alignment, probabilistic models of conserved sequence (HMMs) and conserved sequence plus secondary structure (CMs) can be built and used to filter large numbers of sequences for candidate homologous and/or analogous regions (44). CMs cater to the characteristics of RNA sequence evolution that are imposed by base pairing (i.e. variation tends to preserve base pairing), the result is that the accuracy of CMs is greater than alternative approaches (45). The computational speed of CMs has tended to be poor, however, a lot of effort has been expended on improving the speed of the approach while maintaining the accuracy. The improvements include using HMMs as pre-filters to accelerate CMs, query-dependent banding and Dirichlet mixture priors (43)(44)(46)(47)(48).
RMfam sequences, structures and alignments were collated from a variety of heterogeneous and sometimes overlapping data repositories (12,24,28,31,(49)(50)(51)(52)(53)(54)(55)(56). Where possible we sourced data from publicly accessible RNA motif resources, these included the FR3D MotifLibrary (54), the models supplied with RMDetect (32), the comparative RNA website (52) and SCOR (51). We also used information from specialized resources, such as the k-turn structural database (49) and SRPDB (57), as well as generating our own alignments for motifs, such as the Shine-Dalgarno and Rho-independent terminators based upon the context of genome annotations (e.g. (31)). RNAFrabase was frequently consulted for RNA secondary structures derived from Protein Data Bank (PDB) structures (58,59). Finally, where necessary, we extracted sequences from publications. This was often a manual effort, involving manually transcribing sequences and structures from figures in published manuscripts. Where possible, these were mapped to nucleotide sequences derived from the PDB (downloaded June 2014) (60-62), the EMBL nucleotide archive (63) and Rfam (v11.0) (12,13). The provenance of each data set is stored in the corresponding Stockholm alignment. Each of these motifs were then passed through quality control steps, where the sensitivity and specificity of the resulting motif is assessed (see Figure 1 and Supplementary Figures  S10-S43). If these failed (e.g. the CM cannot identify member sequences or the false-positive rate is extremely high), then the motif was not included in the database. Each motif is also assigned a curated score threshold. This threshold (in bits) provides a reasonable distinction between true and false matches.

A benchmark of motif annotations
In the following we briefly describe the benchmarks we have used to evaluate our motif annotations. The benchmarks are described in further detail and with more elaborate results in the Supplementary Results.
In order to determine the accuracy of our approach we ran a series of three benchmarks. These were evaluated on individual motifs (see Figure 1B and Supplementary  Figures S10-S43), as well as on the collective RMfam results (see Figure 1A and Supplementary Figure S9

ROC plots for 3 RMfam benchmarks
Specificity Sensitivity  seed alignments. Ten shuffled sequences, with identical dinucleotide distributions, were generated for each RMfam seed sequence (64). Together these serve as positive and negative controls for our test. We constructed two further tests based upon Rfam (v11.0) families. We identified Rfam families where there exists good evidence (primarily based upon reviewing the RNA literature) that a motif is conserved in the family of related sequences (Supplementary Table S1, also available at http://github.com/ppgardne/RMfam/benchmark/ true positives.txt). These serve as positive controls for two further tests. For the 'Rfam sequences' benchmark we randomly selected at least five sequences from each Rfam seed alignment (if fewer than five sequences were available, then all were included). We generated 10 shuffled versions of each sequence; all had an identical di-nucleotide distribution to the native sequence. These sequences were all annotated with RMfam motifs, their CM scores were recorded and used to evaluate the accuracy of the annotations. Finally, for a 'Rfam alignments' benchmark, we evaluated the accuracy of RMfam annotations in an alignment context. Each Rfam alignment was filtered, removing sequences more than 90% identical. The remaining sequences were annotated with RMfam CMs, retaining only those that cover more than 10% of the seed sequences and more than two Rfam seed sequences. The summary statistic we use for this final benchmark is a 'sum-bits' score, this is the sum of the bit scores for each match in all the sequences in a seed.
The accuracy metrics that we report here are the Matthew's correlation coefficient (MCC) (65), sensitivity and specificity. All of our secondary structure diagrams are illustrated with R2R (66).
The CMs built from RNA motifs tend to be short and contain little sequence information. In RMfam the mean sequence length is just 34.3 nucleotides and the mean number of base pairs is 10.9. Therefore, a scan of a large sequence database with these models will result in a number of falsepositives. We propose that annotating sequence alignments of ncRNAs has the potential to improve the specificity of our predictions. This assumes that evolutionarily conserved motifs are more likely to be correct. In theory, this approach could be extended to genome alignments of, e.g. transcribed regions.

RESULTS
In this study we present 34 RMfam alignments and probabilistic models of published RNA motifs (all freely available from our repository hosted on GitHub: http://github. com/ppgardne/RMfam). These have been used to predict ∼2500 conserved motifs in the Rfam (v11.0) seed alignments; many of which are confirmed in the published literature. Furthermore, our permutation tests have shown that both the sensitivity and specificity of this approach is remarkably high given the short motifs we use (see Figure 1 and Supplementary Figures S9-S44).

Inference of RNA function with motifs
One of the most labour-intensive stages of RNA research is identifying the function of newly discovered RNAs. In order to illustrate the utility of RMfam for this task we show the matches between a model of the CsrA-binding site and two RNA families of unknown function, TwoAYG-GAY and Bacillaceae-1 (Rfam IDs RF01731 and RF01690, see Figure 2). CsrA is a bacterial RNA-binding protein that regulates the translation and stability of mRNAs (20). It binds mRNAs carrying CsrA binding motifs, these physically occlude ribosome-binding sites. This binding can itself be regulated by competition between the mRNAs and highly expressed sRNAs that host numerous CsrA binding sites. However, this class of sRNA (CsrB, CsrC, RsmX, RsmY and RsmZ) has only been identified in Gammaproteobacteria (21,22). The TwoAYGGAY and Bacillaceae-1 families were initially discovered by a large-scale bioinformatic screen (67). Some further analysis identified two tandem-GAs in one of the stems that characterize the structure of TwoAYGGAY (32). Our motif-based analyses have identified strong matches between the TwoAYGGAY family, the reverse-complement of the Bacillaceae-1 family and the CsrA-binding motif. These provide a testable hypothesis for further validation, that there are also CsrA-binding sRNAs in Clostridia (TwoAYGGAY), and Bacillales and Lactobacillales (Bacillaceae-1). The validation of these predictions is a work in progress with our collaborators.

Evolution of RNA motifs
Non-coding RNAs are remarkably tolerant of genetic variation, as evident by the wide degree of sequence variation that can be found between evolutionarily related ncR-NAs (16,(68)(69)(70). However, structure frequently constrains the evolution of RNA sequences. That said, structures can also be dynamic. For example, motifs that confer structural stability can be exchanged over time, resulting in a rich and complex evolutionary history. This illustrates that studying the gain and loss of RNA motifs over evolutionary timescales can help characterize the dynamic evolution of RNA sequences and structures.
A good example of this is the Lysine riboswitch. This is a convenient example, that for illustrative purposes we will describe in further detail. As illustrated in Figure 3 many motifs may be exchanged, e.g. the U-turn motif with a k-turn in the P2 stem or the T-loop and the GNRA tetraloop in stem P4. Interestingly, the motif distributions are relatively clade-like, with closely related riboswitches more likely to share motifs, e.g. the GNRA tetraloop is found in Lysine riboswitches from the Pasteurellales and Vibrionales taxonomic groups. This type of annotation information is valuable for researchers investigating the structure and evolution of RNA families.

RNA motifs for curating RNA alignments
Another use of the results presented in this work is of importance for the curators of RNA alignments and sequences (12,(71)(72). Until now it has been difficult to analyse the evolutionary conservation of motifs in the context of an alignment, although some progress has been made when crystallographic data is available, e.g. the RNASTAR collection of structural RNA alignments (72). With the help of RMfam, malformed alignments can be detected and corrected where conserved RNA motifs are incorrectly aligned. We illustrate an example of this for the Rfam (v11.0) 5S rRNA alignment that contains a misaligned, yet highly conserved sarcin-ricin motif (see Supplementary Figure S45), and for the Rfam RsmY alignment, which is a CsrAbinding sRNA. The RsmY alignment has a misannotated consensus structure that does not include a further CsrAbinding motif, which are hairpin motifs that host a 'GGA' sequence in the loop (see Supplementary Figure S46). These motifs generally occur in pairs, as CsrA is a homodimeric protein, with each half of the protein binding a RNA motif (73,74).

DISCUSSION AND CONCLUSION
The chief motivation for this work is to functionally characterize novel ncRNAs. Our vision for the RMfam resource is to annotate RUFs (e.g. (8)). These motif annotations will help develop further functional hypotheses and accelerate experimental characterization.
In this work, we have shown that RMfam is surprisingly accurate. Despite the fact that the average RMfam motif consists of just 34.3 nucleotides and 10.9 base pairs, we show that the CMs are specific enough to distinguish between motif-hosting sequences and negative control sequences (see Figure 1 and Supplementary Figures S10-S43). Our approach shows improved performance when evolutionary information encoded in Rfam sequence alignments is incorporated into the predictions. We hypothesize that annotated genome alignments may be a useful source of motifs and we will investigate this idea further in future.
One apparent weakness of employing CMs and HMMs for motif annotation is that the 5 and 3 halves of internalloop motifs (e.g. the k-turns, sarcin-ricin and tandem-GA loops) can in theory be a large distance apart, and therefore outside the QDB window covered by CMs (47). In practice, the distance between nucleotides involved in intramolecular contacts rarely exceeds 100 nucleotides (42), therefore, the majority of these motifs are still captured by CMs and HMMs.
As a discovery tool the RMfam resource has already made some useful predictions. We have predicted the existence of two new CsrA-binding ncRNAs, potentially the first of this class of regulatory molecules to be found outside of the Gammaproteobacteria. However, further work needs to be carried out to validate this claim.

Future work and potential applications
We have identified some future developments and applications for the RMfam resource. We plan to continue developing the accuracy of the motif annotation tools as well as increase the access to RMfam annotations via other databases, such as Rfam (v12.0) (14), and expand the number of motifs included in RMfam. Furthermore, it may be possible to boost the accuracy of RNA secondary structure prediction tools by constraining these with predicted motifs. We elaborate further on these ideas below.  . The Lysine riboswitch has substituted different motifs through its evolution. On the left is a representation of the consensus Lysine riboswitch secondary structure (66). This has been annotated with the most frequent motifs the RMfam annotates in the Lysine Rfam (v11.0) seed alignment, the percentage of seed sequences hosting each motif is also indicated. On the right is an annotated species taxonomy that illustrates the phylogenetic nature of the motif distributions. We have also annotated each tip with the motifs hosted in the P2 and P4 stems. The red, blue, green, black and yellow boxes illustrate kink-turn, U-turn, sarcin-ricin loop, GNRA tetraloop and the T-loop, respectively.
The Lysine riboswitch example raises the possibility that certain types of motif are preferentially exchanged during the evolution of ncRNAs. Do stable hairpin motifs, such as the GNRA and T-loops, replace each other more frequently than we expect by chance? This would blur the lines between our understanding of homologous and analogous structures (75). Another possibility is that certain motifs cooccur more frequently than we expect. For example, are kturns more frequently closed by U-turns than we expect? If correct, these enrichments of favoured exchanges and cooccurances could be used to increase our confidence in motif annotations and can assist with the design of functional RNAs.
Typical RNA structure prediction methods to not incorporate information about RNA motifs. We propose that RMfam predictions can be used as constraints for existing RNA structure prediction software, thus improving the accuracy of structure prediction tools which can often be in-accurate (76). This approach is analogous to the fragmentlibrary approach that is frequently used for tertiary structure prediction (77).
Another application for RMfam CMs is as a pre-filter to accelerate the more complex methods, for example, the Bayesian network approach implemented in RMdetect (32).
Increasing the access of motif annotations is another goal of the authors. We are active in the Rfam consortium which curates non-coding RNAs alignments (12)(13)(14). Our results show that curators can benefit greatly from motif annotations (see Supplementary Figures S44 and S45) and it is likely that RMfam annotations will be incorporated into further databases in future releases.
New technologies, such as the sequencing of cross-linked RNA and protein, are a potential source of new RNAprotein motifs. In the future we will mine these data sets (78)(79)(80) for new additions to the RMfam database. Further-more, we will continue to add new motifs to RMfam as they are published.
Finally, as previously mentioned, the specificity of the RMfam annotations is generally low. However, incorporating the genomic and taxonomic context of annotations into the predictions may result in performance gains. For example, Shine-Dalgarno and Rho-independent terminators are generally located in bacterial sequences and at the extremities of annotated genes. A probabilistic incorporation of contextual information will likely result in further performance gains.
In summary, we have developed a resource for annotating diverse sets of RNA motifs in nucleotide sequences and alignments. We have proven the accuracy using benchmarks, and the utility of this resource for alignment curation, evolutionary analyses and shown that it has some promise for the prediction of RNA function.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.