Rfam is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. Rfam aims to facilitate the identification and classification of new members of known sequence families, and distributes annotation of ncRNAs in over 200 complete genome sequences. The data provide the first glimpses of conservation of multiple ncRNA families across a wide taxonomic range. A small number of large families are essential in all three kingdoms of life, with large numbers of smaller families specific to certain taxa. Recent improvements in the database are discussed, together with challenges for the future. Rfam is available on the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/ .
Received September 15, 2004; Revised and Accepted October 8, 2004
Non-coding RNA (ncRNA) genes produce a functional RNA product instead of a translated protein. These products are components of some of the most important cellular machines, such as the ribosome (ribosomal RNAs), the spliceosome (U1, U2, U4, U5 and U6 RNAs) and the telomerase (telomerase RNA). The known repertoire of ncRNA cellular functions is expanding rapidly. Small nucleolar RNAs (snoRNAs) guide essential modifications of ribosomal and spliceosomal RNAs [reviewed in ( 1 )]. Ribozymes catalyse a range of reactions, such as self-cleavage of hepatitis delta virus transcripts, and 5′ maturation of transfer RNAs (tRNAs) by the ubiquitous RNase P. A class of small RNAs almost unknown before 2000, the microRNAs (miRNAs), are found to be involved in regulation of ever more processes in higher eukaryotes—including development, cell death and fat metabolism—by repressing the translation of mRNA targets [reviewed in ( 2 )]. Similar mRNA-binding regulatory roles in bacteria are fulfilled by distinct families of small RNAs [reviewed in ( 3 )].
Like protein-coding genes, ncRNA sequences can be grouped into families and much can be learnt about structure and function from multiple sequence alignments of such families. Unlike proteins, ncRNAs often conserve a base-paired secondary structure with low primary sequence similarity. The combined secondary structure and primary sequence profile of a multiple sequence alignment of ncRNAs can be captured by statistical models, called profile stochastic context-free grammars (SCFGs), analogous to profile hidden Markov models (HMMs) of protein alignments.
Rfam is a database of ncRNA families represented by multiple sequence alignments and profile SCFGs, available via the Web at http://www.sanger.ac.uk/Software/Rfam/ and http://rfam.wustl.edu/ . All the data are also available for download, local installation and sequence searching using the INFERNAL software package ( http://infernal.wustl.edu/ ) ( 4 ). The Rfam/INFERNAL model is much like the Pfam/HMMER system ( 5 ), extended to deal with RNA secondary structure consensus, and has been discussed previously ( 6 ). Here, we concentrate on recent improvements and discuss challenges that we expect to address through future development.
The database has grown dramatically over the past two years: from 25 families annotating around 55 000 regions in the nucleotide sequences databases in release 1.0, to 379 families annotating over 280 000 regions in release 6.1. This growth is partly due to a significant increase in scope. The evolution of some large gene families, such as miRNAs and snoRNAs, is constrained partially by inter-molecular base-pairing, and thus they do not conserve significant sequence or secondary structure. While we cannot therefore represent all C/D box snoRNAs, or all miRNAs, with a single alignment and model, subfamilies are conserved and are now well represented in the database. Rfam also now includes not only bona fide ncRNA genes, but also structured regions of mRNA transcripts. These fall into two broad classes: self-splicing introns and cis -regulatory elements in the untranslated regions (UTRs). The latter can be used as detectors for a wide range of environmental conditions [e.g. bacterial riboswitches bind a range of metabolites as reviewed previously ( 7 , 8 ), and the 5′-UTR of the PrfA acts as a temperature-dependent switch ( 9 )] to regulate message stability or translational efficiency.
This increased scope has led to the introduction of a limited type ontology, with the top-level types representing the three classes of structured RNA discussed above—‘Gene’, ‘Intron’ and ‘Cis-reg’. The database currently contains 308 gene families, 69 cis -regulatory elements and two self-splicing introns. The type field provides one of the primary entry points for family browsing and searching, enabling the user to quickly identify all snoRNA gene families for instance, or to find all riboswitches in the database.
One of the primary uses of the Rfam database is to search for homologues of known RNAs in a query sequence, including a complete genome. Indeed, the profile SCFG library has been used to annotate a number of newly sequenced genomes [e.g. Caenorhabditis briggsae ( 10 ), chicken ( 11 ) and Erwinia caratova ( 12 )]. In addition, we calculate hits in over 200 complete genomes and chromosomes. These data are available through the web interface and are discussed briefly in the following section.
NON-CODING RNAS IN COMPLETE GENOMES
Rfam makes available annotation of over 13 400 candidate ncRNA genes (plus 172 self-splicing introns and 1285 cis -regulatory RNA elements) belonging to 172 families in 224 completed chromosomes and genomes. The average bacterial genome contains over 80 hits, dominated by the number of tRNAs. A total of 170 regions are annotated in Escherichia coli , in which most experimental validation of computationally predicted ncRNAs has been carried out. Rfam annotated regions in Bacillus genomes ( B.anthracis is shown in Figure 1 ) include a number of recently described riboswitches ( 7 , 8 ).
These data provide the first comprehensive view of the distribution of ncRNAs in the three kingdoms of life. There are a small number of very large families representing some of the best-understood RNAs. Figure 2 shows that these few large families are the only RNAs that are ubiquitous between all three domains of life—only the essential translation components, tRNA and ribosomal RNA, together with RNase P (tRNA maturation) and SRP RNA (protein export) are found in eukaryotes, bacteria and archaea. It is tempting to believe that very few families will be added to the catalogue of universally conserved RNAs. However, it is clear that members of some families are highly divergent so as to be computationally almost unrecognizable. For example, although most eukaryotes would be expected to have a telomerase RNA, current computational techniques are unable to identify homologues in even well-studied model organisms such as Caenorhabditis elegans .
Only snoRNAs are found in eukaryotes and archaea and not in bacteria, but RNA families have not yet been identified that are common to bacteria and archaea but not eukaryotes, or eukaryotes and bacteria but not archaea. The vast majority of Rfam families are small, and are often specific to one taxonomic group, and in some cases to one organism, suggesting relatively recent evolution of function or divergence beyond our ability to recognize homologues. Many novel bacterial ncRNAs have been identified by a number of recent computational screens in E.coli [reviewed in ( 13 )], but comparatively few have been experimentally verified. Rfam contains more than 30 ncRNA families based on the verified genes. Few large-scale studies have been conducted in archaea or eukaryotes, and it is clear that such efforts will identify many more small families.
Profile SCFG searches are computationally expensive. Rfam at present uses a BLAST-based heuristic ( 14 ) as described previously ( 6 ), reducing the search space with an inevitable sensitivity cost. This allows us to search a 5 Mb bacterial genome against the entire Rfam library in ∼24 h. Annotation of large eukaryotic genomes is just feasible using this approach. Recent advances allow the speed of profile SCFGs to be increased by a factor of ∼100 for most families, and provably do not reduce the sensitivity of the full SCFG search ( 15 ). Work is ongoing to incorporate such algorithms into the Rfam/INFERNAL approach. We also recognize that the current approach is restricted to RNAs with defined secondary structures, precluding inclusion of important families of essentially unstructured RNAs like XIST (X-Inactive Specific Transcript), RoX (RNA on X) and IPW (Imprinted in Prader–Willi). Furthermore, the consensus structure annotation may conceal additional elements in divergent structures. We plan to evaluate how the use of profile HMMs may allow the detection of homologues of unstructured RNAs, and investigate the propagation of structure annotation at the sequence level.
Perhaps the biggest challenge for annotation of higher eukaryotic genomes is the problem of ncRNA-derived pseudogenes and repeats. For example, the B2 repeat in mouse is evolutionarily related to a tRNA, and Alu repeats in human derive from SRP RNA ( 16 ). Over 10% of the draft human genome sequence is made up of 1.1 million Alu sequences ( 17 ), and there are over 350 000 B2 repeat sequences in mouse ( 18 ). The human genome also contains over 1000 sequences that are closely related to U6 spliceosomal RNA, yet sensible estimates of the U6 gene count suggest that <50 are functional. Other problem families include the polIII transcribed Y and 7SK RNAs. Distinguishing the functional copies from the large numbers of pseudogenes is an unsolved problem and presents a significant challenge to RNA computational biologists.
It seems likely that computational and experimental screens will continue to identify numerous novel ncRNAs. Most of these genes are predicted to fall into small families with narrow taxonomic ranges. In contrast, we believe that very few universally conserved RNAs will be found, and the large, well-studied and ubiquitous families will continue to make up the large majority of ncRNAs in a single genome. Rfam will continue to translate novel discoveries of ncRNA genes into alignments and models that are immediately useful for genome annotation and phylogenetic analysis.
We thank all those who have contributed data and annotation and developed tools and algorithms for ncRNA detection, alignment and structure prediction. Work at the Sanger Institute is funded by the Wellcome Trust. A.K. and S.R.E. are supported by the Howard Hughes Medical Institute, the NIH National Human Genome Research Institute and Alvin Goldfarb.
The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK and 1Howard Hughes Medical Institute and Department of Genetics, Washington University School of Medicine, St Louis, MO 63108, USA