In recent years, there have been increasing numbers of transcripts identified that do not encode proteins, many of which are developmentally regulated and appear to have regulatory functions. Here, we describe the construction of a comprehensive mammalian noncoding RNA database (RNAdb) which contains over 800 unique experimentally studied non-coding RNAs (ncRNAs), including many associated with diseases and/or developmental processes. The database is available at http://research.imb.uq.edu.au/RNAdb and is searchable by many criteria. It includes microRNAs and snoRNAs, but not infrastructural RNAs, such as rRNAs and tRNAs, which are catalogued elsewhere. The database also includes over 1100 putative antisense ncRNAs and almost 20 000 putative ncRNAs identified in high-quality murine and human cDNA libraries, with more to be added in the near future. Many of these RNAs are large, and many are spliced, some alternatively. The database will be useful as a foundation for the emerging field of RNomics and the characterization of the roles of ncRNAs in mammalian gene expression and regulation.
Received August 14, 2004; Revised and Accepted October 12, 2004
The traditional view of the genomic programming of cells and organisms is predicated on the belief that genetic information normally flows from DNA to RNA to protein. As a consequence, genes are generally considered to be synonymous with proteins, which carry out the majority of the structural, catalytic and regulatory transactions in living cells, with RNA functioning primarily as an intermediate coding template for protein synthesis, aided by infrastructural RNAs that are central to the process (tRNAs and rRNAs). This view of the structure of molecular genetic systems is essentially correct in prokaryotes, whose genomes consist almost entirely of closely spaced protein-coding sequences flanked by cis -acting regulatory elements that act to control transcription and translation, although it has recently become clear that prokaryotes also express limited numbers of small regulatory RNAs. However, these occupy <1% of the genome sequence in prokaryotes ( 1 ). In contrast, non-protein-coding RNA (ncRNAs) sequences are abundant in the genomes of the eukaryotes, especially the developmentally complex multicellular eukaryotes, where they dominate transcriptional output ( 2 ). These RNAs are composed of transcribed introns from protein-coding genes, and an increasing but, as yet, undetermined number of transcripts that do not encode proteins and may or may not themselves contain introns, collectively termed ncRNAs ( 2 – 5 ).
The evidence for large numbers of ncRNA transcripts in eukaryotes is increasing. There are large numbers of sequences that do not contain a significant open reading frame (ORF) in well-constructed mammalian cDNA collections ( 6 , 7 ), many of which show variation in their expression in different cells and developmental stages. The number of detectable RNAs derived from human chromosomes 21 and 22 has been shown to be at least an order of magnitude higher than that expected from known protein-coding sequences ( 8 , 9 ). All well-studied loci (including β-globin, bithorax and various imprinted loci) show a predominance of non-coding transcripts, many of which are developmentally regulated and appear themselves to have regulatory functions ( 10 – 12 ). Careful analysis of human chromosome 7 has identified 213 ncRNA genes, indicating by extrapolation that the number of ncRNA genes in the human genome is more than 3500 ( 13 ) while other estimates based on the integration of data from a wide variety of sources suggests that the majority of all human genes encode ncRNAs ( 3 , 5 ). Indeed, while protein-coding sequences occupy <1.5% of the human genome, it is clear that at least 50% of the genome is transcribed ( 2 ).
ncRNAs have been identified in various ways. Early examples such as lin-4, Xist, IPW, NTT and BC200 were discovered and characterized experimentally on an ad hoc basis, and their lack of protein-coding capacity was unexpected (references for individual ncRNAs are listed in Supplementary Material 1). Many of these RNAs, expressed in particular tissues and/or developmental stages, are associated with particular diseases including various cancers ( 14 – 20 ), schizophrenia ( 21 ), ataxia ( 22 ), cartilage-hair hypoplasia ( 19 ), DiGeorge syndrome ( 23 ) and autism ( 13 ), and/or are involved in complex genetic phenomena such as imprinting and other forms of epigenetic control of gene expression ( 12 , 24 ). More recently, there have been targeted experimental approaches to the large-scale discovery of ncRNA genes. The presence of hundreds of microRNAs (miRNAs) and small nucleolar RNAs (snoRNAs) has been established experimentally by specifically screening for tiny RNA species ( 25 – 27 ), at least some of which have been implicated in the control of development ( 28 – 30 ) and in the etiology of cancer ( 31 – 33 ). As noted above, thousands of larger ncRNA transcripts have also been putatively identified via the systematic sequencing and annotation of tens of thousands of full-length cDNAs ( 6 , 7 ). Recent advances in computational genomics are also helping to identify ncRNAs of particular types. New algorithms have been used to search the genomic sequence databases for members that share secondary structure motifs with existing ncRNA families ( 34 , 35 ), although there remain large numbers of ncRNAs which do not yet appear to share recognizable primary or secondary structural motifs and hence cannot be identified by these methods.
The term ‘non-coding RNA’ in its broadest sense includes all RNAs that do not code for protein (i.e. non-messenger RNAs) and encompasses transfer RNAs (tRNAs), ribosomal RNAs (rRNAs) and spliceosomal RNAs, which largely have basic housekeeping functions in cells. Many other ncRNAs, however, have been shown to perform regulatory functions within the cell including phenomena such as the temporal suppression of mRNA translation, RNA interference, imprinting, DNA methylation and X chromosome dosage compensation ( 3 , 5 ). So, while precise biological roles for the vast majority of these non-messenger, non-infrastructural RNAs are still to be elucidated, such ncRNAs have been proposed to serve as a diverse and hitherto hidden regulatory network in eukaryotic cells ( 2 , 3 ).
At present, there is no comprehensive database of ncRNAs, although there are several existing databases that cover aspects of the field. tRNAs and rRNAs are listed within multiple databases ( 35 – 37 ). The miRNA registry provides a searchable database of published miRNA sequences ( 38 ). The Rfam database contains thousands of mammalian RNAs, the majority of which are infrastructural RNAs (tRNAs, etc.) and predicted using co-variance models from multiple-sequence alignments of genomic datasets with little direct experimental support for their transcription ( 35 ). Notably, many well-documented regulatory ncRNAs such as Xist and NTT are not listed in Rfam, reflecting a bias to include only those entries that are members of particular structure-based RNA families. A ‘regulatory’ ncRNA database has been published previously ( 39 ) but the total number of unique mammalian ncRNAs in this database is <40 excluding homologs and miRNAs.
AIMS OF THE DATABASE
Here, we describe a new ncRNA database, RNAdb. The broad aims of RNAdb are 2-fold. First, to provide a searchable sequence repository for mammalian ncRNAs whose function may be regulatory and whose existence is supported by experimental evidence. Being sequence-oriented, the database permits not only bioinformatic analyses but also the rational design of experimental tools such as microarray chips to characterize the roles of ncRNAs in mammalian gene expression and regulation. The second aim is to highlight to the scientific community the presence of thousands of potentially regulatory ncRNAs.
RNAdb contains a comprehensive listing of mammalian ncRNAs, but excludes tRNAs, rRNAs and spliceosomal RNAs. More than 800 unique known mammalian ncRNAs are included in the database, around two-thirds of which are miRNAs and snoRNAs. The rest are largely of unknown function, but some are known to be developmentally regulated, disease-associated, imprinted, expressed pseudogenes or antisense transcripts. As well as sequence data, additional information—including GenBank accession numbers, species, references, chromosomal location, transcript length, splicing status, conservation notes, function, disease associations, antisense relationships, imprinting status and tissue expression patterns—is provided wherever possible in separate searchable fields. The database also includes almost 20 000 putative ncRNAs identified in high-quality cDNA libraries ( 6 , 7 ), with more to be added in the near future. At this stage, we have restricted the database to mammalian ncRNAs, as this group has the best information and is most relevant to humans and disease, but we expect that many of these ncRNAs will have paralogs in other vertebrates, and that the database will eventually be extended to include ncRNAs from other vertebrates and from invertebrates.
DATABASE DEVELOPMENT AND DESIGN
ncRNAs were included in RNAdb provided that (i) there was experimental (i.e. EST, cDNA, RT–PCR and/or northern blot) evidence to support their existence as RNAs; (ii) they did not contain a significant ORF (i.e. <100 amino acids); (iii) they were not annotated as rRNAs, tRNAs and spliceosomal RNAs; and (iv) they were mammalian.
ncRNAs that satisfied these requirements were identified from the following sources: a comprehensive literature review; the Functional Annotation of Mouse (FANTOM) and the H-Invitational databases, which contain over 60 000 and 20 000 annotated gene candidates identified from mouse and human full-length cDNA libraries, respectively; the human chromosome 7 annotation project; the miRNA registry; and public sequence databases at NCBI, Ensembl and UCSC. A novel pipeline was designed to identify putative antisense ncRNAs from the public databases (Supplementary Material 2).
For individual ncRNAs with direct support in the literature, the details (sequence, accession number, organism, chromosomal location, etc.) were manually curated where possible. For ncRNAs identified from large-scale discovery projects such as FANTOM and H-Invitational, information was restricted to that already publicly available. Further information will be obtained and incorporated in the near future.
All the information is currently stored in a Microsoft SQL2000 relational database which runs on a Microsoft.NET platform.
DATABASE ACCESS AND INTERFACE
RNAdb is hosted at the Institute for Molecular Bioscience and is freely available at http://research.imb.uq.edu.au/RNAdb . The entire database is available for download in XML format via the website.
The web interface allows users to browse the entire collection or to search for specific ncRNAs of interest. Searches are executed either by performing keyword searches or by applying filters to nominated fields including organism, chromosome number, ncRNA category and ncRNA feature (e.g. antisense, disease-associated, etc.) either singly or in combination ( Figure 1A ). Once an individual ncRNA is selected, the sequence and relevant annotations are displayed in a tabular format ( Figure 1B ).
A BLAST tool allows users to query sequences of interest against those present in the database.
Questions, feedback and submissions of new mammalian ncRNAs are welcomed, and should be directed via email to RNAdb@imb.uq.edu.au .
CONTENT OF THE DATABASE
RNAdb contains more than 800 unique, known mammalian ncRNAs. Including homologs, splice variants and precursor forms, there are more than 2000 individual sequences listed. Of the uniquely known ncRNAs in the database, around two-thirds comprise miRNAs and snoRNAs. The rest are generally much longer and while some of these have documented biological roles, most are transcripts of unknown function. Altogether 36 mammalian organisms are represented within the dataset but the majority of unique ncRNAs are either murine and/or human.
Over 200 unique mammalian miRNAs are found within RNAdb and were chiefly obtained from the miRNA registry ( 38 ). While it is not our aim to duplicate this valuable resource, we have included miRNAs here because they represent an important population of potentially regulatory ncRNAs and consequently fall within the scope of this database. In addition, we have manually annotated the miRNAs to reflect recent published results ( 30 , 40 – 43 ).
More than 300 snoRNAs have been described. They fall into two general classes, C/D box and H/ACA snoRNAs, which guide ribose methylation and pseudo-uridylation of rRNAs, respectively. Some snoRNAs are expressed in a tissue-specific manner and show complementarity to mRNAs rather than rRNAs ( 44 ), pointing to possible roles in post-transcriptional modification and regulation.
Over 40 unique ncRNAs have been linked to diseases ranging from malignancies to psychiatric illnesses and neuro-developmental disorders. These include: miR-15 and 16, BCMS (B cell chronic lymphocytic leukemia), TTY1 and 2 (gonadoblastoma), NCRMS (rhabdomyosarcoma), BIC (ALV-induced B cell lymphoma), H19 (breast/colon/bladder/Wilm's tumor), MALAT-1 (non-small cell lung cancer), DISC2 (schizophrenia), SCA8 (spinocerebellar ataxia), ST7OT1-4 (autism), IPW (Prader–Willi syndrome), LIT-1 (Beckwith Wiedemann syndrome), RMRP (cartilage hair hypoplasia) and UBE3A antisense (Angelman syndrome).
Developmentally regulated ncRNAs
More than 40 ncRNAs show tissue-specific expression and/or regulation during development. Recent examples include multiple miRNAs differentially expressed during brain development and hematopoiesis ( 30 , 40 ), and others include Xist, 7H4, BC1, BC200, BORG, Bsr, CIOR, MBII-13/52 and 85, Ntab, NTT, tncb and Zim3.
Natural antisense ncRNAs
Natural antisense transcription is now recognized as being much more common than previously thought, with recent estimates indicating that there may be thousands of sense–antisense pairs in the human genome ( 45 ). RNAdb includes over 30 previously described antisense ncRNAs, a few examples of which are aHIF, Air, DISC2, EMX2OS, FGF antisense, HFE antisense, HOXA11S, LIT1, Msx1 antisense and Nespas.
The number of pseudogenes in the human genome has been estimated at ∼20 000 ( 46 ), only a minority of which are thought to be transcribed into ncRNAs ( 47 ). Here, we catalog over 50 expressed pseudogenes. These include Makorin1-p1, the disruption of which was recently shown to play a role in the pathogenesis of a mouse mutant exhibiting polycystic kidneys and bone deformity.
We have also found over 40 ncRNAs that are imprinted. Several of these cluster together including IPW, PWCR1, UBE3A antisense, PAR1 and PEG13 on human chromosome 15q11-12, and PEG-11 antisense and MEG8 (on ovine chromosome 18). Many are natural antisense transcripts, such as (apart from those above) LIT-1, Air, GNAS1 antisense, PEG8, Copg2IT1, PEG1 antisense and Zim3.
Alternatively spliced ncRNAs
Over 20 ncRNAs are known to be alternatively spliced. Some of these include BCMS, CIOR, DD3, DGCR5, Enox, G.B7, Gas5, GTL2, NCRMS, Nespas, SCA8 and Tmevpg1.
In addition to the 800 unique known ncRNAs, RNAdb also contains almost 20 000 putative ncRNAs. Approximately 15 000 of these were originally identified from over 60 000 high-quality largely full-length mouse cDNA clones annotated by the FANTOM consortium ( 6 ). A further 2000 putative ncRNAs were obtained from the H-Invitational database, which contains over 20 000 validated gene candidates derived from analysis of high-quality largely full-length human cDNA clones ( 7 ). The human chromosome 7 annotation project has described over 350 putative ncRNAs derived from computer-based annotation in conjunction with extensive laboratory experimentation, and these are also included among our putative ncRNA dataset. Finally, using a computational pipeline, we identified over 1100 putative antisense ncRNAs (Supplementary Material 2), and these are also contained in the database.
To date, we have cataloged hundreds of known non-infrastructural mammalian ncRNAs. This number quickly stretches to thousands once the putative ncRNAs are taken into account. While it has been proposed that many of these ncRNAs function as components in a regulatory network ( 3 , 5 ), in the majority of cases, virtually nothing is known about their precise biological roles, tissue expression patterns and significance. Elucidating the functional significance of these transcripts must, therefore, become a priority. Until this is done, we acknowledge the important caveats that some of the RNAs classified here as ncRNAs may not have intrinsic function, as in the case of SRG1 in yeast ( 48 ), and that some may encode small proteins as recently described for the steroid receptor RNA activator ( 49 ). Nonetheless, having a searchable database such as RNAdb dedicated to non-infrastructural ncRNAs will be a vital foundation for the emerging field of RNomics as the future knowledge base grows. We are, therefore, setting up a regular database update schedule, and invite researchers to submit new ncRNAs to RNAdb as they are published. In the near future, we will be adding thousands of novel putative ncRNAs that have arisen from the latest round of analysis of the FANTOM cDNA libraries, as they become publicly available.
At the time of submission, it came to our attention that a new noncoding RNA database (NONCODE) had been released online in the previous month. This database appears to share a similar interest with ours but RNAdb has the following advantages: (i) over 1100 putative antisense ncRNAs and almost 20 000 putative ncRNAs identified in high-quality murine and human cDNA libraries are included and annotated; (ii) BLAST searches against the non-coding dataset can be performed; and (iii) the database is available for download to permit more thorough local bioinformatic analyses.
Supplementary Material is available at NAR Online.
We thank Jeff McDonald and Steve Scherer for providing us with information on ncRNAs identified by the chromosome 7 annotation project, as well as Carole Charlier for providing us with information on callipyge locus ncRNAs. K.C.P. is supported by the NHMRC with a Postgraduate Medical Research Scholarship (234711). W.C. is supported by a Wellcome Trust International Senior Research Fellow Fellowship (066646Z01Z). J.S.M. is supported by the Australian Research Council and the Queensland State Government.
1ARC Special Research Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland 4072, Australia, 2T cell Laboratory, Ludwig Institute for Cancer Research, Austin and Repatriation Medical Centre, Heidelberg, Victoria 3084, Australia, 3Center for Genomics and Bioinformatics, Karolinska Institutet, 171 77 Stockholm, Sweden and 4Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Centre, Yokohama, Kanagawa 230-0045, Japan