The term non-coding RNA (ncRNA) is commonly employed for RNA that does not encode a protein, but this does not mean that such RNAs do not contain information nor have function. Although it has been generally assumed that most genetic information is transacted by proteins, recent evidence suggests that the majority of the genomes of mammals and other complex organisms is in fact transcribed into ncRNAs, many of which are alternatively spliced and/or processed into smaller products. These ncRNAs include microRNAs and snoRNAs (many if not most of which remain to be identified), as well as likely other classes of yet-to-be-discovered small regulatory RNAs, and tens of thousands of longer transcripts (including complex patterns of interlacing and overlapping sense and antisense transcripts), most of whose functions are unknown. These RNAs (including those derived from introns) appear to comprise a hidden layer of internal signals that control various levels of gene expression in physiology and development, including chromatin architecture/epigenetic memory, transcription, RNA splicing, editing, translation and turnover. RNA regulatory networks may determine most of our complex characteristics, play a significant role in disease and constitute an unexplored world of genetic variation both within and between species.
Until recently most of the known non-coding RNAs (ncRNAs) fulfilled relatively generic functions in cells, such as the rRNAs and tRNAs involved in mRNA translation, small nuclear RNAs (snRNAs) involved in splicing and small nucleolar RNAs (snoRNAs) involved in the modification of rRNAs. The central tenet of molecular biology, developed from the study of simple organisms like Escherichia coli, has been that RNA functions mainly as an informational intermediate between a DNA sequence (‘gene’) and its encoded protein. The presumption has been that most genetic information that specifies biological form and phenotype is expressed as proteins, which not only fulfill diverse catalytic and structural functions, but also regulate the activity of the system in various ways. This is largely true in prokaryotes and presumed also to be true in eukaryotes. Reciprocally, the extensive sequences in the higher eukaryotes that do not encode proteins or cis-acting regulatory elements (i.e. the majority of the vast tracts of intronic and intergenic sequences) have been regarded as simply accumulated evolutionary debris arising from the early assembly of genes and/or the insertion of mobile genetic elements.
However, most of these supposedly inert sequences are transcribed. It is also increasingly evident that RNA itself can and does have a very wide repertoire of biological functions (1) and, in particular—as first predicted by Jacob and Monod 45 years ago (2)—that it is widely employed as a means of gene regulation, both in cis and in trans, especially in the higher eukaryotes. These RNAs are the subject of this review.
EXPANSION OF ncRNAs AND RNA METABOLISM IN EUKARYOTES
A limited number of trans-acting small ncRNAs have been described in prokaryotes that appear mainly to regulate mRNA translation or stability. Over 60 such RNAs have been identified during the past few years in E. coli, with another 200 or so predicted bioinformatically (3–7). Some of these RNAs are co-expressed with mRNAs and released by cleavage after transcription (4,7), examples of a parallel output of regulatory RNAs that appears to be widespread in the higher eukaryotes (8). Small ncRNAs have also been identified in other bacteria (see e.g. 9,10) and archaea (11), which interestingly have homologs of Argonaute, a family of RNA-binding endonucleases central to the action of microRNAs (miRNAs) and small interfering RNAs (siRNAs) in eukaryotes (12). However, ncRNAs do not dominate genomic output in prokaryotes, representing, as far as one can tell, only a small fraction of their genomes, which are generally dominated (80–95%) by protein-coding sequences (13), whose repertoire can vary widely even between closely related strains (14).
In contrast, the higher organisms have a relatively stable proteome, and a relatively static number of protein-coding genes, which is not only much lower than expected but also varies by less than 30% between the simple nematode worm Caenorhabditis elegans (which has only 103 cells) and humans (∼1014 cells), which have far greater developmental and physiological complexity (15). Moreover, only a minority of the genomes of multicellular organisms is occupied by protein-coding sequences, the proportion of which declines with increasing complexity, with a concomitant increase in the amount of non-coding intergenic and intronic sequences, most of which are in fact transcribed [(15,16); discussed in more detail subsequently]. Thus, there seems to be a progressive shift in transcriptional output between microorganisms and multicellular organisms from mainly protein-coding mRNAs to mainly non-coding RNAs, including intronic RNAs.
The eukaryotes, particularly the higher eukaryotes, also have a far more developed RNA processing and signaling system than prokaryotes, which appears to be linked to the more sophisticated pathways of gene regulation and complex genetic phenomena in eukaryotes, transcriptional and post-transcriptional gene silencing, including RNA interference (RNAi), DNA methylation and chromatin modification, imprinting, and other phenomena such as transvection, transinduction, dosage compensation and position effect variegation (8,17,18). The higher eukaryotes also have a large repertoire of RNA-binding proteins as well as many nucleic acid- and chromatin-binding proteins whose exact specificity is unknown or uncertain, but which may recognize different types of RNA:RNA and RNA:DNA complexes (8,18).
Both theoretic considerations and empirical evidence indicate that the amount of regulatory overhead scales non-linearly with complexity in all integrated systems, and that regulatory architecture will progressively dominate the information content of more complex systems, leading to complexity limits, until and unless there is a change in the physical basis of the regulatory architecture itself (19). The generic solution to this accelerating regulatory problem is the superimposition of digital communication and control systems, which have only been broadly established in the human intellectual lexicon during the past 20–30 years, well after the central tenets of molecular biology were developed and after introns were discovered. Interestingly, although it is widely appreciated that DNA itself is a digital storage medium, it has not been considered that some of its outputs may themselves be digital signals, communicated via ncRNA, in addition to the mRNAs encoding analog components (i.e. the proteins), albeit with many design variations elaborated by alternative splicing (which itself requires regulation).
Regulatory proteins scale almost quadratically with genome size in prokaryotes (20,21), and extrapolation of this relationship suggests that prokaryotes have been limited in their complexity by their reliance on a protein-based regulatory architecture, probably for most of their evolutionary history (13,19,20,22). Conversely, it appears that the eukaryotes breached this limit by the co-option of RNA as a digital regulatory solution, in concert with the evolution of the necessary protein infrastructure to recognize and act on these signals (13). Indeed, both logic and evidence suggest that both developmental programming and the phenotypic difference between species and individuals is heavily influenced, if not fundamentally controlled, by the repertoire of regulatory ncRNAs (13,16–18,23), which are only now being recognized and beginning to be studied in any systematic way.
Some infrastructural ncRNAs have been known for a long time and have well-established functions. These include tRNAs, rRNAs, spliceosomal uRNAs or ‘snRNAs’ and the common ‘snoRNAs’. Both translation and splicing require core infrastructural RNAs not only for sequence-specific recognition of RNA substrates, but also for the catalytic process itself (1,24–27). Recent findings indicate that some of these RNAs, not surprisingly, may also be involved in regulatory processes. For example, besides its role in splicing, the U1 snRNA is involved in the regulation of transcription initiation by RNA polymerase II through interaction with the transcription initiation factor TFIIH (28). U1 RNA also interacts with cyclin H (29), raising the possibility that ncRNA might be involved in cell cycle regulation. In addition, the small conserved nuclear RNA 7SK inhibits the kinase activity of the CDK9/cyclin T complex, leading to reduced phosphorylation of RNA polymerase II and a reduction in transcription (30). The 7SK RNA acts in concert with the HEXIM1 and HEXIM2 proteins, both of which show distinct expression patterns in various human tissues (31–33), and depletion of 7SK RNA by siRNA causes apoptosis in HeLa cells (34).
ncRNAs also play a role in chromosome maintenance and segregation (35). A small RNA with similarity to box H/ACA snoRNAs is a component of telomerase (for review see 36) and is mutated in autosomal dominant dyskeratosis congenita (37). In human–chicken hybrid cells, mutation of Dicer, a key component of the siRNA/miRNA processing machinery, leads to the accumulation of transcripts derived from centromeric-satellite repetitive sequences, premature separation of sister chromatids and cell death (38). ncRNA has also been implicated in the control of chromatin architecture and epigenetic memory (35,39; discussed further below).
There are also other types of infrastructural ncRNAs that are involved in central cell biological processes. The ncRNA 7SL RNA is a core component of the signal recognition particle (SRP), a ribonucleoprotein complex that interacts with the ribosome and is essential for targeting/transportation of nascent proteins containing signal peptides to the endoplasmic reticulum membrane for secretion or membrane insertion (40–43).
The 13 MDa vault complex (discovered in 1986) is the largest ribonucleoprotein complex described to date, three times bigger (albeit far less complex) than the ribosome. It is present in 104 to 105 copies per cell, forms a barrel-like structure predominantly localized in the cytoplasm and is presumably involved in transport (for review see 44). Different species have between one and three vault RNAs, ranging in length from 86 to 141 nucleotides. In multi-drug resistant cells, the vault complex is upregulated and has a different ratio of vault RNAs in comparison with normal (44). Moreover, two human vault RNAs, hvg-1 and hvg-2, specifically bind to mitoxantrone (45), a chemotherapeutic agent commonly used for treatment of breast cancer, myeloid leukemia and non-Hodgkin's lymphoma.
cis-ACTING REGULATORY SEQUENCES IN NON-CODING REGIONS OF mRNAs and PRE-mRNAs
Regulatory RNAs function in most cases by base-pairing with complementary sequences in other RNAs and DNA, to form RNA:RNA (and probably RNA:DNA) complexes that are recognized, and acted upon, by a relatively generic infrastructure [such as RNA-induced silencing complex (RISC) complexes or RNA editing enzymes]. There are many well-characterized examples of regulatory RNA sequences in the untranslated regions (UTRs) of mRNAs that act in cis as receivers of other trans-acting signals, by forming secondary structures that bind regulatory proteins or small molecular weight ligands. Examples of the former include sequences in UTRs that can bind regulatory proteins or be the targets of RNA editing to control the stability, translatability or localization of mRNAs (46–49). Examples of the latter are the so-called ‘riboswitches’ that regulate metabolic pathways by binding metabolites such as vitamins, amino acids and purines, to effect allosteric changes in the mRNA to control its translation or stability. These have been well documented in bacteria (50–52), but also occur in eukaryotes (53,54).
UTRs in mRNAs (as well as the coding sequences themselves) can also be the sensors of trans-acting regulatory RNAs, specifically miRNAs (at least some of which are encoded in introns of other genes), by base sequence recognition (8,55,56), which appear to have significant influence on their evolution (57). That is, ncRNAs can either be receivers or transmitters, or both, of regulatory signals. Interestingly, the average length of the UTRs in mRNAs increase with developmental complexity in animals, and is almost equivalent to the length of the protein-coding sequences in human (total 34 Mb of coding sequences and 32 Mb of UTR at last count) (15), indicative of the much greater sophistication of mRNA regulation in the higher organisms.
There are also cis-acting regulatory sequences in and around splice junctions, some of which (the so-called ‘exon-splicing enhancers’ or ESEs) occur within protein-coding sequences (58). Nucleotide sequence conservation is higher around alternative splice sites than constitutive splice sites, albeit in complex patterns (59–61). These sequences are thought to bind regulatory proteins that influence splice selection, but two recent papers have suggested that such selection may, at least in some cases, involve complex RNA:RNA interactions, which are themselves presumably regulated by other trans-acting signals, including other RNAs (62–64). Consistent with this, small artificial antisense RNAs and introduced riboswitches have been shown to easily regulate splicing in vitro and in vivo (65–68), with obvious implications for the natural mechanisms of splicing control (8). A snoRNA has also been shown to control splicing of serotonin receptor 5-HT(2C)R mRNA (64). In addition, a significant number of ultra-conserved sequences in mammals and insects are located at splice sites (63,69). It should be borne in mind that some protein-coding sequences may have dual function, and be themselves the targets of regulatory molecules, such as miRNAs and siRNAs, as has been well documented in plants (70) and has been recently shown to occur in mammals (71–73). It should also be borne in mind that many RNAs may combine both digital (i.e. sequence-specific) and analog (structure-based ligand/protein binding or catalytic) functions, and that we have barely yet scratched the surface of these functions and networks.
LARGE NUMBERS OF NCRNAS EXPRESSED FROM THE MAMMALIAN GENOME
The Ensembl 34b version of Human Genome annotation lists 22 287 known or predicted protein-coding gene loci. The coding regions occupy ∼34 Mb (∼1.2%) of the euchromatic genome, and the total fraction of bases occupied by known protein-coding transcripts is only about 2% (15,74). However, summation of the sequences covered by known genes, ‘mRNAs’ and spliced ESTs indicates that (at least) 60–70% of the mammalian genome is transcribed on one or both strands (15,75), noting that introns are also actually transcribed (as distinct from generating stable transcripts) (Fig. 1). These estimates are conservative, as it is clear from both cDNA and genome tiling array studies that we have not yet come close to plumbing the full depth or breadth of the expressed transcripts in different types of cells under different developmental and physiological conditions (75–79).
Large-scale cDNA cloning studies have recently shown that there are many tens of thousands of transcripts expressed from the mouse genome, a large fraction of which (over 34 000) do not appear to encode proteins (75). These studies involved aggressive normalization to enrich for rare transcripts, which introduces the possibility of contamination from pre-mRNA sequences (i.e. introns), but the findings were generally supported by the results of large-scale promoter/transcription start site mapping, suggesting that the observed transcriptional complexity of the genome is real and extends far beyond what had been previously imagined (75). It should be noted that most putative ncRNAs are expressed at lower levels than mRNAs, and many are rare, consistent with the suggestion that these RNAs mainly fulfil regulatory functions. It should also be noted that these studies, as is traditional, were orientated towards cytoplasmic polyA+RNA (75,76), for technical reasons (to exclude infrastructural RNAs and primary transcripts), on the assumption that nearly all transcripts are processed to polyadenylated RNAs that are exported to the cytoplasm for translation, which may not be correct.
It is also apparent that much of the mammalian genome is transcribed from both strands. It is estimated that 5880 human transcription clusters (22% of those analyzed) form sense–antisense pairs with most antisense transcripts being ncRNA (80), an arrangement that exhibits considerable evolutionary conservation between the human and pufferfish genomes (81). A detailed analysis of the mouse transcriptome indicated that 43 553 (72%) transcriptional units overlap with transcripts coming from opposite strand (82). In fact, there is evidence from spliced ESTs, annotated ‘mRNAs’ and protein-coding genes listed on the UCSC Genome Database (83) that at least 2.4 Gb of the human genome is transcribed, at least 25% from both strands (Fig. 1; M. Pheasant and J.S. Mattick, unpublished analysis). It would not be surprising if the true extent of transcription was greater than the size of the genome itself, noting that the upper limit is twice the genome size.
Genome tiling array (76,77) and massively parallel signature sequencing (MPSS) (78) studies of various tissues and cell lines have independently revealed many thousands of non-coding transcripts from intergenic and intronic sequences in the human genome. Over 37% of the MPSS signatures matched known loci, but outside of annotated exons, with another 20% matching the complementary strand of known transcripts, indicating the presence of as many as 50 000 additional non-annotated RNAs in analyzed human tissues (78). These findings are reinforced by the analysis of conserved RNA secondary structures which predict thousands of functional ncRNAs in the human genome (84,85).
High-density genome tiling array studies of 10 human chromosomes (approximately one-third of the human genome) showed that 9% of the non-repetitive sequences were expressed as detectable transcripts (‘transfrags’) in individual cell lines, and that 16.5% of non-repetitive bases were transcribed in at least one out of eight cell lines analyzed, indicating that many of the observed RNAs are cell-type specific (77), consistent with MPSS studies (78). It should be noted that this figure is much higher than the total length of all mRNAs expected from these chromosomes. Over 56% of the detected transfrags do not overlap with any well-characterized exon, mRNA or EST annotation; 30% map with ‘intergenic’ regions and 26% with introns of known genes. The latter do not appear to represent pre-mRNA contamination, as the signals were not generally spread across the introns, but rather showed discrete foci, indicative of previously unknown exons or of other RNAs (perhaps regulatory ncRNAs or their precursors) derived from these regions (77). Moreover, for technical reasons these analyses are likely to overlook many important small regulatory RNAs such as miRNAs which may be present in only trace amounts and are difficult to label by reverse transcription.
Rapid amplification of cDNA ends (RACE) analysis of selected genomic regions (79) confirmed the existence of these RNAs, and revealed an amazingly complex landscape of interlacing and overlapping transcripts, not only on opposite strands, but also on the same strand, so that there is often no clear distinction between splice variants and overlapping and neighboring genes, which had also been indicated by cDNA cloning studies (75,82). This study also showed that there are many hitherto unrecognized exons and splice variants even in very well-studied genes, such as that encoding Sonic Hedgehog, and that it is not unusual for a single base pair to be part of an intricate network of multiple isoforms of overlapping sense and antisense transcripts (Fig. 2). These observations all have important and challenging implications for genotype–phenotype correlations, the complexity of the transcriptional regulation, and the definition of a gene (79), which may now be best viewed as fuzzy transcription clusters with multiple products (18).
Just as disturbingly, it appears that almost a large proportion of the transcripts in human and mouse are unique to the largely unstudied polyA− and the nuclear polyA+ fractions of the transcriptome (77,86), which have escaped detection in most transcriptomic studies. It seems that we have barely begun to uncover the extraordinary complexity of the mammalian transcriptome.
TRANSCRIPTIONAL NOISE OR MEANINGFUL OUTPUT?
The observation that there are literally tens of thousands of ncRNAs expressed in mammals, and that most of the genome is transcribed, confronts and very largely contradicts the traditional protein-centric view of genetic information and genome organization. There are two opposing alternatives—either the bulk of the transcription which does not yield mRNAs is ‘transcriptional noise’ and/or (in the case of introns) the residue of evolutionary baggage retained or accumulated within genes, or this transcription comprises another level of expression and transaction of RNA information that is important to the evolution and developmental ontogeny of the higher organisms (13,16,18,23,87–90).
Most of the ncRNAs identified in genomic transcriptome studies have not been studied and have yet to be ascribed any function. However, there are many lines of evidence that suggest that these RNAs are biologically meaningful.
First, most intensively studied gene loci, including both those that are imprinted and conventional loci such as beta-globin, have been shown to express non-coding transcripts (91–96). This includes some enhancers and conserved intergenic sequences (92,97).
Second, it is clear that many of these transcripts are cell-type specific, with specific subcellular locations, and are developmentally regulated (77,98,99). A large number of ncRNAs are specifically expressed from either the paternal or maternal allele at imprinted loci, and some are associated with human diseases, such as the Prader–Willi and Angelman syndromes (39). Hence, the genetic cause for some, and perhaps many, diseases may be associated with mutations within ncRNAs. An imprinted ncRNA, LANCAT, spanning more than megabase in the murine region orthologous to the human Prader–Willi/Angelman syndrome locus, exhibits a distinct expression pattern in brain, as well as a cytoplasmic location (100). It has also been shown that some snoRNAs and miRNAs may be encoded within the introns of imprinted ncRNA genes (95,101). The snoRNA HBII-52 which regulates the splicing of the serotonin receptor 5-HT(2C)R gene is not expressed in Prader–Willi syndrome patients which have different 5-HT(2C)R mRNA isoforms from normal, suggesting that this defect contributes to the Prader–Willi syndrome (64,102). Antisense transcripts associated with eight transcription factor genes involved in eye development also display specific expression patterns in brain, and in the retina in particular (103). Another non-coding antisense transcript, which has several alternatively spliced isoforms, shows an expression pattern similar to the sense-strand Foxl2 gene, which encodes a forkhead transcription factor involved in development of eyelid and ovary (104).
Third, the upstream regions of ncRNA transcripts show many of the features normally associated with promoters (75,105,106) and, somewhat surprisingly, may be more highly conserved than the promoters of protein-coding genes (75). A recent large-scale study of the binding sites for the transcription factors, Sp1, cMyc and p53, found that a large proportion (36%) correlate with ncRNA transcripts, a significant number of which are regulated in response to retinoic acid, leading to the general conclusion that the human genome contains comparable numbers of protein-coding and non-coding genes that are bound by common transcription factors and regulated by common environmental signals (106).
Finally, an increasing number of ncRNAs have been shown to be functional, including the well-characterized ncRNAs Xist and Tsix that control X-chromosome inactivation in mammals (107,108). They also include a number of well-characterized antisense transcripts which appear to play regulatory roles in relation to their sense gene, including those opposite FGF-2 (fibroblast growth factor-2), HIF-1 (hypoxia inducible factor-1) and myosin heavy chain [for review see (109)]. Increasing numbers of functional studies of ncRNAs are being conducted using ectopic expression and RNAi-mediated knockdowns. For example, ectopic expression of the murine brain-specific ncRNA SCA8, which has been implicated in Spinocerebellar Ataxia Type 8 (110), under the control of a promoter specific to photoreceptors, results in late-onset, progressive neurodegeneration in the Drosophila eye (111). Moreover, using this neurodegenerative phenotype as a sensitized background for a genetic modifier screen, mutations were identified in four genes, all of which encode neuronally expressed RNA binding proteins conserved in Drosophila and humans (111). The knockdown by RNAi of a 6.7 kb spliced and polyadenylated murine ncRNA (TUG1) that is expressed in the retina and brain and upregulated by taurine in developing retinal cells RNA resulted in malformed or non-existent outer segments of transfected photoreceptors in mice (112).
This approach has recently been extended into large-scale screening strategies of ncRNAs. Pairs of siRNAs directed against 512 ncRNA sequences from the RIKEN Fantom2 mouse cDNA collection (113) were used to interrogate a battery of 12 cell-based reporter assays representing key cellular processes and signaling pathways (114). Eight functional ncRNAs were identified (114; J.B. Hogenesch and P.G. Schultz, personal communication), a good rate of return given the limited functional scope of the assays: six essential for cell viability, one repressor of Hedgehog signaling, and one (termed NRON) which acts as a repressor of the transcription factor NFAT, which itself is required for T-cell receptor-mediated immune response, and the development of the heart, vasculature, musculature and nervous tissue. NRON occurs as a variety of alternatively spliced transcripts ranging from 0.8 to 3.7 kb, and interacts with 11 different proteins, possibly as scaffolding for a complex including a translation initiation factor, RNA helicase and proteins involved in nucleocytoplasmic transport, proteolysis and signal transduction (114).
The number of known functional ncRNA genes has risen dramatically in recent years and over 800 ncRNAs (excluding tRNAs, rRNAs and snRNAs) have been catalogued in mammals, at least some of which are alternatively spliced (115,116). ncRNAs have also been implicated in many diseases, including various cancers and neurological diseases (18,115).
There is a rapidly looming nomenclature problem for the large number of ncRNAs (117), especially as the function and mode of action of the vast majority are unknown, and their complex structures and interlacing/overlapping nature make discrete classification difficult. As a considerable fraction of eukaryotic transcripts are spliced, most approaches used, including cDNA cloning, detect only portions of transcripts, which often correspond to exons. Depending upon the method used these detected sites of transcription have been called an assortment of terms, such as ditags, CAGE tags, transfrags and ESTs, to mention a few. In some cases, experiments are used to connect these fragments into full-length or near full-length transcript structures [see e.g. (79)]. When transcripts are found to contain reduced protein-coding potential these have also been given various names including npcRNA (non-protein-coding RNA), utRNA (untranslated RNA) (117) or TUF (transcript of unknown function) (77). A structured system that may be used to catalog and refer to ncRNAs until they can be grouped and re-classified into recognized structural and/or functional classes is currently being considered by the HUGO Gene Nomenclature Committee (see http://www.gene.ucl.ac.uk/nomenclature/).
SMALL REGULATORY ncRNAs
Small nucleolar RNAs
snoRNAs generally range from 60 to 300 nucleotides in length and guide the site-specific modification of nucleotides in target RNAs via short regions of base-pairing. There are two major classes, the box C/D snoRNAs which guide 2′-O-ribose-methylation, and the box H/ACA snoRNAs which guide pseudouridylation of target RNAs (36,121–123). Initially, it was thought that the role of snoRNAs was restricted to rRNA modification in ribosome biogenesis, but it is now evident that they can target other RNAs, including snRNAs and mRNAs (36,64,121–123). Most mammalian snoRNAs come from the introns of either protein-coding or non-coding genes (124) but apparently some human C/D snoRNAs are independently transcribed as indicated by the presence of methylated guanosine caps at their 5′ ends (125). Although the snoRNAs involved in ribosome biogenesis are located in the nucleolus where this type of ncRNA was first characterized (hence their name), a subset of H/ACA snoRNAs is located in Cajal bodies (a class of small nuclear organelle) and are sometimes called scaRNAs (small Cajal body RNAs) (36). Telomerase RNA is also found in Cajal bodies in a cell-cycle dependent manner (126,127).
At least some snoRNAs exhibit tissue-specific and developmental regulation, and/or imprinting (101,102,128,129), indicative of a regulatory function. There are also a number of so-called orphan snoRNAs without known targets (101,102,123,128,130,131). As noted earlier, one of these snoRNAs is linked to the aberrant splicing of the serotonin receptor 5-HT(2C)R gene in Prader–Willi syndrome patients (64,102). It is also evident that there are many other snoRNAs, as well as likely, other as yet functionally uncharacterized classes of small regulatory RNAs, that have yet to be discovered (36,132).
MicroRNAs and small interfering RNAs
miRNAs and siRNAs are short, approximately 22 nucleotides long RNA molecules derived either from hairpin or double-stranded RNA precursors. Details of miRNA and siRNA biology and biochemistry can be found in a number of recent reviews (8,133–135). miRNAs suppress translation via non-perfect pairing with target mRNAs—usually involving a seed pairing of just six to eight nucleotides in length (56)—or (as with siRNAs) cause degradation of target RNAs by the RISC complex in the case of perfect complementarity with the target site—the phenomenon known as RNAi. It is estimated that approximately one-third of human protein-coding genes are controlled by miRNAs [reviewed in (119)]. In addition, siRNAs derived from repeats participate in the establishment of silenced (heterochromatic) chromatin, as well as in other aspects of chromosome dynamics, phenomena best studied in yeast [for reviews see (8,136)].
miRNAs are derived from the introns and exons of both protein-coding and non-coding transcripts that are synthesized by RNA polymerase II (8,137,138). It has also recently been shown that a number of mammalian miRNAs are derived from repeats, mainly various transposons (139), which may lead to a re-examination of the functional role of transposons, especially since it also appears that transposon sequences can play a significant role in the developmental processes and epigenetic variation (140,141). Some miRNAs also appear to be derived from processed pseudogenes (142).
The expression of many miRNAs is regulated and miRNAs have been shown to be central to a wide range of developmental processes, including developmental timing, cell proliferation, left–right patterning, neuronal cell fate, apoptosis and fat metabolism [for reviews see (8,133–135,143)], as well as neuronal gene expression (144), brain morphogenesis (145), muscle differentiation (146) and stem cell division (147). Not surprisingly, therefore, alterations in the expression, sequence or target sites for miRNAs may be a significant but hitherto unrecognized source of human genetic disease, including cancer. Sequence variants in the binding site for the miRNA miR-189 in the SLITRK1 mRNA have recently been shown to be associated with Tourette's syndrome (148). miRNA expression is dysregulated in cancer cells (143,149,150) and miRNA profiling can be used as a very accurate diagnostic tool for cancer classification (151,152). The proto-oncogene c-Myc has been shown to activate expression of an miRNA cluster on human chromosome 13, and two miRNAs (miR-17-5p and miR-20a) from this cluster downregulate expression of the transcription factor E2F1 that activates cell cycle progression (153). Enforced expression of the same miR-17-92 miRNA cluster has also been shown to promote tumor development (154), as has misexpression of the Drosophila miRNA mirvana/mir-278 (155), indicating that some miRNAs may also function as proto-oncogenes.
Until recently, it was believed that the post-transcriptional suppression of gene expression by miRNA in vertebrates occurs through translation suppression directed by a non-perfect duplex formed between miRNA and mRNA in the 3′-UTR. However, in 2004, two groups described suppression of HOX gene expression by mRNA degradation because of a perfect match between miRNA and mRNA in 3′-UTR (71,72). Another example of mRNA degradation because of a perfect match with a trans-acting miRNA has been reported for the imprinted Rtl1/Peg11 locus (73). The maternally transcribed anti-Peg11 transcript is processed into several miRNAs, which cause RISC-mediated cleavage of paternally expressed Rtl1/Peg11 mRNA. Interestingly, the miRNAs are complementary to the coding region, not to the 3′-UTR (73), indicating that miRNA target sites may be located anywhere in the transcript, and indeed in any functional transcript, not just mRNAs. In addition, it has recently been shown that certain miRNA precursors are edited by ADAR1 and ADAR2, resulting in both suppression of processing by Drosha, and degradation by Tudor-SN, which is a component of RISC (156).
The miRBase database (http://microrna.sanger.ac.uk/) lists over 300 experimentally verified miRNAs in human as well as predicted miRNA target genes (157). However, many more miRNAs have been identified computationally, with a proportion validated post hoc (158). Most miRNA prediction methods rely on identification of a stable stem–loop precursor and phylogenetic conservation [see e.g. (158)]. However, these criteria may be far too narrow. Although many of the known miRNAs are highly conserved (and have been mainly identified on this basis), there is no reason why they all should be, as (as far as one can tell) these short RNAs have no intrinsic catalytic activity and function simply by target recognition, and thus should be able to evolve relatively quickly by co-variation with their targets, and by positive selection for new connections in regulatory networks underpinning adaptive radiation. Consistent with this, the known miRNAs appear to have many targets, thereby making co-variation difficult, and explaining their strong conservation, which in many cases surpasses that of protein-coding sequences (108). A recent study that did not require substantial evolutionary conservation identified many new human miRNAs, a significant number of which appear to be primate-specific (159).
The number of predicted human miRNAs is rising rapidly (8,135,159). Sensitive genetic screens in C. elegans have also identified rare miRNAs with limited evolutionary conservation such as lys-6 which is required for left–right neuronal patterning (160), suggesting that many miRNAs may be cell-type specific and that many more remain to be found.
BIOLOGICAL ROLES OF ncRNAs
As outlined earlier, ncRNAs are already known to fulfill a wide range of functions, including the control of chromosome dynamics, splicing, RNA editing, translational inhibition and mRNA destruction. It is obvious that we have only begun to explore the true extent of RNA regulation of these processes. It also appears that RNA may play a role in virtually all levels of gene regulation in eukaryotes.
A range of evidence suggests that RNA signaling underpins chromatin remodeling and epigenetic memory, although the mechanisms are unknown, and the matter is not without controversy [for reviews and discussion see (8,18,35,161–163)]. There is evidence that transcription from upstream regions can affect the expression of the adjacent gene, either by promoter interference (164) or by altering chromatin structure (165–167), leading to the hypothesis that it is the act of transcription which is responsible for the regulatory effects, and that the transcript itself (an ncRNA) is just a by-product (168). However, it is hard to imagine how transcription per se could convey sufficient information to account for the precise and quite complex changes in histone modification and chromatin remodeling that are observed at most loci. Indeed, there are only a limited number of chromatin-modifying enzymes in animals, suggesting that these enzymes must be targeted to their sites of action, which vary at thousands of loci around the genome during differentiation and development, by another level of sequence-specific signals, most logically RNA. In agreement with this prediction small RNAs have been shown to induce transcriptional silencing and alterations to DNA methylation in human cells (169,170).
There are also good reasons to expect that splicing is regulated, at least in part, by trans-acting RNAs that guide splice site selection (8,18,64,171) or modify sequences around splice sites to render them accessible or otherwise to the splicing machinery (64).
Evidence is also emerging that transcription itself may be regulated by ncRNAs (18,163). As noted earlier, RNA polymerase II itself appears to be regulated in part by ncRNA signaling (30–34). A ncRNA has been reported to be required for the repression of RNA polymerase II-dependent transcription in primordial germ cells in Drosophila (172). At least some transcription factors (and chromatin-modifying proteins) appear to have affinity for structures involving RNA (173–179). A small double-stranded RNA termed NRSE activates transcription of neuron-specific genes (180) and short artificial RNAs have been shown to inhibit transcription of targeted genes in the absence of concomitant DNA methylation, with considerable potential for therapeutic use (181,182). An interesting case is the steroid receptor RNA activator (SRA) which was originally described as functional non-coding RNA involved in the regulation of gene expression by steroid hormones (183). The gene produces several transcripts of which one encodes a protein (184) and both the ncRNA and its encoded protein affects the activity of estrogen receptor in breast cancer cells (185). Recently, it was shown that pseudouridine synthase mPus1p (an enzyme that converts uridine to pseudouridine in RNA) is a coactivator for the retinoic acid receptor, which acts by pseudouridinilation of SRA RNA (186). In addition, the thyroid hormone receptor has an RNA-binding domain which binds SRA, and the binding enhances expression of reporter genes (187).
ncRNAs also play a role in stress responses. The small non-coding transcript B2 is produced by RNA polymerase III from murine short interspersed elements (SINE) under heat shock. The B2 RNA binds to RNA polymerase II and represses transcription after heat shock (188,189). In primates, RNA polymerase III also produces the brain-specific Alu-derived transcript BC200 (190). Non-coding repetitive RNAs are also transcribed in stressed human cells and are localized in ‘nuclear stress bodies’ that are assembled on specific pericentromeric heterochromatic domains that change their epigenetic status from heterochromatin to euchromatin in response to stress (191). The non-coding RNA omega is among few heat-shock-inducible genes in Drosophila (192), and although its exact role is unknown, it binds to a number of RNA-binding proteins involved in processing of nuclear RNA (hnRNPs complexes) (193).
ncRNAs may also act as scaffolding for the assembly of macromolecular complexes. Examples include rRNA in ribosomes, the 7SL RNA in the SRP (40), and possibly RNAs involved in the assembly of chromatin complexes (35), as well as NRON, recently shown to interact with a number of proteins involved in nuclear transcription factor trafficking (114).
INTRONS AS A SOURCE OF FUNCTIONAL NCRNAS
Introns account for at least 30% of the human genome and may be a significant, perhaps major, source of regulatory ncRNAs (17,87), produced in parallel with protein-coding sequences (and others) as efference signals to convey regulatory information to other genes and transcripts (16,18). Almost all snoRNAs and a large proportion of miRNAs in animals are encoded in introns (138,194–196), located in both protein-coding and non-protein-coding genes [for review see (8)]. Although introns are thought to be simply degraded after being excised from primary transcripts, there is good evidence that intronic RNAs may actually be processed to smaller RNAs (which were not anticipated or detected when introns were first studied) with significant half-lives and specific subcellular locations (197,198). Recently, it was shown that ectopic expression of intronic sequences derived from the CFTR gene causes specific changes in transcription of various genes in HeLa cells (199). Interestingly, each of the three intron sequences tested resulted in a distinctive pattern of effects on specific subsets of genes (199). The idea that introns may be a rich source of regulatory information is consistent with the fact that the density of introns scales with developmental complexity (87), and many highly conserved sequences, including ultraconserved sequences, are found in introns (69,200–203). However, at present, it is simply not known what proportion of transcribed introns are subsequently processed into smaller functional RNAs, although many intronic sequences are detected in whole genome tiling array analyses of human transcription (77).
We may have fundamentally misunderstood the nature of genetic programming in the higher organisms. It appears that the human genome and those of other complex organisms express an enormous repertoire of ncRNAs, and that their cells are awash with these RNAs, which constitute a hidden layer of molecular genetic signals. Although the functions of these RNAs are likely to be many and varied, both logic and evidence strongly suggest that their main role is to regulate and direct the complex pathways of developmental ontogeny, which must require enormous amounts of information in an organism as precisely sculptured as a human (13).
The existence of a sophisticated RNA-based regulatory system would also largely explain the paradox of the tremendous diversity of characteristics observed among mammals and other complex organisms, despite the relative commonality of their proteomes. That such RNAs have remained hidden from view for so long appears to have been a consequence of their sheer numbers and population complexity which makes biochemical detection of individual sequences difficult, combined with the subtlety of their genetic signatures. Indeed, with few exceptions, until recently most known ncRNAs were those that are present in relatively large amounts, such as rRNAs, tRNAs and the common snoRNAs and snRNAs, and it has only been the combination of sensitive genetic screens (such as those that first identified miRNAs), large-scale cDNA and whole genome sequencing, new sensitive analytical methods (such as RT–PCR and genome tiling arrays) and bioinformatics, based on clues from known examples, that has begun to reveal the true complexity of what lies under the surface.
It is also evident that many ncRNAs, including those of demonstrated functionality like Xist, are evolving quickly (108). This rapid evolution has been considered as evidence of lack of functionality (204). This may be incorrect, and these sequences may in fact be simply able to drift easily because of different constraints and/or be subject to positive selection related to phenotypic variation. Recent analyses of the Drosophila genome have indicated that, contrary to long-held expectation, a large fraction of the non-coding sequence is functionally important and subject to various levels of purifying selection and adaptive evolution (205).
The extent of non-coding sequence conservation in mammals is also much higher than that of protein-coding sequences (202,206), perhaps as high as 10% by some estimates (207). This conservation includes ultraconserved sequences (69) and long transposon-free regions that have remained refractory to transposon insertions throughout mammalian evolution (208), observations which are difficult to reconcile with orthodox protein-based conceptions of gene regulation. As noted earlier, there is increasing evidence that transposon-derived sequences may also contribute to mammalian genetic activity. Indeed, it may be that much, if not most of the sequences comprising the human genome are functional, albeit having arrived at different times in our evolutionary history and be evolving at different rates.
The problem has been compounded by the fact that most mutations in regulatory sequences may be both subtle and difficult to track, particularly given the expectational and practical bias to date in genome scanning projects on exonic lamp-posts of protein-coding genes, and the fact that the relevant mutations may be quite distal to these lamp-posts, hidden in the dark of the vast tracts of intergenic and intronic sequences. The mutations underlying the callipyge (‘beautiful bottom’) phenotype in sheep and the enhanced muscling of domestic pigs, which are single base substitutions within non-coding sequences (a long intergenic sequence of unknown transcriptional status in the DLK1-GTL2 imprinted region, and the third intron of the IGF2 gene, respectively), the identification of which involved tour-de-force analyses in well structured pedigrees (209–211).
It is clear that different types of genetic information will be subject to different structure–function relationships and therefore different constraints on their variation related to their role and the number of interacting partners. We predict that mutations/variations in many if not most ncRNA sequences, especially those that are involved in regulatory networks, will lead to a variety of milder phenotypes than the usually severe consequences of mutations in proteins, and will have a major influence on quantitative trait variation, developmental differences and abnormalities, cancer and other complex diseases such as neurological disorders.
The functional genomics of ncRNAs will be a daunting task, an equal or greater challenge than that we already face in working out the biochemical functions and biological roles of all of the known and predicted proteins and their isoforms (212). Bioinformatics will be key, as it should be possible to use sequence homology (albeit in small patches, and obeying a rather broader set of rules than simply Watson–Crick DNA base pairing) to identify transmitters and their receivers in RNA regulatory networks, as is already the case for miRNAs. This also means that it should be possible to develop generic approaches, applicable to any regulatory RNA or its target, to intersect and modulate gene activity at various levels for therapeutic purposes, which may revolutionize the pharmaceutical industry. The advent of large-scale whole genome (re-)sequencing, which is at an advanced stage of development (213,214), while creating enormous informatic challenges, will soon also provide the density of genomic data required to identify sequences directly associated with different characteristics in structured populations, without assumptions about the genomic position of these sequences or their mode of action.
We thank the Australian Research Council, the University of Queensland and the Queensland State Government for their financial support, and our colleagues for many stimulating discussions. We also thank Tom Gingeras for helpful suggestions. JSM is a Federation Fellow of the Australian Research Council.
Conflict of Interest statement. None declared.