Advances in long noncoding RNAs: identification, structure prediction and function annotation

Abstract Long noncoding RNAs (lncRNAs), generally longer than 200 nucleotides and with poor protein coding potential, are usually considered collectively as a heterogeneous class of RNAs. Recently, an increasing number of studies have shown that lncRNAs can involve in various critical biological processes and a number of complex human diseases. Not only the primary sequences of many lncRNAs are directly interrelated to a specific functional role, strong evidence suggests that their secondary structures are even more interrelated to their known functions. As functional molecules, lncRNAs have become more and more relevant to many researchers. Here, we review recent, state-of-the-art advances in the three levels (the primary sequence, the secondary structure and the function annotation) of the lncRNA research, as well as computational methods for lncRNA data analysis.


Introduction
While <2% of the human genome has been reported as proteincoding regions (20 000 genes) [1,2], a large part of the genome gives rise to noncoding RNAs (ncRNAs), which have little or no protein-coding capability [3,4]. Even though many classes of short ncRNAs, such as microRNAs (miRNA) and Piwi-interacting RNAs [5,6] are widely studied, heterogeneous ncRNAs with length longer than 200 nucleotides (called long noncoding RNAs or lncRNAs) attract extensive interests from researchers [7]. With the rapid progress in high-throughput sequencing technology, thousands of lncRNAs have been identified in the mammals [8].
A hypothesis is that most currently annotated lncRNAs are not functional [9] and there are two reasons supporting this point. One is that like all biochemical processes, the transcription machinery is not perfect and can produce spurious RNAs that have no significant biological purpose [10], albeit many lncRNAs would be capped, spliced and polyadenylated just like mRNAs, none of these features offer informative indicator of function. The other is that even though the act of transcription matters, the product of transcription does not [9]. These would include RNAs generated during transcriptional interference, which involves the transcription of noncoding loci that overlaps the regulatory regions and is known to regulate gene expression in both prokaryotes and eukaryotes [11]. However, more and more lncRNAs are reported to play critical roles in biological processes. For example, the Xist RNA, which is required for mammalian dosage compensation [12], is clearly functional. And the roster of biological events in which lncRNAs are key factors is rapidly growing. These events include cell-cycle regulation, apoptosis, establishment of cell identity [13][14][15] as well as others. More importantly, dysregulation of lncRNAs is associated with a variety of human diseases, including cancer and other immune and neurological disorders [16][17][18]. As lncRNAs are crucial regulators of gene expression, it is expected that their dysregulation will lead to abnormal cellular function, growth defects and many human diseases. Analysis of the expression profiles of lncRNAs in a variety of cancer cells, and their comparison with that in corresponding normal cells, demonstrated that many lncRNAs are dysregulated in a wide range of cancers [18]. Furthermore, multiple lines of evidence increasingly link mutations and dysregulations of lncRNAs to diverse human diseases [19,20]. Alterations in the primary structure, secondary structure and expression levels of lncRNAs as well as their cognate RNA-binding proteins underlie diseases ranging from neurodegeneration to cancers [21]. As for the process of cancer metastasis, which consists of a series of sequential and complex steps, lncRNAs exhibit distinct gene expression patterns in primary tumors and metastases, which can be used for cancer diagnosis and prognosis and served as potential therapeutic targets [22].
Even though lncRNAs have attracted increasing research interests, specific features in their sequences, secondary structures and the functional mechanisms for most lncRNAs remain unknown. The aim of this article is to bring together the scattered findings in lncRNA studies, focusing on the three levels relating sequence, structure and function. The lncRNA-related resources are also provided. We believe that this review will enable researchers to understand the key issues, and facilitate further advances in understanding the lncRNAs.

Basic features In the lncRNA sequences and their identification
Unfortunately at present with our limited knowledge, there is no clear positive definition for lncRNAs. Generally, lncRNAs are still loosely defined as RNA transcripts more than 200 nucleotides (nt) long that can not be translated into a protein [23]. Nonetheless, the basic features of lncRNAs can be comparable with mRNAs, which can be translated into proteins. First, the size and the exons in lncRNAs are considered. In a set of human annotated lincRNAs (long intergenic ncRNAs, a subset of lncRNAs) [24], the average size of these lincRNAs are found to be smaller than that of mRNAs. They have fewer exons on average, which may partly be attributed to both the lower abundance and the incomplete assembly. It has been reported that lncRNAs have an unusual exonic structure, but exhibit standard canonical splice site signals and alternative splicing [25]. In the data set from Cabili et al. [24], most lncRNAs are spliced (98%) and show a striking tendency to have only two exons (42% of lncRNA transcripts versus 6% of mRNAs). Second, similar to mRNAs, many lncRNAs are characterized by 'K4-K36' domains, which consist of histone 3 Lys 4 trimethylation at the promoter followed by histone 3 Lys 36 trimethylation along the transcribed region [8,[24][25][26]. Third, there is substantial evidence to indicate that lncRNAs, just like mRNAs, are transcribed by RNA polymerase II and usually contain canonical polyadenylation signals, even though it is found that some lncRNAs are likely to be transcribed by polymerase III [27]. Fourth, generally unlike protein-coding genes, which are usually conserved across the species, most lncRNAs are poorly conserved, and thus have been taken for transcriptional noise [28]. Even though lncRNAs are less conserved than mRNAs in most cases, this by itself does not necessarily mean a lack of functionality.
Generally lncRNA promoters are more conserved than their exons, and even as conserved as the mRNA promoters [24][25][26]. Previous evidence has reported that purifying selection exists in different sets of lncRNAs [26,29,30]. The expressed orthologs of a few highly conserved and brain-expressed mouse lncRNAs have also been identified in species as distant as opossums and chickens [24]. Although lncRNAs have low sequence conservation [31,32], increasing evidence indicates critical roles played by lncRNAs, which will be illustrated later in this review.
Transcription of lncRNAs was first observed with traditional cloning methods without any further detection of translation products [33], such as H19 [34]. A major progress in experimental identification of lncRNAs came with microarrays and tilling arrays, and more recently with next-generation sequencing technologies [4,35,36]. It was the FANTOM project [4,35] in which cDNA cloning followed by Sanger sequencing that identified >34 000 lncRNAs in different mouse tissues. A significant portion of these lncRNAs had confident support [37,38]. For example, lncRNAs identified in the GENCODE V7 [25] and the current RefSeq issue [39] were based on the refined EST and cDNA data. A special method of screening chromatin signatures such as 'K4-K36' domain had identified several thousands of lincRNAs in mouse and human [8,26].
It was in recent years that thousands of lncRNAs have been identified owing to the broad applications of next-generation sequencing technologies [24,40,41]. It is worth mentioning that methods based on the next-generation sequencing data have discovered dozens of lncRNAs expressed in various samples of cancer cells and cell types. Furthermore, a canonical classification method has been applied [13,17,25] to categorize lncRNAs, by which lncRNAs have been grouped into five biotypes according to their proximity to protein-coding genes: sense, antisense, bidirectional, intronic and intergenic. Regarding the fact that some transcripts can have both coding and noncoding functions [42], Ulitsky et al. [9] have discussed the complexity of classification of noncoding transcripts and with examples.
Indeed, determining the protein coding ability for a transcript is critical in the identification of lncRNAs. It is also challenging because an lncRNA is likely to contain a putative open reading frame (ORF) purely by chance [42]. Accordingly, the principles such as a lack of evolutionary conservation of the identified ORFs, a lack of homology to known protein domains and a lack of the ability to template significant protein production [34,43] have been generalized to distinguish coding potential across thousands of transcripts. Several recent methods and the measures used by them are described in Table 1. The method of scoring conserved ORFs across dozens of species is used in [44,45], which used the 'codon substitution frequency' to develop algorithms to score conserved ORFs across dozens of species, and provide a general strategy for determining the coding potential. But conservation-based methods may fail to detect young proteins because they do not contain a conserved ORF [44,45]. Searching for a putative ORF and a homology in a large protein-domain database Pfam [50] is employed by a tool called Coding Potential Calculator (CPC) [46]. Another method named Coding-Potential Assessment Tool [47], similar to CPC, employed the information of ORF embedded in transcripts to develop the classifier. Different from previous works, Sun et al. [48] proposed a method to classify protein-coding and lncRNA transcripts by exploiting the intrinsic components contained in sequences instead of predicting the ORF. Additionally, CONC [49] is developed and applied in the FANTOM project, and another gene identification program GeneID [51] is used to measure the protein coding potential for lncRNAs in GENCODE v7.
Sequencing RNAs associated with polyribosomes is used in the experimental method of [52], in which ribosome profiling has provided a strategy for identifying the ribosome occupancy on RNAs to distinguish the coding and noncoding transcripts.

Probing lncRNA secondary structures
It is acknowledged that the secondary structure plays an important role for most ncRNA classes, including some lncRNAs [53][54][55][56]. Despite the prevalence of their secondary structuremediated roles, the secondary structures for many lncRNAs in relation to their functions remain largely unknown. Here we describe some recent progresses related to the lncRNA secondary structures.
In general, the RNA secondary structure plays many key roles in molecular biology, more so than the primary sequence. The characteristics of the lncRNA secondary structure have occupied researchers and clinicians recently. For example, in the functional investigation of lncRNA MALAT1, it was reported that MALAT1 clearly has a fascinating tRNA-like structure at its 3 0 end [57,58]. Another example is the steroid receptor RNA activator (SRA), which is 0.87 kb in length, and is organized into four domains with various secondary structure elements ranging from small, autonomous helical stems to larger structures formed via long-range base pairing [55]. SPRY4-IT1 (AK024556), which is a cancer-associated lncRNA, is derived from an intron of the SPRY4 gene and is predicted to contain several long hairpins in its secondary structure [59]. The lncRNA HOTAIR is also implicated in cancer [60], serving as a structural scaffold for protein complexes and possesses complex RNA structural motifs [61]. These structural motifs may act as distinct binding domains for protein complexes such as PRC2 and LSD1, and serve in a manner of signal, guide or scaffold in different cellular contexts [62]. The lncRNA Gas5 acts as both molecular decoy and signal to negatively regulate an effector. It has been examined that the lncRNA SRA has a complex structural organization, consisting of four domains, with a variety of secondary structure elements [53]. Moreover, the lncRNA structures may play critical roles in the interaction between lncRNA and other molecules such as chromatin-modifying complexes [8], chromatin [63] and miRNA [64]. All these suggest an important interplay between the lncRNA secondary structure and their biological functions.
The RNA secondary structure as well as the tertiary structure can be determined by experimental and computational methods. Because some large RNAs such as ribosomal RNAs and RNase P have already been successfully crystallized, the structural studies on lncRNAs will likely be possible in the near future. Because RNAs are extracted from cells and renatured in a buffer, the obtained structures in in vitro study may differ markedly from their in vivo forms. However, determining the RNA structures in vitro also has the important advantage of enabling studies on homogeneous populations of the targets and of using systems that are simpler than their in vivo counterparts. Comparing with computational methods, experimental methods can give a more reliable result, but with a higher experimental cost. On the other hand, computational methods can give large-scale investigation of lncRNA secondary structures with a low cost despite the high false-positive rate. For instance, in Volders et al. [65], the secondary structures of 21 488 human lncRNAs are predicted by the software RNAfold, and displayed via the graphics interchange format (.gif) in web browsers. In Rfam [66], the structure information for regions of higher conservation within the lncRNA transcripts is provided. The predicted results may provide clues in lncRNA studies, giving guidance to future experimental design. Still, a comprehensive whole-genome investigation of lncRNA secondary structures is lacking for any metazoan.
Recently, experimental techniques based on high-throughput sequencing have been developed to probe the RNA structures, such as SHAPE [67], parallel analysis of RNA structure (PARS) [68,69] and FragSeq [70], which have enabled genomewide measurements of paired and unpaired regions in the RNA secondary structures, and may shed a new light on lncRNA secondary structure analysis. Specifically, Li et al. [71] used a highthroughput, sequencing-based, structure-mapping approach to identify the paired (double-stranded RNA) and unpaired (singlestranded RNA) components of the Drosophila melanogaster and Caenorhabditis elegans transcriptomes, providing a global assessment of RNA folding in animals. Kertesz et al. [68] described a novel strategy termed PARS based on deep sequencing fragments of RNAs, and applied to profile the secondary structures of the mRNAs of the budding yeast Saccharomyces cerevisiae, and obtained structural profiles for over 3000 distinct transcripts. These initial studies indicate high-throughput sequencingbased methods as an effective and efficient approach for Nine features based on sequence composition, secondary structure and alignment with proteins SVM classifier [49] investigating RNA (lncRNAs included) secondary structures on a global scale. Related works have been reviewed in Mortimer et al. [72]. Another recent work [73] has also provided a comprehensive structure map of human coding and ncRNAs. However, like most existing experimental methods, high-throughput sequencing suffers from the disadvantage that it can only be used to assess the RNA structure in vitro. Obtained structures in vitro may differ markedly from their in vivo forms. Indeed, a fraction of the probed RNA secondary structures do not resemble the biologically functional state in many regions [9]. Thus, the methods based on high-throughput sequencing may not be as accurate as we can expect, especially for larger structured RNAs with long-range tertiary interactions. Nonetheless, it should be acknowledged that the advent of increasingly cheap high-throughput sequencing technologies make it possible to perform genome-wide investigation of the lncRNA secondary structures with a higher precision in comparison with direct computational prediction methods. Furthermore, genome-wide high-throughput sequencing structural data can be used to constrain folding algorithms and improve their accuracy, as previously shown for specific RNAs [74,75]. Therefore, this huge catalog of structural sequencing data can provide us an opportunity to exploit these data collectively as a whole, especially when the lncRNA secondary structures are also considered.

The function annotation of lncRNAs
From the previous discussion, it is noted that increasing evidence has been accumulated for the critical roles played by lncRNAs. However, when comparing to mRNAs, lncRNAs are generally expressed in a more tissue-specific manner [24,25]. They also show lower expression level [24,25,38,76], and higher expression variability across cell lines and tissues [25]. That is, the expression of lncRNAs may be regulated by subtle molecular mechanisms, but the lncRNAs themselves may function as a regulator in molecules. In this section, we will discuss the molecular mechanisms of several lncRNAs, and the current approaches devoted to lncRNA function annotation. As a fact, the molecular mechanisms of most lncRNAs remain largely unknown. However, some clues have been provided recently by well-known examples. First, lncRNAs are found to be implicated in gene regulation through a variety of mechanisms such as epigenetic modifications of DNA, alternative splicing, posttranscriptional gene regulation and mRNA stability and translation [77][78][79]. Moreover, it is found that lncRNAs can regulate the expression of protein-coding genes, positively or negatively, and in cis or in trans [80]. For example, lncRNA Kcnq1ot1 can regulate epigenetic gene silencing in an imprinted gene cluster in cis [81]. It is known that Kcnq1ot1 specifically interacts with nearby genes in embryonic tissues causing transcriptional gene silencing. Another example is the lncRNA AK143260, termed Braveheart (Bvht), which acts in a trans manner and specifically promotes activation of a core gene regulatory network to direct cardiovascular lineage commitment [82]. In the recent two studies [24,25], both cis-acting and trans-acting co-expression between lncRNAs and mRNAs have been observed. Second, lncRNAs are involved in cellular processes including proliferation, migration, apoptosis and development [83,84], also in maintaining pluripotency [84,85]. Based on these molecular features, lncRNAs can be categorized into different groups [33], such as signal, guide, scaffold and decoy. For example, KCNQ1ot1, Air and Xist are illustrated as signals of active silencing at their respective genomic locations, and others as guide, scaffold and decoy in [86].
Moreover, a complex interaction network exists between lncRNAs and other molecules such as miRNA, protein complex and other regulatory elements. Modular mechanisms have been proposed and ascribed to lncRNAs [87], providing an emerging model whereby lncRNAs may achieve regulatory specificity by assembling diverse combinations of proteins, and possibly with RNA and DNA interactions. For instance, a muscle-specific lncRNA, linc-MD1, could interact with two specific miRNAs, miR-133 and miR-135, and promote muscle differentiation by acting as a competing endogenous RNA in mouse and human myoblasts [88]. The interactions between lncRNAs and other molecules are then exploited in other computational or experimental studies. For example, in Khalil et al. [8], the associations between lincRNAs and the polycomb repressive complex (PRC) 2 are studied, about 20% of 3300 lincRNAs expressed in various cell types are bound by PRC2 [8]. Accumulating associations between lncRNAs and other molecules are also predicted by computational methods or verified by experimental means [63,89].
With the accumulating lncRNAs, there is a critical need to functionally annotate these lncRNAs. However, it is still a challenging task. First, undocumented structural features and weak conservation in their primary sequences for lncRNAs make it difficult to make inferences based on comparison. Second, there is a lack of a reliable network model on the relationships between lncRNAs and other molecules. Third and importantly, experimental validation of lncRNA functions is still expensive, labor-intensive and time-consuming. Fourth, subtle properties between the sequences, spatio-temporal and tissue-specific expression of lncRNAs, make them dynamic and elusive, increasing the difficulty. Nonetheless, pioneer works have been conducted. These works on lncRNA function annotations can be classified into two approaches, experimental and computational [90]. A framework of the computational methods is described in Figure 1. As for the input data, most of these methods are mainly based on the expression data for lncRNAs. One source of expression data is based on the RNA-seq sequencing. It can provide a comprehensive quantitative measure of the transcribed molecules in various samples. This includes the expression information of both lncRNAs and other RNA molecules. Another source is from the microarray data, which can be re-annotated based on further analysis because some of the probes are mapped to lncRNAs. A third source is based on the lncRNA array data with the probes specifically designed for lncRNAs. After obtaining the input data, in the second step, the mixture expression profiles for lncRNAs and mRNAs (or other molecules) are constructed. In the third step, differential expression analysis and co-expression analysis can be performed. The former is usually treated as case control, such as between the normal and the disease states [91]. The genes with differential expression profiles are then clustered into different gene sets, whereas the genes with similar expression profile are clustered into one gene set. The co-expression network between lncRNAs and other molecules can also be constructed based on the co-expression analysis. In co-expression network, different network modules are detected and the genes in one module are considered as a gene set. Models and algorithms can be designed and exploited based on the co-expression network. In the fourth step, strategies are employed to functionally annotate the lncRNAs. One strategy is based on the gene sets. For each gene set, function enrichment analysis is performed and the enriched function terms can be assigned to the un-annotated lncRNAs in the set. An example of this strategy can be found in Guttman et al. [26,84]. Another strategy is based on a network model and uses specific algorithms. Algorithms are developed to infer the candidate functions of the lncRNAs in the network model. For example, lncRNA functions are predicted based on the network strategy in [90,92,93]. A global function predictor lnc-GFP was also developed by our group [90], which can effectively perform large-scale function prediction for lncRNAs. In this method, coding-noncoding co-expression data were integrated with protein interaction data to construct the bicolored network, on which a global method based on the information flow was designed to infer probable functions for as much lncRNAs as possible. Furthermore, the lnc-GFP was integrated into the webserver called ncFANs [94], which was developed to functionally annotate lncRNAs online.

The databases for lncRNAs
Advances in transcriptome arrays and deep sequencing have given rise to fast accumulation of large data sets of lncRNAs. LncRNA transcripts and related information have recently been gathered in databases dedicated to lncRNA research. In this section, we summarize the content of general and specialized databases on lncRNAs. A recent work [33] has given a comprehensive report on the description and comparative evaluation of the resources and the computational tools, particularly of lncRNA databases. Here, we categorize the lncRNA databases into two main groups, the annotation databases and the interaction databases, in addition to other specific databases. The details are shown in Table 2.
Regarding the annotation databases, information such as the sequences, expressions, available secondary structures, related function and other internal information of lncRNAs are given. Other than comprehensive databases such as GenBank [111], FANTOM [112], HinvDB [113], GeneCards [114] and the ENCODE project [1] include annotated lncRNAs and publish their updated issue regularly. The general knowledge-based databases such as NONCODE [3], lncRNome [96] and LNCipedia [65] can offer a good compromise between coverage and depth of annotations. All these annotations can provide useful information in the understanding of lncRNAs. NONCODE is an integrated knowledge database dedicated to ncRNAs (excluding tRNAs and rRNAs). Particularly in its fourth version, the number  of lncRNAs has increased sharply from 73 327 to 210 831 (accessed on 27 November 2013). Another example is lncRNAdb [95], it provides comprehensive annotations of eukaryotic lncRNAs, and enables the systematic compilation and updating of increasing data describing the expression profiles, the molecular features and related functions of individual lncRNA. It is designed especially for the list of lncRNAs that have been shown to have, or to be associated with, biological functions in eukaryotes, as well as messenger RNAs that have regulatory roles. Some annotation databases are developed for a specific organism, such as PLncDB [100] for Arabidopsis. Other annotation databases have also documented some interactions between lncRNAs and other molecules, such as fRNAdb [97], lncRNAtor [98] and lncRNome [96].
As for the interaction-based databases, ChIPBase [101], NPInter [102], miRcode [103], lncRNA2Target [105] and others are included. These databases deposit the relationships between lncRNAs and other molecules, which are retrieved by experimental methods or computational prediction. Several databases give insights into the potential regulatory roles of human lncRNAs and their interaction with miRNAs (Starbase v2.0 [104]), as well as sRNAs (LncRNAMap [99]), and proteins (LncRNAtor). LncRNAtor also provides information on co-expression between mRNAs and lncRNAs in various tissues. In addition, DIANA-LncBase [64] is focused on regulatory associations between miRNAs and lncRNAs, in which both experimental and computational interactions are included. Moreover, databases such as lncRNADisease [106] and lnCeDB [107] are also included in this group, which focus on the functional or logical relationship between lncRNAs and others. Detail comparisons of the lncRNA databases are available in the review of Fritah et al. [33].
Apart from the resources categorized above, some databases are designed for specific purpose and also listed in Table 2, such as NRED [108] for expression of ncRNAs, Linc2go [109] associating lncRNAs with gene ontology (GO) terms and lncRNASNP [110] including SNPs in the lncRNA regions. All these resources can be helpful for lncRNA research, especially for deep computational analysis of the lncRNA data.
It should be noted that these databases are important in delineating the transcript functional relationships. However, substantial divergence exists in the content and specific annotations among these resources [33] that researchers should be considered carefully.

Conclusion
In summary, enormous progress has been made toward comprehensive annotation on thousands of lncRNAs with respect to their primary sequences, the structural features and their related functions. The mechanistic underpinnings of a few well-studied examples suggest that many of these transcripts might participate in important and diverse biological processes and human diseases. Current research is exploring how lncRNAs may participate in these cellular activities. To this end, expanding experimental techniques together with computational algorithms can provide important valuable insights.
With respect to the sequence level of lncRNAs, most studies focus on the comparison with mRNAs and the negative description of lncRNAs such as splicing pattern, 5 0 cap, poly A tail and properties related to 'limited protein coding ability'. Hitherto, there is no general positive definition of lncRNA, despite advances in defining some of its subtypes and motifs embedded in the lncRNA sequences. With respect to the structure level of lncRNAs, components discovered in the lncRNA secondary structures are of great value for further analysis, especially based on high-throughput sequencing technologies. With respect to the function level of lncRNAs, increasing evidence has indicated important roles of lncRNAs in biological processes and diseases, even though the molecular mechanisms for most lncRNAs remain unknown. Nonetheless, the lncRNA expression data and the interactions between lncRNAs and other molecules may provide valuable important clues into the lncRNA functional mechanisms. In short, the coming advance in the study of lncRNAs, especially at a large genome-wide scale, poses an exciting opportunity to investigate the lncRNA function in the future.

Key points
• Many basic features in lncRNA sequences are found to be similar to that of mRNAs, even though the components in the lncRNA sequences encode a limited protein-coding ability, indicated using coding potential tools and other methods.
• Many secondary structural components in lncRNAs can be identified, as well as their related functional roles. High-throughput sequencing-based methods may shed light on probing the lncRNA secondary structures using a combination of computational prediction methods.
• While a few lncRNAs have been demonstrated to play key roles in various biological processes, the functional mechanisms of many lncRNAs remain poorly understood. Computational methods can be employed to predict probable functions of lncRNAs, mainly based on gene expression data for lncRNAs and others.
• The databases for lncRNAs can be categorized into two groups, the first group based on annotations, and the second group based on the relationships between lncRNAs and other molecules. Data from these resources can be used for further data-mining of important functional patterns of lncRNAs.