deepBase v2.0: identification, expression, evolution and function of small RNAs, LncRNAs and circular RNAs from deep-sequencing data

Small non-coding RNAs (e.g. miRNAs) and long non-coding RNAs (e.g. lincRNAs and circRNAs) are emerging as key regulators of various cellular processes. However, only a very small fraction of these enigmatic RNAs have been well functionally characterized. In this study, we describe deepBase v2.0 (http://biocenter.sysu.edu.cn/deepBase/), an updated platform, to decode evolution, expression patterns and functions of diverse ncRNAs across 19 species. deepBase v2.0 has been updated to provide the most comprehensive collection of ncRNA-derived small RNAs generated from 588 sRNA-Seq datasets. Moreover, we developed a pipeline named lncSeeker to identify 176 680 high-confidence lncRNAs from 14 species. Temporal and spatial expression patterns of various ncRNAs were profiled. We identified approximately 24 280 primate-specific, 5193 rodent-specific lncRNAs, and 55 highly conserved lncRNA orthologs between human and zebrafish. We annotated 14 867 human circRNAs, 1260 of which are orthologous to mouse circRNAs. By combining expression profiles and functional genomic annotations, we developed lncFunction web-server to predict the function of lncRNAs based on protein-lncRNA co-expression networks. This study is expected to provide considerable resources to facilitate future experimental studies and to uncover ncRNA functions.

Recent advances in high-throughput next-sequencing technology have produced large numbers of short and long RNA sequences, and enable the detection and profiling of known and novel ncRNAs at unprecedented sensitivity and depth (1). Many studies have identified thousands of sRNAs, lncRNAs and circRNAs from deepsequencing datasets in various species, including human (13)(14)(15)(16)(17), mouse (16,18), fruitfly (19), nematode (16,20) and zebrafish (21,22). With the increasing amount of deepsequencing data available, there is a great need to integrate these large-scale data sets to explore the expression, evolution and function of diverse ncRNAs.
To meet above-mentioned needs, we have updated deep-Base (23) to version 2.0 (deepBase v2.0) (Figure 1). The deepBase v2.0 facilitates the integrative, interactive and versatile display of, as well as the comprehensive annotation and discovery of sRNAs, lncRNAs and circRNAs ( Figure  1). In deepBase v2.0, we constructed the most comprehensive expression profiles and evolutional patterns of diverse ncRNAs.  A system-level overview of the deepBase v2.0 core framework. A total of 558 small RNA datasets and 478 RNA-seq datasets were retrieved from NCBI GEO or SRA database. All the small and large noncoding RNAs were identified. The expression, evolution and functions of these ncRNAs were further analyzed. All the results generated by deepBase v2.0 were deposited in MySQL relational databases and displayed in the visual browser and web pages.

Annotation and Identification of sRNAs, lncRNAs and cir-cRNAs
The raw data downloaded from NCBI SRA or GEO databases were classified into different clades, species, tissues and cell-lines according to the description on the website or related literature. For small RNA annotation, after remove the 3 adapter sequences, the clean reads were then mapped back to their corresponding reference genome using bowtie program. Bedtools intersect tool was used to inspect how many reads could matched to the region of an- notated small RNA genes. The reads which were perfectly located in the annotated gene regions were used for expression evaluation.
To accurately annotate lncRNA, we proposed a pipeline to process transcriptome annotation profiles generated by RNA-seq (usually assembled by Tophat (31) and Cufflinks (32)) using several filters described as follow: (1) Transcript length filter: Multi-exonic transcripts whose length exceeded 200nt were kept. (2) Known non-lincRNA annotation filter: Transcripts were excluded from further consideration if they overlap, or reside within 50nt upstream or downstream of, any exon of any transcript from the following annotation sets: coding genes annotated in RefSeq, Ensembl or UCSC knownGene; pseudogenes annotated in Ensembl or Yale Pseudogenes. Those which have any overlap spanning at least 80% of miRNA precursors annotated in miR-Base or miRNA/tRNA/rRNA/snRNA/snoRNA annotated in Ensembl were also eliminated. The above rules apply in a strand-specific manner, thus allowing for calling antisense lncRNA. (3) Coding potential filter: Three open-reading-framesrelated metrics were used to estimate whether the remaining transcripts exhibit significant protein-coding potential. The getorf program from the EMBOSS suite (33) was used to find ORFs in the candidates and therefore the length and coverage of ORFs and were calculated. ORFScan scores were calculated by the txCd-sPredict program from UCSC. Transcripts with ORF <100 aa, ORF coverage less than 30% and txCdsPredict score <800 were classified as lncRNAs. Furthermore, we classified an additional subset of transcripts (ORF coverage between 30% and 90% and txCdsPredict score <800) that may include short ORFs and may serve as either lncRNAs or small peptides as TUCP (transcripts of uncertain coding potential), which was defined and used in several papers (15,34). The cutoff for these parameters was determined based on our survey of the known functional lncRNAs collected in the lncRNAdb database (5,35). (4) Known protein domains filter: Each remaining transcript was mapped to unitRef90 database with BlastX and transcripts with E-value < 1E-30 were removed. The resulting transcripts were translated in all three forward frames and fed to HMMER v3.0 (36) to examine the occurrence of any PfamA and PfamB known protein family domains collected in the Pfam database (37). Transcripts with a Pfam hit showing both the full sequence E-value < 1E-5 and the single best domain Evalue < 1E-5 were removed.
All transcripts that passed all the filters above were classified based on their loci and treated as a set of lncRNAs in the following study. All the lncRNAs in deepBase v2.0 are named according to the following rules. The names of all the lncRNAs in deepBase v2.0 consist of three parts, 'A-B-C'. Amongst, 'A' is composed by three letters, which represents the shorted name of species, 'B' is the gene ID and 'C' is the transcript ID. For example, hsa-lncRNA10932-8 means the 8th transcripts of gene lncRNA10932 in human.
Human, mouse and C. elegans circRNA genes were downloaded from circBase v1.0 (12) or obtained from the Supplementary Data of the original articles. In addition, we used the method 'find circ' (16) to identify novel circRNAs from raw RNA-seq data of ENCODE and modENCODE. In brief, raw reads were first aligned to the genome. Then, unmapped sequencing reads were used for back-splice sites identification.

Expression analysis of sRNAs and lncRNAs
DESeq (38)  includes a program, Cuffnorm, which can be used to generate tables of expression values that are properly normalized for library size.

Evolution analysis of long non-coding RNAs
The sequences of all identified long non-coding RNAs were extracted from genomes of corresponding species using Bedtools v2.17.0 (39). As described by Ulitsky et al. (21), we performed similarity searches with each lncRNA sequence in one species against that in the rest species using NCBI BLASTN with parameters '-task blastn -word size 6 -evalue 1e-5 -strand plus' and the resulting hits with the lowest evalue was kept as the most conserved lncRNA.

Predicting the function of lncRNAs from co-expression networks
Expression correlation (Pearson correlation coefficient) between lncRNAs and protein-coding genes was estimated using coexpression program (the program is available from the authors upon request), which was written in C language and ALGLIB library, and p-value was adjusted with the False Discovery Rate (FDR) correction (40). GO ontology data for the Ensembl Genes was downloaded from the Ensembl website (41). There are thirteen functional categories for human genome. Enrichment analysis of these pathways in the dataset was determined using a hypergeometric test with Bonferroni and FDR correction (40).

The web-based exploration of sRNAs, lncRNAs and circR-NAs
deepBase provides genome-wide identification of small RNAs in multiple types, including microRNAs, snoRNAs, rRNA-derived small RNAs. In total, there are 18 508 small RNAs are identified in 51 studies of 10 species (Table 1). Users who want to see the results of these small RNAs can use the Browse page. For example, deepBase v2.0 integrates a study which sequenced small RNAs in 60 samples of tumour or normal human cervical tissues (42). After our systematic re-annotation of all types of small RNAs in these samples, the 'Browse' page shows 3,176 small RNAs in the result table. By ordering the 'Sample' column of the table, users can find 14 small RNAs that could be detected in all of the 60 samples, including 12 miRNAs, one snoRNA (SNORD85) and one snRNA (RNU2-59P). By sorting the 'Expression' column of the table, the highest expressed miRNA hsa-mir-21 ranks on the top of the table. The 'View' button on the last column can link to the detail information of this miRNA, which displays the distribution of all the sequencing reads on the miRNA precursor in each sample. Most of the reads correspond to the mature arm of the miRNA precursor, and the highest one has 10 005 sequencing copies.
In the Browse page, we also provide integration of lncR-NAs identified by RNA-seq, organized by different species and corresponding studies ( Table 1). About half of the lncRNAs were not annotated in GENCODE (Supplementary Figure S1). For instance, we applied our lncRNA identification pipeline (detailed in Methods) to accurately annotate lncRNAs in mouse from RNA-seq data generated by a previous study (43). 14 462 mouse lncRNAs were identified, among which 11 220 were not annotated in GEN-CODE. We compared the result of lncRNAs in deepBase v2.0 with other three studies (13,15,34). Supplementary Table S2 shows the number of overlapped lncRNAs in deep-Base v2.0 and novel lncRNAs that is not reported in the other studies. According to our results, although varied in amounts, the lncRNAs in different species shared common features (Supplementary Figures S2-S5). For example, the transcript length of lncRNAs is generally shorter than that of protein-coding genes but longer than other ncRNAs (Supplementary Figure S3). Users can view basic information of each lncRNA in the query result, and pressing the 'Predict' button of each entry will call the web-based lnc-Funtion program (introduced in the later session) to predict possible functions of the lncRNA.

Expression profiles of sRNAs and lncRNAs
One of the most important features of miRNAs is their special expressions in different stages, tissues or cells. deepBase v2.0 provides the expression profiles of small RNAs in each sample. Users can find their interested miRNA name in the small RNA 'Expression' page. The heatmap will display the normalized expression number of miRNA in each sample. Click the name of the miRNA will link to the detail expression result page. Users can find that hsa-mir-122 is especially highly expressed in normal liver tissues compared to other samples.
Spatial and temporal specific expression is one of the well-characterized properties of lncRNAs (15). In the Expression page, expression profiles of lncRNAs are visualized through heatmaps, providing an easy way for users to spot tissue-specific lncRNAs. For instance, H19, a maternally imprinted lncRNA, was reported to be highly expressed in placenta and down-regulated in all tissues except skeletal muscle immediately after birth (44)(45)(46). This phenomenon can be observed through our expression heatmap of H19 generated from data of the Human Body 2.0 project (47).

Evolutional conservation of lncRNAs in 14 species
We conducted evolution conservation analysis of lncRNAs in 14 species. Many studies show that most lncRNAs are lineage-specific and in fast phylogenetic evolution (13). Out of 18,964 lncRNAs in human we identified, 16 926 have at least one ortholog in primates. However, only 754 lncRNAs present to be conserved in Platypus. 13 800 human lncR-NAs have not previously been annotated, among which hsa-lncRNA12238 from antisense is highly but only conserved in primates, indicative of weak evolution constraints. hsa-lncRNA12238 presents intensively high expression in human testis. To verify that, we then predict the function of hsa-lncRNA12238 in Function entry, and GO items show strong correlation with such reproductive process as spermatogenesis, sperm-egg recognition and so forth.

lncSeeker web-server for the annotation and identification of lncRNAs
We provide a web-based tool named IncSeeker to implement our filtering pipeline of finding lncRNAs transcriptome assembled from RNA-seq data. Users can upload their transcriptome data in BED12 or GTF format, and our web-server will process the file by a series of tunable criterion and filter steps, including transcript length, exon number, coding ability and potential coding domain. A strict set of lncRNAs will then be displayed on the result page. To evaluate the performance of lncSeeker, we used lncSeeker to predict lncRNAs in GENCODE gene sets. There are 11 324 high-confidence lincRNA transcripts and 95 309 protein-coding transcripts in GENCODE Release 19 (GRCh37.p13). The result showed that the accuracy of lncSeeker was 96.82%, the sensitivity and specificity were 70.08% and 100% (Supplementary Table S3), respectively. We further compared lncSeeker with other four tools related with lncRNAs identification, getorf (33), txCdsPredict (48), CPAT (49) and CNCI (50). The result showed lncSeeker had the highest accuracy and specificity.

lncFucntion web-server for the functional annotation of lncR-NAs
A web-based tool, lncFunction, was also developed to predict lncRNA functions from coding and non-coding coexpression networks. There are six options on the query page, and the option 'Input the Target LncRNA' is required. The running parameters, expression values from selected study and every outcome in the three categories of function prediction are available for users to download.

CONCLUSIONS
Although a few dozen of lncRNAs and circRNAs have been characterized to some extent and reported to function in important cellular processes (1)(2)(3)(4), such as differentiation, development, proliferation, self-renewal, pluripotency, carcinogenesis and progression, the functions of most annotated ncRNAs are unknown (1)(2)(3)(4). In this study, by analyzing a large set of sRNAs, lncRNAs and circRNAs identified from deep-sequencing datasets, we have annotated ncRNAs with a broad range of structural, expression, and evolutionary features.
Currently, there is an increasing amount of databases developed for ncRNAs, including NONCODE (51), lncR-NAdisease (52), ncFANs (53), lncRNA2Function (54), FAME (55), miRo (56), miRGator (57) and lncRNAwiki (58). NONCODE (51) is a comprehensive database which contains all kinds of ncRNAs, but no evolution information for these ncRNAs. lncRNAdisease (52) is a specialized database which collects disease-associated lncRNAs and their interaction entries. On comparison with ncFANs (53), which is used to predict lncRNAs from 'Affymetrix microarray datasets', our tool is designed to identify lncR-NAs from 'RNA-Seq' based datasets. Compared with other pipeline/tools (59), our pipeline integrates more tools as well as built-in scripts and annotates and identifies lncR-NAs more reliably. Compared with our previous deepBase v1.0 which was designed to focus on small RNAs (60), deepBase v2.0 was designed to identify and annotate lncR-NAs and circRNAs, and to predict the functions of lncR-NAs. deepBase v2.0 also provides more comprehensive expression and evolution profiles of lncRNAs, circRNAs and small RNAs. The distinctive features of deepBase v2.0 include the following: (i) deepBase v2.0 has provided the most comprehensive expression analysis of sRNAs and lncR-NAs from 1036 RNA-Seq datasets from 19 species. The constructed gene expression profiles of both ncRNAs and protein-coding genes from our analyses are a valuable resource for understanding the similarities and differences of transcriptional regulation of protein-coding genes and ncR-NAs across different tissue/cell-line types. (ii) To the best of our knowledge, this is the first attempt to construct evolutional patterns of lncRNAs and circRNAs across several evolutional clades. Conservation patterns provided by our database may help biologists to select important ncR-NAs for further functional validation. (iii) Two web-based tools, lncSeeker and lncFunction, can be used to identify high-confidence lncRNAs, and to predict lncRNA functions. We expect that access to these tools will enable more researchers to search for functions of novel lncRNAs in the ever-increasing amounts of deep-sequencing data.

FUTURE DIRECTIONS
Next-generation sequencing technologies play vital roles in improving our understanding of functional genomics. As sRNA-Seq and RNA-Seq technology is applied to a broader set of species, cell lines, tissues and conditions, we will continually maintain and update the database. The integration of transcriptome datasets from the deepBase database, and the cancer genomics data and clinical information from the Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC), will improve our understanding of expression and function of ncRNAs in diseases.

AVAILABILITY
deepBase v2.0 is freely available at http://deepbase.sysu.edu. cn/ or http://biocenter.sysu.edu.cn/deepBase/. The deep-Base data files can be downloaded and used in accordance with the GNU Public License and the license of primary data sources.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.