lncSLdb: a resource for long non-coding RNA subcellular localization

Abstract While long non-coding RNAs (lncRNAs) may play important roles in cellular function and biological process, we still know little about them. Growing evidences indicate that subcellular localization of lncRNAs may provide clues to their functionality. To facilitate researchers functionally characterize thousands of lncRNAs, we developed a database-driven application, lncSLdb, which stores and manages user-collected qualitative and quantitative subcellular localization information of lncRNAs from literature mining. The current release contains >11 000 transcripts from three species. Based on the accumulated region of lncRNAs, we classify transcripts into three basic localization types (nucleus, cytoplasm and nucleus/cytoplasm). In some conditions, the nucleus and cytoplasm types can be divided into three more accurate subtypes (chromosome, nucleoplasm and ribosome). Besides browsing and downloading data in lncSLdb, our system provides a set of comprehensive tools to search by gene symbols, genome coordinates or sequence similarity. We hope that lncSLdb will provide a convenient platform for researchers to investigate the functions and the molecular mechanisms of lncRNAs in the view of subcellular localization.


Introduction
Long non-coding RNAs (lncRNAs) are non-coding transcripts whose lengths are >200 nucleotides (1,2). In recent years, with the development of biological technique, especially the broad application of high-throughput RNA sequencing (RNA-Seq) (3,4), more and more novel lncRNAs have been identified and annotated in genomes (5)(6)(7). Growing evidences suggest that lncRNAs have important function in various aspects of cellular function and biological process (8)(9)(10). However, the function of most lncRNAs is still unclear (10).
Unlike mRNAs, which are transported to cytoplasm and translated into proteins on ribosomes, lncRNAs have little coding potential. Similar to proteins, the function of lncRNAs heavily depends on their subcellular localization (10,11). The accumulated lncRNAs in nucleus may take part in the nuclear organization or regulate the gene expression before transcription (11,12), whereas the accumulated lncRNAs in cytoplasm have important roles in the posttranscriptional regulation and post-translational modification (11,12). For example, lncRNA Airn, accumulated in nucleus, is involved in silencing Igf2r by overlapping with its promoter (13); Neat1 is an essential component to form paraspeckles and related with the nuclear retention of structured or edited mRNAs (14). Cytoplasmic lncRNA NKILA can influence NF-κB activation via inhibiting IKK-induced IκBα phosphorylation (15); TUG1 and CTB-89H12.4 can regulate the PTEN expression by acting as the sponge regulators to complete the microRNA with PTEN transcripts (16).
Therefore, the subcellular localization of lncRNAs is a very important property to understand the function of lncRNAs. Nowadays, researchers have investigated the subcellular localization of a set of lncRNAs. There is a great need for integrated platforms to manage, search and analyse these data. Amaral et al. (17) published the lncRNAdb, which contains subcellular localization information of ∼80 lncRNAs gene. Zhang et al. (18) has developed a database, RNALocate, to collect the subcellular localization of all kinds of RNA, which contains >1700 lncRNAs genes from 10 different species. Mas Ponte et al. (19) publish the LncATLAS, which collects the subcellular localization of 7267 human lncRNAs genes in 15 cell lines and define the RCI (Relative concentration index) for measuring the localization types. However, these systems usually focus on the lncRNA genes instead of lncRNA transcripts and only cover a small fraction of available lncRNAs in different species. We also note that these systems only provide limited support for qualitative and/or quantitative experimental results, such as photos or expression levels in different cell compartments. More details are shown in Table 1.
We develop an lncRNA subcellular localization system (lncSLdb), which collects qualitative and quantitative subcellular localization information of lncRNAs by manually curating the literatures. The current release contains subcellular location information of >11 000 lncRNA transcripts from 9494 genes and three main species (human, mouse and fruit fly), classified into three basic subcellular localization types (nucleus, cytoplasm and nucleus/cytoplasm) and three subtypes (ribosome, chromosome and nucleoplasm), all of which are supported by biological experiments. Our aim is to provide a comprehensive platform to help researchers investigate the subcellular localization of lncRNAs and further for function and potential molecular mechanism. lncSLdb collects a set of information of lncRNAs, including gene IDs/symbols, transcript IDs, genome coordinates, gene/transcript biotype, subcellular localization and relative expression ratio or experimental pictures. The data set used by our system can be downloaded freely. Furthermore, researchers can submit new subcellular localization of lncRNAs to lncSLdb.

Data collection and implementation
We searched published papers in the PubMed Central (PMC) database by using 'long non coding RNA subcellular localization' and 'lncRNA subcellular localization' as keywords, which leads to >3000 papers. All papers are filtered manually to find if they are related to lncRNA subcellular localization. Papers that are not included in the result set but cited by some paper in the result set are also considered. The current release includes ∼100 papers, filtered from the first 1000 search results and their reference ( Figure 1). We also collected the gene/transcript genome information from other database such as FlyBase (20), Ensembl (21), UCSC (22), MGD (23), GenBank (24) and Gencode (25).
lncSLdb is developed with HTML/JSP and Java languages using MySQL (http://www.mysql.com/) as the database manage system. The web interface is based on the Bootstrap (http://getbootstrap.com/2.3.2/) and AdminLTE (https://www.almsaeedstudio.com/) frameworks, and JavaScript scripts developed to support user interaction.

Database structure and content
For every localization item in lncSLdb, we consider three aspects, including transcript information, gene information and subcellular localization information. All information contained in lncSLdb are listed in the Table 2.
Transcript information records the basic information of transcripts, including transcript ID, genomic coordinates and biotype. Since novel lncRNAs are being identified daily, many of these transcripts may still have no official names. We add the genomic coordinates, including transcript start site position, transcript end site position, chromosome and strand, as an identifier for every transcript. We fetch the genomic coordinates from Ensembl (21), UCSC (22), MGD (23), GenBank (24) and FlyBase (20), according to their transcript IDs. For transcripts without official IDs, we use the genomic coordinates described in corresponding articles. GRCh37 and GRCh38 are used as the reference genome for human, while GRCm38 for mouse and BDGP6 for fruit fly, respectively. We also get the transcript biotype from Ensembl database for those with Ensembl IDs. For the transcripts with accession number in GenBank, we use FEELnc (26), a tool for lncRNA annotation, to classify transcript into different biotype by comparing the genome location of transcripts with that of Gencode (25) transcripts. The biotype of other transcripts is obtained based on the description in corresponding papers or marked as 'lncRNA' if no description.
Gene information consists of gene symbol, Ensembl ID, alias and genomic coordinates and gene biotype. Since an lncRNA gene may have plenty of isoforms, which may have different subcellular localization types, we gather all transcripts belonging to the same gene to show its localization type. For intronic lncRNAs, information of host genes is used as gene information. In order to avoid the mismatch due to alias names, we convert all names to Ensembl ID and get gene symbol from Ensembl database. All other names are thought to be alias. For genes that cannot be found in Ensembl database, the Ensembl ID field will be unknown, while the known gene names are used as gene symbol. For some transcripts that do not belong to any genes, the genes are marked as unknown.
Subcellular localization information collects the experiment condition and the results, which mainly contain the cell line or tissue used, experiment method, experiment conclusion and specific experiment results. lncRNA subcellular localization is typically obtained from two types of experiments: one is based on in situ hybridization, for example ISH (27) and RNA-FISH (fluorescence in situ hybridization) (28,29). The other combines nuclear-cytoplasm fraction with an expression assay using either microarrays (30) or RNA-Seq technologies (31). The first-type method will produce images showing subcellular localization of a certain lncRNA, while the second method will provide specific expression levels in different cellular compartments. In lnc-SLdb, we show the photos of in situ hybridization methods collected from papers or public databases, like Fly-Fish (32). For sequence results, we show bar plots about the expression level in different cell compartments and compute the relative ratio for every compartment with following formula: relative ratio(comp) = exp(comp) min x∈CS exp(x) The chromosome of the transcript Start The transcript start position of the transcript End The transcript end position of the transcript Strand The strand of the transcript Biotype The biotype of the transcript Sequence source The source of transcript sequences

Gene information
Gene symbol The official symbol of the gene ensembl id The ensembl id of the gene alias The alias of the gene chromosome The chromosome of the gene start The transcript start position of the gene end The transcript start position of the gene strand The strand of the gene biotype The biotype of the gene species The species of the gene version The We think there are three basic types of subcellular localization in a cell, accumulated in nucleus, accumulated in cytoplasm and accumulated in both (nucleus/cytoplasm). In some condition, where the location region is more accurate, our system includes the most specific sub regions in nucleus or cytoplasm. According to the data we collect, we indicate that some lncRNAs are accumulated in chromosome or nucleoplasm in nucleus and some lncRNAs are accumulated in ribosome in cytoplasm. The type of the lncRNA subcellular localization is fetched directly from the papers. If authors did not state the type explicitly, we provide the reference types by considering the transcripts are nuclear accumulated if the nuclear expression level is more than 2-fold of the cytoplasm expression and cytoplasm accumulated if cytoplasm expression level is >2-fold of the nuclear expression and accumulated in both in other situations, similar with the definition in (30).
The current release contains >11 000 transcripts from ∼100 papers, mainly involving three species. Specifically, there are 9003 transcripts for human, 2630 for mouse, 59 for fruit fly and 6 for other species. In total, we collect >14 000 subcellular localization information. The distribution of localization types is shown in Figure 2.
Querying the database lncSLdb is available online at: http://bioinformatics.xidian. edu.cn/lncSLdb. Users can browse, query and download data through the web interface. In the browse page, all items are listed, which can be filtered by certain subtypes, including species, localization and transcript biotype. Every item has a detail page about the transcripts and localization, including transcript ID, transcript genome coordinates, subtype, method and cell used for experiment, reference article and its PMID, localization conclusion and the specific result. Transcripts belonging to the same gene are listed in the same detail page, where the gene information is shown in the beginning.
In the search page, we provide a comprehensive query tool. Users can query the lncRNA localization by using the gene name or transcript name as the keywords, selecting the specific species, biotype and subcellular localization type. We also offer a tool to search transcripts in a genome region in order to find novel transcripts without official names. In addition, there is a tool for searching the location type of homologous transcript via supplying the sequences in the fasta format.
All data can be downloaded from the download page with txt format or Microsoft Excel format. We also open the SQL interface to allow users to develop their program to access our database.
Researchers can submit new subcellular localization to lncSLdb online. More details can be found on the submission and help page.

Discussion and future prospects
Increasing evidence has proven that lncRNAs play important roles in cell activities. But we still have little knowledge about their basic properties, such as the subcellular localization. The study in the protein subcellular localization helps researchers understand the function of protein. We hope the effort in lncRNA subcellular localization can provide another view to explain their function and biogenesis (11). Although some researchers have developed some databases containing lncRNA subcellular localization (17)(18)(19), they only cover a small fraction of available lncRNAs in different species. Here, we developed lncSLdb, an lncRNA subcellular localization database, collecting the qualitative and quantitative localization information of >10 000 of lncRNAs subcellular localization information from published articles from three species, classified into three basic subcellular localization types and three subtypes. To our knowledge, this is the most complete database for lncRNA subcellular localization up to now. We hope that lncSLdb can provide researchers an integrated platform for studying the basic property and subcellular localization of lncRNAs, and further for figuring out if lncRNAs share the same or similar exportation mechanism with mRNAs and other potential molecular roles. We are interested in mining the features of transcripts in different cellular compartments and predicting the distribution of lncRNAs in different cell compartments. We will continue to update an improve lncSLdb in the future.