LncSEA: a platform for long non-coding RNA related sets and enrichment analysis

Abstract Long non-coding RNAs (lncRNAs) have been proven to play important roles in transcriptional processes and various biological functions. Establishing a comprehensive collection of human lncRNA sets is urgent work at present. Using reference lncRNA sets, enrichment analyses will be useful for analyzing lncRNA lists of interest submitted by users. Therefore, we developed a human lncRNA sets database, called LncSEA, which aimed to document a large number of available resources for human lncRNA sets and provide annotation and enrichment analyses for lncRNAs. LncSEA supports >40 000 lncRNA reference sets across 18 categories and 66 sub-categories, and covers over 50 000 lncRNAs. We not only collected lncRNA sets based on downstream regulatory data sources, but also identified a large number of lncRNA sets regulated by upstream transcription factors (TFs) and DNA regulatory elements by integrating TF ChIP-seq, DNase-seq, ATAC-seq and H3K27ac ChIP-seq data. Importantly, LncSEA provides annotation and enrichment analyses of lncRNA sets associated with upstream regulators and downstream targets. In summary, LncSEA is a powerful platform that provides a variety of types of lncRNA sets for users, and supports lncRNA annotations and enrichment analyses. The LncSEA database is freely accessible at http://bio.liclab.net/LncSEA/index.php.


INTRODUCTION
Long noncoding RNAs (lncRNAs) play key roles in biological processes and can even be used as novel biomarkers (1)(2)(3)(4). Mutations to, and the methylation of, lncRNAs may also affect lncRNA expression levels, leading to diseases such as cancer (5,6). In recent years, some lncRNAs defining cellular identity were discovered by biological experiments and single cell sequencing techniques (7,8). Many studies showed that the functions of lncRNAs are closely related to their location on the inside and outside the cell. For example exosomal lncRNA H19 could promote hepatic stellate cell activation and cholestatic liver fibrosis (9). A large number of studies showed that lncRNAs perform a variety of regulatory functions for downstream genes. Ulitsky et al. demonstrated that lncRNA H19 functions as a competing endogenous RNA by binding miR-17-5p family members in HeLa cells and myoblasts (10). LncR-NAs also bind to proteins and localize protein complexes to specific DNA sequences, which affect gene expression and the development of disease. For example, the FOXN3-NEAT1-SIN3A repressor complex promotes the progression of hormonally-responsive breast cancer (11). A large number of recent studies focused on transcripts annotated as lncRNAs, but encoded small proteins (12,13). Furthermore, the genomes of many species are transcribed pervasively, producing many lncRNAs with unknown functions. Increasing evidence suggests that lncRNAs can be regulated by upstream transcriptional regulators, including transcription factors (TFs) and DNA regulatory elements such as promoters, enhancers, super enhancers (SEs), and accessible chromatin regions (14)(15)(16)(17). For example, TP63 binding to the SE regions of lncRNA LINC01503 led to LINC01503 overexpression in squamous cell cancer (18).
Many lncRNA databases and tools have been built. For example, NONCODE (19), LNCipedia (20), and RNAdb D970 Nucleic Acids Research, 2021, Vol. 49, Database issue focus on providing basic annotation information for lncR-NAs. LncRNADisease (21) and Lnc2Cancer (22) collect the details of relationships between lncRNAs and diseases. LncRNASNP (23), lnc2Meth (24) and LncVar (25) support lncRNAs interacting with other functional elements. StarBase (26) and LncBase (27) provide information on lncRNA targets. Such databases serve as valuable resources for the study of lncRNAs. However, they provide incomplete lists of lncRNAs, rather than a comprehensive, taxonomic set of lncRNAs for users. Moreover, those tools also lack information about the upstream transcriptional regulation of lncRNAs. With studies of human disease and biological processes, a large number of functional lncRNA sets have been generated from high-throughput or lowthroughput experiments. The development of a comprehensive collection of human lncRNA sets is urgent work at present. Importantly, based on such reference lncRNA sets, enrichment analyses will be useful for analyzing lncRNA lists of interest submitted by users.
To infer lncRNA functions, some web servers and tools were developed, such as Co-LncRNA (28), Lnc-GFP (29) and FARNA (30); however, such tools analyze lncRNA functions using RNA-seq data, and the co-expression relationships between mRNAs and lncRNAs. Most tools fail to support enrichment analyses for a lncRNA set that provides only functional annotations for a single lncRNA. A web server, LnCompare (31), can be used to analyze lncRNA set features with six categories of >100 attributes. However, insufficient category characteristics may limit inferences of lncRNA function. Therefore, it is highly desirable to construct a comprehensive resource for lncRNA sets and provide lncRNA set annotation and enrichment analyses.
Here, we developed a human lncRNA sets database (LncSEA, http://bio.liclab.net/LncSEA/index.php), which focuses on accommodating various available resources of human lncRNAs and performs annotation and enrichment analyses of lncRNA lists submitted by users. LncSEA supports >40 000 reference lncRNA sets across 18 categories (miRNA, drug, disease, methylation pattern, cancer specific phenotype, lncRNA binding protein, cancer hallmark, subcellular localization, survival, lncRNA-eQTL, cell marker, enhancer, super-enhancer, transcription factor, accessible chromatin and smORF, exosome and conservation) and 66 sub-categories, which include over 50 000 lncRNAs. We collected lncRNA sets from >20 lncRNA-associated databases that generated lncRNA sets based on downstream regulatory data sources. Furthermore, by integrating TF ChIPseq, DNase-seq, ATAC-seq and H3K27ac ChIP-seq data from hundreds of human cell types, we identified a large number of lncRNA sets regulated by upstream TFs and DNA regulatory elements. More importantly, LncSEA provides annotation and enrichment analyses of lncRNA set. Moreover, lncRNA set enrichment analyses associated with upstream regulators and downstream targets of lncRNAs can be performed simultaneously when choosing the categories for upstream and downstream reference sets. Finally, the differences and advantages of LncSEA compared to other existing databases or web tools, in terms of data and functionality, are shown in Supplementary Table S1 and Supplementary Material 1. In summary, LncSEA is a powerful platform that provides a variety of types of lncRNA sets for users, and performs annotation and enrichment analyses of lncRNA set submitted by users.

Collection of reference sets of lncRNAs
LncSEA contains comprehensive collections of lncRNA sets ( Figure 1). The current version of LncSEA contains >40 000 reference sets, including 18 categories and 66 sub-categories (Table 1). There are two types of sources for all of the reference sets, including sets identified by high-throughput experimental data and sets collected from >20 known databases. Seven of the 18 categories contain lncRNA sets from literature searches. For example, most of the lncRNAs in our reference sets of the 'Disease' and 'Drug' categories were confirmed by a large number of studies involving biological experiments. All of the lncRNA sets for the 'Cell marker', 'Subcelluar localization', 'Cancer hallmark' and 'Exosome' categories are composed of a list of lncRNAs selected purely manually in one or more studies.
LncRNAs are regulated by different regulatory elements and TFs, which bind to their regulatory regions. Due to data resource and technology constraints, few databases provide upstream regulatory information for lncRNAs. We constructed four categories of lncRNAs with upstream regulatory information involving 'Enhancer', 'Super Enhancer', 'Accessible Chromatin' and 'Transcription Factor' by collecting and processing large volumes of ChIP-seq/DNaseseq/ATAC-seq data (  (47) and GGR (Genomics of Gene Regulation Project) (46) (The sample information of these datasets in Supplementary Table S2). To perform normalization and ensure consistency across different data sources, we used the streamlined pipeline of Bowtie-MACS-ROSE, which was developed by Loven et al. (48). Raw sequencing reads were aligned to the hg19 reference genome using Bowtie (49,50), peaks were called using MACS (51), and SE regions were annotated using ROSE (48) software. More than 330 000 SE regions from 542 cells/tissues were obtained. Detailed super enhancer annotation information and analyses were viewed in the SEdb (52) database and SEanalysis (53) web server developed by our team. Based on these enhancers and SEs, we identified the lncRNAs regulated by cell-type-specific enhancers and SEs using the ROSE software GeneMapper program (48). Three different positional relationships, including 'overlap', 'proximal' and 'closest' were supported between enhancers and lncRNAs. The enhancer-associated lncRNAs were classified into 'overlap' when the enhancer region overlapped by at least one base with the corresponding lncRNA. LncRNAs were classified into the 'proximal' sub-category when the distance between the enhancers and lncRNAs was within 50 kb, and they were classified into the 'closest' sub-category when the lncRNA was the closest gene and the distance was within 1000 kb. We constructed multiple sets for the closest active lncRNAs, with SEs identified by the CRC Mapper program (54) in specific cell types.
Accessible chromatin category. DNase-seq and ATAC-seq (46,55) technologies can be used to identify chromatin accessibility regions. (The sample information of these datasets in Supplementary Table S2). We collected the chromatin accessibility regions from DNase-seq data including 292 sample types from ENCODE, Roadmap and Cistrome (55). For ATAC-seq data, we collected the genomic regions of 105 sample types from Cistrome and NCBI, and 386 samples from 23 cancer types from TCGA (56) (https://tcgadata.nci.nih.gov/tcga). We used the liftOver tool in UCSC (57) to convert the genomic locations of those regions into hg19 version. The GeneMapper program in ROSE software (48) was also used to predict the chromatin accessibility regions associated with lncRNAs using the proximity rules, closest, overlapping, and proximal (  (59) and GTRD (60). The peaks overlapping with transcriptional regulatory regions were further identified using BEDTools (default parameter: at least one base overlapping) (61), including enhancers, promoters, and the chromatin accessibility regions of lncRNAs. Then, the relationships between TFs and lncRNAs were built via multiple kinds of lncRNArelated regulatory regions, such as promoter and enhancer regions bound by TFs. Finally, for each TF, we established lncRNA sets with cell/tissue-specific regulatory information.
Survival category. Some survival interacted lncRNAs were predicted by downloading and analyzing lncRNA expres-sion data and clinical data. Univariate Cox regression analysis (62) was used to screen for lncRNAs related to prognosis. We defined each cancer survival related lncRNAs as a set in the TCGA project. Cox regression coefficients, Pvalues, and log rank test P-values are displayed on detailed set pages in our database for user screening and reference purposes. Our survival sets inform and guide the study of prognosis and lncRNA expression in cancer patients.
Other categories. We collected data for multiple categories of lncRNA sets surrounding human diseases and cancers, including cancer hallmarks, diseases, and drug target information from public databases (

Classification of all reference sets in LncSEA
We developed some rules for lncRNA classification, including: (i) directly classifying them into sub-categories based on the data source. For example, we included their relationships to disease from the Lnc2Cancer, LncRNADisease, MNDR, and EVLncRNAs databases. According to Based on the classification rules, we sorted all of the collected lncRNA sets ( Table 1).

Introduction of additional data sources
LncSEA provided additional information that helps users study lncRNA functions in depth. We obtained references for lncRNAs such as clinical information, biological function, and experimentally-supported mechanisms from the Lnc2Cancer2.0 database. The relationships between lncRNA, mRNA, and miRNA were obtained from an excellent ceRNA database, LncACTdb2.0 (63). Multiple lncRNA names, including gene symbol, Ensembl ID, NCBI refseq ID, alias, and Entrez ID were obtained from org.Hs.eg.db (Release 3.11). Chromatin location information for lncRNAs were download from GENCODE Hg19. Gene expression matrices with the FPKM value for invasive breast invasive carcinoma and prostate adenocarcinoma were obtained as test data from the TCGA project. The differentially expressed lncRNAs (P adj < 0.05) of both cancers were obtained from the circlncRNAnet database. We downloaded lncRNA expression profiles with FPKM values from the TCGA project, and expression profiles with TPM values from GTEx, CCLE and ENCODE databases were normalized by log 2 (value + 1).

LncRNA set enrichment and similarity analysis
LncRNA sets enrichment analyses can be performed on >40 000 reference sets, which are divided into 18 categories and 66 subcategories. The annotation and enrichment analyses based on these categories and reference sets in Lnc-SEA covered >50 000 lncRNAs. Users can submit a list of lncRNAs and select multiple categories and various subcategories of lncRNA sets according to their preferences. LncSEA will annotate lncRNAs submitted by users to the reference sets, and calculate the statistical significance of enrichment analyses using the hypergeometric test (64). The enrichment significance P-value for that reference set is calculated as: We consider that reference sets have a total of n lncRNAs (LncSEA or GENCODE), of which k are components of one reference set under investigation, and the query list of lncRNAs of interest has a total of s lncRNAs, of which i are involved in the same reference set. Thus, the enrichment significance P-value for that reference set is calculated using formula (i). Users can adjust the number of lncRNAs required to be enriched, and set thresholds for P-values, false discovery rates (FDR), and the Bonferroni method to control the accuracy of the analysis.
To evaluate the similarity between a query lncRNA set, A, and a reference lncRNA set, B, we applied two classical measures for computing set similarity. The first one was the Jaccard score, which represents a proportion of the intersection elements of the two sets, A and B, in the union set of A and B. The second measure was the Simpson score, which represents a proportion of intersection elements of the two sets, A and B, in the minimum set of A and B.
The Jaccard score was calculated as: The Simpson score was calculated as These two scores can provide additional information for enrichment analyses, and also allow users the choice of more parameters to deepen their understanding of the analytical results.

Similarity calculation between reference lncRNA sets
To provide users with a deep understanding of the reference lncRNA sets, we provided an analysis module to calculate the similarity between reference sets. The similarity score between any two sets in the whole reference set can be quickly computed by the two measures (Formulae (2), (3)) in our database. Users can discover potential associations between two reference sets across the same or different category by browsing the details of each set. Users can not only find directly-related sets by querying a lncRNA list of interest, but can also calculate similarities to other sets to identify indirectly-related sets for lncRNAs. In addition, users can identify relationships between two categories by calculating the similarity scores between sets, which will contribute to the exploration and study the unknown lncRNA functions.

SYSTEM DESIGN AND IMPLEMENTATION
The current version of LncSEA was organized using MySQL 5.7.17 (http://www.mysql.com) and operates on a Linux-based Aliyun Web server. The website was developed based on PHP5.4.45.0 (http://www.php.net), CSS3, and HTML5 frameworks. The lncSEA web interface was designed and built using Bootstrap v3.3.7 (https://v3.bootcss. com) and JQuery v2.1.1 (http://jquery.com). Additionally, we used server-side R scripts for lncRNA set enrichment analysis. Our platform is convenient for users to access and use as it does not require users to register or login to access the database. We recommend using a modern web browser D974 Nucleic Acids Research, 2021, Vol. 49, Database issue that supports HTML5, such as Firefox and Google Chrome for the best display. The LncSEA database is freely available to the research community at the following web address (http://bio.liclab.net/LncSEA/index.php). R script of enrichment analysis and PHP program are provided in Github website. (https://github.com/lxy-boy/LncSEA-Code)

Overview of LncSEA database
The main elements of LncSEA, including the collection of lncRNA sets and the user interface are shown in Figure  1. The current version of LncSEA contains >40 000 reference sets across 18 categories and 66 sub-categories. 'Transcription factor'-, 'lncRNA binding protein'-, and 'superenhancer'-related collections were among the top three in total number (Figure 1). LncSEA provides a user-friendly interface to query, browse, and download detailed information about all of the reference lncRNA sets ( Figure  1). In particular, LncSEA provides enrichment analyses of lncRNA sets.

Effective online tool for lncRNA set enrichment analysis
LncSEA provides lncRNA set enrichment analyses for users. The lncRNA set enrichment analyses that are associated with upstream regulators and downstream targets of lncRNAs can be performed simultaneously. To perform the enrichment analysis, users must input an lncRNA list of interest or a text file containing lncRNAs of interest and select the categories and sub-categories of the reference sets, as well as the parameters and background sets (LncSEA or GENCODE) (Figure 2A). Then, LncSEA will annotate lncRNAs to the reference lncRNA sets, and calculate the statistical significance of the enrichment and similarity scores using the hypergeometric test. Once running, the site will display a progress bar as a percentage to estimate the analytical time. All of the relevant collection categories are shown on the left panel of the return page, and users can view each category according to their own interests. The right panel displays the enrichment analysis results. Users can select the top blue buttons to download the results tables, plot the enrichment analysis bubble and generate a bar chart. Each column of the table represents the name of the lncRNA set, category, sub-category, number of annotated lncRNAs, proportion, Jaccard score, Simpson score, enrichment analysis P-value and adjusted P-value. Users can click on the 'set' hyperlink to view the set details, as well as similar sets. Users can also obtain lncRNA names annotated to the set by clicking the 'count' hyperlink. All significant reference lncRNA sets and visualization results for the enrichment analysis are provided for review and download ( Figure  2B).

Search interface for conveniently retrieving lncRNA sets
Users can search for lncRNAs and their related categories using three approaches: (i) multiple lncRNA features, including gene symbol, Ensembl ID, NCBI refseq ID, alias, and Entrez ID, (ii) genomic region and (iii) genomic sequence ( Figure 2C). If users search via a genome sequence, the sequence alignment from the basic local alignment search tool (BLAST) (65) is also available to download. The results page of the query returns the lncRNA basic information, including gene symbol, Ensembl ID, genomic region, and lncRNA type. The results page also returns all sets related to the query lncRNAs and users can view them by clicking each category. If users only wish to browse one lncRNA in detail, they can select the lncRNA name using the hyperlink. The detailed information associated with lncRNAs will be displayed, such as references to the lncRNA, the sets associated with the lncRNA, sets statistics, and lncRNA expression in different samples of GTEx projects (66), TCGA projects (https://tcga-data.nci.nih.gov/ tcga) normal and cancer, ENCODE projects, and CCLE (67) projects ( Figure 2E and F). Users can view references for each lncRNA, which are including clinical information, biological function, and experimentally-supported mechanism to rapidly understand the related functions of lncR-NAs. For a lncRNA-associated set, users can obtain set names associated with the lncRNA, the category and subcategory to which the set belongs, and lncRNA number of the set by selecting the category. All relevant collections and evidence of current lncRNAs are presented in different modules according to the category on the results page. For example, the 'Transcription Factor' module shows the specific genomic regions for which three transcription factors, CTCF, EZH2, and ZNF639 regulated the promoter regions of lncRNA HOTAIR in K562 cells. Users can select the set hyperlink to review the set details and select samples from the drop-down menu above the table to view TF regulatory information ( Figure 2G). To facilitate further study of the function and mechanism of lncRNAs, lncRNA-associated ceRNA networks are also displayed in the module at the bottom of the results page. LncSEA provided two types of networks based on experimental validations and predictions based on TCGA cancer datasets. Three different types of nodes, including lncRNA, miRNA, and protein coding mRNA in the network are represented by three colors. Users can also drag the edges and nodes to adjust the layout of the ceRNA network. Additionally, users can download images and tables for the ceRNA network ( Figure 2H).

User-friendly interface for browsing lncRNA sets
The 'Browse' page is organized as an interactive table that allows users to quickly search for lncRNA sets and customize filters according to 'Class' and 'Sub class'. Users can click the 'Show entries' drop-down menu to change the number of displayed records per page. To view the details of a given lncRNA set, users only need to click on the 'Set' option. The details of the selected lncRNA set include the categories to which the set belongs, the list of lncRNA names in the set, and the evidence supporting the relationships between the set and each lncRNA. For example, when users select the 'Transcription Factor' class set and the 'enhancer' sub-class, the right side of the interface will show the corresponding set. Each column in the table shown on the right side represents the set name, the class attached to the set, the subclass attached to the set, and the number of lncR-NAs contained in the set ( Figure 2H and I).

ID conversion
LncSEA also supports a user-friendly 'ID conversion' function ( Figure 2D). Users can paste an lncRNA list or upload a file separated by spaces with multiple lncRNA names, including gene symbol, Ensembl ID, NCBI refseq ID, alias, and Entrez ID. When selecting the 'Convert' option, users can obtain the converted results table. Users can not only download the results table, but can also check the 'Analysis' option to connect to the enrichment analysis page for those lncRNAs.

Data download
The 'Download' page was organized as an interactive table. All reference sets of lncRNAs have been arranged and sorted into separate files for download in our database. We also provide two types of file formats for download, including .gmt and .txt. Users can download the reference collection as valuable supplementary data for in-depth experimental research.

A case study using differential cancer lncRNAs
To find lncRNAs as therapeutic and drug targets for breast cancer, numerous studies have focused on identifying differentially expressed lncRNAs. To further explore the function of differential lncRNAs, enrichment analyses of such lncRNAs is necessary. Thus, we used LncSEA to perform functional analyses on the genes differentially expressed in breast cancer. Firstly, we obtained those lncRNAs (log 2 FC > 1, P adj < 0.05) of breast invasive cancer from the TCGA project and circlncRNAnet (68) database as inputs for Lnc-SEA. Next, we set the parameters to include the hypergeometric test P-value = 0.01 and adjusted the P-value = 0.05, and selected the 'RUN' button to perform the enrichment analysis. A total of 18 categories for the sets, including 'transcription factors', 'Disease', 'Drug', 'Enhancer', 'eQTL' and 'Cancer Phenotype' were returned on the left panel of the interface ( Figure 3A). The detailed gene annotation and enrichment analysis results are shown in Supplementary Table S4. We found from the analytical results that these differential lncRNAs were closely related to cancer and therapeutic drugs. For example, when clicking on the 'Disease' set class, we found that the lncRNAs were significantly enriched to two 'Breast cancer' reference sets that belonged to the MNDR2.0, Lnc2Cancer2.0, EVLncRNAs and LncRNADisease2.0 sub-classes. There were 39 lncR-NAs annotated to the breast cancer sets, such as the star molecule HOTAIR, which was reported as a cancer biomarker and therapeutic target (69), and a tumorsuppressor DNA boundary element (70) (Supplementary Table S4). The bubble, bar graphs and Venn diagram of the enrichment analysis results were also provided (Figure 3B). By selecting 'Breast cancer', all of the evidence for correlations between lncRNAs and 'Breast cancer' were listed in the tables on each page. For example, lncRNA HOTAIR was proven by qPCR and knockdown experiments (71) to be associated with breast cancer ( Figure  3C). To further study the subtypes of cancer, LncSEA provided the function of lncRNA enrichment analysis for cancer phenotypes. We found that most of the lncR-NAs were significantly enriched to the 'Breast cancer ER + VS Breast cancer Normal TNBC' set (Simpson score = 0.589, P < 0.01) and the 'Breast cancer TNBC VS Breast cancer Normal TNBC' set (P < 0.01, Simpson score = 0.541) (Supplementary Table S4). Researchers can studied the classification of breast cancers by comparing the biological functions of these two phenotype-specific lncR-NAs.
The identification of novel drug targets and the development of new candidate drugs are of great significance for the targeted treatment of cancer. In the results section of the 'Drug' category, we found these lncRNAs were significantly enriched to the anti-breast cancer drugs Topotecan and Panobinostat (72,73). Interestingly, the drug, TKI258, that ranked third most significant in the enrichment analysis was recently reported to lead to the suppression of downstream signaling by RAS-RAF-MAPK and PI3K-AKT molecules, which are involved in cell proliferation, cell survival, and tumor invasion (74). This result suggested that some of the up-regulated lncRNAs in cancer samples might be used as TKI258 targets. Some studies showed that mutations in lncRNAs may lead to changes in lncRNA expression levels. In the 'eQTL' category, the results showed that the differentially expressed lncRNAs were significantly enriched in breast cancer samples (P < 0.01, Simpson score = 0.337).
Although the analyses above showed that differentiallyexpressed genes were highly related to breast cancer, further explanation of the biological mechanisms leading to cancer is even more important. We speculated that most differentially expressed lncRNAs were regulated by upstream transcriptional regulators, which affected expression levels and lead to breast cancer. It is worth noting that some of the results are consistent with our hypothesis ( Figure 3D). The enrichment results showed that most of the lncRNAs were regulated by accessible chromatin, enhancers, and SE regions in breast cancer tissues and samples. The details of the regulatory regions and more details are available at the Lnc-SEA website (Supplementary Table S4). In addition, these genes are significantly enriched in some core transcription factors in the MCF-7 cell line, such as FOXA1, ESR1, and GATA3, which are important players in transcriptional regulatory networks in breast cancer (75). Several novel TFs with high enrichment scores such as KDM5B have not been widely studied in breast cancer, suggesting that they might be potential novel genes associated with breast cancer. The results of enrichment analyses for more categories can be observed in Supplementary Table S4.
To verify the function and accuracy of LncSEA, we compared the differences between the differentially expressed group and the other three groups included in the enrichment analyses. To maintain consistency across all independent variables, we selected the same scale sets with differentially expressed lncRNAs (log 2 FC > 1; P < 0.05; 2306 lncRNAs) from the breast cancer expression profile. First, we sorted the lncRNAs according to the average expression in all samples. Then, we used high expression, low expression and random 100 times group test sets as inputs for the enrichment analyses. By counting the number of enrichment categories and sets, we found that the four groups had significantly different enrichment results (Supplemen-    tary Figure S1). The enrichment results for the high expression group and differentially expressed group were similar. However, the differently expressed group was enriched with more upstream regulatory elements and factor sets than the high expression group. This result indicated that some lncR-NAs were specifically and highly expressed in cancer as they were regulated by upstream regulatory elements and factors during the transcription process, and those lncRNAs were more likely to be drug targets. Consistent with our expectations, the high expression group was more related to disease, cancer phenotype and RNA binding protein. In contrast, the low expression and random groups were both enriched to a few categories and sets not related to breast cancer (Supplementary Figure S1). We also performed the same tests for prostate cancer and obtained similar results (Supplementary Figure S1). Collectively, we provided a specific case study with enrichment analyses and random tests in two different cancers. The results demonstrate

DISCUSSION
The emerging importance of lncRNAs in human diseases and biological processes, coupled with their upstream regulators and downstream target genes, increases the need for comprehensive human lncRNA reference sets. Therefore, we constructed a human lncRNA database, called LncSEA. Compared with all existing lncRNA databases, LncSEA focuses on building comprehensive human lncRNA sets, and has collected the largest number of human lncRNA sets to date (Supplementary Table S1). LncSEA supports >40 000 reference lncRNA sets, and over 50 000 lncRNAs are annotated to at least one lncRNA set in LncSEA. Thus, such lncRNA sets not only included the sets associated with downstream regulatory data, but also a large number of sets regulated by upstream TFs and DNA regulatory elements by integrating TF ChIP-seq, DNase-seq, ATACseq, and H3K27ac ChIP-seq data. Importantly, based on those reference sets, LncSEA provides lncRNA set annotation and enrichment analysis. Although many databases and tools, such as DAVID (76), GSEA (77), MIEAA (78), TAM2.0 (79) and ESEA (80) provide enrichment analysis for gene set, they mainly focus on the analysis of coding gene, miRNA and pathway, rather than lncRNA set. LncSEA provides annotation and enrichment analysis on lncRNA set, as well as their associated upstream regulators and downstream targets. LncSEA supports a user-friendly interface to analyze, query, browse and download detailed information on lncRNA sets. The main advantages of the database are illustrated below: (I) LncSEA provides comprehensive lncRNA reference sets with classifications of lncRNA sets. There are >40 000 reference lncRNA sets classified into 18 categories and 66 sub-categories. (II) LncSEA supports enrichment analyses for lncRNA sets of interest. In particular, users can perform enrichment analyses of lncRNAs of interest associated with upstream regulators and downstream targets to infer their functions. (III) LncSEA supports the visualization and download of enrichment analysis results. (IV) Users can quickly search related sets by using different lncRNA names; (V) users can quickly search related sets based on genomic region or sequence; and (VI) users can browse each reference lncRNA set. LncSEA provides a catalogue, including categories and sub-categories to browse lncRNA sets. (VII) Similarity score analyses between any two reference lncRNA sets can also be provided by Lnc-SEA. (VIII) ID conversion function is also provided by Lnc-SEA and (IX) LncSEA supports user-friendly displays and allows the download of reference lncRNA sets with interactive tables.
Our effort to establish this platform was prompted by the need of researchers to perform functional analyses of lncRNA sets. Such researchers include geneticists, cell/molecular biologists, and bioinformaticians. Moreover, the field of lncRNA is progressing faster than ever, and the enrichment analysis of a lncRNA set is an indispensable research strategy. LncSEA is a comprehensive resource for human lncRNA sets and is an analysis platform to enhance our understanding of lncRNA functions. The current version of LncSEA stores the most abundant human lncRNA sets and we will manually curate additional lncRNA sets in the future. There are some excellent algorithms and software (81,82) based network for predicting the relationships between lncRNAs and pathways, but because of the complexity of such relationships, we considered adding such data in the next version of LncSEA. Continuous efforts will be made to update the platform with the available data and improve the functionality of the LncSEA database.

CONCLUSIONS AND EXPECTATIONS
The current version of LncSEA involved 18 categories, including >40 000 human lncRNA references. LncSEA is the first database providing a comprehensive collection of lncR-NAs and is capable of performing enrichment analyses upstream and downstream of lncRNAs. With the development of new technologies and the accumulation of experimental data, an increasing number of lncRNA-related information will be generated. In the future, LncSEA will supplement more categories of lncRNAs and additional functional information by tracking developments in biology. We will also include additional experimental sets to extend our data source, and support more powerful enrichment analysis tool. In addition, we will strive to expand the number of species and collections, and provide users with more efficient enrichment analysis methods in the next version of LncSEA.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.