Cancer LncRNA Census 2 (CLC2): an enhanced resource reveals clinical features of cancer lncRNAs

Abstract Long non-coding RNAs (lncRNAs) play key roles in cancer and are at the vanguard of precision therapeutic development. These efforts depend on large and high-confidence collections of cancer lncRNAs. Here, we present the Cancer LncRNA Census 2 (CLC2). With 492 cancer lncRNAs, CLC2 is 4-fold greater in size than its predecessor, without compromising on strict criteria of confident functional/genetic roles and inclusion in the GENCODE annotation scheme. This increase was enabled by leveraging high-throughput transposon insertional mutagenesis screening data, yielding 92 novel cancer lncRNAs. CLC2 makes a valuable addition to existing collections: it is amongst the largest, contains numerous unique genes (not found in other databases) and carries functional labels (oncogene/tumour suppressor). Analysis of this dataset reveals that cancer lncRNAs are impacted by germline variants, somatic mutations and changes in expression consistent with inferred disease functions. Furthermore, we show how clinical/genomic features can be used to vet prospective gene sets from high-throughput sources. The combination of size and quality makes CLC2 a foundation for precision medicine, demonstrating cancer lncRNAs’ evolutionary and clinical significance.


INTRODUCTION
Tumours arise and grow via genetic and non-genetic changes that give rise to widespread alterations in gene expression programmes (1)(2)(3). The numerous dysregulated genes may encode classical protein-coding mRNAs or nonprotein coding RNAs, but it is likely that just a subset of these actually functionally contribute to pathogenic cellular hallmarks. The identification of such functional cancer genes is critical for the development of targeted cancer therapies, as well as emerging methods to identify additional cancer genes. For protein-coding genes (pc-genes), datasets such as the Cancer Gene Census (CGC) collect and organize comprehensive gene collections according to defined criteria, and has proven invaluable for scientific research and drug discovery (4).
The past decade has witnessed the discovery of numerous non-protein-coding RNA genes in mammalian cells (5,6). The most numerous but poorly understood produce long non-coding RNAs (lncRNAs), defined as transcripts >200 nt in length with no detectable protein-coding potential (7). Although their molecular mechanisms are highly diverse, many lncRNAs have been shown to interact with other RNA molecules, proteins and DNA by structural and sequence-specific interactions (8,9). Most lncRNAs are clade-and species-specific, but a subset display deeper evolutionary conservation in their gene structure (10) and a handful have been demonstrated to have functions that were conserved across millions of years of evolution (10,11). The numbers of known lncRNA genes in human have grown rapidly, and present catalogues range from 18 000 to ∼100 000 (12); however just a tiny fraction have been functionally characterized (13)(14)(15)(16). As lncRNAs likely represent a huge yet poorly understood component of cellular networks, understanding the clinical and therapeutic significance of these numerous novel genes is a key contemporary challenge.
LncRNAs have been implicated in molecular processes governing tumorigenesis (17). LncRNAs may promote or oppose cancer hallmarks (18). This fact, coupled to the emergence of potent in vivo inhibitors in the form of antisense oligonucleotides (ASOs) (19), has given rise to serious interest in lncRNAs as drug targets in cancer by both academia and pharma (17,(20)(21)(22).
Initially, cancer lncRNAs were discovered by classical functional genomics workflows employing microarray or RNA-seq expression profiling (23,24). More recently, CRISPR-based functional screening (25) and bioinformatic predictions (26)(27)(28) have also emerged as powerful tools for novel cancer gene discovery. To assess their accuracy, these approaches require accurate benchmarks in the form of curated databases of known cancer lncRNAs.
Any discussion of lncRNAs and cancer requires careful terminology. Experimental evidence suggest that for some lncRNAs, it is a DNA element within the gene, in addition to or instead of the RNA transcript, which mediates downstream gene regulation (29)(30)(31). This introduces the need for meticulous assessment of the basis of each lncRNA gene's functionality. Furthermore, it has been shown that lncRNAs can exert strong phenotypic effects in one cell background, but none in another (32). In the context of tumours, this means that amongst the large numbers of differentially expressed lncRNAs (24), just a fraction is likely to functionally contribute to a relevant cellular phenotype or cancer hallmark (20,(33)(34)(35)(36). Such genes, termed here 'functional cancer lncRNAs', are the focus of this study. Remaining changing genes are non-functional 'bystanders', that are largely irrelevant in understanding or inhibiting the molecular processes causing cancer and highlight the importance of not assessing functionality evidence simply by expressional changes.
There are a number of excellent databases of cancerassociated lncRNAs: lncRNADisease (37), CRlncRNA (38), EVLncRNAs (39) and Lnc2Cancer 3.0 (40). These principally employ labour-intensive manual curation, and rely extensively on differential expression to identify candidates. On the other hand, these databases have not yet taken advantage of recent high-confidence sources of func-tional cancer lncRNAs, such as high-throughput functional screens (25,41). For these reasons, existing annotations likely contain unknown numbers of bystander lncRNAs, whilst omitting large numbers of bona fide functional cancer lncRNAs. Thus, studies requiring high-confidence gene sets, including benchmarking or drug discovery, call for a database focussed exclusively on functional cancer lncR-NAs.
Here, we address this need through the creation of the Cancer LncRNA Census 2 (CLC2). It not only extends our previous CLC dataset by several fold (42), but more importantly, CLC2 takes a major step forward methodologically, by implementing an automated curation component that utilizes functional evolutionary conservation for the first time. Using these data, we present a comprehensive analysis of the genomic and clinical features of cancer lncRNAs.

Literature search
PubMed was searched for publications linking lncRNA and cancer using keywords: long noncoding RNA cancer, lncRNA cancer. Additional inclusion criteria consisted of GENCODE annotation, reported cancer subtype and cancer functionality (oncogene/tumour suppressor). The manual curation and assigning evidence levels to each lncRNA was performed exactly as previously (42) and included reports until December 2018.

CLIO-TIM
From the CCGD website (http://ccgd-starrlab.oit.umn.edu/ about.php, May 2018 (41)) a table with all CIS elements was downloaded. These mouse genomic regions (mm10) were converted to homologous regions in the human genome assembly hg38 using the LiftOver tool (https://genome. ucsc.edu/cgi-bin/hgLiftOver). Settings: original Genome was Mouse GRCm38/mm10 to New Genome Human GRCh38/hg38, minMatch was 0.1 and minBlocks 0.1. For insertion sites intersecting several lncRNA genes, all the genes were reported. IntersectBed from bedtools was used to align human insertion sites to GENCODE IDs by intersecting at least 1 nt and assigned to protein-coding or lncRNA gene families. Insertion sites aligning to proteincoding and lncRNA genes were always assigned to pc-genes. If insertion sites overlap multiple ENSGs, all genes are reported. Insertion sites not aligning to protein-coding or lncRNAs genes were added to the intergenic region.
CCGD human Entrez gene results were converted to GENCODE IDs using the 'Entrez gene ids' Metadata file from https://www.gencodegenes.org/human/ to compare CLIO-TIM results with CCGD results for each gene set.

MiTranscriptome data for evaluating intergenic insertion sites
The cancer associated MiTranscriptome IDs (24) previously used in Bergada et al. (43) were intersected with intergenic insertion sites using IntersectBed. With ShuffleBed the intergenic insertions were randomly shuffled 1000× and assigned to MiTranscriptome IDs.

CRISPRi
We used the Supplementary  (44), we extracted genes with 'hit' (validated as a hit in the screen), 'LH' (unique identifiers correlating to a gene in the screen) and 'lncRNA' (referring to a lncRNA gene and to exclude lncRNA hits close to a pc-gene ('Neighbor hit')) resulting in 499 hits. Of these, 322 hits contain a GENCODE IDs and were used for enrichment analysis, tested by onesided Fisher's test.
We included n = 21 CRISPRi genes to the CLC2 from the Supplementary Figure S8A from the 2017 Liu et al. paper (44), the tested cancer cell line and the effect of the CRISPRi on the growth phenotype (either promoting (tumor suppressor) or inhibiting (oncogene)) of each lncRNA was reported.

Cancer gene sets
For downstream analysis protein-coding (pc) genes (GEN-CODE IDs) are grouped in cancer-associated pc-genes (CGC genes) and non-cancer-associated pc-genes (non-CGC n = 19 174). The TSV file containing the CGC data was downloaded from https://cancer.sanger.ac.uk/census with 700 ENSGs with 698 ENSG IDs detected in GEN-CODE v28 of which 696 are unique (CGC n = 696). The same is done for lncRNAs, into CLC2 (n = 492) and non-CLC genes (n = 15 314).

Matched expression analysis
Based on an in-house script used for Survival analysis (section below), TCGA survival expression data for each GENCODE ID are reported and the average FPKM across all tumor samples is calculated. The count distribution of non-CGC and non-CLC gene expression to CGC and CLC2 expression, respectively, is matched using the matchDistribution.pl script (https://github.com/julienlag/ matchDistribution).

Coding potential analysis
The default CPAT settings (http://lilab.research.bcm.edu/ cpat/) were used to assess lncRNA transcripts; the coding probability for human transcripts ≥0.364 indicates coding sequences (http://rna-cpat.sourceforge.net) and the comparisons are tested using one-sided Fisher's test.

Cancer lncRNA databases
The tested databases were first filtered for lncRNAs in the GENCODE v28 long non-coding annotation (n = 15 767).

Features of CLC2 genes
Genomic classification. The genomic classification was performed as previously (42) using an in house script (https://github.com/goldlab/shared scripts/tree/master/lncRNA.annotator). This analysis uses lncRNA on transcript level and protein coding genes on gene level (default settings).

Genomic classification of CLC2 to CGC/non-CGC genes.
Genomic locations were compared using IntersectBed from bedtools (default settings). This analysis was performed on gene level.
Small RNA analysis. For this analysis 'snoRNA', 'snRNA', 'miRNA' and 'miscRNA' coordinates were extracted from GENCODE v28 annotation file and intersected with the genomic region of the genes (intronic and exonic regions).

Cancer characteristic analysis
Differential gene expression analysis (DEA). Differential gene expression analysis (DEA) was performed using TCGA data and TCGAbiolinks. Analysis was performed as reported in manual for matching tumour and normal tissue samples using the HTseq analysis pipeline as described previously (https://www.bioconductor.org/packages/devel/ bioc/vignettes/TCGAbiolinks/inst/doc/analysis.html) (46). For this analysis, only matched samples were used and the TCGA data were presorted for tumour tissue samples (TP 4 NAR Cancer, 2021, Vol. 3, No. 2 with 01 in sample name) and solid tissue normal (NT with 11 in sample name). Settings used for DEA analysis: fdr.cut = 0.05, logFC.cut = 1 for DGE output between matched TP and NT samples for 20 cancer types. CLC2 cancer types had to be converted to TCGA cancer types (Supplementary Figure S6A). Cancer types and number of samples used in the analysis can be found in Supplementary  Survival analysis. An in-house script for extracting TCGA survival data was used to generate P-values correlating to survival for each gene. Expression and clinical data from 33 cohorts from TCGA with the 'TCGAbiolinks' R package (https://bioconductor.org/packages/release/ bioc/html/TCGAbiolinks.html) were downloaded (46). P-value and Hazard ratio were calculated with the Cox proportional hazards regression model from 'Survival' R package (https://cran.r-project.org/web/packages/ survival/survival.pdf). All scripts were adapted from here (https://www.biostars.org/p/153013/) and are available upon request. For downstream analysis, only groups with at least 20 patient samples in high or low expression group were used. The plot comprises only the most significant cancer survival P-value per gene and was assessed by the Komnogorow-Smirnow test (ks test).
Cancer-associated SNP analysis. SNP data linked to tumour/cancer were extracted from the genome-wide association studies (GWAS) page (https://www.ebi.ac.uk/gwas/ docs/file-downloads) (n = 5331) and intersected with the whole exon body of the genes. SNPs were intersected to the transcript bed file and plotted per nt in each subset (SNP/nt y-axis) and tested using one-sided Fisher's test.
Conservation analysis. Whole exon body of the genes used in the SNP analysis were evaluated using Phast-Cons Scores (phastCons100way.UCSC.hg38) and the R package 'GenomicScores' (https://www.bioconductor.org/ packages/release/bioc/html/GenomicScores.html). Conservation scores for CLC2 SNP exons were plotted and compared to the mean of all CLC2 exons and non-CLC2matched exons, tested using one-sided Fisher's test.
Code availability. Custom code are available from the corresponding author upon request.
Generation of Cas9 stable cell lines. HeLa cells were transduced at a high multiplicity of infection with infection media composed by: lentivirus carrying the Cas9-BFP vector (Addgene 52962) and Hexadimethrine bromide (8 g/ml, Sigma-Aldrich 107689) resuspended in DMEM (10% FBS, 1% L-glutamine). Cells were incubated in infection media during 48 h. After that, the infection media was replaced by selective media composed by complete DMEM (10% FBS, 1% L-Glutamine and 1% Penicillin-Streptomycin) and Blasticidin (4 g/ml, Sigma-Aldrich 15205). Cells were selected until control cells were completely dead. Finally, cells were sorted twice selecting BFP positive cells by fluorescence activated cell sorting and expanded. TOPO Cloning and Sanger sequencing of the qPCR amplicon. The qPCR product of LINC00570 amplified using Qiagen QuantiNova RT (Qiagen, 205410) and Quanti-Nova SYBR Green PCR Kit (Qiagen, 208052) was run on a 2% agarose gel. The main band (corresponding to the expected amplicon size of 95 bp) was purified using the Gene-JET Gel Extraction and DNA Cleanup Micro Kit (Thermo Fisher Scientific, K0831). Using the TOPO TA Cloning Kit (Thermo Fisher Scientific, 45-0030), 4 l of the purified amplicon were ligated into the TOPO backbone vector. A total of 2 l of ligation product was used to transform Stbl3 competent cells, bacterial colonies were expanded and Sanger sequencing was performed (MicroSynth GmBH) using the M13 forward primer targeting the backbone provided with the TOPO TA Kit.

Integrative, semi-automated cataloguing of cancer lncRNAs
We sought to develop an improved map of lncRNAs with functional roles in either promoting or opposing cancer hallmarks or tumourigenesis. Such a map should prioritize lncRNAs with genuine causative roles, and exclude falsepositive 'bystanders'--genes whose expression changes but play no functional role.
We began with conventional manual curation of lncR-NAs from the scientific literature, covering the period from January 2017 (directly after the end of the first CLC (42)) to the end of December 2018. We continued to use stringent criteria for defining cancer lncRNAs--genes must be annotated in GENCODE (here version 28), and cancer function must be demonstrated either by functional in vitro or in vivo experiments, or germline or somatic mutational evidence (see 'Materials and Methods' section) ( Figure 1A). Altogether we collected 253 novel lncRNAs in this way, which added to the original CLC amounts to 375 lncRNAs, hereafter denoted as 'literature lncRNAs' (Figure 1A).
We recently showed that some literature-curated lncR-NAs were also targeted by previously overlooked mutations in published transposon insertional mutagenesis (TIM) screens (42). We hypothesized that this insight could be extended to identify novel functional cancer lncRNAs. Thus we developed a pipeline to automatically identify human lncRNAs by orthology to a collection of TIM hits in mouse (41). In this way 123 lncRNAs were detected, of which 102 were not already in the literature set. These were added to the CLC2, henceforth denoted as 'mutagenesis lncRNAs' ( Figure 1B). This analysis is discussed in more detail in the next section.
Pooled functional screens based on CRISPR-Cas9 lossof-function have recently emerged as a powerful means of identifying function cancer lncRNAs (25). However, there has been relatively little validation of the hits from such screens, and it is possible that they contain substantial false positives (50,51). Amongst the few datasets presently available, the most comprehensive comes from a CRISPRinhibition (CRISPRi) screen of ∼16 000 lncRNAs in seven human cell lines, with proliferation as a readout (44). Of the 499 hits identified, 322 are annotated by GENCODE and hence could potentially be included in CLC2. These are moderately enriched for known cancer lncRNAs from the literature search ( Figure 1C). That study independently validated 21 GENCODE-annotated hits, of which four (19%) were already mentioned in the literature, and two (10%) were detected by TIM above. Given the uncertainty over the true-positive rates of unvalidated screen hits, we opted for a conservative approach and included the remaining 15 novel and independently validated lncRNAs from this study ('CRISPRi lncRNAs') ( Figure 1C).
Altogether, the resulting CLC2 set comprises 492 unique lncRNA genes, representing a 4.0-fold increase over its predecessor. The entire CLC2 dataset is available in Supplementary Table S1 and S2. Importantly, the dataset is fully annotated with evidence information, affording users complete control over the particular subsets of lncRNAs (literature, mutagenesis and CRISPRi) that they wish to include in their analyses.

Automated annotation of human cancer lncRNAs via functional conservation
We recently showed that transposon insertional mutagenesis (TIM) screens identify cancer lncRNAs in mouse (42,52), and that some of these overlapped previously known human cancer lncRNAs (Figure 2A). TIM screens identify 'common insertion sites' (CIS), where multiple transposon insertions at a particular genomic location have given rise to a tumour, thereby implicating the underlying gene as an oncogene or tumour suppressor.
Here, we extend this strategy to identify new functional cancer lncRNAs, by developing a new pipeline called CLIO-TIM (cancer lncRNA identification by orthology to TIM). Briefly, CLIO-TIM uses chain alignments (53) to map mouse CIS to orthologous regions of the human genome, and then identifies the most likely gene target (see 'Materials and Methods' section) ( Figure 2B) (Supplementary Figure S1B). Available CIS maps are based on a variety of identification methods, resulting in CIS with a range of sizes, from 1 bp upwards. We opted to remove our previously conservative size criterion (CIS = 1 bp), to now consider elements of any size resulting in 26 345 CIS (compared to 2806 previously (42)) (Supplementary Figure S1A). This yields a 3-fold increase in sensitivity for true-positive CGC genes (72% compared to 26.4% previously (42)) (Supplementary Figure S1D).
Based on this expanded dataset, CLIO-TIM identified 16 430 orthologous regions in human (hCIS) ( Figure 2B) (Supplementary Figure S1A). Altogether, 123 lncRNAs and 9295 pc-genes were identified as potential cancer genes. It should be noted that the locations of originating mutations within CIS regions remains imprecisely known, meaning that we cannot say with certainty which mutations fall in gene exons or introns. An example is the human-mouse orthologous lncRNA locus shown in Figure 2B, comprising Gm36495 in mouse and LINC00570 in human. A CIS lies upstream of the mouse gene's TSS, mapping to the first intron of the human orthologue. LINC00570 is an alternative identifier for ncRNA-a5 cis-acting lncRNA identified by Orom et al. (54), that has not previously been associated with cancer or cell growth.
We expected that hCIS regions are enriched in known cancer genes. Consistent with this, the 698 pc-genes from the COSMIC CGC (4) (red in Supplementary Figure S1D) are 155-fold enriched with hCIS over intergenic regions (light grey). Turning to lncRNAs, the 375 literature lncRNAs are 19.5-fold enriched, supporting their disease relevance (Figure 2C). Thus, CLIO-TIM predictions are enriched in genuine protein-coding and lncRNA functional cancer genes. Supporting its accuracy, the overall numbers of genes implicated by CLIO-TIM agree with independent analysis in the CCGD database (Supplementary Figure S1C).
An additional 209 hCIS fall in intergenic regions that are neither part of pc-genes or lncRNAs, leading us to ask whether some may affect lncRNAs that are not annotated by GENCODE ( Figure 2C). To test this, we utilized the large set of cancer-associated lncRNAs from miTranscriptome (24). A total of 186 hCIS intersect 2167 miTranscriptome transcripts, making these potentially novel nonannotated transcripts involved in cancer. Nevertheless, simulations indicated that this rate of overlap was no greater than expected by random chance (see 'Materials and Methods' section), making it unlikely that substantial numbers of undiscovered cancer lncRNAs remain to be discovered in intergenic regions, at least with the datasets used here (Supplementary Figure S1E).
In addition to known cancer lncRNAs, CLIO-TIM identifies 102 lncRNAs not previously linked to cancer (Figure 2C, dark grey) with a 3.8-fold enrichment of insertions over intergenic genome. As will be shown below, these lncR-NAs bear clinical and genomic features of functional cancer genes, and hence we decided to include them in CLC2. It should be noted, however, that these 'mutagenesis' lncR-NAs are labelled and hence may be removed by end users, as desired. To experimentally test the principal that human orthologues of mouse cancer genes have a conserved function, we selected LINC00570, identified by CLIO-TIM but never previously been linked to cancer or cell proliferation. We asked whether LINC00570 promotes cell growth in transformed cells. We used RNA-sequencing data to search for cell models where LINC00570 is expressed, and identified robust expression in cervical carcinoma HeLa cells (Supplementary Figure S2A). We designed three distinct ASOs targeting the LINC00570 intron 2 and 3 and exon 3 of the short isoform (intronic targeting ASOs are known to have degradation efficiency comparable to exonic ones (55,56)). Transfection of these ASOs led to strong and reproducible decreases in steady state RNA levels in HeLa cells ( Figure  2D). This resulted in significant decreases in cell proliferation rates ( Figure 2D and Supplementary Figure S2B)). We observed a similar effect through CRISPRi-mediated inhibition of gene transcription by two independent guide RNAs in HeLa ( Figure 2D). To verify this qRT-PCR assay was measuring the correct cDNA, we isolated and sequenced the band, finding that it indeed originated from the expected sequence (Supplementary Figure S2C). Orom et al. reported that knockdown of LINC00570 (ncRNA-a5) led to a reduction in nearby ROCK2 gene's expression (54). Surprisingly, we found that the expression of ROCK2 was not detectably affected by LINC00570 knockdown (Supplementary Figure S2D). In summary, LINC00570 predicted by CLIO-TIM pipeline promotes growth of human cancer cells, and is likely to have a deeply evolutionarily conserved tumorigenic activity.

Enhanced cancer lncRNA catalogue integrating manual annotation, CRISPR screens and functional conservation
We next tallied the distinct lncRNAs in CLC3 and compared them with existing cancer lncRNA databases. Figure 3A shows a breakdown of the composition of CLC2 in terms of source, gene function and evidence strength. Where possible, the genes are given a functional annotation, oncogene (og) or tumour suppressor (ts), according to evidence for promoting or opposing cancer hallmarks. Oncogenes (n = 275) quite considerably outnumber tumour suppressors (n = 95), although it is not clear whether this reflects genuine biology or an ascertainment bias relating to scientific interest or technical issues. Smaller sets of lncRNAs are associated with both functions, or have no functional information (those from TIM screens where the functions of hits are ambiguous).
In terms of the quality of evidence sources, CLC2 represents a considerable improvement over the original CLC. The fraction of lncRNAs with high quality in vivo evidence (defined as functional validation in mouse models or mutagenesis analysis) now represent 66% compared to 24% previously ( Figure 3A and Supplementary Figure S3B). In total, the updated CLC2 comprises 33 cancer types (versus 29) and more lncRNAs are reported for every cancer subtype (Supplementary Figure S3A).
We were curious how much novelty the CLC2 gene set brought to the known universe of cancer lncR-NAs, as estimated from respected and longstanding cancer lncRNA collections ( Figure 3B). Considering only GENCODE-annotated lncRNA genes, CLC2 with 492 is second only to Lnc2Cancer 3.0 (n = 688) in terms of size (40). Lnc2Cancer and CLC2 share the greatest number of lncRNAs in common. However, Lnc2Cancer uses looser inclusion criteria, including lncRNAs that are differentially expressed in tumours without additional functional evidence. The three remaining databases are smaller (<200 genes).
The novel aspect of CLC2 to include lncRNAs from TIM screens leads to the identification of 92 completely novel genes, not detected in any other database ( Figure 3B, inset). Just 41 lncRNAs are common to all five databases (37)(38)(39)(40). In summary, CLC2 achieves large size without compromising on confidence, whilst also including numerous new cancer lncRNAs for the first time.

Unique genomic properties of CLC2 lncRNAs
Cancer genes, both protein-coding and not, display elevated characteristics of essentiality and clinical importance compared to other genes (4,18,57,58). In order to confirm their quality as a resource, we next asked whether CLC2 lncR-NAs, and the mutagenesis subset, display features expected for cancer genes.
In the following analyses, we compared gene features of selected lncRNAs to all other lncRNAs. Comparison of gene sets can often be confounded by covariates, such as gene length or gene expression, therefore where appropriate we used control gene sets that were matched to CLC2 by expression (denoted 'nonCLCmatched') ( Supplementary Figure S4A) and reported findings correcting for gene length (Supplementary Figure S4B). We next tested for potentially protein-coding transcripts amongst the CLC2 set. Overall, only a small but non-negligible fraction (5.9%) of CLC2 genes indicated coding potential (Supplementary Figure  S4C). This highlights the general need for researchers to exercise caution in interpreting the biotypes of annotated lncRNAs and investigate their protein-coding status more thoroughly where appropriate.
Evolutionary conservation and steady-state expression are widely-used proxies for gene function (59)(60)(61). Using the LnCompare tool (45), we find that the promoters and exons of CLC2 genes display elevated evolutionary conservation in mammalian and vertebrate phylogeny ( Figure 4A) and elevated expression in cancer cell lines ( Figure 4B). Strikingly we observe a similar effect when considering the mutagenesis lncRNAs alone: their promoters are significantly more conserved than expected by chance, and their expression is an order of magnitude higher than other lncRNAs ( Figure  4C and D).
Further, we found that CLC2 lncRNAs are enriched in repetitive elements (Supplementary Figure S5A) and are more likely to house a small RNA gene, possibly indicating that some act as precursor transcripts (Supplementary Figure S5B). CLC2 lncRNAs also have non-random distributions of gene biotypes, being depleted for intergenic class and enriched in divergent orientation to other genes (Supplementary Figure S5C). This effect was not driven by CRISPRi targets alone, since when the analysis was repeated without them, the same enrichment for divergent lncRNAs was observed (P = 0.0038). We could observe an enrichment of CLC2 genes overlapping or within 10 kb distance of the TSS of the CGC genes compared to non-CGC genes (Supplementary Figure S5D), suggesting cancer cofunctionalities for CLC and CGC genes.
In summary, CLC2 lncRNAs are significantly more conserved and more expressed than expected by chance, pointing to biological function. Mutagenesis lncRNAs discovered by the CLIO-TIM also carry these features, supports their designation as functional cancer lncRNAs.

CLC2 lncRNAs display consistent tumour expression changes and prognostic properties
Although gene expression was not a criterion for inclusion, we would expect that CLC2 lncRNAs' expression will be altered in tumours. Furthermore, one might expect that the nature of this alteration should vary with disease function: oncogenes overexpressed, and tumour suppressors downregulated.
To test this, we analysed TCGA RNA-sequencing (RNAseq) data from 686 individual tumours with matched healthy tissue (total n = 1372 analysed samples) in 20 different cancer types (Supplementary Figure S6A and B), and classified every gene as either differentially expressed (in at least one cancer subtype, with log2 fold change > 1 and FDR < 0.05) or not. We found that CLC2 lncRNAs are 3.4fold more likely to be differentially expressed compared to expression-matched lncRNAs ( Figure 5A). LncRNAs from each individual evidence source (literature, mutagenesis and CRISPRi) behaved similarly, again supporting their inclusion. Similar effects were found for pc-genes (Supplementary Figure S7A).
Next, we asked whether the direction of expression change corresponds to gene function. Indeed, oncogenes are enriched for overexpressed genes, whereas tumour suppressors are enriched for downregulated genes, supporting the functional labelling scheme ( Figure 5B).
Cancer genes' expression is often prognostic for patient survival. By correlating expression to patient survival, we  found that the expression of 392 CLC2 lncRNAs correlated to patient survival in at least one cancer type (Supplementary Figure S7C). When analysing the most significant correlation of each CLC2 lncRNA compared to expressionmatched non-CLC lncRNAs, we find a weak but significant enrichment (Supplementary Figure S7C), suggesting that CLC2 lncRNAs can be prognostic for patient survival.
In summary, gene expression characteristics of CLC2 genes, and subsets from different evidence sources, support their functional labels as oncogenes and tumour suppressors and is more broadly consistent with their important roles in tumorigenesis.

CLC2 lncRNAs are enriched with cancer genetic mutations
Cancer genes are characterized by a range of germline and somatic mutations that lead to gain-or loss-offunction. It follows that cancer lncRNAs should be enriched with germline single nucleotide polymorphisms (SNPs) that have been linked to cancer predisposition (62). We obtained 5331 germline cancer-associated SNPs from GWAS (63) and mapped them to lncRNA and pc-gene exons, calculating a density score that normalizes for exon length (Supplementary Figure S4B). As expected, exons of known cancer pc-genes are >2-fold enriched in cancer SNPs (Supplementary Figure S7B). When performing the same analysis with CLC2 lncRNAs, one observes an even more pronounced enrichment of 4.0-fold when comparing to expression-matched non-CLC lncRNAs (Figure 5C). Once again, the lncRNAs from each evidence source individually show enrichment for cancer SNPs >2fold ( Figure 5C). Three mutagenesis lncRNAs, namely miR143HG/CARMN, LINC00511 and LINC01488, carry an exonic cancer SNP ( Figure 5D).
CLC2 exons containing a cancer SNP are less conserved than CLC2 exons overall, and display a conservation level comparable to non-CLC exons (Supplementary Figure  S7D). This is consistent with previous reports demonstrating that SNPs tend to occur in regions of lower than average evolutionary conservation (64).
Cancer genes are also frequently the subject of large-scale somatic mutations, or copy number variants (CNVs). Using a collection of CNV data from LncVar (47), we calculated the gene-span length-normalized coverage of lncRNAs by CNVs. CLC2 lncRNAs are enriched for CNVs compared to all lncRNAs ( Figure 5E).
All information of the lncRNAs in the CLC2 with the corresponding cancer function, evidence level, analysis method and cancer types can be found in the Supplementary Table S1. The Supplementary Table S2 can be used to filter lncRNAs based on their reported cancer associated functionalities.
In summary, CLC2 lncRNAs and their subsets display germline and somatic mutational patterns consistent with known oncogenes and tumour suppressors

DISCUSSION
We have presented the CLC2, an expanded collection of lncRNAs with functional roles in cancer. CLC2 is distinguished from other resources by several key features. All its constituent lncRNAs have strong evidence for functional cancer roles (and not merely differential expression), providing for lowest possible false positive rates. All CLC2 lncRNAs are included in the gold-standard GENCODE annotation, permitting smooth interoperability with almost all public genomics projects and resources (12). The majority of CLC2 entries are accompanied by functional labels (oncogene/tumour suppressor), enabling one to link function to other observable features. Finally, we utilize transposon insertional mutagenesis (TIM) datasets for the first time to discover 102 'mutagenesis' lncRNAs, of which 92 are completely novel. In spite of strict inclusion criteria, CLC2 is amongst the largest available cancer lncRNA collections. Overall, CLC2 makes a valuable addition to the present landscape of cancer lncRNA resources.
A key novelty of CLC2 is its use of automated gene curation based on functional evolutionary conservation, as inferred from TIM. This responds to the challenge from the rapid growth of scientific literature, which makes manual curation increasingly impractical. Other high-throughput/automated methods like CRISPR pooled screening, text mining and machine learning will also be important, although it will be necessary to vet the quality of such predictions prior to inclusion. Here we showed one way approach for this, by assessing the TIM gene set across a range of genomic and clinical features. The fact that the 'mutagenesis' lncRNA set display rates of (i) nucleotide conservation, (ii) expression, (iii) tumour differential expression, (iv) germline cancer polymorphisms and (v) tumour mutations similar to that of gold-standard literaturecurated lncRNAs, coupled to thorough experimental validation of one novel prediction (LINC00570), is powerful support for TIM and functional evolutionary conservation as means for new cancer lncRNA discovery.
It might be argued that hits from TIM sites could be false positives that act via DNA elements (for example, enhancers) that, by coincidence, overlap a non-functional lncRNA. Whilst certainly likely to occur in some cases, it would nevertheless appear unlikely to explain the majority, in light of the features listed above, plus the observation that TIM sites are highly enriched in independently validated literature-curated lncRNAs (which act via RNA) including NEAT1, LINC-PINT and PVT1 (42). In spite of this, we recognize that some colleagues may ascribe lower confidence to novel 'mutagenesis' lncRNAs in CLC2. For this reason, the CLC2 data table is organized to facilitate filtering by source, enabling users to extract only the 375 literature-supported cases, or indeed any other subset based on source, evidence or function as desired.
Apart from its usefulness as a resource, this study has enabled some important conceptual insights. Firstly, we have replicated our previous finding that cancer lncRNAs are distinguished by signatures of functionality, as inferred from evolutionary nucleotide conservation and expression. These features were originally linked to protein-coding cancer genes (57,58), but are also utilized as markers for lncRNA functionality (42,65). Moreover, we extended this approach to clinical features, by showing that curated cancer lncRNAs are dramatically more likely to be differentially expressed in tumours, suffer copy number alteration, or carry a germline predisposition SNP. In the latter case, this rate even exceeds cancer driver pc-genes. We also could demonstrate that changes in gene expression in tumours are linked to function: oncogenes tend to be overexpressed, whilst tumour-suppressors tend to be repressed. Finally, the demonstration that cancer lncRNAs can be predicted on the basis of orthology to a TIM hit in mouse, lends powerful support to the notion that there is widespread functional evolutionary conservation of lncRNAs in networks related to cell growth and transformation.
LINC00570 is a new functional cancer lncRNA predicted by CLIO-TIM. The gene was previously discovered by Orom and colleagues, as a cis-activating enhancer-like RNA named ncRNA-a5 (54). That and a subsequent study showed that perturbation by siRNA transfection affects the expression of the nearby pc-gene ROCK2 in HeLa. However, these studies did not investigate the effect on cell proliferation. We here show by means of two independent perturbations, that LINC00570 promotes proliferation of HeLa cells. These findings make LINC00570 a potential therapeutic target for follow up.
Intriguingly, amongst the novel mutagenesis lncRNAs identified by CLIO-TIM are genes previously linked to other diseases. miR143HG/CARMEN1 (CARMN) was shown to regulate cardiac specification and differentiation in mouse and human hearts (66). In addition to being a TIM target, CARMEN1 also contains a germline cancer SNP correlating with the risk of developing lung cancer (67), adding further weight to the notion that it also plays a role in oncogenesis. Similarly, DGCR5, is located in the DiGeorge critical locus and has been linked to neurodevelopment and neurodegeneration (68), and was recently implicated as a tumour suppressor in prostate cancer (69). These results raise the possibility that developmental lncRNAs can also play roles in cancer.
In summary, CLC2 establishes a new benchmark for cancer lncRNA resources. We hope this dataset will enable a wide range of studies, from bioinformatic identification of new disease genes, to developing a new generation of cancer therapeutics with anti-lncRNA ASOs (70).

DATA AVAILABILITY
Information on CIS elements for mouse and human lncR-NAs reported in this publication are available in the Supplementary Table S1 and the code is available from GitHub (https://github.com/Vancuraa/CLC2).