Semi-automated annotation of known and novel cancer long noncoding RNAs with the Cancer LncRNA Census 2 (CLC2)

Long noncoding RNAs (lncRNAs) can promote or repress the cellular hallmarks of cancer. Understanding their molecular roles and realising their therapeutic potential depend on high-quality catalogues of cancer lncRNA genes. Presently, such catalogues depend on labour-intensive curation of heterogeneous data with permissive criteria, resulting in unknown numbers of genes without direct functional evidence. Here, we present an approach for semi-automated curation focused exclusively on pathogenic functionality. The result is Cancer LncRNA Census 2 (CLC2), comprising 492 gene loci in 33 cancer types. To complement manual literature curation, we develop an automated pipeline, CLIO-TIM, to identify novel cancer lncRNAs based on functional evolutionary conservation with mouse. This yields 95 novel lncRNAs, which display characteristics of known cancer genes and include LINC00570 (ncRNA-a5), which we demonstrate experimentally to promote cell proliferation. The clinical importance and curation accuracy of CLC2 lncRNAs is highlighted by a range of features, including evolutionary selection, expression in tumours, and both somatic and germline polymorphisms. The entire dataset is available in a highly-curated format facilitating the widest range of downstream applications. In summary, we show how manual and automated methods can be integrated to catalogue known and novel functional cancer lncRNAs with unique genomic and clinical properties.


Introduction
Cancer arises through a series of somatic genetic mutations, leading via defined cellular phenotypic hallmarks to the formation of a tumour (Hanahan & Weinberg, 2011) (Yates & Campbell, 2012). Such mutations are principally thought to alter the function of polypeptides encoded by protein-coding genes (pc-genes) (Sondka et al., 2018) and the increased probability of suffering dysfunctional mutations defines known cancer genes (Furney et al., 2006).
Datasets such as the Cancer Gene Census (CGC) collect and organise comprehensive sets of cancer pc-genes according to defined criteria, and represent invaluable and widely-used resources for scientific research and drug discovery (Sondka et al., 2018).
The past decade has witnessed the discovery of numerous non-protein-coding RNA genes in mammalian cells (Guttman et al., 2009;Uszczynska-Ratajczak et al., 2018). The most numerous but poorly understood produce long noncoding RNAs (lncRNAs), defined as transcripts >200 nt in length with no detectable protein-coding potential (Derrien et al., 2012).
Although their molecular mechanisms are highly diverse, many lncRNAs have been shown to interact with other RNA molecules, proteins and DNA by structural and sequence specific (Guttman & Rinn, 2012) (Johnson & Guigó, 2014). Most lncRNAs are clade-and speciesspecific, but a subset display deeper evolutionary conservation in their gene structure (Ulitsky et al., 2011) and a handful have been demonstrated to have functions that were conserved across millions of years of evolution (Marín-Béjar et al., 2017;Ulitsky et al., 2011). The numbers of known lncRNA genes in human have grown rapidly, and present catalogues range from 18,000 to ~100,000 (Frankish et al., n.d.), however just a tiny fraction have been functionally characterized (Kopp & Mendell, 2018) (Ulitsky & Bartel, 2013) (Ma et al., 2019) (Quek et al., 2015). Understanding the clinical and therapeutic significance of these numerous novel genes is a key contemporary challenge.
More recently, CRISPR-based functional screening (Esposito et al., 2019) and bioinformatic predictions (Lanzós et al., 2017;Mularoni et al., 2016;Rheinbay et al., 2017) have also emerged as powerful tools for novel cancer gene discovery. To assess their accuracy, these approaches require accurate benchmarks in the form of curated databases of known cancer lncRNAs.
Any discussion of lncRNAs and cancer requires careful terminology. Tumours display large numbers of differentially expressed genes (Iyer et al., 2015). However, just a fraction of these are likely to functionally contribute to a relevant cellular phenotype or cancer hallmark (T Gutschner et al., n.d.;Hosono et al., 2017;Lee et al., 2016;Leucci et al., 2016;Munschauer et al., 2018). Such genes, termed here "functional cancer lncRNAs", are the focus of this study.
Remaining changing genes are non-functional "bystanders", which are largely irrelevant in understanding or inhibiting the molecular processes causing cancer.
There are a number of excellent databases of cancer-associated lncRNAs: lncRNADisease (Bao et al., 2019), CRlncRNA , EVLncRNAs (Zhou et al., 2018) and Lnc2Cancer (Gao et al., 2019). These principally employ labour-intensive manual curation, and rely extensively on differential expression to identify candidates. On the other hand, these databases have not begun to use more recent high-confidence sources of functional cancer lncRNAs, such as high-throughput functional screens (Esposito et al., 2019) (Abbott et al., 2015). For these reasons, existing annotations likely contain unknown numbers of bystander lncRNAs, while omitting large numbers of bona fide functional cancer lncRNAs. Thus, studies requiring high-confidence gene sets, including benchmarking or drug discovery, call for a database focussed exclusively on functional cancer lncRNAs.
Here we address this need through the creation of the Cancer LncRNA Census 2 (CLC2). This extends our previous CLC dataset by several fold (Carlevaro-Fita et al., 2020).
More importantly, CLC2 takes a major step forward methodologically, by implementing an automated curation component that utilises functional evolutionary conservation for the first time. Using this data, we present a comprehensive analysis of the genomic and clinical features of cancer lncRNAs. Most important, we present a practical and versatile dataset intended for use by basic researchers and drug discovery projects.

Integrative, semi-automated cataloguing of cancer lncRNAs
We sought to develop an improved map of lncRNAs with functional roles in either promoting or opposing cancer hallmarks or tumorigenesis. Such a map should prioritise lncRNAs with genuine causative roles, and exclude false-positive "bystander" lncRNAs whose expression changes but play no functional role.
We began with conventional manual curation of lncRNAs from the scientific literature, covering the period from January 2017 (directly after the end of the first CLC (Carlevaro-Fita et al., 2020)) to the end of December 2018. The criteria for defining cancer lncRNAs were identical: genes must be annotated in GENCODE (here version 28), and cancer function must be demonstrated by in vitro or in vivo experiments or germline or somatic mutational evidence ( Figure 1A). Altogether we collected 253 novel lncRNAs in this way, which added to the original CLC (n=122) amounts to 375 lncRNAs, hereafter denoted as "literature lncRNAs" ( Figure 1A).
We previously showed that transposon insertion mutagenesis (TIM) hits are enriched in cancer-associated lncRNA genes, implying that cancer lncRNAs' functions can be deeply evolutionarily conserved (Carlevaro-Fita et al., 2020). We developed a pipeline to automatically identify likely human functional cancer lncRNAs by orthology to a collection of TIM hits (Abbott et al., 2015). In this way 123 lncRNAs were detected, of which 102 were not already in the literature set. These were added to the CLC2, henceforth denoted as "mutagenesis lncRNAs" ( Figure 1B). This analysis is discussed in more detail in the next section.
Pooled functional screens based on CRISPR-Cas9 loss-of-function have recently emerged as a powerful means of identifying function cancer lncRNAs (Esposito et al., 2019).
The most comprehensive dataset presently available comes from a CRISPR-inhibition (CRISPRi) screen of ~16,000 lncRNAs in seven human cell lines, with proliferation as a readout (Liu et al., 2017). Of the 499 hits identified, 322 are annotated by GENCODE. These hits are significantly enriched for known cancer lncRNAs from the literature search ( Figure   1C). That study independently validated 21 GENCODE-annotated hits. Four (19%) of these were already mentioned in the scientific literature, and 2 (10%) were detected in the TIM screen above. Given their high-confidence, we added the remaining 15 novel lncRNAs to CLC2 ("CRISPRi lncRNAs") ( Figure 1C).
PVT1, SNHG1 and SNHG12 genes are detected in all three sources. The entire CLC2 dataset is available in Supplementary Table 1 and 2. Importantly, the dataset is fully annotated with evidence information, enabling users to filter particular sets of lncRNAs in which they have greater levels of confidence.

Automated annotation of human cancer lncRNAs via functional conservation
We recently showed that transposon insertional mutagenesis (TIM) screens can be used to identify cancer lncRNAs in mouse (Copeland & Jenkins, 2010) (Carlevaro-Fita et al., 2020 , and that these often have human orthologues which are also cancer genes ( Figure   2A). TIM screens identify "common insertion sites" (CIS), where multiple transposon insertions at a particular genomic location have given rise to a tumour, thereby implicating the underlying gene as an oncogene or tumour suppressor.
We reasoned that this could be extended to identify new functional cancer lncRNAs in human, and developed a pipeline for this: CLIO-TIM (cancer lncRNA identification by orthology to TIM). Briefly, CLIO-TIM uses chain alignments to map mouse CIS to orthologous regions of the human genome, and then identifies their likely target gene by proximity (see Methods) ( Figure 2B) (SUPP FIG 1B).
Using 26,345 mouse CIS from public databases (Abbott et al., 2015), CLIO-TIM identifies 16,430 orthologous regions in human (hCIS) ( Figure 2B) (SUPP FIG 1A). Altogether, 123 lncRNAs and 9,295 pc-genes are identified as potential cancer genes. An example is the human-mouse orthologous lncRNA locus shown in Figure 2B, comprising Gm36495 in mouse and LINC00570 in human. A CIS lies upstream of the mouse gene's TSS, mapping to the first intron of the human orthologue. LINC00570 is an alternative identifier for ncRNA-a5 cis-acting lncRNA identified by Orom et al. (Ørom et al., 2010), that has not previously been associated with cancer or cell growth.
We expect that hCIS regions are enriched in known cancer genes. Consistent with this, the 698 pc-genes from the COSMIC Cancer Gene Census (CGC) (Sondka et al., 2018) (red in SUPP FIG 1D) are 155-fold enriched with hCIS over intergenic regions (light grey).
Turning to lncRNAs, the 375 literature lncRNA are 19.5-fold enriched, supporting their disease importance ( Figure 2C). Thus, CLIO-TIM predictions are enriched in genuine protein-coding and lncRNA functional cancer genes. As expected, the overall numbers of genes implicated by CLIO-TIM agree with independent analysis in the CCGD database (SUPP FIG 1C).
An additional 209 hCIS fall in intergenic regions that are neither part of pc-genes or lncRNAs, leading us to ask whether some may affect lncRNAs that are not annotated by GENCODE ( Figure 2C). To test this, we utilised the large set of cancer-associated lncRNAs from miTranscriptome (Iyer et al., 2015). 186 hCIS intersect 2167 miTranscriptome genes, making these potentially novel non-annotated transcripts involved in cancer. Nevertheless, simulations indicated that this rate of overlap was no greater than expected by random chance (see Methods), making it unlikely that substantial numbers of undiscovered cancer lncRNAs remain to be discovered in intergenic regions, at least with the datasets used here (SUPP FIG   1E).
In addition to known cancer lncRNAs, CLIO-TIM identifies 102 lncRNAs not previously linked to cancer (FIG 2C, dark grey) with a 3.8-fold enrichment of insertions over intergenic space. These lncRNAs represent novel functional cancer genes, and were added to the CLC2.
This makes the assumption that human orthologues of mouse cancer genes will have a conserved function. We tested this using LINC00570, predicted to be a cancer gene by CLIO-TIM but never previously been linked to cancer or cell proliferation.  We here describe some general features of the CLC2 dataset. Figure 3A shows a breakdown of the composition of CLC2 in terms of source, gene function and evidence strength. Where possible, the genes are given a functional annotation, oncogene (og) or tumour suppressor (ts), according to evidence of promoting or opposing cancer hallmarks.
Several longstanding cancer lncRNA collections have provided an invaluable resource for the community ( Figure 3B). Comparing only GENCODE v28 genes, CLC2 with 492 genes is second only to Lnc2Cancer (n= 512) (Gao et al., 2019). However Lnc2Cancer uses looser inclusion criteria, including lncRNAs without GENCODE IDs and those that are differentially expressed in tumours with no functional evidence. Remaining databases are smaller.
Furthermore, CLC2 has the greatest number of unique GENCODE gene loci, compared to other databases (n=225), including numerous literature-annotated cases and also 95 novel mutagenesis lncRNAs. Just 40 lncRNAs are common to all five databases (Bao et al., 2019;Gao et al., 2019;Wang et al., 2018;Zhou et al., 2018). In summary, CLC2 is a large, practical and high-quality cancer lncRNA annotation that complements existing resources. cell lines ( Figure 4B). Strikingly we observe a similar effect when considering the novel mutagenesis lncRNAs alone: their promoters are significantly more conserved than expected by chance, and their expression is an order of magnitude higher than other lncRNAs ( Figure   4C and D).  Although gene expression was not a criterion for inclusion, we do expect that CLC2 lncRNAs levels will be altered in tumours. Furthermore, we expect that oncogenes should be overexpressed, and tumour suppressors downregulated.
To test this, we analysed TCGA RNA-sequencing (RNA-seq) data from 686 individual tumours with matched healthy tissue (total n=1,372 analyzed samples) in 20 different cancer types (SUPP FIG 6A and B), and classed every gene as either differentially expressed (in at least one cancer subtype with a log2 Fold Change >1 and a FDR <0.05) or not. CLC2 lncRNAs are 3.4-fold more likely to be differentially expressed compared to expression-matched lncRNAs ( Figure 5A). LncRNAs from each individual evidence source (literature, mutagenesis, CRISPRi) display the same trend. Similar effects were found for pc-genes (SUPP FIG 7A).
Next we asked whether the direction of expression change corresponds to gene function. Indeed, oncogenes are enriched for overexpressed genes, whereas tumor suppressors are enriched for down-regulated genes, supporting the functional labelling scheme ( Figure 5B).
Cancer genes' expression is often prognostic for patient survival. By correlating expression to patient survival, we found that the expression of 392 CLC2 lncRNAs correlated to patient survival in at least one cancer type (SUPP FIG 7C). When analyzing the most significant correlation of each CLC2 lncRNA compared to expression-matched nonCLC lncRNAs, we find a weak but significant enrichment (SUPP FIG 7C), showing that CLC2 lncRNAs tend to be prognostic for patient survival.
In summary, gene expression characteristics of CLC2 genes, and subsets from different evidence sources, support their functional labels as oncogenes and tumour suppressors and is more broadly consistent with their important roles in tumorigenesis.

CLC2 lncRNAs are enriched with cancer genetic mutations
Cancer genes are characterized by a range of germline and somatic mutations that lead to gain or loss of function. We hypothesised that cancer lncRNAs should be enriched with germline single nucleotide polymorphisms that have been linked to cancer predisposition (Deng et al., 2017). We obtained 5,331 cancer-associated single nucleotide polymorphisms (SNPs) from genome-wide association studies (GWAS) (Buniello et al., 2019) and mapped them to lncRNA and pc-gene exons, calculating a density score that normalises for exon length (SUPP FIG 4B). As expected, exons of known cancer pc-genes are >2-fold enriched in germline SNPs (SUPP FIG 7B). When performing the same analysis with CLC2 lncRNAs, one observes an even more pronounced enrichment of 4.0-fold when comparing to expressionmatched nonCLC lncRNAs ( Figure 5C). Once again, the lncRNAs from each evidence source individually show enrichment for cancer SNPs >2-fold ( Figure 5C). Three mutagenesis lncRNAs, namely miR143HG/CARMN, LINC00511 and LINC01488, exhibit an exonic cancer SNP ( Figure 5D).
Cancer genes are also frequently the subject of large-scale somatic mutations, or copy number variants (CNVs). Using a collection of CNV data from LncVar , we calculated the gene-span length-normalized coverage of lncRNAs by CNVs. CLC2 lncRNAs are enriched for CNVs compared to all non cancer lncRNAs ( Figure 5E).
In summary, CLC2 lncRNAs display germline and somatic mutational patterns consistent with known oncogenes and tumour suppressors.

Discussion
We have presented the Cancer LncRNA Census 2, a resource of lncRNAs with functional roles in cancer. We hope this dataset will be of utility to a wide range of studies, from bioinformatic identification of new disease genes, to developing a new generation of cancer therapeutics with anti-lncRNA ASOs (Amodio et al., 2018).
A key novelty of CLC2 is its use of automated gene curation using functional evolutionary conservation. This responds to the challenge arising from the rapid growth of scientific literature, which makes manual curation increasingly impractical. Other automated methods like text mining and machine learning will also be important, although it will be necessary to ensure their predictions are sufficiently accurate in identifying bona fide functional lncRNAs and removing bystanders. Available evidence suggests that the CLIO-TIM pipeline accurately identifies functional cancer genes: in addition to experimental validation of the LINC00570 in two cell lines using two perturbations, the entire set of 102 predicted "mutagenesis lncRNAs" carry a range of features consistent with literature-curated cancer lncRNAs: promoter conservation, high expression, differential tumour expression and germline SNP enrichment. Removing cases that overlap other databases, this set amounts to 95 completely novel functional cancer lncRNAs, an invaluable resource for discovery of new molecular pathways and therapeutic targets.
We also integrated hits from latest CRISPRi screens into CLC2 (Liu et al., 2017) Similar to the "mutagenesis" lncRNAs, the CRISPRi hits have features consistent with cancer functionality.
We recognise that some colleagues may ascribe lower confidence to the novel genes in CLC2 originating from mutagenesis and CRISPR sources. For this reason, the CLC2 data table is organised to facilitate filtering by confidence, to extract only the 375 literaturesupported cases, or indeed any other subset based on source, evidence or function as desired by the researcher.
LINC00570 is a new functional cancer lncRNA predicted by CLIO-TIM. The gene was previously studied by Orom and colleagues, as a cis-activating enhancer-like RNA named ncRNA-a5 (Ørom et al., 2010). That and a subsequent study showed that perturbation by siRNA transfection affects the expression of the nearby pc-gene ROCK2 in HeLa. However, these studies did not investigate the effect on cell proliferation. We here show by means of two independent perturbations, that LINC00570 promotes proliferation of HeLa and HCT116 cells. These findings make LINC00570 a potential therapeutic target for follow up.
Intriguingly, amongst the novel mutagenesis lncRNAs identified by CLIO-TIM are genes previously linked to other diseases. miR143HG/CARMEN1 (CARMN) was shown to regulate cardiac specification and differentiation in mouse and human hearts (Ounzain et al., 2015). In addition to being a TIM target, CARMEN1 also contains a germline cancer SNP correlating to the risk of developing lung cancer (Park et al., 2015), adding further weight to the notion that it also plays a role in oncogenesis. Similarly, DGCR5, is located in the DiGeorge critical locus and has been linked to neurodevelopment and neurodegeneration (Johnson et al., 2009), and was recently implicated as a tumour suppressor in prostate cancer . These results raise the possibility that developmental lncRNAs can also play roles in cancer.
In summary, we anticipate that CLC2 will lay the foundation for understanding how lncRNAs are integrated into the molecular events underlying tumorigenesis, and provide targets for a new generation of anti-cancer therapies.

Literature search.
PubMed was searched for publications linking lncRNA and cancer using keywords: long noncoding RNA cancer, lncRNA cancer. The manual curation and assigning evidence levels to each lncRNA was performed in the same way as previously (Carlevaro-Fita et al., 2020) and included reports until December 2018.  (Liu et al., 2017) we extracted genes with "hit" (validated as a hit in the screen), "LH" (unique identifiers correlating to a gene in the screen) and "lncRNA" (referring to a lncRNA gene and to exclude lncRNA hits close to a protein-coding gene ("Neighbor hit")) resulting in 499 hits.

CLIO-TIM
Of these, 322 hits contain a GENCODE IDs and were used for enrichment analysis, tested by one-sided Fisher's test.

Generation of
Gene-specific RT-PCR and cDNA amplification. From the extracted total RNA, we performed a gene specific reverse transcription using the reverse primers for LINC00570 and HPRT1 to enrich for their cDNA. Presence or absence of transcript was detected by a regular 22 PCR using GoTaq® G2 DNA Polymerase (Promega, M7841) from 100ng cDNA and visualized on an agarose gel.
Viability assay. HeLa and HCT116 cells were transfected with Antisense LNA GapmeRs at a concentration of 50nM using Lipofectamine 2000 (Thermofisher) according to manufacturer's protocol. One day after, transfected cells were plated in a white, flat 96-well plate (3000 cells/well). Viability was measured in technical replicates using CellTiter-Glo 2D Kit (Promega) following manufacturer's recommendations at 0, 24, 48, 72 hours after seeding.
Luminescence was detected with Tecan Reader Infinite 200. Statistical significance calculated by t-test.
For CRISPR inhibition experiments, HeLa-Cas9 and HCT116-Cas9 cells were transfected with control sgRNA plasmid and two LINC00570 targeting plasmids. Cells were selected with puromycin (2ug/ml) for 48h. Viability assay was performed as previously described.