FLIBase: a comprehensive repository of full-length isoforms across human cancers and tissues

Abstract Regulatory processes at the RNA transcript level play a crucial role in generating transcriptome diversity and proteome composition in human cells, impacting both physiological and pathological states. This study introduces FLIBase (www.FLIBase.org), a specialized database that focuses on annotating full-length isoforms using long-read sequencing techniques. We collected and integrated long-read (351 samples) and short-read (12 469 samples) RNA sequencing data from diverse normal and cancerous human tissues and cells. The current version of FLIBase comprises a total of 983 789 full-length spliced isoforms, identified through long-read sequences and verified using short-read exon–exon splice junctions. Of these, 188 248 isoforms have been annotated, while 795 541 isoforms remain unannotated. By overcoming the limitations of short-read RNA sequencing methods, FLIBase provides an accurate and comprehensive representation of full-length transcripts. These comprehensive annotations empower researchers to undertake various downstream analyses and investigations. Importantly, FLIBase exhibits a significant advantage in identifying a substantial number of previously unannotated isoforms and tumor-specific RNA transcripts. These tumor-specific RNA transcripts have the potential to serve as a source of immunogenic recurrent neoantigens. This remarkable discovery holds tremendous promise for advancing the development of tailored RNA-based diagnostic and therapeutic strategies for various types of human cancer.


Introduction
The regulation of transcriptome diversity and proteome composition in human cells involves a range of RNA-level regulatory processes ( 1 ,2 ).These processes, including selectable transcription start sites (TSSs) ( 3 ), splicing ( 4 ) and polyadenylation (polyA) sites ( 5 ), play a crucial role in generating a diverse array of messenger RNA (mRNA) and protein isoforms with distinct structures and functions.Advances in high-throughput RNA sequencing (RNA-seq) and computational methods have greatly facilitated the analysis of transcript changes and the discovery of new transcripts in both normal physiological and pathological states ( 6 ,7 ).
One notable study by Demircioglu et al . utilized The Cancer Genome Atlas (TCGA) RNA-seq data to identify a significant number of cancer-associated TSSs and promoters ( 3 ).Another study conducted by Hu et al. involved reference-based transcript assembly using RNA-seq data obtained from over 1000 cancer cell lines.The findings of this study unveiled numerous transcripts that were previously unidentified ( 8 ).Importantly, specific isoform switches have been observed in various diseases, including human cancer.Tumor-preferring transcripts can promote cancer development and progression, making them attractive targets for cancer therapy.For instance, in hepatocellular carcinoma, the short isoform of BIN1 is ex-D 125 pressed in normal tissue, while the upregulated long isoform contributes to carcinogenesis ( 9 ).Moreover, the identification of tumor-specific RNA transcripts (tumor-SRTs), which are expressed exclusively in tumor tissues, holds great potential for cancer diagnosis and treatment.Zheng et al. demonstrated that tumor-specific transcripts could be detected in blood samples from patients with hepatocellular carcinoma, underscoring their potential as diagnostic markers ( 10 ).Oncogenic coding and noncoding tumor-SRTs have been identified in numerous types of cancer (11)(12)(13).Furthermore, tumor-specific transcripts have been recognized as a source of immunogenic recurrent neoantigens ( 14 ).
Given the importance of isoform analysis, several databases have been developed to facilitate the investigation of transcripts and alternative splicing.Notable examples include RJunBase ( www.RJunBase.org) ( 15 ), TCGASpliceSeq ( https:// bioinformatics.mdanderson.org/TCGASpliceSeq) ( 16 ) and In-troVerse ( https:// rytenlab.com/browser/ app/ introverse ) ( 17 ), which provide information on the alternative splicing sites of annotated and unannotated transcripts.However, these databases do not focus on the full-length characterization of transcripts.Other resources, such as GEPIA2 ( http:// gepia2.cancer-pku.cn/)( 18 ), SRTdb ( http://www.shenglilabs.com/ SRTdb/ ) ( 19 ) and TAiC ( http://www.shenglilabs.com/TAiC/) ( 8 ), offer valuable resources for transcript analysis.Nevertheless, it is crucial to acknowledge the limitations of de novo transcript reconstruction methods relying on shortread RNA-seq, as they often encounter challenges leading to incomplete or inaccurate assemblies.In contrast, long-read sequencing (LR-seq) technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have the advantage of producing longer reads ( > 10 kb) ( 20 ), which can accurately capture full-length isoforms.This approach eliminates the need for reference-based transcript assembly, ensuring a more comprehensive representation of transcripts.However, there remains a lack of databases that specifically annotate transcripts based on LR-seq.
To bridge this gap, we have developed FLIBase, a webbased database that focuses on the annotation of full-length isoforms using LR-seq techniques.Our database incorporates meticulous curation processes, utilizing long-read RNA-seq data from distinct normal and cancerous human tissues and cells obtained from public databases.Recognizing the limited depth of full-length sequencing data for transcript quantification, we integrated short-read RNA-seq data from human cancer samples and normal tissues sourced from the TCGA and Genotype-Tissue Expression (GTEx) databases, respectively.FLIBase provides extensive information for each transcript, including its complete nucleotide sequence, precise splice site annotations, expression levels, coding potential and predicted open reading frames (ORFs).These comprehensive annotations empower researchers to undertake various downstream analyses and investigations.Notably, FLIBase offers a significant advantage in uncovering a substantial number of previously unannotated isoforms and tumor-SRTs.This remarkable discovery holds tremendous potential for advancing the development of RNA-based diagnostic and therapeutic strategies customized for various tumor types.
The raw PacBio sequencing data (BAM files) were processed using the Iso-Seq (version 3.4.0)workflow, which involved the following steps: (i) Subreads of PacBio sequencing data were combined into circular consensus sequence reads with a prediction accuracy threshold of 0.9.(ii) Full-length reads were generated by removing the 5 and 3 cDNA primers using Lima with default parameters.Full-length nonchimeric (FLNC) reads were obtained by filtering out artificial concatemer reads and polyA tails.(iii) High-quality transcripts (FASTQ file) were generated by clustering FLNC reads using the isoseq3 cluster.Processed or downloaded FASTQ reads of PacBio and ONT data were then mapped to the human genome (hg38, GENCODE v35) using minimap2 (version 2.17) ( 35 ).The mapped ONT sequencing data (SAM files) were processed with the TALON pipeline (version 5.0) ( 36 ) to obtain a merged GTF transcriptome.PacBio isoforms were subsequently collapsed using cDNA_Cupcake (version 19.0.0).Full-length transcripts from all samples were merged into a nonredundant transcriptome using gffcompare (version 0.11.2) ( 37 ).

RNA-seq data collection
Junction quantification data from STAR output and 9870 BAM files of RNA-seq data across 33 different human cancer types were retrieved from TCGA under accession number phs000178.v11.p8.Additionally, 2599 BAM files of RNA-seq data across 22 human normal tissue types were downloaded from the GTEx with accession number phs000424.v8.p2.

Isoform annotation and filtering
Transcripts in the merged GTF file were retained if all splice junctions were detected in at least five samples in the RNA-seq data.Subsequently, isoforms were classified into six groups by comparing them with a reference transcriptome (GENCODE v.35) ( 38 ) using SQANTI3 (version 1.6) ( 39 ) and gffcompare.A full-splice match (FSM) was defined as an exact match to known isoforms, while incomplete splice match (ISM) transcripts were those contained in the reference with compatible introns.Novel in catalog (NIC) isoforms harbored known splice sites but novel splice junctions, whereas novel not in catalog (NNC) referred to isoforms with at least one splice site not present in GENCODE v .35.Additionally , we identified unannotated genomic regions that included antisense and intergenic transcripts in the LR-seq data.All ISM transcripts, which could be a result of RNA degradation and incomplete reverse transcription, were removed from the whole tran-D 126 Nucleic Acids Research , 2024, Vol.52, Database issue scriptome.Cage peaks and polyA sites were annotated using SQANTI3.ORFs in LR-seq transcripts were predicted using GeneMark (version 5.1) ( 40 ).Transcripts were predicted to elicit nonsense-mediated decay (NMD) when the stop codon was located > 50 nucleotides upstream of the last exon-exon junction.TSSs that overlapped with transposable elements (TEs) within ±100 bp, as determined by the BEDTools suite (version 2.29.2) ( 41 ), were considered as TE-derived transcripts.We proceeded with comparable isoform identification and transcript consolidation, adhering to the following criteria: allowing a maximum of three exonic variations, with a defined threshold of no more than four base pairs per exon.The upper limit for permissible dissimilarity stood at 100 for both transcription start and termination sites.Regarding coding transcripts, it was imperative to ascertain the absence of differences in the ORFs.Ultimately, a predilection was manifested for retaining transcripts of extended length.The entire process was facilitated by a dedicated Python code, accessible on Figshare ( https:// doi.org/ 10.6084/ m9.figshare.23805993.v2).

RNA-seq data processing
TCGA and GTEx BAM files were converted to FASTQ files using Samtools (version 1.7).Transcript expression, measured in transcripts per kilobase million (TPM), was quantified by mapping RNA-seq reads to long-read isoforms using Salmon (version 1.5.2) ( 42 ).

Specific RNA transcripts
Tumor-SRTs detected in LR-seq are defined as those found exclusively in tumor samples and not in normal samples.Tumor-SRTs identified through RNA-seq are determined based on the following criteria: (i) The median expression level of the transcript in tumors is at least 10-fold higher than the maximum expression level observed in all normal tissues (excluding the testis) within the GTEx dataset, as well as adjacent normal tissues within the TCGA dataset.(ii) The transcript is expressed (TPM > 0.5) in > 5% of tumor samples from at least one TCGA cancer type.
The definition of tissue-specific RNA transcripts (tissue-SRTs) follows a methodology similar to a previous study ( 19 ).To calculate the specificity score for each transcript, the following formula is used: Here, S t represents the specificity score of transcript t , N is the total number of tissue types and p it denotes the expression ratio of transcript t in tissue type i .Each transcript is assigned one specificity score and N expression ratios.The expression ratio of each transcript across all tissue types is calculated as , where x it represents the expression value of transcript t in tissue type i .A transcript is considered tissue specific if the following conditions are met: (i) The largest expression ratio is > 2 times greater than the second largest expression ratio.(ii) The specificity score is > 1.

Clinically associated transcripts
The Wilcoxon rank-sum test was employed to assess the statistical differences in expression levels of transcripts between tumor and paired adjacent normal tissues.Multiple testing corrections were performed using the Benjamini-Hochberg method.Transcripts with a false discovery rate (FDR) < 0.05 were considered significantly differentially expressed.The statistical difference in overall and progression-free survival between the high-risk and low-risk groups was assessed using a log-rank test.Hazard ratios and P -values were calculated using the 'survdiff' function implemented in the R package survival (version 3.2-3).The Kruskal-Wallis test was used to determine the statistical differences in expression levels of transcripts among patients with different tumor stages.An FDR < 0.05 was considered significant.

Visualization of transcript structure and the expression profile
Each gene's transcripts were graphically presented using Bioconductor's trackViewer (version 1.24.2) ( 43 ), including detailed representation of TEs.Expression profiles of long-read, TCGA and GTEx short-read sequencing data were plotted using the R package ggplot2 (version ≥3.4.0).The optimal cutoff point was calculated using the 'surv_cutpoint' function implemented in the R package survminer (version 0.4.9).Kaplan-Meier survival curves were generated using the R package survminer.

Data summary
FLIBase serves as a comprehensive resource by integrating both long-read (351 samples) and short-read (12 469 samples) RNA-seq data from diverse human tissues and cells, encompassing both normal and cancerous samples (Figure 1 ).The current version of FLIBase comprises a total of 983 789 full-length spliced isoforms, which have been detected using long-read sequences and further validated through short-read exon-exon splice junctions.Among these isoforms, 188 248 are annotated transcripts, while 795 541 are unannotated transcripts.The unannotated transcripts within FLIBase can be classified into four categories: NIC (436 855), NNC (344 752), antisense (6108) and intergenic (7826).
The transcriptome landscape varies across different sample types.For the LR-seq data, the number of identified transcripts in each normal tissue, cell line or cancer type ranges from 5201 in a neuroendocrine tumor sample to 347 049 in 19 stem cell samples (Table 1 ).Pertinent details concerning sample count and sequencing platform are likewise documented in Table 1 .Moreover, in the TCGA dataset, a total of 847 154 transcripts have been identified across 33 distinct cancer types.Notably, the number of expressed transcripts varies among these cancers (Table 2 ), with the lowest count of 259 510 transcripts observed in lymphoid neoplasm diffuse large B-cell lymphoma (DLBC) and the highest count of 688 614 transcripts in breast invasive carcinoma (BRCA).Similarly, the GTEx dataset encompasses 687 556 transcripts across 22 tissues, with transcript counts ranging from 240 968 in the cervix to 466 119 in the prostate (Table 3 ).It is worth highlighting that a substantial number of unannotated isoforms are expressed across multiple sample types, consistent with previous research find-  ings ( 21 , 28 , 34 ).A gene rank indicates the number of detected transcripts across the entire dataset, as well as separately for normal and tumor samples.To address this, we have included relevant information in Supplementary Tables S1-S3.Supplementary Table S4 presents the results of transcripts and tumor-SRTs detected in TCGA samples for each cancer type, while Supplementary Table S5 provides the results of transcripts and tissue-SRTs identified in GTEx samples for each tissue type.

Web design and interface
FLIBase offers a user-friendly interface and an extensive range of functionalities for analyzing, searching and downloading desired results.Figure 2 A showcases the main user interfaces of the database.The 'Analyses' page houses four key function modules (Figure 2 B): (i) 'Specific RNA Transcripts' enables users to query tumor-and tissue-SRTs; (ii) 'Differential Expression Analysis' facilitates the investigation of differentially expressed isoforms between tumors and their paired adjacent normal tissues; (iii) 'Survival Analysis' allows for the identification of transcripts significantly associated with patients' overall and progression-free survival within each cancer type; and (iv) 'Stage' enables users to explore transcripts significantly associated with tumor stage.For instance, leveraging the TCGA and GTEx RNA-seq datasets, users can extract all transcripts expressed in any cancer type but not in normal tissues.
The 'Search' page provides users with three distinct search options: gene name, genomic location and transcript information (Figure 2 C).These search methods allow users to retrieve specific transcripts of interest.The genomic location search enables users to specify chromosomes, strands, and genomic start and end positions, facilitating precise transcript retrieval.Additionally, users can utilize specific transcript information to further refine their search results.This includes comparing transcript types to a reference, checking the annotation status of the first exon in a transcript, assessing the coding ability of the transcript and identifying overlaps between TEs and TSSs.By way of illustration, searching for the gene name PTEN enables users to access all detected transcripts associated with PTEN and their relative genomic positions (Figure 2 D and E).Furthermore, users can utilize column-based filters within the results field to pinpoint the specific transcript(s) of interest.Once researchers navigate to the details page, FLIBase provides a comprehensive range of information for each transcript (Figure 2 F).This includes the transcript ID, associated gene symbol, transcript type in comparison to the reference (FSM, NIC, NNC, antisense and intergenic), transcript length, number of exons, complete nucleotide sequence, genomic location, annotation status of the first exon, coding capacity, prediction of NMD activation, overlap between TSS and CAGE peaks, polyA motif, predicted ORF, and overlap between TEs and TSSs within ±100 bp.

Applications of FLIBase
FLIBase, with its comprehensive annotation of full-length transcripts, offers valuable insights into the dysregulation of genes and their potential roles in oncogenesis.By utilizing the 'Specific RNA Transcripts' function within the 'Analyses' section of FLIBase, we investigated the dysregulation of the receptor tyrosine kinase c-MET (MET) and its implications in cancer ( 44 ).
Within the FLIBase database, we identified multiple tumorspecific unnamed MET transcripts when filtering the MET genes in the results column.Considering the reported genomic alterations of MET, such as gene amplification, mutations and fusions, these unnamed transcripts provide new avenues for exploring the diverse landscape of MET dysregulation in cancer.One specific transcript, MET-u1 (TCONS_00923684), was selected as an example due to its unannotated TSS region, which overlaps with L1PA2, a LINE-1 subfamily TE.This finding is of interest since TEs have been associated with the activation of oncogenes, suggesting a potential role for MET-u1 in oncogenic processes.
MET-u1 is a full-length transcript spanning 5581 bp, classified as an NNC-type transcript, and consisting of 21 exons (Figure 3 A).Further details about MET-u1, including its sequence and other relevant information, can be accessed in the FLIBase database.Analysis of the LR-seq dataset revealed the detection of MET-u1 exclusively in esophageal, gastric and liver cancers, while its expression was absent in normal tissues (Figure 3 B).Consistent with the LR-seq data, the TCGA RNA-seq dataset demonstrated the expression of MET-u1 across various cancer types, exhibiting notable heterogeneity among patients (Figure 3 C).In particular, esophageal and gastric cancers exhibited the highest levels of MET-u1 expression, aligning with the LR-seq findings.Conversely, analysis of normal tissues within the GTEx dataset showed no expression of MET-u1, as depicted in Supplementary Figure S1.Subsequently, specific cancer types were selected to explore the differential expression, survival and stage associations of MET-u1.In the differential expression analysis, we showed that MET-u1 expression was restricted to colon adenocarcinoma (COAD) and lung squamous cell carcinoma (LUSC) tumors, while it was absent in corresponding paracancerous tissues (Figure 3 D).This observation indicates potential tumor-specific regulation of MET-u1 expression.Figure 3 E demonstrates that high MET-u1 expression is significantly associated with poor prognosis in lung adenocarcinoma (LUAD) and pancreatic adenocarcinoma (PAAD), highlighting the potential prognostic value of MET-u1 in these cancer types.

Discussion and perspectives
The ENCODE study revealed that a significant portion of the genome (62% of genomic bases) has the potential to be transcribed, indicating the presence of numerous unidentified transcription products ( 45 ).However, the prevalent shortread RNA-seq technology poses challenges in fully assembling full-length mRNA isoforms.To address this limitation, we have developed the first comprehensive database of fulllength transcripts in human cancer and normal tissues, utilizing LR-seq data.Our database offers a range of analysis, search and download functions, enabling comprehensive exploration and filtering of transcriptomic data.Importantly, we have identified a substantial number of unannotated and tumor-specific transcripts, providing opportunities for investigating their roles in cancer development.For instance, the database allows users to search for an intergenic noncoding transcript derived from an long terminal repeat (LTR), which was previously reported to exert oncogenic activities in promoting hepatocellular carcinoma in vitro and in vivo ( 10 ).
Dysregulation of RNA-related processes can significantly impact the transcriptomic and proteomic landscape of can-D 129    cer.Emerging evidence suggests that the dysregulation of transcriptional processes associated with tumorigenesis could serve as a broad yet unexplored target for novel immunotherapy approaches.Alternative splicing events, such as intron retention ( 46 ), exitron generation ( 47 ) and TE-associated splicing ( 48 ), contribute to proteome diversity and can generate tumor-specific neoantigens.Long-read RNA-seq technology proves valuable in accurately inferring ORFs and reliably predicting NMD transcripts due to its ability to accurately capture full-length transcripts.In a recent study, aberrant splice isoforms detected through full-length transcriptome sequenc-ing were identified as potential neoantigens in non-small cell lung cancer ( 32 ).As full-length sequencing technology continues to advance, the availability of LR-seq data from human cancerous and normal tissues will increase.FLIBase will be regularly updated to incorporate additional data from diverse tissue types.Furthermore, recent studies have employed LR-seq at the singlecell level ( 49 ,50 ).Thus, we plan to integrate single-cell fulllength RNA-seq data into our database to explore isoform heterogeneity across different cell types and states.Through more rigorous investigation and mining of the cancer tran-D 132 Nucleic Acids Research , 2024, Vol.52, Database issue scriptome at the transcript level, we anticipate uncovering new insights into cancer diagnostics and therapeutics.

D 127 Figure 1 .
Figure 1.Analytical process for data collection and processing in the FLIBase database.The process includes LR-seq data collection and processing, RNA-seq data collection, isoform annotation and filtering, transcript quantification and downstream analyses.

FFigure 2 .
Figure 2. Ov ervie w of FLIB ase.( A ) Main user interf aces of the database.( B ) Four k e y modules in the 'Analy ses ' section: 'Specific RNA Transcripts ', 'Differential Expression Analysis', 'Survival Analysis' and 'Stage'.( C ) The 'Search' section offers three different search options: gene name, genomic location and transcript information.( D ) An example of querying a gene symbol on the 'Search' page.( E ) An example of the str uct ure of all transcripts for a specific gene.( F ) An example of detailed information about a transcript.

Figure 3 .
Figure 3.An example of tumor-SRTs.( A ) Detailed transcript information of MET-u1.( B ) Detection of MET-u1 in various types of LR-seq data samples.( C ) Dot plot showing the expression level of MET-u1 in TCGA tumor samples.( D ) Box plots illustrating the expression of MET-u1 in paired COAD and LUSC samples.( E ) Kaplan-Meier plots demonstrating o v erall surviv al associated with MET-u1 e xpression in LUAD and PAAD.

Table 1 .
Distribution of transcripts detected in various samples by LR-seq

Table 2 .
Distribution of transcripts expressed in the TCGA RNA-seq datasets

Table 3 .
Distribution of transcripts expressed in the GTEx RNA-seq datasets D 130 Nucleic Acids Research , 2024, Vol.52, Database issue