RJunBase: a database of RNA splice junctions in human normal and cancerous tissues

Abstract Splicing is an essential step of RNA processing for multi-exon genes, in which introns are removed from a precursor RNA, thereby producing mature RNAs containing splice junctions. Here, we develope the RJunBase (www.RJunBase.org), a web-accessible database of three types of RNA splice junctions (linear, back-splice, and fusion junctions) that are derived from RNA-seq data of non-cancerous and cancerous tissues. The RJunBase aims to integrate and characterize all RNA splice junctions of both healthy or pathological human cells and tissues. This new database facilitates the visualization of the gene-level splicing pattern and the junction-level expression profile, as well as the demonstration of unannotated and tumor-specific junctions. The first release of RJunBase contains 682 017 linear junctions, 225 949 back-splice junctions and 34 733 fusion junctions across 18 084 non-cancerous and 11 540 cancerous samples. RJunBase can aid researchers in discovering new splicing-associated targets and provide insights into the identification and assessment of potential neoepitopes for cancer treatment.


INTRODUCTION
The majority of human pre-mRNAs contain more than one exon and the process of pre-mRNA splicing, which removes introns and joins exons, is crucial for the maturation of functional mRNAs (1,2). Alternative splicing (AS) is a key factor of enlarging the transcriptome and proteome, allowing mammalian cells to generate distinct protein isoforms from a single gene. The splicing process is accomplished by a megadalton complex of RNAs and proteins that comprises five small nuclear ribonucleoprotein particles (snRNPs; U1, U2, U4, U5 and U6) and >200 splicing regulators (2). Highthroughput sequencing-based methods and their applications in the study of transcriptomes have revolutionized our understanding of RNA splicing in a large number of diseases, especially in human cancers (3). Recent studies on transcriptomic profiling indicated that dysregulated splicing events are widespread in human cancers (4,5), and tumorspecific AS events can contribute to the initiation and progression of cancer, representing potential targets for cancer treatment (6,7). For example, Li et al. demonstrated that a splicing switch from ketohexokinase-C to ketohexokinase-A could drive hepatocellular carcinoma formation, which is dependent on c-Myc and heterogeneous nuclear ribonucleoproteins (8). Furthermore, based on the analysis of RNA splice junctions of tumor cells and tissues, tumor-specific transcripts (TSTs) have been shown to be overexpressed in several malignant tumors (9,10). These newly identified TSTs, such as LIN28B-TST and TST1, play key roles in carcinogenesis, and their high expression is correlated with worse overall survival of cancer patients (10,11). Moreover, the tumor-specific neoepitopes derived from splice junctions can represent promising targets for immunotherapy (12). Therefore, knowledge advances in these research fields have emphasized the need for a coordinated, communitybased effort to construct public data resources that can systematically annotate and integrate important information regarding splicing patterns and functions. A compendium of these splice junctions will aid researchers in deciphering tumor heterogeneity at the junction level and support the discovery of new splicing-related targets.
D202 Nucleic Acids Research, 2021, Vol. 49, Database issue To date, several RNA splicing databases have been developed with different purposes. For example, an exon-exon junction database (JuncDB, http://juncdb.carmelab.huji.ac. il/) was established for the comparison of architectures between orthologous transcripts across 88 eukaryotic species (13). Nellore et al. aligned 21,504 Illumina-sequenced human RNA samples and found 56 861 unannotated novel junctions, which were made available on the intropolis database (http://intropolis.rail.bio) (14). In the scope of human cancers, Xia et al. created the CSCD database (http: //gb.whu.edu.cn/CSCD) to provide a visual interface for exploring the function and regulation of cancer-specific cir-cRNAs based on cancer cell lines (15). Additionally, the FusionGDB database (https://ccsm.uth.edu/FusionGDB) also offers an integrative resource of gene fusions, including cancer-associated transcript fusions across all cancers (16 (17). Although these databases offer insightful information on RNA splicing, there is still a lack of a convenient database for analyzing various forms of splice junctions (linear, back-splice, and fusion junctions), potentially tumor-specific junctions, and unannotated junctions in human tissues.
In this study, we developed a web-accessible database, named RJunBase, to provide information on three types of RNA splice junctions that are derived from RNA-seq data of various non-cancerous and cancerous samples. This tool aims to help users search for their desired targets and obtain information about their splice patterns and expression profile. The whole dataset was designed to be easily browsed, queried, and downloaded through its webpage. In addition, RJunBase is expected to allow researchers submit new splice junctions and their expression profiles of human samples.

Identification of linear junctions
We collected RNA-seq data of 10 283 samples containing primary, metastatic, and normal tissues across 33 cancer types from TCGA and obtained linear junction data from datasets stored at the Genotype-Tissue Expression (GTEx) portal (18). Metastatic and primary samples were considered to be in the same class as cancer samples. TCGA BAM files were aligned to the GRCh38 reference genome using a two-pass method with STAR algorithm. StringTie (version 1.2.3) was used to assemble the BAM files for all samples (19). Linear junctions were identified, quantified, and normalized based on the StringTie output file using the Assembling Splice Junctions Analysis (ASJA) software package, which identifies and characterizes all splice junctions from high-throughput RNA-seq data (9). Raw count data obtained from GTEx were normalized according to: raw counts/sum counts × 10 000 000 (sum counts represent the sum of the counts of all samples), with the scale factor being the same as that adopted by ASJA. We removed all junctions that were present in fewer than two samples of GTEx and TCGA. Gene annotations were downloaded in the GFF/GTF format from GENCODE database (re-lease 29, https://www.gencodegenes.org). Linear junctions were matched onto gene coordinates using the genomic intersection functionality of BEDTools (https://bedtools. readthedocs.io/) and were further classified as intragenic or intergenic junctions. All junctions were annotated with gene information, and junctions in the intergenic region were annotated with information of the downstream gene. In addition, oncogene and tumor suppressor gene annotations were derived from ONGene (20) and TSGene (21), and were pooled together and named tumor-associated genes. Junctions from tumor-associated genes were labeled as 'yes' in 'Tumor associated genes' column in the RJunBase. Five types of alternative splice junctions (intron retention, exon skipping, mutually exclusive exons, alternative 3 , and alternative 5 splice) were extracted from a paper published by André Kahles et al. (4). The genomic coordinates of alternative splice junctions were converted from hg37 to hg38 using the University of California Santa Cruz (UCSC) LiftOver tool.
The unannotated junctions were defined as the junctions that are not annotated in GENCODE. To obtain highconfidence unannotated junctions, we removed those unannotated junctions that were expressed in <10 samples. The majority of the unannotated junctions (89.45%) contained the canonical splice site (GT-AG).
The junctions were identified as tumor-specific junctions when the following criteria were met: NT max = 0 & nTE max = 0 & Freq tumor > 5%; or Tumor median > 10 * NT max & Tumor median > 10 * nTE max & Freq tumor > 5%. Tumor median is the median expression value of tumor samples, NT max is the maximal expression value of TCGA normal samples, nTE max is the maximal expression value of GTEx normal tissues, and Freq tumor is the frequency of junctions expressed in the cancer.

Collection of back-splice and fusion junctions
The location and expression information of back-splice junctions from over 2000 samples across 27 cancer types were obtained from MiOncoCirc database (https: //mioncocirc.github.io/) (22). The total number of backsplice reads for every sample was calculated. The majority of these samples contained 1200-730 000 back-splice reads. Only 37 samples contained <1000 back-splice reads, which were considered to be low-quality data. Therefore, they were filtered to prevent any impact on the subsequent analyses. MiOncoCirc included 25 normal samples, which were removed and not included in the RJunBase for the extreme imbalance between cancer and normal sample sets. Samples with ambiguous classification were also removed. The novel read-through circRNA junctions involving two different genes were not included in the RJunBase because of the lack of detailed expression data.
Fusion junctions were downloaded from a previous report published by Dehghannasiri et al. (23). They developed the DataEnriched Efficient PrEcise STatistical (DEEPEST) fusion detection algorithm (https://github.com/salzmanlab/ DEEPEST-Fusion) to detect gene fusions emerging at annotated exon boundaries. This algorithm used RNA-seq fastq files from 9946 TCGA samples as input data and identified over 30 000 fusions. Gene information for the two types of junctions in the RJunBase was then annotated based on the GENCODE database (release 29). The expression of fusion junctions were standardized as linear junctions according to: raw counts/sum cov × 10 000 000 (sum cov: sum coverage of samples). The expression frequency (and corresponding sample number) and median value in cancer cohorts were calculated for each back-splice and fusion junction.

Visualization of the splicing pattern and expression profile
We used a custom python-based script to plot all relevant forward-and back-splice junctions for each gene. Hosting genes of fusion junctions were efficiently and flexibly visualized in a circular layout using the R software package 'circlize' (24), and the gene length was scaled to 100 bp to facilitate graphical representation. The expression profile was depicted using the 'ggplot2' R package. We stratified samples in the same cancer cohort into two groups (high and low expression groups) under the median (or mean) of junction expression values. The Kaplan-Meier survival curves were plotted using 'ggplot2'.

Names of the three junctions types
A unique name was provided for all junctions. Linear, fusion, and back-splice junction types were named as LS, FS, BS, respectively. Unannotated junctions were named as UN. UP means junction locate in the upstream region of neighbor genes. Annotated linear junctions in the intragenic region were named as Genename LS+ID, where Genename is the hosting gene of the junction and ID is the rank number of the annotated junctions according to genomic location for that hosting gene. Unannotated linear junctions in intragenic regions were named as UN Genename LS+ID, where ID is the counterpart of the unannotated junctions for that hosting gene. Linear junctions in intergenic regions were named as UN Genename LS UP+ID, where Genename is the downstream gene of the junction, and ID is the rank number of unannotated junctions according to genomic location. Intergenic junctions were annotated with information of downstream genes as a large part of these junctions was generated by alternative promoters, which are located in the upstream region of neighbor genes. A similar phenomenon has been reported by Demircioglu et al. (5). Backsplice junctions were named as Genename BS+ID. Fusion junctions were named as Genename1-Genename2 FS+ID, where Genename1 and Genename2 are the two hosting genes of the junction.

Database implementation
The RJunBase was built using MySQL (database server) and Akka 2.6.3 (web server). All data were handled and organized into a MySQL Database Management System (version 5.7). The interface of the website was designed and achieved using the Twirl template engine. The query system for the database was based on Scala script. The website was tested in several distinct popular web browsers such as Google Chrome, Firefox, and Internet Explorer.

Data summary
RJunBase collects and integrates three types of splice junctions from different non-cancerous and cancerous human tissues ( Figure 1 Table 3. The Supplementary Table S1 provides the abbreviations and full names for all 47 cancer types included in the RJunBase database.

Browse the database
Users can browse the information on junctions by clicking the 'Browse' tab on the left-side navigation menu, and the browse page will present the entry point to tables containing all the information on the three types of junctions ( Figure 2A). Users can browse the RJunBase database with various options. For example, the 'Junction type' allows users to browse junctions of interest, such as fusion junctions. The 'Gene type' could be specified to select a particular type of hosting gene for each junction, such as 'protein coding'. If not selected, junctions of all types will be displayed. For fusion junctions, the hosting genes, such as protein coding genes, are connected by '-'. Moreover, RJunBase provides two additional options to screen potentially functional junctions: (i) 'Annotation' indicates whether selected junctions can be extracted from GENCODE database; (ii) 'Tumor-specific types' denotes whether the junctions currently being browsed are tumor-specific. The RJunBase includes tumor-specific junctions that are related to 33 TCGA cancer types.
To obtain the analysis results only a click on the 'Run' button is necessary and a dynamic tabular with optional boxes will be displayed, in which each row and column show a junction and region-specific information respectively, including Junction ID, Junction location, Junction type, Gene symbol, Gene type, Tumor median, Tumor frequency (sam- ple number), NT frequency (sample number), Tumor specific types, Annotation. Users can also obtain Transcript ID, NT median, Normal median, Normal frequency (sample number), and Alternative splicing types by clicking the corresponding menu options ( Figure 2D). By default, each page displays 10 records and the user can view the remaining records through using the pagination features on the bottom right of the table. The number of records displayed in each page can be switched between 10, 25, 50 and 100 using the 'records per page' dropdown menu. Details about each junction can be viewed by clicking the junction ID, and clicking the Gene symbol can direct users to the search results.

Search the database
RJunBase provides a simple search interface and genes of interest can be easily queried ( Figure 2B). Users can enter one (e.g., ERBB2) or more gene symbols that are separated by commas (e.g. ETV4, FGFR2) in the 'Keyword Search' field. Advanced filters, including Junction type, Annotation, Tumor-specific types, and Genomic location, allow users to further filter the results ( Figure 2C). To obtain the basic in-  formation on all junctions related to the input gene, only a click on the 'Search' button is required. The 'Search Results' include a 'Splicing pattern chart' and responsive tables. The 'Splicing pattern chart' displays visualizations of the junctions that are contained in the input gene. By clicking in the 'Linear&Back' button, users will obtain a comprehensive plot in a publication-quality format of all linear and back-splice junctions, and the corresponding exon information from the reference gtf file will also be attached (25). A 'Fusion' button will be displayed if fusion junctions exist in the host gene, which will provide a Circos plot of fusion junctions between the two hosting genes. As for the responsive tables, which provide the annotation and expression information on the three types of junctions, clicking on the JunctionID hyperlinks will lead users to a page containing detailed information about the corresponding junctions.

Detail page and result
Users can obtain basic junction annotations in the summary page. By clicking the 'Expression profile' button, the expression profile of the input junction across all cancer and normal tissues will be shown in boxplots. 'Boxplot' dynamically plots expression profiles of a given junction according to user-defined cancer selections and methods. The RJun-Base also performs survival analysis based on junction expression profiles, allowing users to select their custom cancer types for overall or disease-free survival analysis. For example, to examine the survival curves of liver cancer patients based on the expression of the input junction in their cancer samples, users can select liver hepatocellular carcinoma (LIHC). RJunBase uses the Kaplan-Meier method for hypothesis evaluation, and the thresholds for high/low expression cohorts are adjustable.

Applications of RJunBase: case 1
The user can investigate splice junctions for desired genes in pan-cancer and normal tissues. For example, ETS variant transcription factor 4 (ETV4) holds a potential role in endometrial tumor-specific estrogen receptor binding (26). As a demonstration, we entered 'ETV4' in the search box on the home page or 'Search' page. The search results were classified into three parts: the splicing pattern chart, the table of three junctions, and the junction detail. The splicing pat-tern chart showed that ETV4 contained linear junctions and back-splice junctions ( Figure 3A). Furthermore, the circular layout showed genes fused to ETV4 ( Figure 3B). The search results also listed a table that contains linear, backsplice, and fusion junctions. Considering the median expression value and frequency of tumor, ETV4 LS002 was used as a test research subject by clicking 'ETV4 LS002' and switching to the junction detail page. The summary page showed that this specific junction existed in multiple transcripts, and its expression in cancer was much higher than in pan-cancer and normal tissues. Figure 4A and B shows the expression of this junction in pan-cancer and the GTEx database. Expression of this junction in colon adenocarcinoma, rectum adenocarcinoma, and thyroid carcinoma were presented individually in the 'Expression DIY' tab ( Figure 4C). Survival curves were plotted for several cancers. High 'ETV4 LS002' expression was found to be correlated with decreased overall survival for adrenocortical carcinoma ( Figure 4D) and liver hepatocellular carcinoma (LIHC) ( Figure 4E).

Applications of RJunBase: case 2
The potential functional implication of the RJunBase is to extract tumor-specific junctions. In this second situation, UN SLC22A10 LS0017 was used as an example (Figure 5A), which is a tumor-specific junction in SLC22A10. As shown in Figure 5B, UN SLC22A10 LS0017 was found to be highly expressed in LIHC and breast cancer. Survival analysis revealed that high expression of UN SLC22A10 LS0017 was significantly associated with poorer prognosis in LIHC ( Figure 5C). Therefore, UN SLC22A10 LS0017 may function as an independent prognostic biomarker in patients with LIHC.

DISCUSSION AND FUTURE DIRECTIONS
To the best of our knowledge, RJunBase is the first database of RNA splice junctions, which include linear, fusion and back-splice junctions from 11 540 can-cerous and 18 084 non-cancerous human tissues. Compared with the TCGASpliceSeq database (17), RJunBase offers information on RNA splice junctions of cancer transcriptome that is not limited to AS patterns of proteincoding genes, and enables users to explore and visualize the gene-level splice pattern and the junction-level expression profile. This new database also supports users to assess desired splice junctions with custom query options, such as unannotated junctions and tumor-specific junctions. Additionally, researchers can download full datasets for integrative analysis, which will be beneficial in the search for new splicing-associated targets for cancer treatment.
In accordance with the existence of TSTs across human cancers, our database highlights the abundance of tumorspecific junctions. Tumor-specific junctions may be generated from multiple factors, such as epigenetic changes (chromatin remodeling or the activation of long terminal repeats) and mutations in cis-regulatory sequences or trans-acting splicing regulatory factors (27,28). For example, we previously depicted a TST of lin-28 homolog B, named LIN28B-TST (10). This TST was transcribed from an upstream long terminal repeat that possesses strong promoter activity with high-level H3K27Ac and H3K4me3 histone modifications, and an abundance of RNA polymerase II. The LIN28B-TST-specific junction could be browsed in the RJunBase database (LIN28B LS001). Importantly, neojunctions possess the ability to form a class of potential neoantigens (4). These tumor-specific junctions may act as diagnostic markers and potentially as therapeutic targets for cancer patients. We also discovered plenty of unannotated novel junctions in pan-cancer tissues and non-cancerous tissues, which could be divided into intergenic and intragenic junctions. The intergenic junctions may be generated by epigenetic alterations such as transcriptional activation, whereas the intragenic junctions may be produced by aberrant splicing in a single gene. It should be noted that some unannotated junctions may be by-products of the bioinformatics RNA splicing analysis. Recent advances in long-read sequencing identified full-length transcripts in different species (29)(30)(31). By using information on full-length transcripts and junctions from the RJunBase, users can further understand the splicing characteristics of specific targets to overcome identification of false-positive splice junctions.
The RJunBase will be regularly updated to continuously include more splice junctions of healthy and pathological human tissues, especially cancer tissues, including rare cancers. We believe that RJunBase will serve as an important resource and a powerful tool for studying splice junctions in human cancer transcriptomes and further understanding specific transcript production. Furthermore, the use of this database could potentially lead to the discovery of diagnostic and therapeutic targets for cancer.