scTEA-db: a comprehensive database of novel terminal exon isoforms identified from human single cell transcriptomes

Abstract The usage of alternative terminal exons results in messenger RNA (mRNA) isoforms that differ in their 3′ untranslated regions (3′ UTRs) and often also in their protein-coding sequences. Alternative 3′ UTRs contain different sets of cis-regulatory elements known to regulate mRNA stability, translation and localization, all of which are vital to cell identity and function. In previous work, we revealed that ∼25 percent of the experimentally observed RNA 3′ ends are located within regions currently annotated as intronic, indicating that many 3′ end isoforms remain to be uncovered. Also, the inclusion of not yet annotated terminal exons is more tissue specific compared to the already annotated ones. Here, we present the single cell-based Terminal Exon Annotation database (scTEA-db, www.scTEA-db.org) that provides the community with 12 063 so far not yet annotated terminal exons and associated transcript isoforms identified by analysing 53 069 publicly available single cell transcriptomes. Our scTEA-db web portal offers an array of features to find and explore novel terminal exons belonging to 5538 human genes, 110 of which are known cancer drivers. In summary, scTEA-db provides the foundation for studying the biological role of large numbers of so far not annotated terminal exon isoforms in cell identity and function.


Introduction
Most of the transcript isoform expression variation across human tissues is caused by the use of alternative promoters and alternative transcript 3 ends ( 1 ).The latter are generated by endonucleolytic cleavage and polyadenylation of the nascent RNA, which is mediated by a large molecular machinery, the so-called 3 end processing complex.The complex recognizes specific sequence motifs located in the vicinity of the 3 end processing sites, also called polyadenylation [poly(A)] sites ( 2 ).The most prominent RNA 3 end processing signal is the hexameric consensus motif 'AAUAAA' ( 3 ), termed canonical poly(A) signal, which is positioned ∼21 nucleotides upstream of poly(A) sites.
The vast majority of human genes have multiple poly(A) sites ( 4 ) and the alternative cleavage and polyadenylation (APA) of these sites gives rise to alternative terminal exons that can encode for alternative coding sequences and / or 3 untranslated regions (3 UTRs).The latter harbour cis -regulatory D 1019 elements, such as microRNA and RNA binding protein (RBP) binding sites, known to impact the stability, translation and localization of the transcripts ( 2 ) and even the localization of the encoded proteins ( 5 ).
In humans, the repertoire of 3 UTRs varies substantially across tissues.Whereas in ovary, testis and embryonic stem cells, 3 UTRs are short, the longest 3 UTRs are found in neurons ( 6 ).Importantly, APA is dynamically regulated ( 7 ) and was reported to play important roles in cell identity and function.For example, the switching of alternative terminal exons is key to neuronal differentiation, where hundreds of genes switch towards using alternative terminal exons whose 3 UTRs contain binding sites for the muscleblind-like protein 1 (MBNL1) and MBNL2 RBPs.Strikingly, the MBNL1 / 2 RBP binding sites mediate the transport of the alternative transcript isoforms from the cell soma to the neural projections ( 8 ).
Despite the important role that accurate poly(A) site use plays in cell identity and healthy cell function [reviewed in ( 2 )], large numbers of 3 end transcript isoforms appear to be unknown to date and thus remain to be identified and annotated.Various research groups have developed RNA sequencing protocols, termed 3 end sequencing, that are specifically designed to capture the 3 end of transcripts ( 2 ).In previous work, we have performed a comprehensive analysis of 3 end sequencing datasets and found that 24.8% of the detected transcript 3 ends are located within regions currently annotated as intronic ( 9 ).To enable the identification and annotation of transcript isoforms that end at intronic loci, we have previously developed the Terminal Exon Characterization tool (TECtool).Using TECtool to study full-length single cell RNA sequencing (scRNA-seq) data of 201 T cells, we unveiled that there exists an abundance of unknown 3 end transcript isoforms ( 10 ).
Within this study, we have followed up our initial screening of 201 T cells at much larger scale in order to identify and annotate currently unknown terminal exons and corresponding 3 end isoforms at single cell resolution, covering 101 human cell types and 23 tissues (Table 1 ).Our single cell-based Terminal Exon Annotation database (scTEAdb, www.scTEA-db.org) provides an abundance of so far unknown terminal exons (68 615 cases, 12 063 of which are unique).scTEA-db not only offers a vastly extended 3 end isoform annotation, but also implements an array of functions that allow users to find and investigate terminal exons of interest, thereby providing the community with a tool required for studying the biology of large numbers of so far unexplored alternative terminal exon transcript isoforms.

Materials and methods
The scTEA-db data selection, analysis and integration workflow (Figure 1 ) encompassed the steps outlined below.

Dataset selection and downloading
Dataset selection exclusively focused on full transcript lengthbased sequencing technology, in particular scRNA-seq libraries created with the Smart-seq2 protocol ( 11 ), enabling the identification and annotation of terminal exons and isoforms using TECtool ( 10 ).
To obtain datasets suitable for our analysis, publicly available Smart-seq2 datasets were collected (Supplementary Tables S1 and S2) from the NCBI-SRA ( 12) and the EMBL-EBI ( 13) databases based on a set of selection criteria (see Supple- For each tissue (sorted alphabetically), the number of novel terminal exons (TEs), the number of considered cells, the number of cell types and the number of analysed datasets are shown.
Figure 1 .Sc hematic represent ation of the scTEA-db dat a curation, processing and integration.After scRNA-seq data selection and quality control (top panel in green), the sequencing reads of samples with sufficient quality are processed and aligned to the human reference genome using the STAR aligner ( 16 ).Subsequently, unknown terminal e x ons (TEs) and associated isoforms are annotated using TECtool ( 10 ) (middle panel in blue).scTEA-db enables users to interactively filter and explore the terminal exon data across cell types and tissues (bottom panel in red).

D 1020
Nucleic Acids Research , 2024, Vol.52, Database issue mentary Materials).The download of the data was carried out using the SRAtoolkit 3.0.3( 12 ) with default parameters.

Data processing
The data analysis workflows were implemented using the Snakemake workflow management system ( 14 ).The analysis was split into two major workflows: first, the data quality control part (Supplementary Figure S1), and second, the identification of novel terminal exons and isoforms (Supplementary Figure S2).
In the first part, data quality control was performed by aligning the sequencing reads to the human genome version GRCh38.102 ( 15 ) and Ensembl gene annotation version GRCh38.102 ( 15 ), which was carried out using the STAR aligner version 2.7.1a ( 16 ).Then, the following criteria were used to filter out low-quality scRNA-seq libraries: a minimum sequencing read mapping rate of 0.50 and a minimum unique mapping rate of 0.50 as reported by the STAR aligner ( 16 ).Further, a maximum intergenic mapping rate of 0.50 and a maximum ribosomal RNA rate of 0.50 were tolerated, both of which were obtained using the RSeQC tool ( 17 ).
In the second part, the identification and annotation of novel terminal exons and associated isoforms was performed using TECtool ( 10 ).As TECtool runs on single-end read alignments, a Snakemake pipeline was used that first utilizes the STAR aligner to map the sequencing reads in single-end mode; i.e. the alignment was conducted for each sequencing read in pair (fastq1 and fastq2 files) separately.Then, TECtool was run on each BAM file using poly(A) site annotations from the PolyASite 2.0 atlas ( 18 ) and the same Ensembl gene annotation and genome versions as utilized for the first part (see above).Finally, Sashimi plots ( 19 ) were created for each terminal exon identified and annotated by TECtool ( 10 ).

Data integration and curation
The StringTie tool ( 20 ) was used to merge and unify the terminal exon isoforms identified from individual single cells (see Supplementary Materials).The nucleotide frequencies around the 3 ends of novel terminal exons were comparable to those observed around already annotated terminal exons (Supplementary Figure S3).Finally, the cell type and tissue nomenclature was unified by using the manually curated mapping reported in Supplementary Table S3.

scTEA-db web application
The scTEA-db web portal was developed using R Shiny ( 21 ) and tidyverse ( 22 ) packages.The human body map that visualizes the number of terminal exons identified in each tissue was implemented using the 'gganatogram' R package ( 23 ).The scTEA-db Shiny application's layout design was further customized making use of hypertext markup language and cascading style sheets.

Results
scTEA-db is available at www.scTEA-db.organd supports secure communication and data transfer between the web portal and the user.scTEA-db can be accessed via desktop computers and mobile devices, whereas our recommendation is to use it in combination with a decently large electronic visual display to allow for an optimal experience.Below, we provide a roadmap describing the main features and functionalities offered by the scTEA-db web portal (Figure 2 ).

Finding terminal exons of interest
To enable users to find terminal exons of their interest, the scTEA-db web portal selection menu (Figure 2 A) provides the following filtering options: 'Sex / Stage' : Enables users to select a biological system of interest.Currently possible options are male, female, foetal or any.The latter enables researchers to investigate all terminal exons available within scTEA-db, i.e. independent of their detection within a specific biological system of interest.
'Gene' (required field) : Allows to search for terminal exons that belong to specific genes.The user is free to search for an individual gene of interest, a set of genes or all genes (option 'Select All').When the latter is chosen, all genes for which terminal exons have been identified within the selected biological system of interest will be considered.Please note that the downstream selectors 'Tissue', 'Cell type', 'Dataset' and 'Samples' will be automatically updated to show only values available for the selected gene or gene set.
'Tissue', 'Cell type', 'Datasets' and 'Samples' selectors : The user can further limit the search to specific tissues, cell types, and even datasets and samples.As done for the 'Gene' selector, also here all downstream selectors will automatically be filtered according to the user selection; e.g. in case of selecting a specific tissue, only those cell type(s) will remain that are part of this tissue.
Once all selectors are filled, the corresponding filtering is performed upon pressing the 'Apply filtering' button.To clear all selectors and start a new search, the user simply needs to make use of the 'Clear all filters' button.

Exploring terminal exons of interest
Once specific filters have been applied by the user (as described above), the resulting set of terminal exons can be investigated in detail using different feature panels: 'Overview' panel : Presents a summary of the number of terminal exons found across tissues and cell types (Figure 2 B).
'Metrics' panel : Provides the user with sample-and terminal exon-associated quality metrics as inferred by the STAR aligner ( 16 ) and TECtool ( 10 ), respectively (Figure 2 B).
'Terminal Exons' panel : Offers an interactive table that contains detailed information about each terminal exon at single cell level (Figure 2 C).The user can investigate and further filter terminal exons based on user-tailored features / columns, such as cell type, tissue, Ensembl / Entrez ID, etc.By default, the 'Show only unique' box is checked to prevent the user from looking at the same terminal exon identified within a multitude of single cells.The 'UCSC terminal exon' and 'UCSC gene' columns provide the option to investigate terminal exons and associated transcript isoforms in the UCSC Genome Browser ( 24 ).Finally, scTEA-db offers users to export their tailored views and filterings in the form of TSV or Excel files.

General information and bulk data download
The 'Info' panel contains general information about scTEA-db and also provides an option for highly welcome user feedback.Finally, the full database in TSV file format and annotations in GTF are available as bulk data downloads (Figure 2 B).

Discussion
The great demand for transcript 3 end annotations is reflected by the large number of resources that provide genomic locations and poly(A) site processing quantifications, many of which are heavily used by the community.Examples are PolyA_DB 3 ( 25 ), PolyASite 2.0 ( 18 ) and scAPAdb ( 26 ).Importantly, large numbers of the experimentally observed transcript 3 ends are not covered by the current gene annotation and there exists no detailed information about the large number of not yet annotated terminal exon isoforms.However, the annotation of these isoforms is required to enable studying their expression in various systems of interest and to uncover their biological roles.Here, we have made use of a large number of publicly available full-length scRNA-seq datasets to create scTEA-db, a resource that provides the community with so far not yet annotated terminal exons and associated transcript isoforms at single cell, i.e. cell type resolution.scTEA-db is an easy to use web portal that offers user-specific searches, data views and downloads, thereby providing the foundation to study the biology of thousands of so far not yet annotated terminal exon isoforms.Our extended terminal exon annotation enables to infer the expression of the corresponding D 1022 Nucleic Acids Research , 2024, Vol.52, Database issue isoforms within any system of interest and thus also has the potential to contribute to more fine-grained cellular subtyping.Importantly, we have designed scTEA-db as an easy to expand resource that we aim to update by incorporating single cell sequencing datasets that will be made available by the community in the future.Additionally, providing expression information of the terminal exon transcript isoforms and corresponding protein domain predictions ( 27 ) will be valuable extensions.Finally, making use of bioinformatics tools for the prediction of RBP ( 28 ) and microRNA binding sites ( 29 ) within the 3 UTRs of the terminal exons will further our understanding of the expression regulation and biological roles of specific isoforms in cell identity and function.

D 1021 Figure 2 .
Figure 2. Finding and exploring terminal exons (TEs) using scTEA-db.( A ) Users can filter the terminal exons by multiple features, such as gene symbol, cell type or tissue.( B ) The 'Overview' and 'Metrics' panels provide details about the selected terminal exons.The 'Info' panel provides general information about scTEA-db and data bulk download options in tab-separated values (TSV) format and gene transfer format (GTF).( C ) Details about the selected terminal e x ons at single cell resolution, including links to the UCSC Genome Browser ( 24 ).