TTSBBC: triplex target site biomarkers and barcodes in cancer

Abstract The technology of triplex-forming oligonucleotides (TFOs) provides an approach to manipulate genes at the DNA level. TFOs bind to specific sites on genomic DNA, creating a unique intermolecular triple-helix DNA structure through Hoogsteen hydrogen bonding. This targeting by TFOs is site-specific and the locations TFOs bind are referred to as TFO target sites (TTS). Triplexes have been observed to selectively influence gene expression, homologous recombination, mutations, protein binding, and DNA damage. These sites typically feature a poly-purine sequence in duplex DNA, and the characteristics of these TTS sequences greatly influence the formation of the triplex. We introduce TTSBBC, a novel analysis and visualization platform designed to explore features of TTS sequences to enable users to design and validate TTSs. The web server can be freely accessed at https://kowalski-labapps.dellmed.utexas.edu/TTSBBC/.


Introduction
Triplex-forming oligonucleotides (TFOs) bind to the major groove of genomic DNA, termed TFO target sites (TTS), forming intermolecular triplexes via Hoogsteen hydrogen bonding ( 1 ,2 ).The Hoogsteen binding enables the base recognition of a TFO to a TTS, compared to Watson-Crick base pairing rules for standard double-stranded DNA.TTSs are the poly-purine genomic duplex DNA with binding to TFOs influenced by longer lengths, higher G-content, and fewer pyrimidine interruptions (3)(4)(5)(6)(7).TTSs and TFO binding for sequences containing a G-quadruplex motif have also been found to be impacted by self-associating into G-quadruplexes ( 8 ).Altogether, these motif features enable categorization of TTS sequences for stability of the formed structure.
Triplex formation is a promising cancer treatment as an anti-gene technique, capable of specific DNA sequence targeting ( 2 , 9 , 10 ).TFOs can direct sequence-specific DNA damage whether they are bound to another compound or by them-self ( 2 ,11 ), and they can facilitate site-specific DNA repair synthesis and mutagenesis, likely via the nucleotide excision repair (NER) mechanism ( 2 ,12 ).TFOs also modulate transcription and inhibit replication ( 10 ).Common uses of TFO technology includes gene modulation by blocking transcription and gene modification through mutagenesis and homologous recombination ( 2 , 10 , 13 ).TFO binding is site-specific ( 14 ), leading to differential responses in cells with varying TTS abundance which can be exploited to target cancer cells with DNA amplification, a related form of genomic instability ( 15 ,16 ).The site-specific nature of TFO binding to TTSs plays an important role in predicting the triplex formations in the genome.
Though there have been several databases of TTS ( 17 ,18 ), their exploration for potential as markers in cancer treatment strategies has been limited due to the lack of a corresponding analysis platform.We present TTSBBC as a webserver analysis tool for researchers investigating the role of TTSs in cancer

The web application architecture
The web application is developed using Shiny R. The front end of the application is extended with HTML, Cascading Style Sheets (CSS), and JavaScript.TTSBBC is deployed on Amazon Web Service (AWS).The deployment uses AWS Fargate with containerization.AWS Elastic Load Balancer (ELB) serves as a traffic controller to ensure visiting experience and scalability.TTSBBC accesses a secured MySQL database containing all possible TTS mappings and features that have been precalculated for improved user experiences.This website is free and open to all users and there is no login requirement.

Data sources
The TTS sequences were downloaded from two sources, the Triplex-forming Oligonucleotide Target Sequence Search engine (TFO Search) ( 17 ) and from TTSMI ( 18 ).These datasets denote human genome TTSs that comprise 1,297,671 TTSs from TFO Search ( 19 ) and 36,276,455 TTSs from TTSMI ( 18 ), of which 384,089 overlapped between them.A database of 39 published, experimentally-validated TTSs was curated and is available to explore within the application which provides associated information and citations.Cancer cell line copy number segment data was sourced from DepMap ( 20 ,21 ) and filtered for amplifications, with names converted to cancer cell line encyclopedia nomenclature.The human genome reference sequence and annotation used was Gencode v44 ( 22 ).

Data pre-processing
Each TTS sequence was aligned to Gencode v44 using Bowtie ( 18 ,23 ).When a TTS mapped to multiple genomic locations, all mappings that were perfect matches were retained.For each TTS, sequence level features, length and associated base content, were also calculated.While each TTS is characterized by a feature set, their mappings to the genome can be oneto-many.The number and location of TTS mapping(s) were determined along with features as unique to a given TTS sequence.This precomputing enhances the analytical efficiency and user experience of TTSBBC.
A consensus gene element dataset was generated from the Gencode v44 GFF file.Promoters, introns, and exons were determined for all transcripts, then filtered to one representative transcript per gene.Introns were defined as unannotated regions (i.e.not exons, cds, or start / stop codons) while promoters were identified as 2Kb upstream of the start codon ( 17 ).In parallel, for a given gene, representative transcripts were selected based on this criteria in order: basic annotation, longest transcript ( 17 ), most exons, most annotated features, greatest support, and Ensembl canonical.In the case of ties, which occurred in less than one-percent of all genes, one transcript was randomly selected after manual review.To combine representative transcripts and gene elements, the gene elements were filtered for selected transcripts, then for exons, introns, and promoters to use as a dataset to annotate TTS mappings.

Biomarker: TTSs as multi-gene targets
The accuracy and precision of a TTS relate to the user's specified query region(s).For each TTS within these regions, accuracy, indicating the degree of on-target mapping, and precision, reflecting the degree of overlap with query regions, are assessed.Collectively, these measures evaluate the TTS's on and off-target mappings in the human genome and are defined as follows: Barcoding: sequence-informed signatures The link between TTS sequences, their genomic mappings, and sequence features are used to derive a barcode that may be further linked with a gene signature.For this task, each TTS's genomic mappings were intersected with genomic regions to define a unique gene signature.Barcoding TTSs organizes them into gene signatures based on binding-relevant features such as length, G-content, and pyrimidine interruptions (3)(4)(5)(6)(7).Barcoding is key to performing a detailed analysis of TTS sets.Barcoding of TTSs involves categorizing them based on key features, with each Barcode representing a combination of these traits.G-content and length are divided into two groups: above (G1 / L1) or below (G2 / L2) the median of that data source.Specificity is high (S+) for TTS sequences with multiple genomic mappings, and low (S1) for single mappings.Pyrimidine interruptions are classified as high (P+) if present, and low (P0) if absent.The presence of a G-quadruplex motif ( 24) is indicated as Q+, and its absence as Q0.For instance, a Barcode G1_L2_S1_P+_Q0 signifies a TTS with high G-content, shorter length, one mapping, at least one pyrimidine interruption, and no G-quadruplex motif.Using this barcoding system, all TTSs were organized according to the aforementioned traits.
Barcode signatures are representative of genes, TTSs and TTS features; these components can be stratified ( Supplementary Figure S1 ).This stratification illustrates the composition of Barcode signatures from various TTS gene signatures by data source.For instance, a desirable Barcode G1_L1_S+_P0_G0 contains TTSs with varying TTS gene signature sizes demonstrating sequence homology across the genome.Within each barcode, TTS gene signatures may be combined to derive barcode-level gene signatures ( Supplementary Data S1 ).Barcoding adds a new dimension to genomic annotation, enhancing the interactive understanding of gene targeting using TFOs.This serves as a valuable reference for researching genes with specific TTS characteristics, aiding studies in TFO formation and advancing our knowledge of anti-gene therapy in cancer.

Overall design of TTSBBC
The TTSBBC web server features three main modules, as depicted in Figure 1 .The first, 'findTTS,' includes 'Biomarker' and 'Barcode' submodules and accepts various genomic region inputs.'Biomarker' calculates the accuracy and precision of a TTS, assessing on-and off-target mappings.
W 549 Figure 1.Ov ervie w of the TTSBBC analy sis platf orm.TTSBBC is a platf orm f or analy sis of TTS sequences that is designed with three modules-findTTS, myTTS and mySEQ.findTTS enables searching for TTSs based on a genomic region query where users are provided with analysis options that guide them through an informed, smart screening of TTSs off-targeting and sequence features to increase successful triplex binding.myTTS enables a smart screening of TTS targets and visualizations to inform on multi-targeting of a user input TTS.Both findTTS and myTTS modules utilize platform populated TTS data sources.mySEQ offers a smart screening of sequence targets for an arbitrary input sequence by providing genomic locations for the sequence and its mismatches.The locational and gene outputs of mySEQ may be input into findTTS to find TTSs for the input sequence.
'Barcode' evaluates the sequence-level features of TTSs for binding.The second module, 'myTTS,' enables users to explore their TTS sequences, focusing on both features and genomic mappings.Both 'findTTS' and 'myTTS' use published sets of TTSs, while the third module, 'mySEQ,' offers users the capability to explore any input sequence, independent of data source.All modules include a form of 'smart screening' that informs on sequence characteristics including, but not limited to multi-gene targeting, for consideration in experimental design.

Input format options
The web server offers various input methods.For 'findTTS,' users can input genomic regions in several ways: using built-in cancer-related gene sets (25)(26)(27)(28)(29)(30)(31)(32), selecting amplified segments from CCLE cell lines ( 33 ), manually entering genes, or uploading custom genomic regions via a csv file.The 'myTTS' module accepts a TTS nucleotide sequence as its query.The 'mySEQ' module accepts a nucleotide sequence and number of mismatches as its query parameters.

findTTS module enables TTS discovery for a genomic region query
The 'findTTS' module offers a comprehensive analysis tool for screening TTSs based on their accuracy and precision to target user-defined genomic regions (Figure 2 ).The 'Biomarker' submodule enables researchers to compare TTSs' accuracy and precision across different regions, identifying genes with varied on-and off-target mappings.It includes a network visualization to show which genes are targeted by TTSs within the input regions.The 'Barcode' submodule categorizes TTSs into specific Barcodes based on sequence-level features.This module provides visual representations of TTS Barcode distributions, making analysis more intuitive and effective (Figure 3 A).It also shows spatial relationships of TTSs, particularly those with the S + Barcode, allowing detailed exploration of their genomic regional density (Figure 3 B).Additionally, 'findTTS' offers custom TTS Barcoding through k-means clustering based on numerical features, helping users identify optimal TTSs for their specific needs (Figure 3 C).This feature supports deeper investigation into how sequence features influence TTS formation and function,  enhancing user understanding of TTS characteristics and their impact.

myTTS module enables TTS validation and in vestig ation
The 'myTTS' module focuses on TTS sequence analysis, enabling users to perform smart screening of input sequences.It identifies mappings for a given TTS and displays a network of genes connected to it.The network of genes connected to the TTS helps inform users of the comprehensive genomic locations of this sequence and the genes at those locations.Additionally, the module offers detailed sequence-level features and characteristics for assessing TTS binding properties.The 'myTTS 'module enables users to explore the data from the provided data sources on a sequence-by-sequence basis.This functionality provides the ability to explore sequence features that may have potentially affected binding outcome.

mySEQ module enables TTS validation and discovery
The 'mySEQ' module is designed to provide utilities from the 'findTTS' and 'myTTS' modules, independent of a data source.With a user input nucleotide sequence and specification of the number of mismatches, TTSBBC aligns the sequence to the human reference genome and defines mismatches to it using a Hamming Distance, and similar preprocessing workflows are implemented for the other two modules.With these calculations, 'mySEQ' provides networks of which genes are targeted by the input and mismatch sequences.A stacked bar plot of the base differences to the original sequence is displayed, along with a circos plot of the den-sities of mapping locations of the input sequence and its mismatches.Among the downloadable tables available, the mismatch features table includes the Hamming Distance to the input sequence.The 'mySEQ' module provides a database agnostic analysis option for the case in which a TTS is of interest to explore, but it is not included in either populated data source by simply inputting a TTS sequence.Alternatively, by inputting an arbitrary sequence using mySEQ, a user can obtain genomic locations for use as input into 'findTTS', thereby providing the ability to find arbitrary sequence-based TTSs, albeit data source restrictions.

Case Study #1: validation and discovery of TTSs targeting HER2 in breast cancer
The recent study by Tiwari et al. ( 34 ) showed the enhanced ability of TTS's to target HER2 as compared to Trastuzumab in breast cancer cell lines.Using their TTSs (with 'myTTS' module), and cell lines (with 'findTTS' module) we demonstrate the utility of our application.The 'myTTS' module validated four TTS sequences (HER2-1, HER2-205, HER2-40118, HER2-5922) targeting the ERBB2 gene, with HER2-1, also noted in other literature ( 18 ), serving as a positive control (Figure 4 ).TTSBBC offers dual methods for TTS discovery: (i) targeting amplified cell line segments or (ii) targeting genes.The cell line approach focuses on increased TTS copies for differential triplex formation and DNA damage in cells with amplified regions ( 34 ,35 ).Targeting genes directly is similarly effective, especially in cases of gene amplification.This dual approach showcases TTSBBC's versatility in TTS analysis.
In the cell line approach, using the BT474 cell line known for ERBB2 amplification ( 34 ), the findTTS module identified 18,629 TTSs with 100% precision and accuracy for the cell line's amplified segments (Figure 5 A, B).After further selection of Barcodes indicating the best binding, 2,254 TTSs remained.Clustering on Guanine content and Length reduced this to 53 TTSs with higher Guanine content and length and TTSBBC found a cluster of 207 TTS with better indicators for binding than the HER2-1 cluster (Figure 5 D).Another analysis but rather using TTS literature not containing our positive control ( 17 ), produced three 100% precise and specific TTSs with optimal binding features with longer lengths than HER2-1, confirming TTSBBC's effectiveness in diverse data sources ( Supplementary Figure S2 ).
In the gene-level approach, targeting the ERBB2 gene identified 389 TTS sequences with 100% precision and accuracy ( Supplementary Figure S3 A).Analysis of their Barcodes ( n = 60 with ideal Barcodes for binding) and clustering based on Guanine content and Length validated HER2-1 as an effective TTS for targeting ERBB2 with no clear putatively better TTS ( Supplementary Figure S3 B, C).
This case study found both cell line and gene-based approaches can identify effective TTSs, but targeting amplified segments yielded a larger set of TTSs with better binding features than solely targeting amplified genes.This demonstrates TTSBBC's utility in guiding researchers to design more effective and efficient experiments targeting any human genomic region, especially those with differential amplification.
Case study #2: discovery of multi-targeting TTSs to the non-homologous end-joining (NHEJ) pathway TTSBBC's design accommodates the fact that a single TTS can have multiple genomic mappings, allowing for the in- (5`-A GGA GAA GGA GGA GGTGGA GGA GGA GGG-3`in red) connected to all its human genome mapping locations (green).( B ) HER2-205 (5`-GGGGGGCAA GA GGGCGA GGA GGA GCCCC-3`in red) connected to all its human genome mapping locations (green).( C ) HER2-40118 (5`-GGGGTGGG AGGG ACAAAGGGGG-3`in red) connected to all its human genome mapping locations (green).( D ) HER2-5922 (5 -GGGGTG AGG AG AGTGGGGG AGG AG AAAGGG-3 in red) connected to all its human genome mapping locations (green).vestigation of multiple triplex formations throughout the genome.Triplex structures are known to induce doublestranded breaks ( 16 ), necessitating repair via mechanisms like NHEJ.Using the NHEJ pathway as a query in the 'find-TTS" module, 3,579 TTSs were found to map within NHEJrelated genes.Of these, six TTSs showed multi-targeting capabilities with minimal off-target effects (Figure 6 A).Analysis of these TTSs' genic mappings indicated that all six mapped to NHEJ1 , two to XRCC4 , four to PRKDC and six to ENSG00000280537 (Figure 6 B).Barcode screening revealed that none of these TTSs had a G-quadruplex motif, with two having high guanine content and two lacking pyrimidine interruptions (Figure 6 C).Furthermore, the TTS mappings analysis revealed a high density near NHEJ genes PRKDC , XRCC6 , XRCC5 and NHEJ1 (Figure 6 D).
Additional analysis revealed off-target effects with complete sequence similarity.However, sequence homology in surrounding regions suggested undiscovered biological mechanisms involving triplex formation possibly involving ENSG00000280537 .An enrichment analysis ( 36 ) of the TTS gene signature revealed 3 TTSs enriched in FAK signaling, a different set of 3 in Hallmark Apical Junction, 2 in DNA polymerases and all 6 in NHEJ (Figure 6 E).Notably, a single TTS showed enrichment in 12 of the cancer Hallmark gene signatures ( 28 ,30 ).This multi-targeting characteristic of TTSs across the human genome is similar to the DNA-binding domain of transcription factors, which bind to genomic DNA influencing expression and cell differentiation ( 37 ,38 ), and like transcription factors, triplex formation affects transcription ( 10 ).
TTSBBC provides a novel approach to study multimapping and multi-targeting TTSs, offering unique insights into their biological relationships and mechanisms.This research demonstrates how TTSBBC can help users explore biologic processes and the potential applications of TTS in broader genomic research.

Discussion
TTSBBC is a comprehensive web platform for analyzing TTSs.It features a user-friendly interface for examining TTS accuracy, precision, Barcodes, genomic mappings, and sequence-level characteristics.The platform highlights the multi-targeting potential of TTSs, in addition to their traditional single targeting applications, offering novel insights with biological and therapeutic implications.
TTSBBC specializes in the screening analysis of human genome TTSs.As analysis-focused platform, TTSBBC is limited in that it does not support the creation or design of TFOs for TTSs.For this purpose, other methods may be used supplemental to TTSBBC that focus on the chemical interactions required for TFO design ( 39 ,40 ).While other platforms, such as Triplex-Inspector assesses TFO-TTS conjugates and their off-target effects ( 41 ,42 ), while ignoring the multi-targeting capabilities of TTSs.By comparison, TTSBBC focuses upon relative assessments of both specific and non-specific TTSs with additional analyses capabilities.Notably, at the time of writing, Triplex-Inspector is not accessible via their published link.
Other triplex based tools include TTS Mapping which finds TTSs for specific regions but restricts focus to highlyspecific TTSs and co-occurring motifs ( 43 ).TFO Search and TTSMI are TTS sequence repositories without analysis features ( 17 ,18 ), and are included in TTSBBC as data sources.TTSBBC advances beyond these tools by offering analytical capabilities with informed, 'smart screening' for experimental design.
Future planned enhancements for TTSBBC include the ability to analyze sets of TTSs, their genomic mappings, and features to aid in precision medicine and personalized anti-gene therapies.Another potential improvement is incorporating all possible permutations of TTS sequences, not just those from published sources, for cartography of the landscape of TTSs and their multi-targeting potential.The initial curation of published and experimentally validated TTSs remains ongoing and will be updated, potentially forming a third database in the future.A forthcoming update includes TTS mapping to the newest T elomere-to-T elomere (T2T) ( 44 ) coordinates and annotations.
The web server offers various input options to enhance user experience and is accessible to a wide audience.Its intuitive visualizations aid non-computational biologists in exploring TTS sequence features and genomic mappings for targeted genomic instability in cancer.TTSBBC introduces novel adaptations of metrics and concepts for TTSs-precision, accuracy, barcoding and multi-targeting assessment-integrating these into a comprehensive and accessible analysis platform.
Number o f T T S map p ing sites in query regions To tal number o f T T S map p ing sites Accuracy = Number o f query regions with T T S map p ing sitesTo tal number o f query regio ns

Figure 2 .
Figure2.TTS Biomarker assessment using accuracy and precision.( A ) Network of TTS Mappings.The node colors in red are TTS sequences, in blue are off target genes, and in green are query regions.In the app users can zoom in and see the names of each node.The off-target mappings that are not within any gene region are connected to a blue node labeled intergenic.( B ) TTS accuracy and precision scatterplot.On the x-axis is precision, and the y-axis is accuracy, both on a scale from 0-1.This scatterplot allows users to select TTSs as shown by the blue rectangle selection.( C ) A visual representation of differing accuracy and precision values in relation to a figurative target.Greater accuracy are points closer to the bullseye, how 'on-target' a TTS is, and greater precision is a smaller spread of points, how consistent and able to target all regions a TTS is.On the top right corner of each target are cartesian coordinates of the v alues f or that target's accuracy (x-axis) and precision (y-axis) in (B).Where the target would be located in (B) informs interpretation of the different TTS positions in the scatterplot.( D ) A table of accuracy and precision values to aid users in the interpretation of their values in (B) and (C).High accuracy and precision values are close or equal to 1, and low values are close or equal to 0. Precision is defined as the number of TTS mapping sites in query regions divided by the total number of TTS mapping sites and informs how much of the TTS's mapping sites are within the query.Accuracy is defined as the number of query regions with TTS mapping sites divided by the total number of query regions and informs how much of the query regions are being targeted.

Figure 3 .
Figure 3. TTS barcode assessment of sequence le v el features.( A ) Heatmap of TTSs and their features in relation to their mapping to query regions.TTS barcodes are visualized as annotations above the heatmap to demonstrate the diversity of sequence features for TTSs that are within certain query regions.( B ) Circos plot denoting the locations the selected TTSs map to that are not within the query regions (labeled outside the chromosome ring).T he TTS is link ed to other regions it maps to, and the link is colored by the destination chromosome.( C ) Custom Barcoding.Clustering can be done on a pair of numeric features for users to create their own barcodes and find a set of TTSs that are most applicable to their work and use cases.

Figure 5 .
Figure 5. Analyses of TTS in BT474 Amplified Segments reveal putatively better TTS targeting and binding.( A ) Table of amplified segments selected because of high segment means and o v erlap with ERBB2 to target with TTSs (left) and a visualization of segment location (blue) in relation to the chromosome (right).( B ) Scatterplot (with x and y jitter to show density) of TTS accuracy and precision revealing 18,629 TTSs, of which HER2-1 (Figure 4 A) is one, with 100% accuracy and precision.( C ) Heatmap of TTS barcodes with query segment targeting.A plus (+) indicates > 100 TTSs are in that category.HER2-1 is present in the barcodes that indicate the highest binding in the column highlighted in red, along with 2,254 other unique TTSs.( D ) Clustering of TTSs with barcodes that indicate the highest binding ( n = 2,254) by guanine content and length.In the red box is where HER2-1 clustered.In the orange box are TTSs with putatively better TTS targeting and binding ( n = 53).Grey cluster ( n = 207) has better indicators for binding than the cluster that HER2-1 is in (pink).

W 553 Figure 6 .
Figure 6.Analysis of non-homologous end-joining (NHEJ) pathway reveals differential multimapping TTS. ( A ) Scatterplot (with x and y jitter to show density) of TTS accuracy and precision re v ealing six TTSs, with effective multimapping to query gene set.( B ) Network of TTS mapping locations within genic regions.The node colors in red are TTS sequences, in blue are off target genes, and in green are query gene regions.The off-target mappings that are not within any gene region are connected to a blue node labeled intergenic.NHEJ1 contains a mapping from all six TTSs, XRCC4 contains two of six, PRKDC contains four of six, and ENSG00000280537 contains six of six.( C ) Heatmap of TTS barcodes with query gene targeting.Query regions, NHEJ1 , PRKDC and XRCC4 , contain multiple TTS mapping locations with varying barcodes.( D ) Circos plot of TTS genomic location density.Genomic location density shows concentrations of TTS mappings near PRKDC , XRCC6 , XRCC5 , and NHEJ1 .( E ) Enrichment analysis based on the Hypergeometric distribution ( 36 ), using a significance threshold of P ≤ 0.05, of TTS gene signatures as queries and built-in cancer-related gene sets ( 25-32 ) as references re v eals significant enrichment of TTS gene signatures in red.