The transcriptional landscape of endogenous retroelements delineates esophageal adenocarcinoma subtypes

Abstract Most cancer types exhibit aberrant transcriptional activity, including derepression of retrotransposable elements (RTEs). However, the degree, specificity and potential consequences of RTE transcriptional activation may differ substantially among cancer types and subtypes. Representing one extreme of the spectrum, we characterize the transcriptional activity of RTEs in cohorts of esophageal adenocarcinoma (EAC) and its precursor Barrett's esophagus (BE) from the OCCAMS (Oesophageal Cancer Clinical and Molecular Stratification) consortium, and from TCGA (The Cancer Genome Atlas). We found exceptionally high RTE inclusion in the EAC transcriptome, driven primarily by transcription of genes incorporating intronic or adjacent RTEs, rather than by autonomous RTE transcription. Nevertheless, numerous chimeric transcripts straddling RTEs and genes, and transcripts from stand-alone RTEs, particularly KLF5- and SOX9-controlled HERVH proviruses, were overexpressed specifically in EAC. Notably, incomplete mRNA splicing and EAC-characteristic intronic RTE inclusion was mirrored by relative loss of the respective fully-spliced, functional mRNA isoforms, consistent with compromised cellular fitness. Defective RNA splicing was linked with strong transcriptional activation of a HERVH provirus on Chr Xp22.32 and defined EAC subtypes with distinct molecular features and prognosis. Our study defines distinguishable RTE transcriptional profiles of EAC, reflecting distinct underlying processes and prognosis, thus providing a framework for targeted studies.


INTRODUCTION
Gene expression depends on transcription, subsequent splicing and polyadenylation of nascent RNA, through recognition of specific motifs in genomic DNA. Between and within genes, howe v er, r eside numerous r etrotransposable elements (RTEs) that can contribute transcription initiation signals, splice donor and acceptor sites, and poly adenylation signals, thereb y contributing to or disrupting RNA production processes ( 1 ). The human genome harbors over 4 million RTE integrations of distinct phylogeny, genomic structure and replication life-cycle, with a vast majority of them being replication-defecti v e due to accum ulation of m utations and deletions ( 2 , 3 ). A major distinction is the presence of long-terminal repeats (LTRs) in human endogenous retroviruses (HERVs), originating from germline infection with exogenous retroviruses, and in mammalian apparent LTR retrotransposons (MaLRs). In contrast, long interspersed nuclear elements (LINEs) and short interspersed nuclear elements (SINEs), which include the expanded Alu elements, lack LTRs ( 2 , 3 ). Another pertinent distinction is the use of poly(T) by non-LTR LINE-1, SINE and the composite SINE-VNTR-Alu (SVA) elements, all of which rely on the LINE-1 replication machinery, for priming of re v erse transcription (target-primed), which inserts poly(A) tails in DNA integration sites ( 2 , 3 ). RTE expression has been found d ysregula ted, a t least in part owing to epigenetic der epr ession, in most cancer types that have been examined, where it may have more pronounced effects on gene function and RNA production, as well as additional effects, such as induction of an interferon (IFN) response or creation of cancer-specific antigens ( 4 , 5 ).
Howe v er, the degree or direction of RTE d ysregula tion is highly variable among different cancer types. Whilst average RTE transcription is r eported upr egulated in a majority of cancer types, it may also be downregulated in other types characterized by increased epigenetic r epr ession, and individual RTE copies or families may also display opposing patters of dysregulation e v en in the same cancer type ( 4 , 6 , 7 ), highlighting cancer-specific and RTE-specific causal processes.
Using de novo transcriptome assemb ly, we hav e previousl y observed substantiall y increased aberr ant tr anscription of LTR retroelements in esophageal carcinoma (ESCA), compared with healthy tissues or other cancer types, with ESCA being second only to testicular germ cell tumors (TGCT), where RTEs are d ysregula ted also as part of epigenetic reprogramming during spermatogenesis ( 8 ). ESCA is classified histolo gicall y and molecularl y as squamous cell carcinoma (ESCC) or adenocarcinoma (EAC) ( 9 ), each associated with different risk factors. In particular, EAC is connected with Barrett's esophagus (BE), a condition of esophageal epithelium and an EAC precursor state ( 10 ). It was ther efor e unclear whether the pronounced LTR r etroelement dysr egulation in ESCA transcriptomes ( 8 ) tracks with EAC and its precursor BE, or whether it extends to all types of RTE, particularly the more numerous non-L TR type. Increased non-L TR retroelement activity in EAC is also suggested by observa tions tha t LINE-1 insertions are the most frequent type of somatic structural variation in EAC (11)(12)(13)(14) and are also detected in pre-malignant BE ( 12 , 15 ).
In this work, we have utilised established TCGA cohorts, as well as a new extended cohort of EAC and BE from the OCCAMS consortium to define RTE transcriptional activity in EAC, and explore its origins and potential consequences. We identified a comprehensi v e list of RTEover lapping tr anscripts ov ere xpressed specifically in EAC, many of which were not previously annotated and from which we deri v ed both diagnostic and prognostic RTE transcriptional signatures. We further pinpointed incomplete intr on pr ocessing and intr onic RTE removal, rather than autonomous ERE expression, as the origin of the exceptional RTE transcriptional di v ersity in EAC. In turn, we found that defecti v e intr on pr ocessing is associated with reduced cellular fitness, owing to reduced expression of the full y-spliced, functional mRN A isoforms. We further associated defecti v e intr onic RTE removal with str ong transcriptional activation of distinct HERVH proviruses and of the HERVH Xp22.32 provirus in particular, controlled by transcription factors KLF5 and SOX9. Most notably, we found that defecti v e intr onic RTE pr ocessing and associa ted HERVH Xp22.32 activa tion defines a distinguishable subtype of EAC that exhibits more pronounced adenocarcinoma characteristics, is more di v ergent from its BE precursor and from ESCC, and is associated with reduced cancer cell fitness and, consequently better prognosis.

OCCAMS (oesophageal cancer clinical and molecular stratification) cohorts
This work used previously published and newly collected samples from the OCCAMS stud y (REC . no. 10-H0305-1). OCCAMS is an observational study to determine the molecular dri v ers of EAC. Ethical approval was obtained from the Cambridgeshire 4 Research Ethics Committee, UK. Tissue was obtained with written, informed patient consent. All relevant ethical regulations were correctly followed and samples were fully anonymized. The OCCAMS whole genome sequencing (WGS) samples analyzed here included 99 previously described EAC samples ( 16 ) and an additional 128 EAC samples. Single nucleotide variants (SNVs), small indels and copy number alterations (CNAs) were called using Strelka v.1.0.13 ( 17 ) and ASCAT-NGS v.2.1 ( 18 ), as described previously ( 19 ). A total of 40 EAC cancer dri v ers fr equently alter ed through SNVs or indels were deri v ed from the Networ k of Cancer Genes database (NCG, http://www.network-cancer-genes.org ) ( 20 ). Additionally, 34 dri v ers fr equently alter ed through cop y number alterations (CNA) were deri v ed from the literature ( 9 , 16 , 19 , 21 , 22 ). Damaging alterations in these 74 EAC dri v ers were identified using ANNOVAR (April 2018) ( 23 ) and dbNSFP ( 24 ). Stopgain, stoploss and frameshift mutations were considered damaging. Missense and splicing mutations were further filtered to identify loss-of-function and gain-of-function alterations, as described previously ( 20 ). Additionally, dri v ers with copy number gains (CNA > 2 times sample ploidy), homozygous deletions (CNA = 0) and heterozygous deletion (CNA = 1) with a loss of function mutation in the other allele wer e consider ed to be damaged. The OCCAMS RN A sequencing (RN A-seq) samples (ribosomal RN A (rRN A)-depleted total RN A) anal yzed here included 116 previously described EAC samples ( 16 ) and an additional 131 EAC and 110 BE samples. For RNAseq raw data anal yses, ada pter and quality control trimming were carried out using Trimmomatic v0.36 (Bolger et al. , 2014). Quality control of raw reads, carried out using FastQC, indicated the presence of bacterial RNA and r esidual rRNA r eads in a majority of the samples. These r eads wer e filter ed out using BBsplit (BBMap v36.20) from BBTools suit ( http://jgi.doe.gov/data-and-tools/bb-tools/ ) by aligning reads against the GRCh38 / hg38 genome and the human ribosomal DNA complete repeating unit (Gen-Bank: U13369.1). Samples that ended up having less than 10 6 pair ed r eads after r emoval of bacterial RN A and rRN A r eads wer e excluded from downstr eam analyses.

T r anscript identification, read mapping and quantitation
OCCAMS RNA-seq samples that passed quality control were to the GRCh38 / hg38 human genome using HISAT2 v2.1.0 ( 25 ). Additional RNA-seq samples were downloaded from TCGA (poly(A) selected RNA) as .bam files and converted to .fasta files using SAMtools v1.8 ( 26 ). Downstream analysis of TCGA samples and other publicly available RNA-seq datasets used in this study was carried out as for OCCAMS cohorts excluding bacterial RNA and rRNA read filtering step. Although pol y(A) RN A selection may underestimate transcripts from satellite repeats, transcripts from RTEs were previously found similarly r epr esented in total and poly(A)-selected RNA ( 27 ). Annotated gene and r epeat expr ession wer e calcula ted by fea tureCounts (part of the Subread package v1.5.0) ( 28 ) using GENCODE.v29 basic ( 29 ) and a custom repeat annotation ( 30 ). Genic repeats were defined as those with at least one nucleotide overlap with annotated gene bodies, with the rest of the repeats defined as intergenic. To pre v ent ambiguity, only reads that could be uniquely assigned to a single feature were counted. Long-read RNA-seq samples were aligned to the GRCh38 / hg38 human genome using minimap2 v2.17 ( 31 ). For assemblies of long-read RNA-seq reads, the obtained .bam files were first converted to bed12 using bam2bed12.py script from FLAIR suit ( 32 ). High-confidence isoforms were selected using 'collapse' function from flair.py script ( 32 ). ChIP-seq datasets were trimmed using Trimmomatic v0.36 ( 33 ) and aligned to the GRCh38 / hg38 human genome using Bowtie 2 v2.2.9 ( 34 ). Additional transcripts were previously de novo assembled on a subset of the RNA-seq data fr om TCGA ( 8 ). Samples fr om T CGA wer e downloaded through the gdc-client application and the .bam files were parsed with a custom Bash pipeline using GNU parallel ( 35 ). RNA-seq data from TCGA, GTEx, CCLE OCCAMS and listed previous studies were mapped to our de novo cancer transcriptome assembly and counted as previously described ( 8 ). Briefly, transcripts per million (TPM) values were calculated for all transcripts in the transcript assembly ( 8 ) with a custom Bash pipeline and Salmon v0.8.2 ( 36 ), which uses a probabilistic model for assigning reads aligning to multiple transcript isoforms, based on the abundance of reads unique to each isoform ( 36 ). The 4844 ESCAov ere xpressed transcripts were selected based on median expression in ESCA > 0.5 TPM, with the 90th percentile of expression in the respecti v e healthy tissues or the maximum median expression in any healthy tissue at least 3 × lower than the 75th percentile of expression in ESCA, and < 0.5 TPM. We separately quantified expression of annotated genes by using a transcript index with all GENCODE tran-script support le v el:1 entries and collapsing counts for the same gene. For quantitation of e xon v ersus intron r epr esentation, the same pipeline was followed, except counts were collapsed for all annotated exons and all introns for the same gene, separately. Read count tables were additionally imported into Qlucore Omics Explorer v3.8 (Qlucore, Lund, Sweden) for further downstream expression analyses and visualization. Splice junctions were visualized using the Integrati v e Genome Viewer (IGV) v2.4.19 ( 37 ).

Repeat annotation
Repeat regions were annotated as previously described ( 30 ). Briefly, hidden Markov models (HMMs) r epr esenting known Human repeat families (Dfam 2.0 library v150923) were used to annotate GRCh38 using Repeat-Masker, configured with nhmmer. RepeatMasker annotates LTR and internal regions separately, thus tabular outputs were parsed to merge adjacent annotations for the same element.

Cellular deconvolution of bulk RNA-seq data
Frequencies of immune cell populations in patient samples were estimated by cellular deconvolution of bulk RNA-seq data using the CIBERSORTx method ( https://cibersortx. stanford.edu ) ( 38 ).

Consensus motif identification
Consensus motifs were identified at the 5 and 3 ends of all fully intronic transcripts by sequence alignments of the terminal 40 bp at either end using the WebLogo tool ( https: //w e b logo.ber keley.edu/logo.cgi ) ( 39 ). The results were plotted as sequence logos.

Functional gene annotation by gene ontology
Pathway analyses were performed using g:Profiler ( https: //biit.cs.ut.ee/gprofiler ) with genes ordered by the degree of differ ential expr ession. P values wer e estimated by hypergeometric distribution tests and adjusted by multiple testing correction using the g:SCS (set counts and sizes) algorithm, integral to the g:Profiler server ( 40 ).

Survival analysis and hazard ratio calculations
For survival analysis, all OCCAMS EAC samples with survi val data recor ded were used. To test if expression of a transcript of interest correlated with patients' survival, we identified the patients in the bottom and top percentile expression ('low' versus 'high' expression). Survival analysis was done using the survfit function of the survival R package (v2.42), using ov erall survi val time. To compare curves between low and high expression tertiles, log-rank testing was used and a Cox r egr ession model was built to test the assumption of proportional hazards holds. Hazard odd ratios are gi v en based on the Cox regression model. Similarly, a Cox r egr ession model was used to compare survival between multiple expression clusters.

Reverse transcriptase-based quantitative PCR (RT-qPCR)
RNA was extracted using the RNeasy kit (Qiagen). cDNA was synthesized using the Maxima First Strand cDNA Synthesis Kit (Thermo Fisher), and qPCR performed using Applied Biosystems Fast SYBR Green (Thermo Fisher) using the following primers: Values were normalised to HPRT expression using the C T method.

RT-PCR and amplicon sanger sequencing
cDNA from HARA and LK-2 cells was used as template for PCR amplification, performed using KOD Hot Start Master Mix (Sigma) with the following primers: Target Forward Reverse L1PA2-L1PB1 TTTGACTCA GAAA GGGAACT GT ACGCCAA TTTTAA TTGTT Separatel y, cDN A from LK-2 cells was amplified using nested PCR with the following primers: The PCR products were Sanger sequenced by Genewiz, Essex, UK, using the same primers.

Statistical analyses
Statistical comparisons were made using GraphPad Prism 7 (GraphP ad Softwar e), SigmaPlot 14.0., or R (versions 3.6.1-4.0.0). Parametric comparisons of normally distributed values that satisfied the variance criteria were made by unpaired or paired Student's t -tests or One Way Analysis of variance (ANOVA) tests with Bonferroni correction for multiple comparisons. Data that did not pass the variance test were compared with non-parametric two-tailed Mann-Whitney Rank Sum tests (for unpaired comparisons), Wilco x on Signed Rank test (for paired comparisons) or ANOVA on Ranks tests with Tukey or Dunn correction for multiple comparisons. Multi-region data were compared using a linear mixed effects model with each patient as a random effect.

Increased RTE inclusion in esophageal and stomach cancer transcriptomes
To examine if the increased inclusion of LTR elements previously seen in ESCA transcriptomes ( 8 ) extended beyond these elements, we compared measures of overall transcriptome complexity across different cancer types and respecti v e healthy tissues. We considered the total number of assembled transcripts expressed at ≥0.5 TPM in each sample as an indir ect measur e of transcriptome complexity. The number of expressed transcripts overlapping an LTR element was proportional to the total, with a pproximatel y 14% of all transcripts including an LTR element in both malignant and healthy tissues (Supplementary Figure S1A). A far greater proportion of transcripts included a non-LTR RTE (77% and 82% in healthy and malignant tissues, respecti v ely) than an LTR RTE (Supplementary Figure S1B). According to this measure, healthy tissues varied substantially in transcriptome complexity, as expected, given distinct cellular composition and dif ferentia tion (Figure 1 A), independently of sequencing depth ( Supplementary Figure S1C). Cancer samples exhibited overall lower complexity and, in most cases, lower than the respecti v e healthy samples, with the exception of ESCA and stomach adenocarcinoma (STAD), where transcriptome complexity was significantly increased ( P = 0.005 and P < 0.001, respecti v ely, Mann-Whitney Rank Sum test) (Figure 1 A). Thus, increased activity of LTR elements in ESCA ( 8 ) appeared to reflect increased overall transcriptome di v ersity.
We next selected 4844 assembled contigs r ecurr ently present in both ESCA and STAD, but minimally in healthy tissues for further analysis. Estimated expression of these was shared also with ovarian serous cystadenocarcinoma (OV) and, to a lesser degree, colon adenocarcinoma (COAD) (Figure 1 B). Expression of the selected transcripts was not observed in non-malignant tissue samples from TCGA or GTEx, with the exception of three particular esophagus samples from T CGA (Figur e 1 B). The latter, howe v er, were all tumor-adjacent samples, rather than completely healthy tissue, which may have altered their transcriptional profile ( 41 ).
The vast majority (98.9%) of the selected transcripts were also found expressed in two extended cohorts of esophageal adenocarcinoma (EAC, n = 78) and esophageal squamous cell carcinoma (ESCC, n = 78) from TCGA, as well as an additional cohort of EAC ( n = 99) from OCCAMS ( 16 ) (Figur e 1 C). Mor eov er, the assemb led transcripts appeared co-e xpressed in indi vidual samples (Figure 1 C), implying they were generated by a common mechanism.
A small number of the selected transcripts (238) comprised only RTEs and an e v en smaller number ( 42 ) Table S1). The remaining transcripts (4564) partially overlapped with 2144 annotated genes (an average of 2.1 transcripts per gene), and a vast majority of these (4053) also overlapped with RTEs (Figure 1 D). A great majority of the transcripts (4281) included one or more elements from the three major groups, with SINEs being by far the most frequent, followed by LINEs, and LTR elements the least frequent (Figure 1 E). Compared with all assembled transcripts, the ov ere xpressed 4844 transcripts were significantly enriched for L1 LINE subfamilies, including L1HS , and certain SINE subfamilies, particularly Alu subfamilies (Figure 1 F). LTR elements were relati v ely absent, with the notab le e xception of the HERVH subfamil y, w hich was the most enriched in the selected transcripts ( Figure 1 F; Supplementary Table S1). In contrast, the evolutionary older MIR SINE and L2 LINE subfamilies appeared overall underr epr esented in the selected transcripts (Figure 1 F).
For orthogonal validation of these findings, RTE expression was separately quantified using featureCounts and a custom repeat annotation ( 30 ), excluding m ulti-ma pping r eads, in the T CGA EAC cohort (Supplementary Figur e S2). This analysis identified 1179 RTEs as significantly differ entially expr essed ( > 6-fold-change, P < 0.05, q < 0.05) between EAC and normal esophagus TCGA samples, a great majority of which (984) were genic (Supplementary Figure S2A). They included comparable proportions of LINEs and SINEs and a smaller proportion of LTR elements (Supplementary Figure S2B). In agreement with our transcriptome-based quantitation, enrichment analysis of read counts identified L1 LINE subfamilies, including L1HS , and the HERVH subfamily of LTR elements as significantly enriched (Supplementary Figure S2C). Howe v er, the ov ere xpressed Alu subfamilies appeared underr epr esented, likely due to an underestimation of their actual expression by the discarding of m ulti-ma pping reads, which would affect multi-copy subfamilies disproportionally (Supplementary Figure S2C). Indeed, with feature-Counts, we observed a strong effect of the number of copies of a gi v en RTE in the total transcriptome on its relati v e enrichment in the EAC-ov ere xpressed RTEs (Supplementary Figure S2D).
Together, these findings suggested that the increased transcriptional r epr esentation of RTEs in esophageal and stomach cancer transcriptomes resulted primarily from gene tr anscription incorpor ating intronic or adjacent RTEs, rather than of autonomous transcription of stand-alone, intergenic RTEs.

Aberrant RNA splicing of RTEs in the esophageal cancer transcriptome
To investigate a potential mechanism underlying the preferential inclusion of genic, rather than intergenic RTEs, we separated ESCA-ov ere xpressed transcripts into those that wer e entir ely within annota ted introns, those tha t did not include any intronic sequences and all other combinations (Figur e 2 A). Compar ed with the entir e transcriptome, the ESCA-ov ere xpressed transcripts were enriched for fully intronic contigs, at the expense of those not overlapping with introns (Figure 2 B). Ne v ertheless, a thir d of the ESCAov ere xpressed transcripts comprised combinations of exonic and intronic or intergenic RTEs (Figure 2 B).
The abundant seemingly fully intronic reads were not a result of DNA contamination as they covered intronic but not intergenic RTEs, as exemplified in the CASP2 , TIMM17A or CPSF6 loci, w here RN A-seq reads mapped to numer ous intr onic but not intergenic RTEs (Figure  2 C). Moreover, they were independent from RNA selection methods as they were detected both in TCGA and in OC-CAMS RN A-seq data, w hich were generated using pol y(A) selected RNA and rRNA-depleted total RNA, respecti v ely (Figure 2 C). Lastl y, full y intronic, but not intergenic contigs spanning RTEs were also detected in long-read ISOseq data from esophageal cancer cell lines ( 42 ), coinciding with TCGA and in OCCAMS RNA-seq peaks (Figure 2 C). These da ta indica ted incomplete RNA splicing in ESCA affecting multiple introns of a gi v en gene, rather than retention of specific introns (Figure 2 C), and was therefore likely distinct from intron retention.
Fully intronic contigs detected in ISO-seq data exhibited more defined boundaries than mapping of shorter RNAseq reads and this offered the opportunity to examine the pr esence of r epea ts a t either end of the contig. We noted that fully intronic contigs appeared to be initiated or terminated typically at an RTE (Supplementary Figure S3A, B) and we reasoned that this could arise from priming of re v erse transcription during the cDNA synthesis step of library prepara tion a t poly(A) tails in such RTEs, as has been suggested for intronic reads identified in single-cell (sc) RNA-seq data ( 43 ). Indeed, fully intronic contigs in ISO-seq libraries had pol y(A) or pol y(U) tracts at either end, depending on orienta tion rela ti v e to the gene, and were enriched for Alu SINEs, particularly of the most recent members AluJb and AluSx (Supplementary Figure S3C, D).
In addition to the generation of seemingly fully intronic contigs, large introns also e xhibited e vidence for splicing between flanking exons and intronic RTEs, likely resulting from incomplete recursi v e splicing (Supplementary Figure  S4). Splicing between gene exons and intronic RTEs involved canonical and non-canonical donor and acceptor splice sites often in close proximity in the same intronic RTE (Supplementary Figure S4). Novel splicing was also detected between exonic RTEs, usually inverted Alu repeats in 3 UTRs of annotated genes. In many cases, such apparent splicing was an artifact created during library preparation, wher e r e v erse transcription omits hairpin structures created by inverted Alu repeats, as previously described ( 44 ).
Howe v er, actual splicing e v ents were supported for inverted Alu repeats in 3 UTRs of certain genes, such as METTL16  Figure S5). In these cases, splicing involved canonical splice donor and acceptor sites and spliced reads spanning ADAR-edited inverted Alu repeats were detected in direct long-read RNA-seq data from HEK293T cells (Supplementary Figure S5), which is consider ed fr ee from such artifacts ( 44 ).
Other types of chimeric transcripts included fully or partially annotated and unannotated isoforms transcribed from known genes and r epr esenting fully spliced, mature mRNAs that included RTEs as alternati v e e xons or alternati v e promoters, as exemplified by the CASP8 , EPB41L5 or DNAJC5 loci (Supplementary Figure S6-S8).
Ther efor e, autonomous transcription of stand-alone RTEs was a relati v ely minor contributor to the increased RTE r epr esentation in ESCA transcriptomes, which was instead caused primarily by transcription of RTEs within annotated genes and inclusion in alternati v ely spliced isoforms of annotated genes or transcripts overlapping with annotated gene bodies.

Diagnostic and prognostic properties of RTE transcriptional inclusion in EAC
We examined if the predictable inclusion of RTEs in EACspecific RNA transcripts could improve EAC detection and diagnosis. To this end, we defined a signature of 29 transcripts (Supplementary Table S2) from the previously selected 4844 EAC-specific transcripts r ecurr ently expr essed in TCGA EAC (n = 78) and OCCAMS EAC ( n = 99) pub licly availab le samples, as well as an additional OC-CAMS EAC cohort ( n = 128) (Figure 3 A). Many of these 29 transcripts were also expressed in BE samples, as well as in ESCC and other cancer indications, but were absent from any normal tissue we analyzed (Figure 3 A). Moreover, 8 of the selected 29 transcripts were highly specific to EAC when compared with BE samples, in which they were not expressed (Figure 3 B). These included two transcripts from the GNGT1 locus exonising an L1 element ( GNGT1-L1PB1 ), w hich was highl y expressed also in ESCC, and two from a stand-alone HERVH provirus on Chr Xp22.32, which were relati v ely absent from ESCC (Figure 3 B). Notab ly, e xpression of these 8 transcripts was largely nonoverlapping in EAC, with GNGT1-L1PB1 and HERVH Xp22.32 expressed in a m utuall y exclusive manner (Figure 3 B, C), suggesting that they r epr esented distinct EAC subtypes.
Whereas HERVH Xp22. 32 transcripts corresponded to an annotated HERVH provirus that has been previously r eported highly upr egulated in COAD ( 45 ), the GNGT1-L1PB1 transcripts were not previously annotated. Inspection of the locus re v ealed that they were partially assembled transcripts belonging to a larger transcript, which origina ted a t an L1PA2 element > 320 kb upstream of the GNGT1 gene (Supplementary Figure S9A). This transcription start site and first exon matched the annotated GNGT1-205 isoform (ENST00000455502.5). A transcript matching GNGT1-205 was independently reported in a recent pan-cancer analysis (r eferr ed to ther e as L1PA2 GNGT1), where it was also found to produce antigenic peptides from an alternati v e open reading frame, largely embedded in the L1PA2 element ( 46 ). Howe v er, splicing was found here considerably mor e fr equently between the common initiating L1PA2 element and a second L1PB1 element, where transcription terminated without extending to the remaining GNGT1 gene (Supplementary Figure S9A). The latter transcript (r eferr ed to her e as L1PA2-L1PB1 ), was also detected in ISO-seq data from the ESCC cell line KYSE140 (Supplementary Figure S9A) and canonical donor and acceptor splice sites were confirmed by sequencing of RT-PCR amplicons from HARA and LK-2 cells (Supplementary Figure  S9B). L1PA2-L1PB1 expression was significantly upregulated in multiple types of cancer and was found at higher le v els than transcription of GNGT1 , which was also cancerspecific (Supplementary Figure S9C).
To evaluate the diagnostic properties of the EAC-specific transcripts, we calculated the cumulati v e e xpression of the eight selected transcripts (by taking the sum of the z -scores of all transcripts in all available samples). When applied to the 29 selected transcripts, this metric distinguished EAC and normal samples with 99% sensitivity and specificity (Figure 3 D). Restricting the analysis to the 8 selected transcripts largely retained the ability to separate EAC and normal samples (97%) and additionally separated EAC and BE samples with reasonable sensitivity and specificity (89%) (Figure 3 D). These results highlighted characteristic transcriptional changes in EAC that can be re v ealed by analysis of only a few selected loci.
We ne xt inv estiga ted if distinct EAC subtypes indica ted by non-ov erlapping e xpression of some of the diagnostic EAC-specific transcripts may also follow different disease trajectories. To explore this possibility, we estimated the potential effect of aberrant RTE transcriptional inclusion on EAC survival, calculated as the hazard ratio for each OC-CAMS EAC cohort separately, for additional validation. Of the 4593 EAC-specific transcripts expressed in both cohorts, 282 were significantly prognostic ( P < 0.05 in both cohorts separ ately; hazard r atio ≥2 or ≤0.5) (Supplementary Table S3), with a majority (215) being protecti v e (hazar d ratio ≤ 0.5) (Figure 4 A). A majority of prognostic transcripts were fully intronic (Figure 4 B), as would be expected f or an y fraction of the EAC-specific transcripts.
As fully intronic transcripts often resulted from incomplete intron splicing in EAC, we examined if their association with survival reflected the expression of the genes in which they resided. As an independent measure of incomplete intron splicing indicated by the contigs in our assembl y, we additionall y calculated the expression of each gene considering exons and introns separately (Materials and Methods). Gi v en that introns only from transcribed genes can be present in the transcriptome, we observed a significant positi v e correlation between e xon and intron r epr esentation for each gene, but also considerable variation between genes (Figure 4 C). Hazard ratio calculations identified 410 and 364 prognostic genes, when exon and intron expr ession wer e consider ed separately, r especti v ely, with 204 of these at the intersection. For the majority of the intersecting 204 genes, exon and intron expression correlated with survival in the same direction, with a majority again being protecti v e (hazar d ratio ≤ 0.5), with the e xception of six genes, whose intron expression was associated significantly with survival but in the opposite direction than that of exon expr ession (Figur e 4 D). However, all six of these r epr esented shar ed introns between the gene of interest and overlapping antisense transcripts (Figure 4 D). These findings indicated that the survival association of increased intron representation in EAC reflected a contribution of individual genes, in which these introns resided, as well as the underlying process responsible for their incomplete removal. Indeed, the genes associated with longer EAC survi val comprised se v eral splicing factors, including CRNKL1 and HNRNPU , which have been previously associated with EAC survival ( 47 ), and interferon signature genes (ISGs) (Figure 4 E).

HERVH tr anscriptional activ ation corr elates with better EAC prognosis
Whereas a majority of prognostic EAC-specific transcripts were intronic contigs associated with gene transcription and aberrant splicing, few corresponded to stand-alone RTEs (Supplementary Table S3). The latter included the diagnostic HERVH Xp22.32 provirus and another HERVH provirus on Chr 1p31.3, significantly associated with better and worse EAC prognosis, respecti v ely (Supplementary Table S3; Figure 5 A). Although the over expr ession of HERVH Xp22. 32 in COAD has long been reported ( 45 , 48 ), its significance remained uncertain. Global HERVH upregulation has very recently been linked with worse COAD survival, but this related predominantly to HERVH 1p31. 3 and a separate HERVH provirus on Chr 6q24.2, whereas HERVH Xp22. 32 was not reported to affect survival ( 49 ). Consistent with this recent report, we found that HERVH 1p31.3 activation associated with worse EAC survival, in contrast to HERVH Xp22. 32 activation (Supplementary Table S3  subfamily as a whole, we looked for the effect of transcripts from all HERVH proviruses that were included in the selected 4844 EAC-specific transcripts. In addition to HERVH Xp22. 32 and HERVH 1p31.3 , another 6 HERVH proviruses were transcriptionally activated in EAC and this activation was more frequently associated with better survival, mirroring the effect of HERVH Xp22. 32, although this association did not reach statistical significance in both OCCAMS EAC cohorts (Figure 5 A). Moreov er, e xpression of the different HERVH proviruses was not coordinated and for some was m utuall y e xclusi v e (Figure 5 B-F). Indeed, both in OCCAMS EAC and TCGA EAC samples, HERVH Xp22. 32 was the most prominently expressed provirus, following by HERVH 13q33.3 , whereas HERVH 1p31.3 was only sporadically expressed ( Figure  5 B-D). The higher expression of HERVH Xp22. 32, compared with other HERVH proviruses in EAC, was orthogonally confirmed using featureCounts ( Supplementary Figur e S10). Mor eover, comparison of T CGA EAC and ESCC samples re v ealed a shift from HERVH Xp22. 32 to HERVH 13q33. 3 and HERVH 1p31.3 expression (Figure 5 C, D). Similarly to all EAC samples, TCGA COAD samples expressed most prominently HERVH Xp22. 32 and only rarely HERVH 1p31.3 (Figure 5 E, F). This pattern of expression was also observed in gastrointestinal cancer cell lines from CCLE, where EAC cell lines are notably rare, with HERVH Xp22. 32 expr essed pr edominantly in adenocar cinomas, HERVH 13q33. 3 expressed also in squamous cell carcinomas, and HERVH 1p31.3 expressed more rarely (Figure 5 G).
Disparate HERVH provirus expression within and between gastrointestinal cancers indicated independent regulation or responsi v eness to transcription factors. Expression of stand-alone proviruses is dri v en by their LTRs. Through phyloregulatory analysis, HERVH LTR subfamilies have r ecently been r eannotated ( 50 ), with HERVH 13q33. 3 and HERVH 1p31.3 belonging to the new LTR7u2 subfamily, whereas HERVH Xp22. 32 belongs to the LTR7Y subfamil y. Importantl y, L TR7Y and L TR7u2 differ considerably in their responsi v eness to tr anscription factors, particular ly KLF5, which targets pr efer entially the LTR7Y subfamily ( 50 , 51 ). Consistent with earlier analyses ( 50 , 51 ), HERVH Xp22. 32 L TR7Y L TRs contained twice as many consensus KLF5 binding sites than the LTR7u2 LTRs of the other tw o pro viruses (Supplementary Figure S11). Differences between the proviruses, as well as between the 5 and 3 HERVH Xp22. 32 LTRs, were also noted for SOX9 binding sites (Supplementary Figure S11). To validate the predicted effect of KLF5, we analyzed direct KLF5 binding to and expression of the three HERVH proviruses, using ChIP-Seq and RNA-seq data from the EAC and COAD cell lines OE19 and HT-55, respecti v ely ( 52 , 53 ). Although the thr ee proviruses wer e expr essed a t dif ferent le v els in the two cell lines, KLF5 binding to the proviral LTRs was evident in all cases (Figure 6 A, B). Moreover, loss of KLF5 activity reduced the expression of the three proviruses in both cell lines (Figure 6 A, B) Whilst these findings demonstrated that KLF5 was necessary for HERVH expression, they also indicated that it was not always sufficient. For example, despite high KLF5 activity in OE19 cells ( 52 ), HERVH Xp22. 32 was modestly expr essed (Figur e 6 A). Furthermor e, over expr ession of KLF5 in these cells did not raise HERVH Xp22. 32 expression further (Figure 6 C). As a control we examined another LTR7Y HERVH provirus in the CALB1 locus, which we have recently found to be controlled by KLF5 in squamous lung cancer ( 51 ), and which r eadily r esponded to KLF5 ov ere xpression in OE19 cells too (Figure 6 C). Although KLF5 was not sufficient to induce HERVH Xp22. 32 expression in OE19 cells, over expr ession of SOX9 in these cells exhibited a significant effect (Figure 6 C). Therefore, KLF5 or SOX9 exerted the strongest activa ting ef fect on all three proviruses examined. In contrast, loss of ARID1A, which was recently suggested to be responsible for overall HERVH activation in COAD ( 49 ), was rather specific to HERVH 1p31.3 . Indeed, reanalysis of RNA-seq data from the COAD cell line , demonstrated that, in contrast to HERVH 1p31.3 , which was strongly upregulated upon loss of ARID1A, HERVH 13q33. 3 was downregulated and the remaining proviruses were either not expressed or not affected (Supplementary Figure S12). Collecti v ely, these data highlight the independent regulation particularly of HERVH Xp22. 32 and HERVH 1p31.3 , which reconciles their contrasting association with EAC survival.

HERVH xp22.32 activation defines novel EAC molecular subtypes
As the dominant HERVH provirus expressed in EAC, we ne xt e xplored whether the transcriptional acti vation of HERVH Xp22.32 was associated with additional molecular fea tures tha t could account for its association with better ov erall survi val. Firstly, we e xamined if EAC subsets defined by high or low HERVH Xp22.32 (using 1 TPM as the cut-of f), ma tched EAC and ESCC subtypes described previously based on transcriptional profiles ( 54 , 55 ) or epigenetic changes ( 56 ). This analysis re v ealed only minimal overlap between HERVH Xp22.32 and previously defined subsets (Supplementary Figure S13), suggesting that HERVH Xp22.32 marked a distinct molecular process.
In the progressi v e stages leading up to EAC, HERVH Xp22.32 was rarely or weakly activated in OCCAMS BE samples, but was more frequently and strongly activated in EAC (Figure 7 A). Similar results were additionally obtained by analysis of an independent dataset of normal esophagus, BE and EAC samples ( 57 ) (Figure 7 A). Moreover, in paired OCCAMS BE and EA C samples, HER VH Xp22.32 was significantly upregulated in the latter (Figure 7 A), suggesting that the progression of BE to EAC is characterized by HERVH Xp22.32 activation in a substantial proportion of patients. Howe v er, EAC samples with high HERVH Xp22.32 expression were transcriptionally more distant from BE samples than EAC samples with low HERVH Xp22.32 expr ession wer e (Figur e 7 B), indicating that HERVH Xp22.32 activation following BE progression to EAC is linked with a departure from the BE transcriptional profile.
Gi v en that HERVH Xp22.32 defined EAC subsets did not correspond to previously defined subsets, we investigate further characteristics. In the OCCAMS cohorts, HERVH Xp22.32 high and low subsets had a similar total number of alterations in key dri v er genes (with 3.77 and 3.63 average number of altered dri v er genes per sample in each subset, respecti v ely), as well as in a majority of these genes indi vidually. Howe v er, significant differences (linear regression P < 0.05, q < 0.05) were observed for cell-cycle regulators, particularly in the balance of cyclin D and cyclin E altera tions, previously associa ted with ESCC and EAC , respecti v ely ( 9 ). Gain-of-function mutations in cyclin D subunits and loss-of-function mutations or deletions of its inhibitor CDKN2A were significantly enriched (linear regression P < 0.05, q < 0.05) in low HERVH Xp22.32 samples (Figure 8 A). In contrast, high HERVH Xp22.32 samples had significantly more frequent gain-of-function mutations in cyclin E (Figure 8 A). Linear r egr ession analyses identified 839 assembled transcripts, the expression of which was significantly ( P < 0.05, q < 0.05) correlated with HERVH Xp22.32 expression (Figure 8 B). Importantly, a majority (833) of these transcripts, comprising predominantly aberrant or intronic contigs, were positi v e correlated with HERVH Xp22.32 e xpr ession (Figur e 8 B), suggesting that the increased transcriptional di v ersity that characterized EAC is more pronounced in high HERVH Xp22.32 samples. In contrast, transcriptional analysis of annotated genes (at the exon and intron le v el), identified 1756 genes, a vast majority of w hich, particularl y the exons, were significantl y down-regulated in high HERVH Xp22.32 samples (Figure 8 B). Pathway analysis of the genes downregulated in samples with high HERVH Xp22.32 expression indicated substantial alterations in metabolic and transport pathways (Figure 8 C), implying cell-intrinsically reduced fitness associated with HERVH Xp22.32 activation. The downregulated genes also included a smaller number of splicing and nucleosome factors (e.g. CTCF ) ( Supplementary Figure S14A), most of which were the same factors identified by the association with better EAC prognosis of their intronic RTE-overlapping transcripts (e.g. CRNKL1 and HN-RNPU ) (Figure 4 E). With the exception of IRF2, ISGs were nota bly a bsent from differ entially expr essed genes and deconvoluted immune cell signatures were also similar between HERVH Xp22.32 high and low subsets, except for Treg cells and M2 macrophages, which were enriched in high and low HERVH Xp22.32 samples, respecti v ely (Supplementary Figure S14A, B).
Together, these findings suggested that transcriptional activation of HERVH Xp22.32 was closely linked with incomplete RNA splicing, resulting in the EAC-characteristic incr eased r epr esentation of intronic RTEs, and parallel reduction in the expression of the fully-spliced, functional mRNA isoforms. Indeed, intersecting the genes with increased intronic but decreased exonic r epr esentation identified 19 genes where the e xon / intron e xpression ratios were significantly lower in high than in low HERVH Xp22.32 samples (Figure 8 D). These were exemplified by ZDHHC20 , an integral component of Golgi membrane involved in protein translation, and PBRM1 , a chromatin r emodeler, wher e mapping of RNA-seq reads demonstrated a shift in the coverage of introns at the expense of exons, in high HERVH Xp22.32 samples (Figure 8 E). In contrast, intronic reads wer e rar e at the same loci in low HERVH Xp22.32 samples (Figure 8 E). These observations support a model whereby HERVH Xp22.32 activation marks reduced expression of the functional isoforms of essential genes, owing to defecti v e mRNA splicing.

DISCUSSION
We hav e e xamined the transcriptional landscape of RTEs in EAC and described the origins and predicti v e value of its complexity. We found that incomplete RNA splicing affects both EAC and ESCC, and is shared with STAD, OV and COAD. In an independent analysis of the number of cancer-specific e xon-e xon junctions, OV stood out among all cancer types followed by li v er hepatocellular carcinoma (LIHC), ESCA and STAD ( 58 ). Similarly, OV, ESCA and STAD were the top 3 cancer types in an analysis of cryptic introns ( 59 ), although many of them may have been cDNA synthesis artifacts ( 44 ). Collecti v ely, these studies underscore the aberrant splicing patterns observed repeatedly in ESCA and related STAD.
Incomplete removal of introns will elevate the r epr esentation of RTEs, which are found in abundance in intronic regions, and gi v e the impression of an increase in independent RTE tr anscription ( 60 ). Libr aries pr epar ed from rRNAdepleted RNA, such as those from the OCCMAS cohorts, may be enriched for incompletely processed or unprocessed pre-mRNAs, with the potential to distort the r epr esentation of intronic repeats ( 61 ). Howe v er, it is unlikely that incomplete RNA splicing reported here for EAC resulted from the use of rRNA-depleted RNA libraries for the following reasons. Firstly, the original de novo transcriptome assembly employed only poly(A)-selected RNA data from