LINE-1 retrotransposon expression in cancerous, epithelial and neuronal cells revealed by 5′ single-cell RNA-Seq

Abstract LINE-1 retrotransposons are sequences capable of copying themselves to new genomic loci via an RNA intermediate. New studies implicate LINE-1 in a range of diseases, especially in the context of aging, but without an accurate understanding of where and when LINE-1 is expressed, a full accounting of its role in health and disease is not possible. We therefore developed a method—5′ scL1seq—that makes use of a widely available library preparation method (10x Genomics 5′ single cell RNA-seq) to measure LINE-1 expression in tens of thousands of single cells. We recapitulated the known pattern of LINE-1 expression in tumors—present in cancer cells, absent from immune cells—and identified hitherto undescribed LINE-1 expression in human epithelial cells and mouse hippocampal neurons. In both cases, we saw a modest increase with age, supporting recent research connecting LINE-1 to age related diseases.


INTRODUCTION
In the human genome, LINE-1 is the only family of retrotransposons--sequences capable of copying themselves to new genomic loci via an RNA intermediate--that remains active and autonomous (1). LINE-1 element expression is driven by a promoter that lies within its 5 UTR sequence. It is hence an unusual promoter in that it contains 100% 'downsteam elements' (2), which is necessitated by its mode of retrotransposition that requires it to 'take its promoter with it'. Once expressed and translated, LINE-1 mRNAs complex with their two potentially cis-acting (3) protein products, the RNA binding ORF1p (4) and the endonuclease/reverse transcriptase ORF2p (5,6), to form LINE-1 RNPs that are primarily cytoplasmic but can enter the nucleus during mitosis (7). In the nucleus, ORF2p reverse transcribes a new LINE-1 DNA copy at a site of genomic insertion through a process called target primed reverse transcription (TPRT) (8). In addition to its potential to insert into and disrupt protein coding genes (9)(10)(11)(12), LINE-1 expression can contribute to DNA damage (13,14) and induce an interferon response (14)(15)(16)(17). LINE-1 is highly active in cancer, with some tumors bearing more than 100 new LINE-1 insertions (18)(19)(20)(21)(22)(23). It is also active in the germline where it generates insertions that are polymorphic in the human population (24) and occasionally contributes to heritable genetic disorders (9).
The extent of LINE-1 activity outside these two contexts, namely in non-tumor somatic tissues, has been a subject of vigorous debate for decades (25)(26)(27)(28)(29)(30)(31). Recently, several studies have suggested that the epigenetic changes that occur during aging provide an opportunity for LINE-1 and other normally silenced genomic regions to become active and contribute to the aging phenotype and related diseases (16,32,33). We therefore sought a highly sensitive method that would use standard techniques to identify LINE-1 RNA expression and to ascertain what counts as normal somatic LINE-1 expression and how that changes in disease and aging contexts. Quantifying transcription from repetitive sequences poses a number of challenges (34,35). Notably, the very high copy number of LINE-1 sequences means that methods based on simple read counts of LINE-1 RNA are confounded by the fact that the hundreds of thousands of LINE-1 copies include many that lie within cel-lular transcription units and thus those transcripts are not driven by the LINE-1 promoter, but instead are 'passively co-transcribed'. We therefore leveraged 5 targeted single cell RNA-seq data sequenced with reads that are at least 100 base pairs long to develop 5 scL1seq. Because 5 scL1seq is a straightforward method to extract information about LINE-1 expression from reads sequenced from a standard library prep (5 targeted 10× genomics single cell RNAseq), it can easily be deployed by researchers interested in whether LINE-1 expression is present in particular disease models or samples.
We applied this method to publicly available data from 15 healthy somatic tissues, and found that LINE-1 is indeed expressed in normal epithelial cells. We also analyzed skin samples from two younger (under 40) and two older (over 80) non-melanoma skin cancer patients and found that epithelial LINE-1 expression was much higher in malignant cells and somewhat higher in the older patients. Finally, we looked at hippocampal cells from two young (4 month old) and two old (24 month old) mice, and found LINE-1 to be expressed in mouse hippocampal neurons, with modestly increased expression in the older neurons.
Several methods for quantifying TEs from single cell RNA-seq do exist (36)(37)(38). In this paper, we additionally argue that for short read 10x genomics sc-RNA-seq, only the 5 targeted method accurately identifies LINE-1 expression. The challenges of quantifying TE transcripts from 3 biased single cell data has been noted (39), and a previous paper did note that 5 targeted single cell methods can be used to detect transposable elements expressed from their own promoters, but did not explicitly dive into LINE-1 expression (40). In this paper, we develop a strategy that is specifically targeted the overcome the challenges of quantifying LINE-1. Long reads have also been shown to be an effective method for the detection of transposable element transcripts in single cell data (41). Long reads provide an even clearer picture of the relevant transcript than 5 aligning short reads do, but currently, short read sequencing remains the most cost-effective way to do high throughput single cell RNA-seq, and thus methods are needed to quantify LINE-1 expression in this kind of data.

Construction of the cellranger index
To build the custom reference genome, L1Hs and L1PAx family repeats were masked in hg38 using bedtools maskfasta. L1Hs and L1PAx annotations were downloaded in bed format from the UCSC genome table browser (https: //genome.ucsc.edu/cgi-bin/hgTables). Then this masked reference was concatenated with the human L1Hs consensus sequence and all available L1PA consensus sequences from Dfam (https://www.Dfam.org/browse?name accession = L1PA&clade descendants = true). For transcript annotation, we used the RefSeq GRCh38 annotation in gtf format. A cellranger index was then built with cellranger mkref.
The mouse version of L1-sc was generated in the same manner as human. We first used bedtools maskfasta to mask L1Md sequences (downloaded from the UCSC genome table browser in bed format) from the mm39 mouse genome. We then added L1Md consensus sequences in Dfam (https://Dfam. org/browse?name accession = L1Md&classification = root;Interspersed Rep eat;Transposable Element; Class I Retrotransposition;LINE;Group-II;Group-1;L1like;L 1-group;L1&clade = 10088&clade descendants = true) as decoy chromosomes.
L1MdTf II/III, L1MdGf II and L1MdA II/III were excluded as they are highly similar to the L1MdTf I, L1MdGf I and L1MdA I consensuses.

Quantification of LINE-1 UMIs in humans
After running cellranger using the custom reference described above, we looked at the position sorted STAR aligned cellranger output (file name = possorted bam.bam) to identify LINE-1 UMIs in reads aligning to the L1Hs consensus 'decoy chromosome'. Briefly, reads aligning in the appropriate location (5 end or 3 depending on the chemistry) and orientation were filtered for excessive mismatches and clipping. The barcodes were extracted from the remaining reads to assign LINE-1 expression to specific cells. The specifics of the read filtering strategy are as follows: for the 3 targeted data, unclipped reads aligning in the final 1kb (but in not the poly A stretch at the very 3 end) of the L1Hs consensus with two or fewer mismatches were checked for the presence of the cell barcode (CB) and UMI barcode (UB) flags. If both were present the UB was added to a python 'set' object within a python dictionary with the CB being the key. Once the whole bam file has been read, the number of L1Hs UBs associated with each CB is reported. For the 5 targeted data from HeLa cells, quantification was performed in the same manner except reads were instead collected from either the first 150 or first 500 bases of the L1Hs consensus. For the 5 data with 100 or 150 bp paired end reads, UMIs were only counted if they have a proper (sam flag 2 is present) read 1/read 2 alignment that meets the following criteria: For read 1, the first aligned base must in the first 20 bases of the L1Hs consensus, and the read must have less than 20 bases clipped from 5 end, no clipping at the 3 end, and no more than 2 mismatches. (Allowing for clipped bases at the 5 is necessary because while cellranger will trim off the barcode sequences, the TSO sequence and a variable number of G bases added during the template switching process will remain.) For read 2, neither end can be clipped and no more than 2 mismatches are allowed. To recreate the behavior of cellranger's aggr function, when combining samples, reads were downsampled to assure even coverage between samples. Reads must also reflect the strand-specificity of 10× genomics. For the 100/150 base paired reads, read 1 is aligned on the + strand and read 2 is aligned on the--strand.

Pseudobulk L1EM analysis
To turn 10x genomics single cell data into pseudobulk, the first 46 bases were removed from read 1 using trimmomatic (58) (HEADCROP: 46). This removes the barcodes, the TSO sequence and any extra G bases added during the template switching. To make L1EM (https://github.com/ FenyoLab/L1EM) able to handle 5 data, we shortened the LINE-1 annotation in L1EM.400.bed to only cover the first Nucleic Acids Research, 2023, Vol. 51, No. 5 2035 100 bases of the element. Thus, to be counted as active expression, the first base of aligned read must align within the first 100 bases of a 5 UTR intact LINE-1 element. Because 10x genomics data includes many more reads than a standard RNA-seq experiment, the following changes were made to control memory usage by L1EM: threads = 8, re-alignNM = 2, L1EM NM = 2, NMdiff = 1.

Quantification of LINE-1 in mouse
The mouse L1 promoter has a somewhat different structure from the human version, making it more challenging to quantify 5 targeted LINE-1 reads in mice. Rather than using the downstream promoter method that allows human LINE-1 to carry its promoter to a new location, the mouse 5 UTR contains multiple copies of its promoter (i.e. a series of tandem repeats of ∼212 bp), with additional copies potentially being generated during insertion (presumably by slippage during reverse transcription). Given this repetitive 5 UTR structure, it was not obvious where the true 'start' site was located in the monomer. We analyzed the monomer sequence from the Orleans Reeler L1MdTf insertion and looked for a match to the initiator consensus sequence Inr (59,60). We found a single perfect match to Inr, and used this as our landmark to the inferred start site of L1MdTf initiation. When we aligned reads to this modified consensus, we found a read 1 peak at the initiator sequence ( Figure 5A). However, the exact mouse LINE-1 TSS was not as precise as it was in human LINE-1, so we included all read pairs with a read 1 fall within a tandem repeat in our mouse LINE-1 quantification. The same clipping and mismatch filters were applied: no more than 2 mismatches and no clipping except up 20 bases at 5 end of read 1. Reads must also reflect the strand-specificity of 10× genomics: read 1 is aligned on the + strand and read 2 is aligned on the--strand.

Single cell RNA-seq clustering and cell identification
A count matrix was built using cellranger count and the index described above (default parameters). Cellranger aggr was used to merge related runs. Clustering and UMAP embedding was performed in Seurat(61-64) according to the PBMC 3k tutorial (https://satijalab.org/seurat/ v3.2/pbmc3k tutorial.html). Cells with >25% MT-RNA, or <1000 RNA molecules detected were removed. Clusters showing markers for disparate cell types were taken to be doublets and removed. Human cell types were then identified using marker genes: CD3E for T cells, MS4A1 for B cells, MZB1 for plasma cells, GNLY for NK cells, LYZ for macrophages, KIT for mast cells, COL1A2 for fibroblasts, ACTA2 for muscle, PLVAP for endothelial cells, MLANA for melanocytes, and keratin (KRT) genes for epithelial cells (Supplemental Figures S4-S7). For keratinocyte clusters we used: KRT5 for basal cells, KRTDAP for suprabasal cells, KRT28/19 for hair follicle associated cells, PPARG for sebocytes and KRT77 for sweat duct cells (Supplemental Figure S8). Malignant cells were identified as CNV+ cells by inferCNV (Trinity CTAT project: https: //github.com/broadinstitute/inferCNV) (Supplemental Figure S9). The following markers were used in the mouse hippocampus: Gfap for astrocytes, Tmem119 for microglia, Myrf for oligodendrocytes, Rbfox3 for neurons, Cspg4 for OPCs, Cldn5 for ependymal cells, and Slc6a13 for vascular cells. Interneurons were distinguished by Slc6a1 expression, DG neurons by C1ql2, and CA neurons by Neurod6 and Dkk3 (65) (Supplemental Figure S10). UMAP plots were made using the DimPlot and FeaturePlot functions in Seurat.

scRNA-seq from HeLa ± L1RP.
Cell line and plasmid: HeLa-M2 cells (HeLa cells with a reverse tetracycline-controlled transactivator gene coding for rtTA2-M2, a gift from Gerald Schumann, Paul Ehrlich Institute) were cultured in DMEM supplemented with 10% FBS (Gemini #100-106) and 1× Penicillin-Streptomycin-Glutamine (Ther-moFisher #10378016). The cells were routinely tested for Mycoplasma by PCR detection of conditioned medium. The L1rp reporter construct--pLAK002 (tet-inducible promoter-L1rp-GFP-AI) plasmid was introduced into HeLa-M2 cells by plating 0.2 × 10 6 cells per well in a 6well plate the day before transfection. The following day, cells were transfected with pLAK002 using FuGENE-HD Transfection Reagent (Promega #E2311), following the manufacturer's protocol. 48 h post-transfection, 1 g/ml of puromycin was added. The following day, cells were trypsinized and replated in a 10 cm plate with 1 g/ml of puromycin added. Puromycin was replenished every other day for 14 days until no cell death was observed.
Retrotransposition assay in 6-well plates: Once selection was complete, cells were trypsinized and plated at 2.0 × 10 5 cells per well in a 6-well plate. 24 h post cell plating, doxycycline (1 g/ml) was added to induce L1 expression. After 72 h, the percentage of GFP + cells was measured by flow cytometry (FACS buffer: PBS + 2% FBS) to confirm L1 retrotransposition and samples were submitted for scRNAseq.
Single-cell RNAseq: Cells were detached from the plate with TrypLE™ Express Enzyme (ThermoFisher #12604013) following the manufacturer's protocol and then resuspended in DMEM at 0.015 × 10 6 cells per 38.7 l and submitted to the NYULH Genome Technology Core Facility. scRNA-seq was performed using the 10x Genomics system using 5 gene expression library.

Negative binomial generalized linear models
Because LINE-1 expression is low in normal cells, there are usually only a few LINE-1 UMIs detected in a single cell. As a result, LINE-1 detection is biased toward cells with a greater number of UMIs, causing the popular wilcoxon rank sum test on log normalized expression to, in some cases, yield biased results. We therefore chose to estimate LINE-1 expression using a negative binomial generalized linear model with log link function: where U L1 is the number of LINE-1 UMIs in a cell, U total is the total number of UMIs in that cell, X is a predictor (such as cell type or age), a, b are parameters to be learned and is the negative binomial error term. These models were fit using the glm.nb function in the MASS package for R, using the 'offset' function to ensure that the coefficient on log(U total ) has a fixed value of 1. P values were estimated by anova. Under this model, transcripts per million (TPM) can be estimated as:

Single cell suspension derived from patient tumors and matched normal
Patient tumors were processed as described previously (66)

Single-cell library construction and 5 sequencing from patient tumors and matched normal
The cellular suspensions were loaded on a 10× Genomics Chromium instrument to generate single-cell gel beads in emulsion (GEMs). Approximately 10-12 × 10 3 cells were loaded per channel. Single-cell RNA-Seq libraries were prepared using the following single cell 5 reagent kits: Chromium™ single cell 5 Library & Gel Bead Kit, PN-1000006 and chromium single cell VDJ enrichment kit for human T cells (PN-100000). Libraries were run on a No-vaSeq 6000 SP (SCC) or S1 (BCC) flow cell (depending on the number of samples per run).

Mouse hippocampus single nucleus RNA sequencing
For snRNA-seq the hippocampus was dissected from the brains of 4-month old and 24-month old C57BL/6 mice. A total of four animals were used in each age group. The hippocampi from two mice were pooled together into one sample, and the other two mice were pooled in another sample. Nuclei were isolated from minced hippocampi tissue using the Nuclei PURE Prep Nuclei Isolation Kit with a Dounce B homogenizer. Samples were subjected to a sucrose gradient, and nuclei were further purified and counted. We targeted 5000 nuclei per sample to load onto a 10x Chromium chip using VD(J) chemistry. We targeted 50 000 sequence reads per nuclei on an Illumina HiSeq device.

Analysis of GTEx data
LINE-1 quantification in GTEx data were performed on the Cancer Genomics Cloud, using L1EM, which was made available as public tool on the platform. For each sample, ORF1 and ORF2 intact loci expressed at least 2 FPM were combined into a single value of LINE-1 expression.

Targeted scRNA-seq is not appropriate for LINE-1 quantitation
We tested three scRNA-seq methods for their ability to capture LINE-1 expression: the standard 3 targeted 10× Genomics method ( Figure 1A), the standard 10× Genomics 5 method ( Figure 2A) and a 5 library prep followed by 150 base pair (bp) paired end sequencing ( Figure 1B). LINE-1 was quantified through a simple method based on our MapRRCon (42) pipeline ( Figure 1C): Reads are fed into a custom cellranger index that includes a masked hg38 genome with all L1HS and L1PA sequences removed and replaced by consensus sequences available from Dfam. Reads aligned to the L1HS consensus are filtered for alignment quality and then unique molecular identifiers (UMIs) are counted. The default STAR parameters used by cellranger require alignment seeds that match at most 50 loci, meaning that many LINE-1 reads will go unmapped. Replacing the specific loci with a single consensus allows those reads to map uniquely and collects them in a single location for easy identification and quantification. Nevertheless, reliance on the consensus sequence may bias our method toward loci that have 5 ends that are more similar to the consensus sequence. LINE-1 expression (i.e. RNA transcript accumulation driven by the internal. LINE-1 5 promoter), was vastly overestimated when we reanalyzed normal lung, tumor, and metastases (met) samples from lung adenocarcinoma (LUAD) patients sequenced by the 3 method (43). In this data, only the 90 bp read 1 carries any information about the captured RNA molecules ( Figure 1A, top). Read 1 sequences aligned to the L1HS consensus in two peaks ( Figure 1A, middle). The smaller peak likely reflects a cryptic polyA site in the LINE-1 sequence (an AAUAA motif appears at the peak's 3 end.) The second, larger peak occurred as expected at the 3 end of LINE-1. However, there was no difference between normal, tumor and met, despite considerable evidence showing that LINE-1 is upregulated in cancer cells (44).
We then mapped reads from this second peak into a UMAP embedding and found high LINE-1 'expression' in nearly every cell type ( Figure 1A, bottom), further indicating that this method does not reflect the known pattern of tumor LINE-1 expression. LINE-1 is present in the human genome at a very high copy number, but most of these copies are inactivated by mutation and/or 5 truncation. However, LINE-1s, including the half million or so copies that lack promoter activity, are frequently transcribed into non-LINE-1 transcripts, and these non-LINE-1 specific 'passive' transcripts often vastly exceed 'active' LINE-1 expression (45,46). Any of these passive transcripts that are polyadenylated at LINE-1's 3 polyadenylation site will be indistinguishable from active LINE-1 expression in 3 targeted datasets. Furthermore, there may be additional sources of 3 LINE-1 RNA that is not derived from the LINE-1 promoter activity, such as cryptic promoter activity in the 3 UTR (47). Therefore, it is not entirely unexpected that a 3 targeted sequencing method is not effective at measuring LINE-1 expression.

targeted 10x genomics sc-RNA (with an extended read 1 length) captures LINE-1 expression
Because passive LINE-1 transcripts initiate upstream of the element, they can be filtered out by using a 5 targeted method. 10× Genomics methods use a transcript switch oligo (TSO) to identify RNA molecules that have been fully reverse transcribed into cDNA. Polymerase reaching the 5 end of the RNA molecule will change templates at the TSO, yielding a molecule where the TSO sequence is affixed to the 5 end of the fully reverse transcribed cDNA ( Figure 1B, top). Thus, if we perform sequencing that is 100 or 150 bp paired end, we will sequence the TSO/cDNA junction, allowing us to precisely identify LINE-1 RNAs that start within a few bases of the 5 end (i.e. result from LINE-1 promoter activity; see purple segment in  To test this assertion, we aligned reads from a 10x genomics example dataset that includes 150 bp paired end reads from a different--lung squamous cell carcinoma (LSCC)--tumor (https://support.10xgenomics.com/singlecell-vdj/datasets/2.2.0/vdj v1 hs nsclc 5gex). This yielded a large peak of read 1 sequences mapping within 20 bases of the 5 end of LINE-1 (shaded peak in Figure 1B, middle) as well as a few peaks that are likely not representative of LINE-1 promoter activity (gray arrows in Figure 1B, middle). When we only count UMIs from read pairs that have a read 1 in this peak--a method we call 5 scL1seq--LINE-1 expression is almost entirely restricted to the epithelial/cancer cluster, reflecting abundant evidence that LINE-1 is upregulated in cancer cells (14,21,44).

Validation in cell lines
To validate the method of identifying LINE-1 expression from 10x genomics 5 scRNA-seq, we sequenced RNA from a HeLa cell that has been engineered to overexpress LINE-1 (L1RP) alongside the parental (WT) HeLa cell line. We also looked at endogenous expression in two cells: one (MCF-7) that has been shown to express a high level of LINE-1, and one (HCT116) that expresses a low level of LINE-1 (48). The HeLa cell lines were sequenced using the newer v2 chemistry from 10x genomics. By default, this method saves on sequencing cost by sequencing a 26 bp read 1 that only covers the barcode sequences ( Figure 2A). Therefore, only the 90 bp read 2 is aligns to the transcript. Without a TSS defined by read, it is less clear which reads to count as LINE-1 ( Figure 2B). We therefore quantified LINE-1 in two ways: one only counting reads in the first 150 bp of L1Hs and one counting reads in the first 500bp of L1Hs. Nevertheless, both methods show a clear overexpression of L1NE-1 in the L1RP cell compared to the parental (WT) HeLa ( Figure 2C).
To test how the method handles endogenous LINE-1 expression, we sequenced MCF-7, previously shown to express a high level of LINE-1, and HCT116, which expresses a low level of LINE-1 (48). Consistent with that result, we see much higher expression in MCF-7 compared to HCT116 ( Figure 2D). This difference is also reflected in the Nucleic Acids Research, 2023, Vol. 51, No. 5 2039 coverage profile across the L1Hs consensus in these two cell lines ( Figure 2E). To further validate this result, we checked whether we could trace the LINE-1 expression back to the specific loci expressed in these cell lines. To that end, we trimmed the barcode and TSO sequences from read 1 to create pseudobulk RNA-seq reads, and quantified expression at the single locus-specific level using a version of L1EM that was modified for 5 data (see Materials and Methods). For MCF-7, which has high enough LINE-1 expression to clearly identify specific expressed loci, we found that our L1EM analysis agreed with Philippe et al. (2016) (48) about the top locus and three of the fie most highly expressed full length L1Hs loci (non-reference loci were excluded from the top 5 since they are not considered by L1EM). Together, these results suggest that reads aligning at the 5 end of the L1Hs consensus indeed reflect bona fide 5 UTR promoted LINE-1 expression.

LINE-1 is expressed in normal human epithelial cells
Having developed a high throughput method to measure LINE-1 RNA expression in single cells, we returned to the primary question motivating this study: how much LINE-1 expression occurs in normal, adult somatic cells? To that end, we reanalyzed 5 targeted single cell RNA-seq data sequenced using 150 bp paired end reads from 15 healthy human tissues: bile duct, bladder, blood, esophagus, heart, liver, lymph node, bone marrow, muscle, rectum, skin, small intestine, spleen, stomach and trachea (49). Fibroblasts, endothelial, muscle and immune cells clustered by cell type regardless of tissue of origin and showed almost no LINE-1 expression. However, epithelial cells, which clustered by tissue of origin, showed clear evidence for LINE-1 expression ( Figure 3A, Supplemental Figure S2). The presence of LINE-1 expression in normal epithelial cells is reminiscent of the fact that frequent retrotransposition events have been identified in epithelial cell derived tumors (carcinomas), but retrotransposition is infrequent or absent from cancers derived from blood and brain (21,47,50).
To quantify the LINE-1 enrichment in epithelial cells, we fit a negative binomial generalized linear model (nb glm) to the number of LINE-1 UMIs in each cell, using the total number of UMIs per cell and whether a cell is epithelial as predictors. Using this method, we found that LINE-1 expression is ∼17× higher in epithelial cells compared to others in this dataset (P < 10 −16 , Figure 3B, red stars). We then fit similar models to non-epithelial cells, separately considering each cell type as a predictor and found that LINE-1 expression is about twice as high in T cells compared to other non-epithelial cells (P = 7.2 × 10 −5 ) and about 73% higher in endothelial cells (P = 0.01, Figure 3B, gray stars). We also used the nb glm to compare epithelial cells from different individual tissues samples. The esophagus sample was most significantly enriched for LINE-1 expression (P < 10 −16 ), but we also identified increased LINE-1 expression in epithelial cells from the bladder (P = 0.01) and trachea (P = 0.02).
Because the overall signal of LINE-1 expression is low in normal epithelial cells and there is a fair amount of background (Supplemental Figure S1), we wanted to map reads back to specific loci to see if they actually come from actively expressed LINE-1 loci. To that end, we applied the 5 pseudobulk L1EM method (see above and methods) to the esophageal sample, where LINE-1 expression was predicted to be highest. We found the top expressed locus to be an intact L1Hs locus antisense to an intron in the TT28 gene, located in the 22q12.1 cytoband. This locus has previously been shown to be a highly active locus that generates a large number of 3 transductions in cancer (20,21,23). Looking at the pattern of reads aligning upstream of a putatively expressed LINE-1 locus has been previously used to validate its expression (46,51). We therefore plotted the read coverage for 50 kb upstream of the three most highly expressed loci (see Supplemental Figure  S2). Because the data is biased toward the 5 end, it is difficult to be absolutely certain where the transcripts fall. Nevertheless, we do not find reads piling up immediately upstream of these loci and for two of the three top loci, the upstream read peaks are far smaller than the number of reads pairs predicted to be derived from the 5 end of the element by L1EM. (Note that reads within the L1 element multimap. The L1EM expected read counts are much higher because they include reads that miss map to other highly similar loci.) Together these observations build confidence that the 5 read alignments really do reflect proper LINE-1 expression.
We then wanted to know whether the normal epithelial cell LINE-1 expression that we found in single cells was also detectable in bulk samples. A recent study showed that, indeed, LINE-1 expression can be seen in normal tissues from GTEx (52), including high expression in skin, a tissue with a large epithelial cell component. To corroborate these results, we applied our L1EM algorithm (45) to healthy tissue data from GTEx and found that LINE-1 was most often detected in tissues with a significant epithelial cell component. We detected LINE-1 expression (at least one L1Hs locus with >2 fragments per million/FPM) in at least 25% percent of samples from: esophageal mucosa, kidney, lung, ovary, salivary gland, skin, testis and vagina. On the other hand, LINE-1 expression was rarely detected in tissues lacking in epithelial cells such as adipose (fat), brain, muscle and blood ( Figure 3D).
We then searched a dataset of particularly deep mass spectrometry proteomics from 29 healthy tissues for spectra matching peptides in the LINE-1 ORF1 protein (53), following our previously described protocol (54). At least 4 peptide spectral matches (PSMs) were identified in lung (51), ovary (23), placenta (10) and prostate (4), but ORF1p was not detected in tissues lacking epithelial cells, including bone marrow, brain, fat, lymph node and smooth muscle ( Figure 3E). This suggests that LINE-1 ORF1p is present in these tissues, but typically lies near or below the detection limit for mass spectrometry proteomics. While LINE-1 RNA expression is present in both tumor and normal epithelial cells, it is still about 3.3× higher in the LSCC epithelial/cancer cells, analyzed above, than in the normal trachea epithelial cells. This enrichment is consistent with tumor/normal comparisons of LINE-1 RNA and protein expression in bulk LSCC samples from CPTAC(55) (Figure 3F).  To get greater detail on LINE-1 expression in KCs, we reclustered the epithelial cells separately ( Figure 4C). LINE-1 expression was by far the highest in malignant BCC cells (up 15× over other KCs, P < 10 −16 ), but was also elevated in SCC cells (up 2.3x over other non-malignant KCs, P = 4.6 × 10 −6 ). Unexpectedly, we also found LINE-1 expression to be elevated in sebocytes (up 2.2× over non-BCC KCs, P = 8.8 × 10 −7 , Figure 4D).

LINE-1 expression in malignant vs normal cells in younger and older individuals
LINE-1 has been hypothesized to be de-repressed as a result of heterochromatin loss during aging (56). We therefore wanted to know whether the epithelial cell LINE-1 ex-pression identified here was affected by age. We then used an nb glm model to test a binary age variable (>80 years versus < 40 years) as a predictor of LINE-1 expression in our data. We found that advanced age predicted a modest 35% increase in normal KC LINE-1 expression (P = 0.04, Figure 4E). However, with only two patients, this analysis lacked the power to fully account for differences in sample cell type make-up. In particular, the older SCC patient included many hair follicle associated KCs that were absent from other samples (Supplemental Figure S3, bottom right).

LINE-1 is expressed in mouse hippocampal neurons
While we did not see clear evidence for LINE-1 expression in the bulk human brain tissue samples we analyzed above, a number of studies have suggested a role for LINE-1 in normal and diseased brains. We therefore wanted to 2042 Nucleic Acids Research, 2023, Vol. 51, No. 5 A B C D know whether LINE-1 expression in the brain can be detected at the single cell/nucleus level. While we were not able to find appropriate data from human brain cells, we did identify 150 bp paired-end read 5 targeted single nucleus RNA-seq data from the hippocampi of two young (4month-old) and two old (24-month-old) mice. Because mice have three active LINE-1 families (Tf, Gf and A), we aligned reads to the L1MdTf I, L1MdGf I and L1MdA I consensus sequences (57). After filtering low quality alignments, we found L1MdTf I to be most promising for LINE-1 expression ( Figure 5A, Supplemental Figure S11): a read 1 peak occurs in the tandem repeat region (at the position where transcription is expected to initiate) and is followed by a broader read 2 peak. A large number of reads that align near the 3 end of the sequence are likely a form of background and were filtered out from further analyses. Few reads were found to align to L1MdA I and read alignments to L1MdGf I did not form as clear a peak at the 5 end. We then counted L1MdTf UMIs and found that LINE-1 expression was highest in neurons, particularly those of the cornu ammonis (CA/hippocampus proper, Figure 5B). Overall, neuronal LINE-1 expression was about twice glial LINE-1 expression (P = 2.2 × 10 −8 ), and neurons of the cornu ammonis expressed about 50% more LINE-1 than those of the dentate gyrus (P = 4.3 × 10 −4 , Figure 5C). We then used the nb glm model to test whether advanced age was predictive of increased LINE-1 expression in neurons. This yielded a modest effect similar in scale to what we observed in the skin samples: LINE-1 expression was 25% higher in older neurons (P = 0.039, Figure 5D).

DISCUSSION
While very strong evidence from both RNA and protein (44,67,68) shows that LINE-1 is highly expressed in many tumors, the extent of its expression in normal cellular contexts is much less well understood. New evidence has linked retrotransposons broadly and LINE-1 specifically to a variety of non-cancer diseases, including age-related inflammation (16,32), neurodegenerative diseases (69-71) and autoimmune disorders (72,73). This makes it all the more important that we have a clear and accurate understanding of where and when LINE-1 expression is driven from its own promoter in human cells. To that end, we developed 5 scL1seq, a method to quantify LNE-1 expression in single cell RNA-seq data generated on the 10× Genomics platform, and found that the 5 targeted method combined with 100 + bp paired end reads is particularly effective for accurately identifying LINE-1 expression. When we analyzed data generated by this method for LINE-1, we were surprised to find clear evidence for LINE-1 expression in normal epithelial cells in multiple tissue types. This reflects the fact that LINE-1 activity in cancer is largely limited to tumors of epithelial origin (18)(19)(20)(21)(22)(23)47,50). This relationship between normal and tumor cell LINE-1 expression suggests that LINE-1 de-repression in cancer may not so much be a switch from 'repressed' to 'expressed' as much as the widening of a pre-existing window of LINE-1 expression. Somatic LINE-1 expression in primary cells prior to tumor development may also have important implications for how cancer develops--particularly in the acquisition of p53 mutations (14). It is, however, important to note that the quantifications here are of LINE-1 RNA. Because there are many post-transcriptional mechanisms for LINE-1 regulation (74), future research is needed to show whether the expression we observe here leads to the translation of LINE-1 proteins and to new retrotransposition events.
We then looked at direct tumor/normal comparisons from non-melanoma skin cancer patients to see how normal LINE-1 expression relates to tumor LINE-1 expression. Despite seeing similar levels of normal LINE-1 expression in basal and suprabasal cells--the normal equivalent for basal (BCC) and squamous cell carcinomas (SCC) respectively--we saw much higher LINE-1 expression in the cancer cells from BCC than SCC. This suggests that LINE-1 expression in tumors is not simply a reflection of the background expression in relevant cell types of origin, but also involves tumor-specific factors, yet to be elucidated.
Because aging is associated with a loss of heterochromatin that can lead to expression of retrotransposons, including LINE-1 (16,32,33,56,75), particularly in senescent cells (16), we wanted to know whether the epithelial LINE-1 we observe here is an age-related phenomenon. While we did find that tumor adjacent normal skin from two patients over 80 had higher LINE-1 expression than tumor adjacent normal skin from two patients under 40, the effect was modest. Thus, it may be that the modest increase with age is due to the increased prevalence of senescent cells, but that there is also another source of epithelial cell LINE-1 expression that is present in both young and old tissues. A much larger sample size would be required to fully determine the effect of age on LINE-1 expression in epithelial cells. Nevertheless, our results suggest that there is agreement between our method and strategies for LINE-1 expression quantification that show an increase with age.
Finally, we analyzed nuclei from the mouse hippocampus for LINE-1 expression, where we found it to be expressed in neurons, especially those of cornu ammonis (hippocampus proper.) In cancer cell lines, LINE-1 RNPs are primarily cytoplasmic7, so it is possible that the single nucleus RNAseq used to evaluate neuronal expression is measuring a distinct population of LINE-1 mRNAs from those present in malignant and normal epithelial cells. It is also important to note that mouse and human LINE-1 are not identical. Mice have 3 active LINE-1 families whereas humans have 1, and mouse and human LINE-1 may be regulated in different ways due to the differing structures of their 5 UTRs.
The strength of the 5 targeted method is that it pinpoints the transcription start site (TSS) of a LINE-1 transcript, allowing us to distinguish proper LINE-1 transcripts from other transcripts that just happen to include some LINE-1 sequence. This is by no means a problem unique to LINE-1 or unique to human and mouse. What transcript gives rise to transposable element (TE) RNA is an important question for analysis of TE expression, regardless of the element under consideration. Knowing the TSS provides key information about the relevant transcripts. Other TEs in other organisms could be analyzed similarly by replacing individual loci with consensus sequences and then counting only high-quality alignments at the relevant TSS. Alternatively, for TEs that are not as highly repetitive as LINE-1 in human and mouse, cellranger will likely be able to align TEs reads to the genome, making it possible to count reads directly from the standard genome alignment.
In this study, we developed a method to accurately quantify LINE-1 expression in tens of thousands of single cells and were able to make significant progress toward answering the question of what constitutes normal LINE-1 expression and how it compares to tumor LINE-1 expression. However, our results also raise the question as to why LINE-1 is more highly expressed in epithelial cells and neurons compared to other cells. The answer likely involves complex locus specific regulation at specific LINE-1 elements (51,76,77).

DATA AVAILABILITY
3 lung adenocarcinoma data was accessed from the SRA database: PRJNA510251. 5 lung squamous cell data is provided as example data from 10x genomics (https://www.10xgenomics.com/resources/datasets/nsclctumor-1-standard-5-0-0). SRA bioproject accession numbers for the other datasets are as follows: PRJNA510251 for 3 lung adenocarcinoma patient data; PRJNA670909 for 5 normal tissues data; PRJNA781454 for the cell line and non-melanoma skin cancer patient data; PRJNA677926 for 5 mouse hippocampus data. A full list of data and accession is found in Supplementary Table S1. Scripts and readme necessary to run 5 scL1seq on other datasets can be found on github (https://github.com/wmckerrow/5pL1sc) and Zenodo (https://doi.org/10.5281/zenodo.7541408).