SQuIRE reveals locus-specific regulation of interspersed repeat expression

Abstract Transposable elements (TEs) are interspersed repeat sequences that make up much of the human genome. Their expression has been implicated in development and disease. However, TE-derived RNA-seq reads are difficult to quantify. Past approaches have excluded these reads or aggregated RNA expression to subfamilies shared by similar TE copies, sacrificing quantitative accuracy or the genomic context necessary to understand the basis of TE transcription. As a result, the effects of TEs on gene expression and associated phenotypes are not well understood. Here, we present Software for Quantifying Interspersed Repeat Expression (SQuIRE), the first RNA-seq analysis pipeline that provides a quantitative and locus-specific picture of TE expression (https://github.com/wyang17/SQuIRE). SQuIRE is an accurate and user-friendly tool that can be used for a variety of species. We applied SQuIRE to RNA-seq from normal mouse tissues and a Drosophila model of amyotrophic lateral sclerosis. In both model organisms, we recapitulated previously reported TE subfamily expression levels and revealed locus-specific TE expression. We also identified differences in TE transcription patterns relating to transcript type, gene expression and RNA splicing that would be lost with other approaches using subfamily-level analyses. Altogether, our findings illustrate the importance of studying TE transcription with locus-level resolution.

and tpm by transcripts in a "gtf file" and by genes "abund.txt" file. We convert the fpkm values to counts by multiplying the per-exon coverage by exon length normalized by read length.
StringTie quantifies reads that overlap with annotated exons; StringTie does not include completely intronic reads when run with the "-e" option in its fpkm calculation. If a TE-derived read overlaps with an annotated exon but the rest of the gene is not expressed, Stringtie reports gene expression in the abund.txt file, but the Stringtie gtf output identifies the transcript as a novel isoform (the source column is "Stringtie" instead of the reference gtf). If the TE-derived read overlaps with an intron and the rest of the gene is not expressed, there is no report of gene expression by Stringtie.

DESeq2 Implementation in Call
Call incorporates the Bioconductor package DESeq2 [54,55] with its suggested parameters.
Users input the sample names and experimental design (ie which samples are treatment or control), which Call uses to find Count data and create a count matrix for annotated RefSeq genes and TEs. If the RNA-seq data is stranded, transcription from opposite strands of the same TE are treated separately. Call outputs differential expression tables and generates MA-plots, data quality assessment plots, and volcano plots.

STAR implementation in Draw
To visualize the distribution of reads across the TE, Draw runs STAR [52] with the parameters "-runMode input AlignmentsFromBAM -outWigType bedGraph" to provide visualization of read alignments. It will output bedgraphs of all reads ("multi") and only uniquely ("unique") aligning reads. Draw also compresses the bedgraphs into bigwig format for IGV [56] and UCSC Genome Browser [90] viewing. If the RNA-seq data is stranded it will output unique and multi bedgraphs for each strand.

Further details of Count
Count uses a combination of SAMTools (Li et al. 2009), BEDTools (Quinlan and Hall 2010), awk and bash within a Python script to perform the algorithm described in the main text, in particular distinguishing uniquely aligning reads from multi-mapping reads. Because the quantitation in SQuIRE relies on uniquely aligning reads, SQuIRE needed to resolve three issues in identifying uniquely aligning reads and their mapped TE location. 1) Because RepeatMasker annotation includes overlapping TE coordinates, a read can map uniquely at one genomic location corresponding to two TE loci. Count identifies these reads as unique by collapsing reads and their mapped TEs before labeling. The two TEs each would receive a unique count for that TE. 2) Similarly, when SQuIRE incorporates non-reference polymorphic TE insertions, its location can be confused with overlapping reference TE annotation. To address this, Map uses a custom chromosome name for non-reference TEs (eg. "chr3_poly") during alignments. To keep read assignments to non-reference TEs separate from assignments to annotated TEs, Count changes the non-reference chromosome name back to the conventional name (eg "chr3") only after collapsing reads mapped to the same location. III) For paired-end RNA-seq data, a read pair may map concordantly in only one location, particularly if one mate maps outside of the TE.
However, one or both of the TE mapping mates may not be uniquely aligning, and map discordantly to other locations. In this situation, Count does not label these reads as "uniquely aligning", but assigns a full count to the TE and discards the discordant alignments.
Supplemental Figure S1. Outcomes of alternative non-SQuIRE algorithms for locus-level quantification of TE expression. a) Schematic of RNA-seq example from Figure 2 Figure S3. Precision-Recall curve of SQuIRE Count with varying score thresholds. Precision= Σ "True positive"/ Σ "Positive". Recall=Σ "True positive"/ Σ "True". Positive=SQuIRE reported the TE has a count >10. True=TE was simulated to express > 10 reads. A score threshold of >50 maximizes precision and recall for the TEs in the hg38 genome.  Figure S4. L1HS loci detected in HEK293T cells that initiate at the 5' end of the L1HS locus (red boxes).  Figure S5. Comparison of TE RNA-seq tools at the subfamily level for simulated data. Histogram of % TE subfamilies for each percentage of reported over simulated counts. SQuIRE has the tallest and narrowest peak near 100% Observed/Expected, indicating the it is correctly attributing simulated reads to the greatest number of subfamilies. Because TETools outputs in reads rather than fragments, its output is twice that of the other software.  Figure S6. Bar plot comparison of TE RNA-seq tools compared to Nanostring data at the subfamily level. Y-axis represents log2 fold changes of subfamily expression in testis compared to pooled somatic tissues (brain, heart, kidney, and liver).