Locus-specific expression analysis of transposable elements

Abstract Transposable elements (TEs) have been associated with many, frequently detrimental, biological roles. Consequently, the regulations of TEs, e.g. via DNA-methylation and histone modifications, are considered critical for maintaining genomic integrity and other functions. Still, the high-throughput study of TEs is usually limited to the family or consensus-sequence level because of alignment problems prompted by high-sequence similarities and short read lengths. To entirely comprehend the effects and reasons of TE expression, however, it is necessary to assess the TE expression at the level of individual instances. Our simulation study demonstrates that sequence similarities and short read lengths do not rule out the accurate assessment of (differential) expression of TEs at the instance-level. With only slight modifications to existing methods, TE expression analysis works surprisingly well for conventional paired-end sequencing data. We find that SalmonTE and Telescope can accurately tally a considerable amount of TE instances, allowing for differential expression recovery in model and non-model organisms.

In the study "Locus-specific expression analysis of transposable elements" the performances of SalmonTE*, SQuIRE, Telescope, TEtools* and TEtranscripts* were evaluated with respect to the detection and quantification of transposable elements (TEs) and the detection of differentially expressed TEs. This document is structured by analysis steps and lists all tools used in the respective step with their purpose, used arguments, and the argument descriptions (partially copied from the reference of the respective tool). All scripts that are not listed within Table 1 are in-house script, which can be found at GitHub: https://github.com/Hoffmann-Lab/TEdetectionEvaluation The following tools were used for this study: The following references were used for this study:

Preparation Generation of reference library
A reference library (FASTA-format) of the annotated TEs (contained in the .align-files) was created. The in-house script align_parser.py was used to extract the coordinates of the annotated TEs in a BED-format with the following call: align_parser.py -a <.align-file> The reference sequences were extracted with bedtools and stored as FASTA-format (referencelibrary.fa) with the following command: bedtools getfasta -fi <reference-genome> -bed <.bed-file> > referencelibrary.fa

Select a random set of 100.000 TEs
The script sampleTEs.R was used to randomly select 100.000 TEs for the polyester simulation with the following call: • minimal length of TE is set to 100 • species is set to human mouse or notho as required • number of TEs is set to 100.000 • .bed-file & .fasta-file is set respected to the file names

Simulation of reads
The tool polyester was used to simulate reads for the main part of the study. The polyester call was implemented in the in-house script simulation_polyester.R and called as follows: For this study following parameters are selected: • percentage of differentially expressed TEs (-d) is set to 5 • number of replicates (-r) is set to 3, 5 or 10 as required • read length (-l) is set to 50 or 100 as required • setup (-s) is set tot paired or single as required

Alternative simulation
Select a random set of 100.000 TE instances getRandomSeq -f <name of FASTA file> \ -n <int> -l <int> -o <name of output directory> -s <int> For this study following parameters are selected: • minimal length (-l) of TE is set to 100 • number of TEs (-n) is set to 100.000 • seed (-s) is set to 5

Simulation of reads
The alternative simulation, which is contained in the study, was done with the in-house script readiator and was called as follows: For this study following parameters are selected: • simulation of a second group with differentially expressed TEs (-d) • read length (-l) is set to 50 or 100 as required • desired number of reads (-r) is set to 5.000.000 • sample size per group (-sz) is set to 5 • seed (-s) is set to 5

Run Tools Preparation of tool-specific files
Files that are needed by the respective tools were prepared as listed within this section.

Creation of index for SalmonTE
salmon index -t <name of FASTA file> -i <index name> -type quasi -k 31

Create alignment files with STAR for TEtranscripts* and Telescope
The alignment files generated by STAR serve as input for TEtranscripts* and Telescope.
STAR --genomeDir <index of genome> \ --readFilesIn <fastq-file> \ --winAnchorMultimapNmax <int> \ --outFilterMultimapNmax <int> \ --alignIntronMax <int> \ --outFilterMismatchNoverLmax <float> The simulated sequencing data was aligned with STAR (v2.7.6a) according to the recommendation of the authors of TEtranscripts with the following options: • winAnchorMultimapNmax is set to 100 list of fastq files winAnchorMultimapNmax max number of loci anchors are allowed to map to outFilterMultimapNmax max number of multiple alignments allowed for a read alignIntronMax maximum intron length outFilterMismatchNoverLmax alignment will be output only if its ratio of mismatches to *mapped*length is less than or equal to this value • outFilterMultimapNmax is set to 100 • alignIntronMax is set to 100000 • outFilterMismatchNoverLmax is set to 0.04

Apply Tools
The tool calls, which were used to estimate counts of TEs, are listed within this section.

SQuIRE
The shipped bash-scripts from SQuIRE were used. The following command is copied from the bash-script, which is responsible for the read count. The creation of the clean_folder is named in the 'Generate tool specific files' section of this document.

Evaluation
After adapting the general.R and dataInfo.csv files run the evaluation script as follows: Rscript TEdetectEval.R Subsequently, the scripts figure.R and table.R need to be run to generate figures and tables.