PoPoolationTE2: Comparative Population Genomics of Transposable Elements Using Pool-Seq

The evolutionary dynamics of transposable elements (TEs) are still poorly understood. One reason is that TE abundance needs to be studied at the population level, but sequencing individuals on a population scale is still too expensive to characterize TE abundance in multiple populations. Although sequencing pools of individuals dramatically reduces sequencing costs, a comparison of TE abundance between pooled samples has been difficult, if not impossible, due to various biases. Here, we introduce a novel bioinformatic tool, PoPoolationTE2, which is specifically tailored for the comparison of TE abundance among pooled population samples or different tissues. Using computer simulations, we demonstrate that PoPoolationTE2 not only faithfully recovers TE insertion frequencies and positions but, by homogenizing the power to identify TEs across samples, it provides an unbiased comparison of TE abundance between pooled population samples. We anticipate that PoPoolationTE2 will greatly facilitate the analysis of TE insertion patterns in a broad range of applications.


Introduction
Transposable elements (TEs) are short stretches of DNA that selfishly propagate within genomes and are thought to be involved in diverse phenomena ranging from human diseases (Kazazian, 1998) to genome evolution (Kazazian, 2004).
Many questions about the biology of TEs can be only addressed by comparing the TE abundance among different samples, such as the activity of TEs in mutation accumulation lines (e.g., base population vs. mutated lines), the dynamics of TE invasions during experimental evolution (e.g., evolved populations at different time points), the contribution of TEs to local adaptation (e.g., populations from different areas), the evolution of TE activity (e.g., populations from different species), and the extend of somatic TE activity (e.g., different tissues) (Gonz alez et al. 2008;Perrat et al. 2013;Kofler et al. 2015b). Sequencing individuals (cells) separately is either too costly or, as in the case of tissues, technically too challenging (sequencing of single cells). Sequencing pools of individuals (Pool-Seq) offers a viable alternative approach (Schlötterer et al. 2014). However, a comparison of TE abundance between pooled samples is difficult as the read depth is usually not high enough to identify all TEs within a pool. This leads to an obvious bias, with more TEs being found in the sample with more mapped reads. Although it is possible to standardize the number of reads in the samples, small differences in sequencing library preparation may introduce some additional biases: 1) insert sizes may vary between samples, with longer insert sizes leading to a higher power to identify TEs, 2) coverage heterogeneity may vary among samples (e.g., due to different DNA polymerases), and 3) genome sizes may differ between samples (e.g., due to different TE contents), where larger genomes result in lower coverage and thus fewer detected TEs. We address these problems by introducing a new data format, the physical pileup track. Analogous to the pileup track, which summarizes for every genomic site the base calls, the physical pileup summarizes the structural states (e.g., TE insert presence or absence) for every genomic site. Based on the physical pileup, our new software tool PoPoolationTE2 homogenizes the physical coverage across samples and thus also the power to identify TEs.

PoPoolationTE2
PoPoolationTE2 is a fast and user friendly tool for analyzing TE insertions in one or more samples, where samples could be tissues, pooled individuals, or separately sequenced individuals. PoPoolationTE2 does not rely on a set of annotated TE insertions in the reference genome, thus both novel (insertions not present/annotated in the reference genome) and annotated TE insertions can be identified. Nested insertions and insertions from uncharacterized TE families, however, cannot be identified. In contrast to its predecessor PoPoolationTE (Kofler et al. 2012), PoPoolationTE2 is designed to compare TE abundance among multiple samples in one joint analysis. PoPoolationTE2 is substantially faster than its predecessor, as it is implemented in Java and uses bam files as input. Although PoPoolationTE2 was primarily designed for Pool-Seq data, it can also be used for sequenced individuals where the population frequency may serve to identify heterozygous insertions or to estimate the Brief communication ß The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons. org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Open Access penetrance of somatic insertions. However, for identifying TE insertions in sequenced individuals multiple dedicated tools are available (T-Lex2 [Fiston-Lavier et al. 2015], RetroSeq [Keane et al. 2013], Jitterbug [Hénaff et al. 2015], and TE-Tracker [Gilly et al. 2014]). PoPoolationTE2 requires paired end data for at least one sample, a reference genome and either a set of TE sequences or a TE annotation. Although PoPoolationTE2 accounts for heterogeneity in sequence coverage, the number of chromosomes contributing to the pools should be similar among samples or much larger than the coverage in each sample (minimizing multiple sampling from one individual at a given genomic position).
PoPoolationTE2 requires reads to be mapped to a modified genome, consisting of a reference genome with masked TE sequences and a set of TE sequences. Masking of TEs may be done based on a TE annotation, RepeatMasker (Smit et al. 1996(Smit et al. -2010 or, as RepatMasker sometimes misses TE insertions (Rahman et al. 2015), iterative mapping of reads derived from TE sequences (see Manual). When reads are mapped to such a modified genome, TE insertions will result in groups of discordantly mapped paired ends, where one read maps to the reference chromosome and the other to a TE sequence (signatures of TE insertions; fig. 1A), whereas properly mapped paired ends indicate the absence of a TE insertion ( fig. 1B). Based on the position of mapped paired ends, a physical pileup track is generated ( fig. 1C). In contrast to base coverage, which relies on the position of reads, physical coverage is based on the sequence spanned by paired ends ( fig. 1C; for the difference between base and physical coverage see also Meyerson et al. 2010). Different types of physical coverage can be distinguished. Properly mapped pairs result in coverage supporting the absence of a TE while discordantly mapped reads, with one read mapping to a reference chromosome and the other to a TE, result in coverage supporting the presence of a TE ( fig. 1D). Because the distance between discordantly mapped reads is not known, we use the median of the distance between proper pairs as approximation (inferred for each sample separately). The power to identify TEs scales with the number of mapped reads as well as the distance between the reads, a property that is captured by physical coverage (fig. 1E). The physical coverage of overlapping paired ends is summed up yielding a physical coverage track with the height reflecting the number of paired ends spanning a given position ( fig. 1F). The power to identify TEs can be homogenized across samples by randomly sampling the physical coverage to equal levels between and within samples ( fig. 1G). Signatures of TE insertions are identified with a sliding window approach scanning for peaks in physical coverage supporting a TE insertion ( fig. 1H). The population frequency of TEs is estimated as the ratio of physical coverage supporting a TE insertion to the total physical coverage ( fig. 1I). Finally, pairs of TE signatures (forward and reverse) of the same family within a given distance are joined ( fig. 1J). PoPoolationTE2 reports the position, the family, the strand, and the population frequency for every TE in all samples. A more detailed explanation of the PoPooaltionTE2 algorithm can be found in the manual (https://sourceforge.net/p/popoo lation-te2/wiki/Home/, last accessed August 8, 2016).

Performance
First, we assessed the performance of PoPoolationTE2 under optimal conditions such that in principle all TEs could be detected. We simulated a population of size N ¼ 100 with 1,000 TE insertions, used a minimum distance of 990 bp between insertions and randomly picked the family, the strand, and the population frequency (0:01 f 1:0) of the TEs (total size of one genome %3:3 Mb). For this population, we simulated paired end reads with an uniform genomic distribution and a coverage sufficiently high to detect all TE insertions (average physical coverage in the pool %200). The mapping algorithm may have a substantial influence on the identification of TEs. Evaluating the suitability of different alignment algorithms, we found that local alignments, with only a fraction of the read required to match, perform consistently better than semiglobal algorithm, with the entire read matching (supplementary table S1, Supplementary Material online). All local alignment algorithm tested (bwa bwasw, bwa mem, bowtie2-local Durbin, 2009, 2010;Langmead and Salzberg, 2012]) allowed for a robust identification of TEs even with sequencing error/polymorphism rates up to 10-15% (supplementary table S1, Supplementary Material online). The best results were obtained when we aligned both reads of paired ends separately using bwa bwasw and then restored the paired end information with PoPoolationTE2 (supplementary table S1, Supplementary Material online). We used this approach for the remaining analyses. We reasoned that variation of the inner distance (¼ fragment size-2 * read length) may cause problems with mapping strategies relying on paired ends. Consistent with this hypothesis, small variation in fragment sizes yields the most accurate estimates of the population frequency and of TE positions (supplementary table S2, Supplementary Material online). The accuracy slightly decreases with increasing variation of the inner distance (supplementary table S2, Supplementary Material online). The physical coverage derived from paired ends only depends on the mapping position of the reads. Hence, we evaluated how the sequencing strategy could be optimized to obtain the highest accuracy at the lowest sequencing costs. As long as mapping positions are not altered, the cost of sequencing may be reduced by shorter reads. Optimal results were obtained with reads of 75-100 bp length (supplementary table S3  . In contrast to optimal conditions, we simulated randomly distributed paired ends (resulting in a heterogeneous coverage) and additionally, to reflect properties of Illumina paired end data (Kofler et al. 2015a), introduced 1% error rate of reads and 2% chimeric reads (reads derived from unrelated genomic positions). Allele frequencies are estimated with a precision of 62.5% and the insertion position differs on average by 7.2 bp (table 2). The population frequency of segregating insertions is slightly overestimated whereas the frequency of fixed insertions is slightly underestimated (fig. 1E). About 80% of the estimated TE positions are within 10 bp of the true position ( fig. 1F) Table 1. Performance of PoPoolationTE2 under optimal conditions such that, in principle, all TEs could be identified. We evaluated the influence of sequencing error rate, inner distance between paired ends (ID), standard deviations of the inner distance (r ID ), read length, and the product between read numbers and inner distance (keeping the physical coverage constant). The performance was assessed by the number of identified TEs, missed TEs, false positive TEs, TEs with correct strand (strand), TEs with both signatures identified (both sign.), and TEs with a single signature identified (one sign.). Furthermore, we assessed the accuracy of the estimated insertion positions (mean: l Dpos , standard deviation: r Dpos ) and of the estimated population frequencies (mean: l Dfreq , standard deviation: r Dfreq ). The resulting average coverage (l c ) and average physical coverage in the pool (l pc ) were estimated from the data.  2016). The advantage of these simulated data is that the true insertions and population frequencies are known which permits to estimate the accuracy, sensitivity, and specificity of tools. Alternatively, it has been suggested to compare the performance of tools with real data using a standardized data set (Ewing, 2015). A weakness of this approach is that a good agreement between tools does not necessarily mean a high performance as all evaluated tools may be biased. Additionally in the case of discordant results between tools, it is not possible to assess which one actually performs best. Finally, we compared the performance of PoPoolationTE2 to its predecessor, PoPoolationTE, using real Pool-Seq data from a natural Drosophila melanogaster population sampled 2008 in Northern Portugal (Kofler et al. 2012) and found that the two tools yield very similar results (supplementary fig. S2, Supplementary Material online).
Because PoPoolationTE2 was designed specifically for an unbiased comparison of TE abundance among samples, we tested its performance by simulating three populations with variable numbers of low frequency (f ¼ 0.01) insertions (A ¼ 1000, B ¼ 750, C ¼ 500). For each of these three populations we generated in silico different numbers of paired end reads which varied in insert sizes (table 3). With typical Pool-Seq studies only sampling a subset of the chromosomes in the sample (Schlötterer et al. 2014), it is not possible to identify all TE insertions. Nevertheless for an unbiased comparison of different samples, it is sufficient to determine the relative TE abundance. The example in table 3 shows how the analysis of the complete data set may lead to misleading results: in population A fewer TE insertions (minimum count 2) are detected than in population B, despite the opposite being true. Subsampling reads in all samples to equal numbers (i.e., identical base coverage) reduces the problem, but still causes misleading results, with population B having more insertions (table 3). Subsampling the physical coverage to equal levels in all populations consistently resulted in the least biased comparison of TE abundance between populations (table 3).
When the inner distance of the paired end reads is similar between samples, subsampling reads to equal numbers has the same effect as homogenizing the physical coverage (sup plementary table S7, Supplementary Material online), but the latter strategy identifies fewer TE insertions. PoPoolationTE2 supports both approaches.
Some applications, such as measuring TE activity in mutation accumulation lines, may depend on a reliable identification of sample specific insertions. This could be challenging as a putative absence of a TE insertion in one sample may in fact be an artefact of coverage heterogeneity. We show that coverage heterogeneity among samples may result in a substantial fraction of false sample specific insertions (supplementary table S8 We conclude that PoPoolationTE2 is a fast and user friendly tool for an unbiased comparison of TE abundance between samples, thus enabling to study TE dynamics in a broad range of applications.

Availability
PoPoolationTE2 is implemented in Java and freely available at https://sourceforge. net/projects/popoolation-te2/ (last accessed August 8, 2016); For a detailed manual and a walkthrough using a small sample data set see https://sourceforge. net/p/popoola tion-te2/wiki/Home/ (last accessed August 8, 2016). A data set for benchmarking tools for the identification of TE insertions with Pool-Seq data is available at https://sourceforge. net/p/ popoolation-te2/wiki/TE-Benchmark/ (last accessed August 8, 2016).

Supplementary Material
Supplementary figures S1 and S2 and tables S1-S8 are available at Molecular Biology and Evolution online (http://www. mbe.oxfordjournals.org/). Table 3. Evaluating different strategies to compare TE abundance in Pool-Seq samples. We simulated three populations with different numbers of low-frequency insertions (f ¼ 0.01) and paired ends with varying inner distances (ID). An unbiased comparison should result in a stable ratio between observed and simulated TEs in the three populations (i.e., a low r obs=sim ). The best results were obtained when the physical coverage (p.c.) was sampled to equal levels in all three populations. Results are shown for two different minimum count thresholds (mc). The average coverage (l c ) and the average physical coverage in the pool (l pc ) were directly estimated from the data. a Coverage after sampling.