Patterns of Linkage Disequilibrium and Long Range Hitchhiking in Evolving Experimental Drosophila melanogaster Populations

Whole-genome resequencing of experimental populations evolving under a specific selection regime has become a popular approach to determine genotype–phenotype maps and understand adaptation to new environments. Despite its conceptual appeal and success in identifying some causative genes, it has become apparent that many studies suffer from an excess of candidate loci. Several explanations have been proposed for this phenomenon, but it is clear that information about the linkage structure during such experiments is needed. Until now only Pool-Seq (whole-genome sequencing of pools of individuals) data were available, which do not provide sufficient information about the correlation between linked sites. We address this problem in two complementary analyses of three replicate Drosophila melanogaster populations evolving to a new hot temperature environment for almost 70 generations. In the first analysis, we sequenced 58 haploid genomes from the founder population and evolved flies at generation 67. We show that during the experiment linkage disequilibrium (LD) increased almost uniformly over much greater distances than typically seen in Drosophila. In the second analysis, Pool-Seq time series data of the three replicates were combined with haplotype information from the founder population to follow blocks of initial haplotypes over time. We identified 17 selected haplotype-blocks that started at low frequencies in the base population and increased in frequency during the experiment. The size of these haplotype-blocks ranged from 0.082 to 4.01 Mb. Moreover, between 42% and 46% of the top candidate single nucleotide polymorphisms from the comparison of founder and evolved populations fell into the genomic region covered by the haplotype-blocks. We conclude that LD in such rising haplotype-blocks results in long range hitchhiking over multiple kilobase-sized regions. LD in such haplotype-blocks is therefore a major factor contributing to an excess of candidate loci. Although modifications of the experimental design may help to reduce the hitchhiking effect and allow for more precise mapping of causative variants, we also note that such haplotype-blocks might be well suited to study the dynamics of selected genomic regions during experimental evolution studies.

Figure S15: Accuracy in frequency estimation of base-haplotypes through haplotype specific 21 singleton markers depending on the amount of known founder haplotypes Table S1: Overview of haplotype-block identification in experimental and simulated data 22 Table S2: Overview of putative target SNPs and candidate genes in haplotype-blocks 22 Table S3: Overview of inversion breakpoints and low recombining genomic regions 22

Distance between SNPs [bp]
Increase in r-square The haplotype-block structure is presented. Each row corresponds to one evolved haplotype. Base-singletons called from 29 basehaplotypes (a subset of ~113 haplotypes present in the experimental base population) specific to the respective base-haplotype have been used to infer presence of base-haplotype fragments in the evolved haplotypes and are painted in different colors. Only marker alleles were used if at least two markers of the same base-haplotype were adjacent to each other (single marker alleles surrounded by markers alleles from different base-haplotypes were omitted as most likely to be sequencing of marker calling errors). The interpretation of this visualization is complicated by the inaccuracy of initial marker calling from a subset of the base population. In some regions large fragments of initial base-haplotypes can be seen, which is in accordance with the observed increase in long range LD ( Fig. 1. Fig. S5) and supports our claims of kb to Mb sized block moving in our experiment. However, it is also expected that the evolved haplotypes are mosaics with a large fraction of the ~84 unknown base-haplotypes. As the unknown basehaplotypes were not included for marker calling of the 29 known base-haplotypes it is likely that haplotype markers contain a considerable fraction that are not unique to the intended base-haplotype but also present on an unknown base-haplotype. Unknown founder haplotypes are therefore partly visualized through the non-presence of haplotype markers or as largely seen here via very fine mosaics of the 29 different base-haplotype markers. This explanation is also strongly supported when zooming into smaller genomic regions (data not shown).

Distance between SNPs [bp]
Mean r-square estimates (r 2 ) are based on SNPs that are polymorphic in the base and the evolved population, with a minimum of 24 haplotypes. Data points for long range LD estimates (dots) present averages of 1 kb windows for distances of 2 kb, 50 kb, 100 kb, 1 Mb, 2Mb, 4Mb, 8 Mb, 15 Mb. r 2 is higher in the low recombining than the high recombining regions in the base population for X and 2R and for most chromosome arms at generation F67. Inversion regions show similar or reduced amounts of LD compared to the high recombining regions, particularly in the evolved population.

Figure S10
Fig S10: Identification of haplotype-blocks for all 10 base-haplotypes for which haplotype-blocks were identified. Panels 1-10 correspond to base haplotypes b1, b5, b6, b7, b12, b17, b19, b21, b24 and b26. Singletons specific to the respective haplotype were identified from the 29 base haplotypes, grouped into windows of 20 (overlap 15). For each singleton-window the frequency changes (inferred from Pool-Seq) are plotted for three different comparisons of the base to generation F15 (green), F37 (blue) and F59 (red). Dashed lines represent thresholds for haplotype-block identification for the respective generations based on mean frequency changes of the top 2,000 CMH candidates (see Material & Methods). Short solid colored lines above present areas of identified haplotype-blocks. The ten panels show haplotypeblock identification through frequency changes of base specific singleton-SNPs.
The figure is provided as a separate supplemental pdf file to be able to provide a better resolution.

Figure S11
A) haplotype-block 9 B) haplotype-block 11 Fig. S11: Evolved haplotypes from generation 67 with each haplotype-block from the base population, respectively. A) haplotype-block 9, B) haplotype-block 11, C-P) haplotype-blocks 1-8, 10, 13-17. Haplotype-blocks from the base population are labeled with b, evolved haplotypes are labeled with e. The left panel represents the haplotype block structure. Each horizontal line corresponds to one haplotype. Base-singletons specific to the respective haplotype-block are indicated in blue, yellow indicates the remaining singletons in the base population. The major allele in the base population is shown in red. Missing data are shown in grey. Only loci that are singletons for any base-haplotype are shown. The right panel shows a cladogram based on singleton-SNP sharing. The blue rectangle marks the evolved haplotypes that cluster with the respective haplotype-block sequence from the base population. If present blue rectangles in dashed lines highlights parts that also might have originated from the original haplotype-block sequence. For different haplotype-blocks its core is shared to different extends with the evolved haplotypes that cluster around it. (Parts C-P are shown in a separate file; figure S11 also shown in a separate file uses an alternative visualization for the same 17 haplotype-blocks.) Figure S12 0.000 0.004 0.008

Edit%distance+to+haplo%block+[fraction+of+SNPs]
Fig. S12: Allele sharing between haplotype-blocks suggests little recombination during 67 generations. Each plot shows the amount of allele sharing between a haplotype-block and the corresponding evolved haplotypes of generation F67 (identified via Fig. 7, Fig. S10). Allele sharing is calculated as average pairwise distance in the complete region. Colored points on top of the boxplots indicate the evolved haplotypes. Haplotype-block plots are sorted in ascending order with respect to the divergence of the evolved haplotypes (note that we used three different scales on the y-axis). Evolved haplotypes carrying the haplotype-block sequences show small but variable amounts of recombination during 67 generations of experimental evolution.
Figure S13     : Accuracy in frequency estimation of base-haplotypes through haplotype specific singleton markers depending on the amount of known founder haplotypes. Haplotypes frequencies were calculated for sliding windows of 20 haplotype singleton markers through median marker frequencies based on subsets of 29, 40, 79 and 119 out of 158 base-haplotypes corresponding to 18%, 20%, 50% and 75% of known base haplotypes. The mean deviation between estimated and true frequencies was calculated based on 4,000 -11.000 windows per chromosomal arm (Table S4). Boxplots display results for chromosomal arms 2L, 3L and X. Deviations were calculated for haplotype compositions obtained from neutral simulations with Ne=200 and no recombination from a starting population of equal amounts of 158 DGRP (Mackay et al. 2012) haplotypes with MimicrEE (Kofler & Schlötterer 2013). Mean deviations in frequency estimated for 20 marker haplotype windows decrease with the amount of known base haplotypes.    Overview of implemented scripts 1) Estimation of short and long range LD from haplotype data Script 1: calc_r2_min-maxdist.py Calculation of r 2 values from whole genome haplotype data: r 2 estimation for all SNP-pairs on a chromosome with physical distances within the provided range.
2) Identification of haplotype-blocks Script 2: haplotype_singleton-blocks_window_EE_repl.py Estimation of frequency changes of base-haplotypes between the base and subsequent time points: frequency changes of base haplotypes are estimated via haplotype specific singletons in the base population. Besides knowledge of haplotype specific singletons, the singleton allele frequencies should be known for at least two replicates in the base population and subsequent time points during experimental evolution.
Haplotype-blocks have to be called separately afterwards using time point specific thresholds (for required frequency changes) and windows have to be combined if they overlap or are within a defined recombination distance.