A Guide for the Design of Evolve and Resequencing Studies

Standing genetic variation provides a rich reservoir of potentially useful mutations facilitating the adaptation to novel environments. Experimental evolution studies have demonstrated that rapid and strong phenotypic responses to selection can also be obtained in the laboratory. When combined with the next-generation sequencing technology, these experiments promise to identify the individual loci contributing to adaption. Nevertheless, until now, very little is known about the design of such evolve & resequencing (E&R) studies. Here, we use forward simulations of entire genomes to evaluate different experimental designs that aim to maximize the power to detect selected variants. We show that low linkage disequilibrium in the starting population, population size, duration of the experiment, and the number of replicates are the key factors in determining the power and accuracy of E&R studies. Furthermore, replication of E&R is more important for detecting the targets of selection than increasing the population size. Using an optimized design, beneficial loci with a selective advantage as low as s = 0.005 can be identified at the nucleotide level. Even when a large number of loci are selected simultaneously, up to 56% can be reliably detected without incurring large numbers of false positives. Our computer simulations suggest that, with an adequate experimental design, E&R studies are a powerful tool to identify adaptive mutations from standing genetic variation and thereby provide an excellent means to analyze the trajectories of selected alleles in evolving populations.


Base population
As a base population for the simulation of experimental evolution we simulated 8,000 haploid genomes with fastsimcoal v1.1.8 [1] that reproduce the pattern of natural variation of a D.
melanogaster population captured in Vienna in fall 2010 [2]. The average number of SNPs and the nucleotide diversity (Tajima's π) for this simulated base population in windows of 10,000 bp can be found in Figure 1. As we excluded the X-chromosome from the analysis we only show results for the chromosomes 2L, 2R, 3L, and 3R; Only a subset of the 8,000 haploid genomes was used for most of the simulations, we therefore show the number of SNPs in the base population dependent on the number of sampled haploid genomes in table 1. Figure 2 shows the allele frequency spectrum for the ancestral and the derived allele in the simulated base population excluding the low recombining regions (<1 cM/Mb). For the simulations beneficial loci were randomly chosen, where either the ancestral or the derived allele was randomly picked as the beneficial one. To increase the probability of detecting the selected loci, the starting frequency of selected alleles was not allowed to exceed 80% (excluded alleles are shaded in gray). Figure 2 also demonstrates that the majority of the beneficial alleles will be derived alleles.   Shaded gray areas indicate alleles that are not available as beneficial ones as beneficial alleles were not allowed to exceed a frequency of 0.8

Test statistic
To evaluate the performance of different evolve and resequencing (E&R) designs a suitable test-statistic for the identification of selected loci is required. Recently several test statistics have been suggested such as the Cochran-Mantel-Haensel test (CMH-test) [3], diffStat [4], association statistic [5] and the F ST [6].
We evaluated these test statistics with forward simulations resembling typical E&R studies. We used a population of 1,000 individuals and randomly picked 100 beneficial loci

Low recombining regions
High levels of linkage disequilibrium in low recombining regions may prevent the identification of beneficial loci. Excluding low recombining regions from the analysis, therefore, may improve the power to identify selected loci with the tradeoff that beneficial loci lying within low recombining regions cannot be detected.
To test this strategy we used the results of the simulations mentioned above (100 selected SNPs; s = 0.1; h = 0.5; N = 1000; homozygous individuals) and generated two data sets: one including low recombining regions (< 1cM/M bp) and one excluding low-recombining regions. As E&R studies aim to identify a set of beneficial loci containing a minimum of false positives we evaluated the performance at a low false positive rate (F P R < 0.01) with the partial area under the curve (pAU C; see main text). At a low FPR the data set excluding the low recombining region had a significantly better performance than the data set that included the low recombining regions (Fig. 4 B; Wilcoxon rank sum test with pAUC; When evaluating the performance of the whole ROC curve, the data set that included the low recombining regions had the best performance ( Figure 4 A; Wilcoxon rank sum test with AU C; W = 100, p = 1.1e − 05). This is not surprising as beneficial loci lying within low recombining regions cannot be identified with the data set that excluded the low recombining regions. As performance at a low FPR is most relevant for E&R studies we recommend excluding low recombination regions from the analyses and followed this strategy also in our study.

Number of selected loci
Given that forward simulations are highly CPU intensive, in particular for large population sizes, we were interested in minimizing the number of simulations. To obtain a sufficient number of observations with a moderate number of simulation runs, we opted for simulating multiple selected loci. One possible complication of this strategy is that too many targets of selection will interfere with each other, precluding generalization of the results. We ex-         . 12). Both, loci starting at low and high frequencies, are expected to yield only comparatively low allele frequency differences between the base and the evolved population because the allele frequency difference per generation is proportional to the product of the major and the minor allele frequency (pq).
If neutrally evolving loci will due to chance result in similar allele frequency differences than the beneficial loci, the performance will be decreased.  fig. 7).
We found that mixing loci of different selection coefficients decreases the power to identify beneficial loci relative to unmixed conditions (supplementary fig. 13). As compared to unmixed sets, the performance was increased only for loci with the highest selection coefficient in each of the mixed sets. This can be easily explained by the fact that reduced numbers of strongly selected loci result in an increased performance (supplementary fig. 10). In mixed sets of beneficial alleles only loci with s = 0.05 could be identified using a high budget study design as compared to loci with s = 0.005 in unmixed sets (supplementary fig. 13).    (Fig. 14).
By contrast, increasing the average coverage to about 200 improves the ability to identify weakly selected loci (Fig. 14). Based on these results, we suggest that average coverages >200 will be necessary in order to reliably detected weakly selected loci.    the probability of fixation under neutrality (π n ) can be calculated as π n = c n /c a .
MimimcrEELimits also records the distribution of offspring in every generation which allows to estimate the variance effective population size (N ev ) [7]. We are only showing N ev for the first generation because here the reduction in the effective population size is the most pronounced. In summary we simulated 10, 25, 50, 75, 100, 150, 200, 500, 1,000, and 2,000 beneficial loci with strong (s = 0.1) and weak (s = 0.025) selection coefficients. Every simulation was repeated 20 times.

Simulating the sampling properties of Pool-Seq
To investigate the influence of estimating the allele frequencies with Pool-Seq on the power to identify beneficial loci we simulated the sampling properties of Pool-Seq for different targeted average coverages. We aimed to capture four properties of Pool-Seq. First, the coverage may differ among samples (a phenomenon frequently encountered with barcoding). We modeled this by choosing the average coverage for each sample from a Poisson distribution, where the mean corresponds to targeted average coverage over all samples. Second, Illumina sequencing has a GC-bias [9], where regions with elevated GC content have increased coverages.
This results in a correlated coverage across samples. We modeled this by modulating the coverage at a given genomic positions in all samples by a fraction chosen from a Poisson distribution. This step was independently repeated for each genomic position. Third, to 39 model stochastic coverage heterogeneity, we randomly selected a number from a Poisson distribution where the mean was the product of the previous two sampling steps. This step was independently repeated for each genomic position in each sample. This third step yielded the final coverage for each genomic position in each sample. Finally, to model the uneven sampling of individuals in pools, we randomly picked alleles from the true allele frequency until the targeted coverage was obtained (binomial sampling).