-
PDF
- Split View
-
Views
-
Cite
Cite
Emily B Josephs, Jeremy J Berg, Jeffrey Ross-Ibarra, Graham Coop, Detecting Adaptive Differentiation in Structured Populations with Genomic Data and Common Gardens, Genetics, Volume 211, Issue 3, 1 March 2019, Pages 989–1004, https://doi.org/10.1534/genetics.118.301786
- Share Icon Share
Abstract
Adaptation in quantitative traits often occurs through subtle shifts in allele frequencies at many loci—a process called polygenic adaptation. While a number of methods have been developed to detect polygenic adaptation in human populations, we lack clear strategies for doing so in many other systems. In particular, there is an opportunity to develop new methods that leverage datasets with genomic data and common garden trait measurements to systematically detect the quantitative traits important for adaptation. Here, we develop methods that do just this, using principal components of the relatedness matrix to detect excess divergence consistent with polygenic adaptation, and using a conditional test to control for confounding effects due to population structure. We apply these methods to inbred maize lines from the United States Department of Agriculture germplasm pool and maize landraces from Europe. Ultimately, these methods can be applied to additional domesticated and wild species to give us a broader picture of the specific traits that contribute to adaptation and the overall importance of polygenic adaptation in shaping quantitative trait variation.
DETERMINING the traits involved in adaptation is crucial for understanding the maintenance of variation (Mitchell-Olds et al. 2007), the potential for organisms to adapt to climate change (Aitken et al. 2008; Bay et al. 2017), and the best strategies for breeding crops or livestock (Howden et al. 2007; Takeda and Matsuoka 2008). There are many examples of local adaptation from reciprocal transplant experiments (Leimu and Fischer 2008; Hereford 2009) that tell us about fitness in a specific environmental context, but these experiments are less informative about how past evolutionary forces have shaped present day variation (Savolainen et al. 2013). Instead, quantifying the role of adaptation in shaping current phenotypic variation will require comparing observed variation with expectations based on neutral models (Leinonen et al. 2008). With the growing number of large genomic and phenotypic common garden datasets, there is an opportunity to use these types of comparisons to systematically identify the traits that have diverged due to adaptation.
A common way of evaluating the role of spatially variable selection in shaping genetic variation is to compare the proportion of the total quantitative trait variation among populations with that seen at neutral polymorphisms (Prout and Barker 1993; Spitze 1993; Whitlock 2008). methods have been successful at identifying local adaptation, but have a few key limitations that are especially important for applications to large genomic and phenotypic datasets (Whitlock 2008; Leinonen et al. 2013). First, standard assumes a model in which all populations are equally related [but see Whitlock and Gilbert (2012), Ovaskainen et al. (2011), Karhunen et al. (2013) for methods that incorporate more complicated models of population structure]. Second, rigorously estimating requires knowledge of the additive genetic variance both within and between populations (Whitlock 2008). Many studies skirt this demand by measuring the proportion of phenotypic variation partitioned between populations (“”), either in natural habitats or in common gardens. However, replacing with can lead to problems due to both environmental differences among natural populations and nonadditive variation in common gardens (Pujol et al. 2008; Whitlock 2008; Brommer 2011). Third, approaches are unable to evaluate selection in individuals or populations that have been genotyped but not phenotyped. It can be cheaper to phenotype in a smaller panel and test for selection in a larger genotyped panel. If individuals are heterozygous or outbred, cannot be easily maintained in controlled conditions, or are dead, they can be genotyped but not easily phenotyped. In these cases, the population genetic signature of adaptation in quantitative traits (“polygenic adaptation”) can be detected by looking for coordinated shifts in the allele frequencies at loci that affect the trait (Latta 1998; Le Corre and Kremer 2012; Kremer and Le Corre 2012).
Current approaches to detect polygenic adaptation in genomic data take advantage of patterns of variation at large numbers of loci identified in genome-wide association studies (GWAS) (Turchin et al. 2012; Berg and Coop 2014; Field et al. 2016). One approach, , developed by Berg and Coop (2014), extends the intuition underlying classic approaches by generating population-level polygenic scores—trait predictions generated from GWAS results and genomic data—and comparing these scores to a neutral expectation. However, methods for detecting polygenic adaptation using GWAS-identified loci are very sensitive to population structure in the GWAS panel (Berg and Coop 2014; Robinson et al. 2015; Berg et al. 2018; Novembre and Barton 2018; Sohail et al. 2018). Because GWAS in many systems are conducted in structured, species-wide panels (Flint-Garcia et al. 2005; Atwell et al. 2010; Wang et al. 2018), current methods for detecting polygenic adaptation are difficult to apply widely.
Here, we adapt methods for detecting polygenic adaptation to be used in structured GWAS panels and related populations. First, using a new strategy for estimating , we develop an extension of , which we call , to test for evidence of adaptation in a heterogeneous, range-wide sample of individuals that have been genotyped and phenotyped in a common garden. We then develop an extension of for use in structured GWAS populations where the panel used to test for selection shares population structure with the GWAS panel. We apply both of these methods to data from domesticated maize (Zea mays ssp. mays). Overall, we show that the method controls for false positive issues due to population structure and can detect selection on a number of traits in domesticated maize.
Results
Extending QST − FST to deal with complicated patterns of relatedness with
Our approach to detecting local adaptation is meant to ameliorate two main aspects of analysis that limit its application to many datasets. First, many species-wide genomic datasets are collected from individuals that do not group naturally into populations, making it difficult to look for signatures of divergence between populations. Second, calculating requires an estimate of , usually done by phenotyping individuals from a crossing design.
We address these issues by using principal component analysis (PCA) to separate the kinship matrix, K, into a set of principal components (PCs) that are used to estimate , and an orthogonal set of PCs used to test for selection. We base our use of PCA on the animal model, which is often used to partition phenotypic variance among close relatives within populations into genetic and environmental components (Henderson 1950, 1953; Thompson 2008). More generally, the animal model is a statement about the distribution of an additive phenotype if the loci contributing to the trait are drifting neutrally (see Ovaskainen et al. 2011; Berg and Coop 2014; and Hadfield and Nakagawa 2010).
Here, we test for selection by looking for excess phenotypic divergence along the major axes of relatedness described by PCs instead of looking for excess divergence between populations. For an example of how PCs relate to within and between population variation in a set of simulated populations, see Figure 1. In these simulated populations, PCs 1 and 2 distinguish between-population variation for three populations, while PC 3 separates out individuals within one population. While this is a simplified example compared to the populations analyzed later in the paper, it shows the intuition underlying our use of PCs to replace the within and between population structure used by .

Relating PCs to within- and between-population variation. A conceptual figure based on three populations simulated following the dendrogram on the left of plot A. Each population has 70 individuals in it with some within-population variation in relatedness (data not shown.). (A–C) The relationship between PCs 1, 2, and 3 and a neutrally evolving trait.
can be linked to a PC-based approach by noting that, for any arbitrary H matrix (not just the type described above), will follow a distribution, and the degrees of freedom of this distribution will be equal to the number of linearly independent columns in H. We can generate a measure of excess divergence along PCs by replacing H with a matrix, U, of eigenvalues of the kinship matrix K. U is calculated from the eigendecomposition of K, such that where is the matrix of eigenvectors and is a diagonal matrix with the eigenvalues of K. Here, we denote the eigenvector as and the eigenvalue as and we call our measure of excess divergence along PCs .
Testing for selection with in a maize mapping panel
We applied to test for selection in a panel of 240 inbred maize lines from the GWAS panel developed by Flint-Garcia et al. (2005). The GWAS panel includes inbred lines meant to represent the diversity of temperate and tropical lines used in public maize breeding programs, and these lines were recently sequenced as part of the maize HapMap 3 project (Bukowski et al. 2017). In Figure 2A, we plot the relatedness of all maize lines on the first two PCs. The first PC explains 2.04% of the variance and separates out the tropical from the nontropical lines, while the second PC explains 1.90% of the variance and differentiates the stiff-stalk samples from the rest of the dataset [stiff-stalk maize is one of the major heterotic groups used to make hybrids (Mikel and Dudley 2006)]. While previous studies have used relatedness to assign lines to subpopulations, not all individuals can be easily assigned to a subpopulation, and there is a fair amount of variation in relatedness within subpopulations (Flint-Garcia et al. 2005) (Figure 2A).

Structure in the maize populations. These plots show the first two PCs of population structure (the eigenvectors of the kinship matrix) for various maize panels included in this paper. (A) 240 maize lines from the “GWAS panel” that were used in the trait analysis. Each point represents an inbred line and points are colored by their assignment to subpopulations from Flint-Garcia et al. (2005). (B) The GWAS panel from (A) along with the 2704 inbred maize lines of the Ames diversity panel (Romay et al. 2013). (C) The GWAS panel from (A) along with 906 European maize landraces from Unterseer et al. (2016).
We first validated that would work on this panel by testing on 200 traits that we simulated under a multivariate normal model of drift based on the empirical kinship matrix, assuming . As expected, from Equation 6, the variance in the standardized projections onto PCs of these simulated traits centered on 1, and, across the 22 PCs tested in 200 simulations, only 200 tests (4.5%) were significant at the level before correcting for multiple testing. Adding simulated environmental variation and ) to trait measurements increased the variance of , with this excess variance falling disproportionately along the later PCs (those that explain less variation in relatedness). These results suggest that unaccounted increases estimated variance at later PCs, reducing power to detect the selective signal of increased variance at earlier PCs relative to later PCs. However, this reduction in power can be minimized by controlling environmental noise—for example by measuring line replicates in a common garden or best unbiased linear predictions (BLUPs) from multiple environments (See Appendix A for a more extensive treatment of ).
We tested for selection along 22 PCs for 22 traits that, themselves, are estimates of the breeding value (BLUPs) of these traits measured across multiple environments (Hung et al. 2012). These 22 traits include a number of traits thought to be important for adaptation to domestication and/or temperate environments in maize, such as flowering time (Swarts et al. 2017), upper leaf angle (Duvick 2005), and plant height (Duvick 2005; Peiffer et al. 2014). After controlling for multiple testing using a false discovery rate (FDR) of 0.05, we found evidence of adaptive divergence for three traits: days to silk, days to anthesis, and node number below ear (Figure 3A). These results are generally robust to the choice of PCs used for the denominator of Equation 8 used to estimate (Supplemental Material, Figure S3). We plot the relationship between PC1 and two example traits to illustrate the data underlying these signals of selection. In Figure 3B, we show a relationship between PC1 and Kernel Number that is consistent with neutral processes, and, in Figure 3C, we show a relationship between PC 1 and days to silk that is stronger than would be expected due to neutral processes and is instead consistent with diversifying selection. We detected evidence of diversifying selection on various traits along PC1, PC2, and PC10. While PC 1 and PC2 differentiate between known maize subpopulations (Figure 2A), PC 10 separates out individuals within the tropical subpopulation, so our results are consistent with adaptive divergence contributing to trait variation within the tropical subpopulation (Figure S2).

Detecting adaptation within the GWAS panel with . (A) A heatmap showing results from on the first 22 PCs (x-axis) for 22 traits (y-axis). Squares are colored by their P value. If the FDR (“q value”) corresponding to that P value is there is a white dot in the square. (B) Total kernel number per cob plotted against the first PC of relatedness (PC 1). Each point represents a line in the GWAS panel, colored by its membership in a subpopulation (same colors as Figure 2A). The solid line shows the linear regression of the trait on PC 1 and the dashed lines show the 95% confidence interval of linear regressions expected under neutrality. Note that the linear regression is not the same as the F test done in , and that we plot these lines for visualization purposes only. (C) Similar to (B), but showing days to silk on the Y axis. The regression line for days to silk against PC 1 (solid line) falls outside the expectation based on neutrality (dotted line), consistent with selection increasing trait divergence across PC 1.
Detecting selection in unphenotyped individuals using polygenic scores
However, when there is shared population structure between the GWAS panel and the genotyping panel, there are two concerns about testing for selection on polygenic scores made for the genotyping panel:
If we have already found a signal of selection on our phenotypes of interest in the GWAS panel, then a significant test could simply reflect this same signal and not independent adaptation in the genotyping panel.
Controls for population structure in the GWAS could bias the loci identified and the effect sizes estimated for these loci, leading to false positive signals of selection in the genotyping panel.
We illustrate the problem by looking at the relationship between PC 1 and polygenic scores for a trait that has been simulated to be evolving neutrally in the GWAS panel and a set of related maize lines (Figure 4A). While there is no strong relationship between the simulated trait and PC 1 (Figure 4A), the polygenic scores calculated from GWAS results do show a strong correlation with PC 1 (Figure 4B). Note that the simulated trait values and polygenic scores have been standardized by dividing by the square root of their respective estimated from the lower PCs, so this erroneous correlation not simply a result of the inflation of effect sizes in the GWAS due to false positives and the winner’s curse, but represents an issue caused by shared population structure between the GWAS panel and the genotyping panel. The magnitude of the slope of the relationship between standardized polygenic score and PC 1 was greater than the magnitude of the slope of the relationship between standardized simulated trait and PC 1 for 162 out of 200 simulations.

Simulations of on polygenic scores. (A) The relationship between a neutrally simulated trait and PC 1 in the European landraces. Each point represents one individual and the trait value has been divided by so that it can be compared to traits with different values of . (B) The relationship between polygenic scores calculated for the same trait as in (A), based on a GWAS in the GWAS panel. This relationship between polygenic scores and PC is what is used in the nonconditional test. As in (A), each point represents one maize individual and trait has been standardized by . (C) The proportion of 200 neutral simulations that were significant at the level for the nonconditional test and the conditional test. A horizontal line is plotted at 0.05, to show the proportion of significant tests expected under the null hypothesis. (D) The same information, this time for the European landraces.
It is worth taking some time to discuss how the conditional test controls for the two issues due to shared structure. First, by incorporating the polygenic scores of individuals in the GWAS panel into the null distribution of conditional , we are able to test directly for adaptive divergence that occurred in the genotyping panel. Berg et al. (2017) also use the conditional test in this manner. Second, the conditional test forces the polygenic scores of individuals in the genotyping panel into the same multivariate normal distribution as the polygenic scores of individuals in the GWAS panel. The polygenic scores of GWAS individuals will include the ascertainment biases expected due to controls for structure in the GWAS, so ascertainment bias will be incorporated into the null distribution of polygenic scores expected under drift. Now, we will detect selection only if trait divergence exceeds neutral expectations based on this combined multivariate normal distribution.
Applying to polygenic scores in North American inbred maize lines and European landraces
First, we conducted a set of neutral simulations to assess the ability of the conditional test to control for false positives due to shared structure. We applied both the conditional and original (“nonconditional”) test to detect selection on polygenic scores constructed from simulated neutral loci in two panels of maize genotypes that have not been extensively phenotyped: a set of 2815 inbred lines from the United States Department of Agriculture (USDA) that we refer to as “the Ames panel” (Romay et al. 2013), and a set of 906 individuals from 38 European landraces (Unterseer et al. 2016). We chose these two panels to evaluate the potential of conditional to control for shared population structure when the problem is severe, as in the Ames panel (Figure 2B), and moderate, as in the European landraces (Figure 2C). In addition, we expect that the evolution of many quantitative traits has been important for European landraces as they adapted to new European environments in the last 500 years (Tenaillon and Charcosset 2011; Unterseer et al. 2016).
False positive signatures of selection were common when using the original, nonconditional to test for selection on polygenic scores based on loci simulated under neutral processes (Figure 4, C and D and Figure S4) even though the GWAS used to identify the loci used to make polygenic scores both had low power to detect associations and had high false positive rates. (Figure S5). The increase in false positive signals of selection due to shared structure persisted to much later PCs in the Ames panel than in the European landraces, likely because the extent of shared structure is more pervasive for the Ames panel. However, the conditional test appeared to control for false positives in both the Ames panel and the European landraces (Figure 4, A and B).
We then conducted GWAS on 22 traits in the GWAS panel, using a P value cutoff of 0.005. This cutoff is less stringent then the cutoffs standardly used in maize GWAS (Romay et al. 2013; Peiffer et al. 2014), but allowed us to detect a large number of loci that we could use to construct polygenic scores. After thinning the loci for linkage disequilibrium (LD), we found associations for all traits with an average of 350 associated SNPs per trait (range 254–493). We used these SNPs to construct polygenic scores for lines in the Ames panel and individuals in the European landraces following Equation 9.
When we applied the original (nonconditional) test from Equation 8 to polygenic scores for 22 traits in the Ames panel on 182 PCs (4004 tests total), we detected widespread polygenic adaptation (Figure S6A and Figure S7A). In contrast, conditional on the same set of traits and PCs found no signatures of polygenic adaptation in the Ames panel that survived control for multiple testing (Figure S6B and Figure S7B). The lack of results in the conditional test is unsurprising because the GWAS panel’s population structure almost completely overlaps the Ames panel (Figure 2B), so once variation in the GWAS panel is accounted for in the conditional test, there is likely little differentiation in polygenic scores left to test for selection. We report these results to highlight the caution that researchers should use when applying methods for detecting polygenic adaptation to genotyping panels that share population structure with GWAS panels.
In the European landraces, we conducted on 22 traits across 17 PCs (374 tests total), and, while we detected selection on a number of traits, as with the Ames panel, none of these signals were robust to controlling for multiple testing using a FDR approach (Figure S7D). However, we report the results that were significant at an uncorrected level in Figure 5A to demonstrate how these types of selective signals could be visualized with these approaches. In Figure 5B, we show the relationship between conditional PC1 () and the difference between polygenic score for the number of brace roots and a conditional expectation (), which was our strongest signal of selection in the panel.

Detecting adaptation within the European landraces with on polygenic scores. (A) A heatmap showing results from the conditional on the first 17 PCs (x-axis) for 22 traits (y-axis). Squares are colored by uncorrected P value if P < 0.1. (B) Marginal evidence of selection on brace root number. The difference between polygenic score for brace root and the conditional mean is plotted against the conditional PC 1 for each line. The solid line shows the observed linear relationship between conditional PC 1 and and the dotted lines show the 95% confidence interval for this slope based on neutral expectations uncorrected for multiple testing and the 99.99% confidence interval corresponding to a Bonferonni correction for multiple testing. (C) The absolute value of the correlation coefficient (R) between PCs and latitude. (D) The proportion of significant tests for traits simulated to be under selection along latitude for three different selection strengths.
We conducted power simulations by shifting allele frequencies of GWAS-identified loci along a latitudinal selective gradient in the European landraces (see Materials and Methods section for details). When selection was strong (selection gradient α = 0.05), we detected signals of selection in all 200 simulations along the first conditional PC, which had the strongest association with latitude. When selection was moderate (α = 0.01) we detected selection in 57 of 200 simulations (Figure 5, C and D). These results suggest that there is power to detect selection on polygenic scores with in the European landraces if selection actually occurs on the loci used to make these polygenic scores.
Discussion
In this paper, we have laid out a set of approaches that can be used to study adaptation and divergent selection using genomic and phenotypic data from structured populations. We first described a method, , that can be used to detect adaptive trait divergence in a species-wide sample of individuals or lines that have been phenotyped in common garden and genotyped. We demonstrated this method using a panel of phenotyped domesticated maize lines, showing evidence of selection on flowering time and node number below ear. Second, we present an extension of that can be applied to individuals related to the GWAS panel that have not themselves been phenotyped using a conditional test to avoid confounding due to shared population structure. We showed that this test is robust to false positives due to population structure shared between the GWAS panel and the genotyping panel, and that it has power to detect selection. We applied this method to two panels of maize lines and showed marginal evidence of selection on a number of traits, but these signals were not robust to multiple testing corrections. Overall, the methods described and demonstrated here will be useful to a wide range of study systems.
While we were able to use to detect diversifying selection on phenotypes in the GWAS panel of 240 inbred lines, we were unable to detect similar patterns using polygenic scores for the Ames panel and European landraces (after controlling for multiple testing). This lack of selective signal was expected in Ames because the high overlap in relatedness between the Ames panel and the GWAS panel reduces power to detect selection in the Ames panel alone. However, our simulations showed that we did have power to detect moderate to strong selection acting on GWAS-associated loci in European landraces, and we expect that adaptation to European environments has contributed to trait diversification (Unterseer et al. 2016). There are a few factors that could explain our inability to detect selection on polygenic scores for European landraces. First, the polygenic scores we constructed used GWAS results from traits measured in North American environments. If there is for these traits, we may not be measuring traits that are actually under selection in Europe. Second, it is likely that in our small GWAS panel (n = 263 or 281) we are underpowered to detect most causal loci and so our predictions are too inaccurate to pull out a signal of selection. All together, our results suggest that while GWAS are undoubtedly useful to identify loci underlying traits, an analysis of phenotypes expressed in a common environment will often be the most powerful approach for detecting adaptation, especially in systems with underpowered GWAS.
Our proposed methods use PCA to separate out independent axes of population structure. There is a clear connection between PCA and average pairwise coalescent times (McVean 2009), and, because of this connection, PCA has been useful in a range of population genetic applications, including the detection and visualization of population structure (Patterson et al. 2006; Novembre et al. 2008), understanding the roles of population history and geography (Menozzi et al. 1978; Novembre and Stephens 2008), controlling for population structure in GWAS (Price et al. 2006). Here we use PCs to define broad axes of variation analogous to the between-population variation described in standard tests while using later PCs to describe variation within populations. In our application to maize data, we use arbitrary cutoffs to define which PCs represent between- and within-population variation. In the future, developing a rigorous method for choosing appropriate PCs will be useful, but, for now, we suggest that researchers use their own knowledge of their study system and datasets to choose these PCs. We also caution that the presence of many close relatives in the sample could affect PCA, such that early PCs could end up representing variation within a large family, and we suggest that researchers remove close relatives when using these methods. In addition, while PCs provide a useful way of separating signals, in some cases the constraints of PCA make the PCs unintuitive in terms of geography and environmental variables (Novembre and Stephens 2008). Therefore, it will also be useful to explore approaches like that outlined in Equation 9 of Berg et al. (2018) that project trait values or polygenic scores onto environmental variables to test if the relationship between trait and environment is larger than would be expected due to drift.
There are a number of connections between the methods presented here and previous approaches. Ovaskainen et al. (2011), Karhunen et al. (2013) calculated a -like measure of diversifying selection using the kinship matrix to model variation in relatedness among subpopulations. Their approach, however, is still reliant on identifying subpopulations and on using trait measurements in families or crosses to obtain estimates of . For single loci, a number of -like approaches have been developed that use PCs to replace subpopulation structure to detecting individual outlier loci that deviate from a neutral model of population structure (Duforet-Frebourg et al. 2015; Chen et al. 2016; Galinsky et al. 2016; Luu et al. 2017). Our methods can be viewed as a a phenotypic equivalent to these locus-level approaches. In addition, Liu et al. (2018) have recently explored a related approach using projections of polygenic scores along PCs. Finally it may be useful to recast our method in terms of the animal model by splitting the kinship matrix into a “between population” matrix described by early PCs and a “within population” matrix described by later PCs. We could then detect selection by comparing estimates of for these two matrices. Such an animal-model approach may also offer a way to incorporate environmental variance in systems where replicates of the same genotype are not possible.
There are a number of caveats for applying to traits on additional systems and datasets that stem from issues in accurately estimating and it is important to carefully consider the assumption underlying that all traits are made up of additive combinations of allelic effects. If environmental variation contributes to trait variation, it will reduce the power of to detect diversifying selection because environmental variation will contribute most to variation at later PCs (Appendix A). Additionally, additive-by-additive epistasis has the potential to contribute to false-positive signals of adaptation because additive-by-additive epistatic variation will contribute most to phenotypic variance along earlier PCs (Appendix B). In general, nonadditive interactions between alleles may cause difficulty for in systems, like maize, where traits are measured on inbred lines but selection occurs on outbred individuals. However, there is evidence that additive-by-additive variance will often be small compared to within populations (Hill et al. 2008); for example, the genetic basis of flowering time variation in maize is largely additive (Buckler et al. 2009), suggesting that our conclusions about adaptive divergence in flowering time are likely robust to concerns about epistasis.
Our results highlight a number of issues with polygenic adaptation tests that depend on polygenic scores using GWAS-associated loci. As has been recently highlighted by Berg et al. (2018) and Sohail et al. (2018), structure in a GWAS panel can contribute to false signals of polygenic adaptation in polygenic scores constructed from the results of that panel. We observed that this problem is especially strong when there is shared population structure between the GWAS panel and the genotyping panel used to construct polygenic scores but that the use of a conditional test that accounts for shared structure between the two datasets can control for these false positives. There is potential for these methods to be used to address problems due to structure in GWAS panels in both nonhuman and human systems, although the conditional test approach would need to be adapted to the very large sample sizes used in human GWAS.
All together, the methods presented here provide an approach to detecting the role of diversifying selection in shaping patterns of trait variation across a number of species and traits. A number of further avenues exist for extending these methods. First, we applied this test to traits independently, but extending to incorporate multiple correlated traits will likely improve power to detect selection by reducing the number of tests done. In addition, this extension could allow the detection of adaptive changes in trait correlations. Second, these methods could be extended to take advantage of more sophisticated methods of genomic prediction than the additive model presented here (as in Beissinger et al. 2018; Liu et al. 2018). Pursuing this goal will require carefully addressing issues related to LD between marker loci. Overall, developing and applying methods for detecting polygenic adaptation in a wide range of species will be crucial for understanding the broad contribution of adaptation to phenotypic divergence.
Materials and Methods
Analyses were done in R and we used the dplyr package (Wickham et al. 2017; R Core Team 2018). All code is available at https://github.com/emjosephs/qpc-maize.
The germplasm used in this study
We analyzed three different maize diversity panels.
The GWAS panel: The Major Goodman GWAS panel, also sometimes referred to as “the 282” or “the Flint Garcia GWAS Panel,” contains 302 inbred lines meant to represent the genetic diversity of public maize-breeding programs (Flint-Garcia et al. 2005). Genotype-by-sequencing (GBS) data are available for 281 of these lines from Romay et al. (2013) and 7X genomic sequence from 271 of these lines is available from Bukowski et al. (2017). In addition, these lines have been phenotyped for 22 traits in multiple common garden experiments (Hung et al. 2012).
The Ames panel: A panel of 2815 inbred lines from the USDA that have been genotyped with GBS (Romay et al. 2013) at 717,588 SNPs.
The European landraces: A panel of 906 individuals from 38 European landraces (31 Flint-type and 7 Dent-type) that were genotyped at 547,412 SNPs using an array (Unterseer et al. 2016).
in the GWAS panel
We tested for selection on 22 traits phenotyped in the GWAS panel. BLUPs for these traits were sourced from Hung et al. (2012) and genomic sequence data from Bukowski et al. (2017). Out of the 302 individuals in the GWAS panel, we retained 240 individuals that had data for all 22 traits of interest and had genotype calls for 70% of the SNPs in the genomic dataset.
To construct a kinship matrix for the GWAS panel, we randomly sampled 50,000 SNPs from across the genome after removing sites that were missing any data or had unrealistic levels of heterozygosity (the proportion of heterozygous individuals exceeded 0.5). The allele frequencies (0, 0.5, or 1) for individuals at these 50,000 SNPs were arranged in an matrix (referred to here as G), where M is the number of individuals (240) and N is the number of loci (50,000) such that is the entry on the row and column of G, and it represents the genotype of the individual at the locus.
For the denominator of Equation 8 we used the last half of the PCs (119–238). We plotted 95% confidence intervals for the PC for display purposes as lines with the same mean as the observed data and a slope of . We conducted 200 simulations by simulating traits that evolve neutrally along the kinship matrix using the mvrnorm R function in the MASS package (Venables and Ripley 2002) and testing for selection using Equation 8.
GWAS in maize inbreds
We used GEMMA (Zhou and Stephens 2012) to conduct GWAS for trait BLUPs in the GWAS panel, controlling for population structure with a standardized kinship matrix generated by GEMMA (note that this matrix is different from the one used in tests of selection with ). We conducted two separate GWAS for use in testing for selection in the Ames panel and in the European panel separately. First, for finding SNP associations that we could use to construct breeding values in the Ames panel, we used GBS data for 281 lines from the GWAS panel that had been genotyped by Romay et al. (2013). Next, for finding SNP associations for constructing breeding values in the European landraces, we took whole genome data from Bukowski et al. (2017) for 263 individuals that had genotype calls for 70% of polymorphic sites and extracted genotypes for sites that overlapped with those present in the European landrace dataset from Unterseer et al. (2016). All genotypic data were aligned to v3 of the maize reference genome, except the genotypes of the European landraces, which we lifted over from v2 to v3 using CrossMap (Zhao et al. 2013). For both sets of GWAS analyses, we tested all SNPs with a minor allele frequency above 0.01, <0.05 missing data, and we picked all hits with a likelihood-ratio test P value below 0.005. We pruned both sets of SNPs by using a linkage map from Ogut et al. (2015) to construct one cM windows with GenomicRanges in R (Lawrence et al. 2013). We picked the SNP with the lowest P value per window and, when multiple SNPs had the same P value, we sampled one SNP randomly.
Simulated populations and traits
We generated data for Figure 1 using ms to do coalescent simulations (Hudson 2002) for three large populations of 60 individuals, each with two smaller subpopulations of 30 individuals. Simulations were conducted with the following command: ms 180 1 -t 500 -r 500 10000 -I 6 30 30 30 30 30 30 -ej 0.06 1 2 -ej 0.06 3 4 -ej 0.06 5 6 -ej 0.1 4 6 -ej 0.15 2 6. This command yielded 6689 loci polymorphic loci for 180 individuals. Neutrally evolving trait values were determined by summing up the number of nonreference allele copies each individual has, multiplied by these effect sizes for 50 randomly chosen causal loci (as in Equation 9). The other 6639 loci were used to generate a kinship matrix following Equation 16 and the eigendecomposition of this matrix provided principal components.
Applying to polygenic scores from the Ames panel and European landraces
We generated combined genetic matrices for each genotyping panel (either Ames or European) and the GWAS panel. In both datasets, we removed sites with a minor allele frequency below 0.01 and a proportion of missing data > 0.05 across the combined dataset, leaving 108,110 SNPs in the Ames-GWAS dataset and 441,986 SNPs in the European-GWAS dataset. Missing data points were imputed by replacing each missing genotype from a random sample of the pool of genotypes present in the individuals without missing data. The random imputation step was done once for each missing data point and the same randomly imputed dataset was used for all subsequent analyses.
We constructed kinship matrices for the Ames panel combined with the GWAS panel and the European panel combined with the GWAS panel following the procedure described in Equation 16, using 50,000 randomly sampled SNPs with a minor allele frequency 0.01 and a proportion of missing data < 0.05. The genotype information from the combined datasets was used to construct polygenic scores following Equation 9.
We used these polygenic scores to test for diversifying selection on 22 traits (described above) for the PCs that cumulatively explained the first 30% of variation in the conditional kinship matrix (182 for the Ames panel and 17 for European maize). As in the test, we chose the later 50% of PCs for the denominator. We used Qvalue (Storey et al. 2015) to generate FDR estimates (“q values”).
Simulations of on polygenic scores
We conducted neutral simulations to detect the rate of false-positive inferences of selection on neutrally evolving traits. We conducted 800 total: 400 with the Ames panel and 400 with the European landrace panel, and within each panel, 200 with 500 causal loci and 200 with 50 causal loci. For each simulation, we simulated a phenotype by randomly picking 500 or 50 sites in the combined genotype datasets and assigning each alternate allele an effect size drawn from a normal distribution with mean 0 and variance 1. For each individual in the GWAS panel, we then calculated a simulated breeding value following Equation 9. These simulated traits were mapped using a GWAS with the same procedure described above. These simulations showed the GWAS in our sample had relatively low power to detect causal alleles, especially ones of small effect, and also identified a large number of false-positives, as could be expected for a GWAS of this size (Figure S5).
The loci identified in these GWAS (P < 0.005) were pruned for LD and then used for analysis. We tested for evidence of diversifying selection on polygenic scores in two ways for each set of simulations. First, we used with the standard kinship matrix generated using lines in the genotyping set (either Ames or European landraces). Second, we used the conditional test described in Equation 11.
Data availability
All code and data are available at https://github.com/emjosephs/qpc-maize. Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7607471.
Appendix
Appendix A—Environmental Variation and Inferences of Selection from
Appendix B—Additive-by-Additive Epistasis and
Appendix C—Mean Centering
Properly mean-centering conditional expectations for polygenic scores and the kinship matrix used to calculate on these scores is crucial. However, the choice of how to properly mean-center these two parameters is not entirely straightforward when working with conditional distributions (as in Equation 10).To illustrate the problem, imagine that we mean center the conditional expectations for polygenic scores in the genotyping panel () around the mean of the GWAS panel, such that , where is the mean polygenic score of individuals in the GWAS panel, is the vector of polygenic scores in the GWAS panel, and and are subsets of the relatedness matrix between individuals in the genotyping panel and GWAS panel as defined for Equation 10. At the same time, we generate , , and from the kinship matrix K following Equation 16, where K is mean centered around the combined mean of the genotyping and GWAS panel. While these two choices, made separately, seem intuitive, together they lead to a situation where, if , we can infer signals of adaptive divergence even if none exist. Therefore, we choose to mean center both K and around the mean of all individuals in the genotyping and GWAS panels.
Acknowledgments
We thank Kate Crosby, Cinta Romay, and Peter Bradbury for assistance with maize data, Nancy Chen, Wenbin Mei, and Michelle Stitzer for comments on this manuscript, and members of the Coop, Ross-Ibarra, and Schmitt laboratories for helpful discussions. E.B.J. is supported by a National Science Foundation (NSF) National Plant Genome Initiative Postdoctoral Research Fellowship (NSF 1523733). J.J.B. is supported by a National Institutes of Health (NIH) F32 NRSA Postdoctoral Research Fellowship (GM126787). J.R.-I. acknowledges support from a NSF PGRP 1238014 and the United States Department of Agriculture (USDA) Hatch project CA-D-PLS-2066-H. G.C. acknowledges support from NIH R01-GM108779, NSF 1262327, and NSF 1353380.
Footnotes
Supplemental material available at Figshare: https://doi.org/10.25386/genetics.7607471.
Communicating editor: M. Beaumont
Literature Cited
Storey, J. D., A. J. Bass, A. Dabney, and D. Robinson, 2015 qvalue: Q-value estimation for false discovery rate control. R package version 2.8.0. https://cran.r-project.org/web/packages/qvalue/index.html
Wickham, H., R. Francois, L. Henry, and K. Müller, 2017 dplyr: a grammar of data manipulation. R package version 0.7.4. Available from: https://cran.r-project.org/web/packages/dplyr/index.html