Asymptotic tests for Hardy–Weinberg equilibrium in hexaploids

Abstract Hexaploids, a group of organisms containing three complete sets of chromosomes in a single nucleus, are of utmost importance to evolutionary studies and breeding programs. Many studies have focused on hexaploid linkage analysis and QTL mapping in controlled crosses, but little methodology has been developed to reveal how hexaploids diversify and evolve in natural populations. We formulate a general framework for studying the pattern of genetic variation in autohexaploid populations through testing deviation from Hardy–Weinberg equilibrium (HWE) at individual molecular markers. We confirm that hexaploids cannot reach exact HWE but can approach asymptotic HWE at 8–9 generations of random mating. We derive a statistical algorithm for testing HWE and the occurrence of double reduction for autopolyploids, a phenomenon that affects population variation during long evolutionary processes. We perform computer simulation to validate the statistical behavior of our test procedure and demonstrate its usefulness by analyzing a real data set for autohexaploid chrysanthemum. When extended to allohexaploids, our test procedure will provide a generic tool for illustrating the genome structure of hexaploids in the quest to infer their evolutionary status and design association studies of complex traits.


Introduction
Polyploidy has been thought to be a driving force for evolution and speciation, especially in higher plants [1][2][3][4], but the genetic mechanisms underlying its evolution have not been fully understood [5]. One approach for studying genetic variation is to test for Hardy-Weinberg equilibrium (HWE) in a natural population, which predicts that gene frequencies and genotype frequencies stay constant after an episode of random mating if no evolutionary forces act in the population [6][7][8]. Since the discovery of HWE by early geneticists, its test has been indispensable for inferring diploid population variation [9][10][11][12]. The HWE test has also served as the basis for genetic association studies [5] and as a tool to monitor genotyping errors [13,14]. However, there is little literature on the use of the HWE test to study genetic variation in polyploid populations [15,16].
More recently, Sun et al. [17] developed a mathematical equation for characterizing how tetraploids approach HWE and implemented a statistical algorithm for testing the significance of deviation from HWE in a tetraploid population. Although the tetraploid model of Sun et al. can be straightforwardly extended to the study of hexaploids, this extension is exponentially more complex with increasing ploidy level, well deserving of articulation in a separate article. Dating back to the late 1940s, through a series of complicated mathematical derivations, Geiringer [18] found that, like tetraploids, hexaploids gradually reach population equilibrium. Yet, the author did not specifically visualize the asymptotic process of hexaploid HWE and, more importantly, because of the low computational proficiency in his time, he was not able to provide an algorithm for a hexaploid HWE test.

Model
How do autohexaploid genotypes segregate and transmit from parental to offspring generations? Consider a SNP marker with two alleles A and a, which form four possible triploid gametes, AAA, AAa, Aaa, and aaa, in a hexaploid population. Through random unification, these gametes form seven possible genotypes, AAAAAA (6A), AAAAAa (5A1a), AAAAaa (4A2a), AAAaaa (3A3a), AAaaaa (2A4a), Aaaaaa (1A5a), and aaaaaa (6a). These genotypes will produce triploid gametes with different frequencies determined by both Mendel's first law and double reduction for autohexaploids. Table 1 lists the gamete frequencies from each parental genotype. For purely homozygous genotypes AAAAAA or aaaaaa, the same gamete type is identified, although its formation results from either double reduction or non-double reduction. We use P 6A (t−1), P 5A1a (t−1), P 4A2a (t−1), P 3A3a (t−1), P 2A4a (t−1), P 1A5a (t−1), and P 6a (t−1) to denote the frequencies of the seven hexaploid genotypes in the parental population (generation t−1). Gametes derived from each parental genotype combine randomly between parents to generate the offspring genotypes, with frequencies that depend on the frequencies of mating types and Table 1's gamete frequencies (Table 2). Let P 6A (t), P 5A1a (t), P 4A2a (t), P 3A3a (t), P 2A4a (t), P 1A5a (t), and P 6a (t) denote the corresponding genotype frequencies in the offspring population (generation t), with forms given in Supplemental Text 1. Table S1 represents a group of recursive equations that describe how genotype frequencies change from one generation to the next in a panmictic hexaploid population. By plotting these frequencies against generation, we can monitor how and when the hexaploid population reaches equilibrium in genotype proportions. Under random chromatid segregation, the rate of double reduction (α) in autohexaploids has a theoretical bound of 0 < α < 3/11 [20]. We randomly sample an array of genotype frequencies P(0) = (0.1, 0.05, 0.2, 0.25, 0.13, 0.1, 0.17) as initial values and plot generation-varying frequencies under α = 0, 1/7, 1/5, 3/11 (Fig. 1A). We find that unlike a case in diploids using one generation to attain HWE, all genotype frequencies in hexaploids will not reach absolute equilibrium but will rather tend to be stable after 8-9 generations of random mating, as opposed to 5-6 generations in tetraploids. Given its asymptotic stability, Sun et al. [17] named such an equilibrium asymptotic HWE (aHWE). We find that double reduction has little impact on the attainment of aHWE in autohexaploids, but it affects the values of equilibrium genotype frequencies (Fig. 1B), suggesting that double reduction is a driver of hexaploid evolution. The above findings are confirmed by repeating our sampling procedure 1000 times.
How is aHWE tested? Sun et al. [17] proposed two approaches for testing aHWE in tetraploids. The first is the recursive test based on comparison between the initial genotype frequencies and the genotype frequencies at asymptotic equilibrium. Let P(8) denote an array of genotype frequencies at generation 8 of random mating, estimated by recursive equations (Supplemental Text 1), as aHWE genotype frequencies. By comparing P(8) to the initial (observed) genotype frequencies P(0), we calculate the chi-square test statistic, j = 6A, 5A1a, 4A2a, 3A3a, 2A4a, 1A5a, 6a and compare it against the critical threshold χ 2 95%,df=6 , from which the significance of deviation from aHWE can be determined.
The second approach is the gamete-based test. Under HWE, genotype frequencies are expressed as the products of gamete frequencies, which are thought to be the expected genotype frequencies. Let P AAA (t), P AAa (t), P Aaa (t), and P aaa (t) denote the frequencies of four gametes that produce zygotic genotypes at generation t, whose equilibrium frequencies are expressed as: Note that the above expressions of zygote genotype frequencies are derived without considering double reduction because, as shown above (Fig. 1B), its impact on       equilibrium frequencies is trivial. Let N j denote the size of the zygotic genotypes,6A, 5A1a, 4A2a, 3A3a, 2A4a, 1A5a, and 6a, observed in the current population. Based on the equilibrium zygotic frequencies in equation (3), we formulate a likelihood of these observations as where terms related to the heterozygotes 4A2a, 3A3a, and 2A4a each contain two mixture components. We implemented the expectation-maximization (EM) algorithm to obtain the maximum likelihood estimates (MLEs) of P AAA (t), P AAa (t), P Aaa (t), and P aaa (t), which are used to estimate the expected frequencies of zygotic genotypes using equation (2) (see also Sun et al. [17]). A chi-square test statistic is calculated to test whether the marker deviates from HWE in hexaploids. Unlike the case of a three-genotype diploid population in which the degree of freedom is equal to 3 − 1 − 1 = 1 [21] for the HWE test, this test statistic follows the chi-square distribution with an unknown degree of freedom. However, we can empirically determine it as a value between 7 − 1 − 1 = 5 to 7 − 1 = 6.
How can double reduction be tested? We develop a procedure for testing the significance of double reduction in autohexaploids. Table 2 shows how zygotic genotypes are formed through random mating in the previous generation through a total of 15 mating types. If there is no double reduction (α = 0), zygotic frequencies in the current population are reduced from full recursive equations (Supplemental Text 1) as where P j 's (j = 6A, 5A1a, 4A2a, 3A3a, 2A4a, 1A5a, 6a) are the zygotic frequencies in the parental population.
We formulate a likelihood of observations of seven zygotic genotypes based on the zygotic frequencies of equation (4) under α = 0, which is expressed as where each term contains complex mixture components. We take advantage of the EM algorithm described in Sun et al. [17] to estimate the genotype frequencies P j 's in the parental population. In Supplemental Text 2, we provide a detailed procedure for the EM algorithm for genotype frequency estimation. By substituting the MLEs of these parental frequencies into equation (4), we obtain the MLEs of zygotic frequencies R j 's in the current generation under the assumption of no double reduction. Thus, by comparing the observations of genotypes, we use a chisquare test to determine whether double reduction exists at the considered SNP by calculating the test statistic which is χ 2 -distributed with five to six degrees of freedom.
Next, we describe a procedure for estimating the MLE of double reduction. Because double reduction has little inf luence on equilibrium genotypic frequencies, we can replace genotype frequencies at generation t − 1 contained in recursive equations (Supplemental Text 1) by equilibrium genotype frequencies expressed as the products of maternal gamete frequencies and paternal gamete frequencies, with a similar form of equation (2). Thus, recursive equations are composed of gamete frequencies P AAA (t − 1), P AAa (t − 1), P Aaa (t − 1) and P aaa (t − 1) and the rate of double reduction α. Similar to equation (3), we formulate a likelihood based on seven genotype frequencies at generation t. Each term in the likelihood contains complex mixture components. Thus, we implement the EM algorithm to estimate P AAA (t − 1), P AAa (t − 1), P Aaa (t − 1), P aaa (t − 1), and α.
The by-product of the above procedure is to test whether the parent population at generation t -1 deviates from aHWE. Under α = 0, we calculate the likelihood (L 1 ) from equation (5), which corresponds to the alternative hypothesis that there is deviation from aHWE. Meanwhile, we calculate the likelihood (L 0 ) of the case in which parental genotype frequencies are expressed as equilibrium frequencies, which corresponds to the null hypothesis. The log-likelihood ratio is a test statistic assumed to follow a chi-square distribution with one degree of freedom. This procedure can test the existence of parental aHWE.

Numerical examples
Test procedure: As an example that demonstrates how to test aHWE and double reduction, we generate a random set of seven hexaploid genotypes containing 29 individuals for 6A, 21 for 5A1a, 17 for 4A2a, 10 for 3A3a, 10 for 2A4a, 10 for 1A5a, and 23 for 6a at a SNP, totaling N = 120, from a natural autohexaploid population. We use recursive equations (Supplemental Text 1) to estimate aHWE genotype frequencies at generation 8 of random mating. By comparing these equilibrium frequencies with observed genotype frequencies, we calculate a chi-square test statistic, which is 6.602 (compared with χ 2 95%,df=6 ), suggesting that the segregation of the marker deviates from HWE. We implement the EM algorithm (Supplemental Text 2) to estimate gamete frequencies under HWE and use these estimates to obtain the MLEs of equilibrium zygotic frequencies. The chi-square test statistic is calculated as 6.649, indicating significant deviation from HWE. Thus, both the recursive test and the gamete-based test produce a consistent result in this example.
To test whether this example contains significant double reduction, we implement the EM algorithm to estimate zygotic frequencies (under the assumption of no double reduction) in the parental generation and use these estimates to obtain the expected zygotic frequencies in the current generation. Then, using equation (6), we calculate the chi-square test statistic as 5.922, which suggests the existence of double reduction in this example.
Monte Carlo simulation: We performed computer simulation to examine the statistical properties of our EM algorithm-based testing procedure. Simulation studies focus on assessing the estimation precision of parental gamete frequencies and parental zygotic genotype frequencies, which are used to test the existence of aHWE and double reduction. We sample a set of parental genotype frequencies P(t−1) = (P 6A (t−1), P 5A1a (t−1), P 4A2a (t−1), P 3A3a (t−1), P 2A4a (t−1), P 1A5a (t−1), P 6a (t−1)) = (0.10, 0.5, 0.20, 0.25, 0.13, 0.10, 0.17) under sample sizes of N = 100, 200, and 400. Under different values of α, we use recursive equations (Supplemental Text 1) to estimate offspring genotype frequencies. By assuming parental aHWE, we use the EM algorithm to estimate parental gamete frequencies (P AAA (t−1), P AAa (t−1), P Aaa (t−1), P aaa (t−1)) and α. As shown in Table 3, all these parameters can be estimated with reasonable precision, even with a modest sample size of N = 100. The power of detecting aHWE is about 0.70 for N = 100 to about 0.90 for N = 400. We simulate another set of offspring genotype frequencies from parental gamete frequencies under aHWE from which to estimate the probability of incorrectly detecting aHWE. Such false positive rates are quite low, below 0.08, under different sample sizes.
Correctly testing for the existence of double reduction depends on the precise estimation of parental genotype frequencies based on the EM algorithm. We examine the precision and power of parameter estimation through computer simulation studies. Given initial values for parental genotype frequencies P(t−1) = (0.10, 0.05, 0.20, 0.25, 0.13, 0.10, 0.17), we simulate the observations of seven genotypes in the current population under α = 0, 1/7, 1/6, 1/5, 1/4, assuming sample size N = 100, 200, and 400. The means and standard deviations of the estimates of each parental genotype frequency and the power for detecting significant double reduction are given in Table 4. It can be seen that parental genotype frequencies can be fairly well estimated even under a modest sample size (N = 100), although the accuracy and precision of parameter estimates increase with sample size. Considering the adequate power of detecting double reduction, a sample size of at least N = 100 is recommended to obtain reasonably good estimates of parental genotype frequencies and, therefore, a good test for the occurrence of double reduction in an autohexaploid natural population. If the signal of double reduction is weak, more samples (say N > 200) are needed to reasonably detect double reduction. If double reduction is detected for the simulated offspring zygote frequency data under no double reduction, then this indicate a false positive discovery. We find that our model has reasonably low false positive rates (<0.10) even under a small sample size.
Real data analysis: To demonstrate how our methods can be used to test for aHWE and double reduction, we analyze marker data collected from an autohexaploid chrysanthemum with great ornamental and medicinal value [22]. As an allogamous plant, chrysanthemum has six sets of chromosomes, each with 9 chromosomes, and its numerous chromosomes (2n = 6x = 54) make it difficult to study the genome structure of this species without sophisticated statistical methods. By crossing two heterozygous parents, Sumitomo et al. [22] generated a segregating full-sib family, which can be used as a proxy for a natural population in terms of the pattern of marker segregation. For this family, a total of 5509 intercross simplex markers and 3710 testcross simplex markers were genotyped. Yet, because a low-resolution sequencing technique was used, these markers are dosage ambiguous, i.e. it is impossible to distinguish the five heterozygotes 5A1a, 4A2a, 3A3a, 2A4a, and 1A5a (collectively denoted A_a_) from one another. We randomly choose four segregating markers for equilibrium and double reduction tests. Table 5 presents the result of marker tests. By comparing observed genotype frequencies with equilibrium genotype frequencies calculated at generation 8 after random mating, the recursive test finds that all chosen markers significantly deviate from aHWE, and the two markers SNP-113 and SNP-312 have p-values of <10 −50 . Because there are only three distinguishable genotypes, a gamete-based approach cannot be used to test for aHWE. To do so, we implement an allele-based approach, assuming that the formation of a triploid gamete involves the random combination of three alleles, i.e. the frequencies of AAA, AAa, Aaa, and aaa are expressed as p 3 , 2p 2 q, 2pq 2 , and q 3 , where p and q are the allele frequencies of A and a, respectively. A chi-square test based on the allele model produces equilibrium test results that are highly consistent with those obtained from the recursive test (Table 5).
By incorporating the allele model into recursive equations (Supplemental Text 1), we can test the significance of double reduction at individual heterozygoteambiguous markers. Table 6 illustrates such test results at four randomly chosen markers. We find that marker SNP-5 does not display significant double reduction, whereas double reduction is highly significant at markers SNP-130, SNP-406, and SNP-558. Our model provides a unique tool for testing double reduction.

Discussion
Polyploids are a group of plants with great importance in plant evolutionary studies and plant breeding. Although there is a rich body of literature on quantitative genetic dissection of complex traits based on artificial crosses [23,24,25], only a few studies have investigated the population genetic diversity of polyploids [4,15,16,26,27]. There are few methodological studies that describe analytical models for population and evolutionary genetics in polyploids by considering the structural and organizational complexities of polyploid genomes [28]. Sun et al. [17] developed a simple mathematical model to confirm the number of generations required to asymptotically approach HWE in tetraploids by early geneticists [18], but beyond this detection, Sun et al. proposed a statistical procedure for testing aHWE and validated its usefulness by analyzing a real data set. It can be expected that the conclusion of Sun et al. can be extended to polyploids at a higher ploidy level, but a convincing proof and the corresponding algorithm for the equilibrium test are not available.  In this article, we propose a mathematical procedure for detecting equilibrium genotype frequencies in a panmictic hexaploid population by deriving a group of recursive equations. We find that in contrast to diploid populations that reach HWE after only one generation of random mating, hexaploids require at least eight generations to approach asymptotic equilibrium. This is also different from tetraploids, which require four generations of random mating [17]. These recursive equations provide a general framework for testing aHWE from different perspectives. A so-called recursive test attempts to compare observed genotype frequencies with equilibrium genotype frequencies calculated at generation 8 after random mating. Using the standard equilibrium assumption, we develop a statistical gametebased algorithm for HWE testing in parental and offspring hexaploid populations. As seen from several numerical examples, both recursive and statistical methods produce consistent test results.
One additional advantage of our procedure is the ability to estimate and test double reduction in autohexaploids. As a common phenomenon with a role in shaping autopolyploid diversity and evolution, double reduction has received considerable attention [19,29]. Yet, its estimation and testing are mostly performed using artificial controlled crosses [30], although a few studies have done so using a panel of samples from natural populations [31]. In this study, we incorporate recursive equations to test the significance of double reduction over the autohexaploid genome. This procedure can scan molecular markers throughout the genome, visualize the landscape of double reduction, and identify key regions where this phenomenon occurs. In many polyploid genetic studies, genome sequencing is not conducted at a level of high resolution that allows heterozygous genotypes to be distinguished from each other in terms of allelic dosages. For these genotypeambiguous markers, we incorporate an allele-based model to test double reduction by assuming that gametes are random combinations of paternal and maternal alleles. This allele-based model expands the application of our test procedure to test double reduction using less informative markers.
We perform computer simulation to examine the statistical properties of our procedure, validating its usefulness. To test aHWE in hexaploid populations, a modest sample size (say 100) is adequate for the recursive approach because it only relies on the estimation of seven genotype frequencies. The gamete-based approach requires a reasonable estimate of gamete frequencies by the EM algorithm, which requires a larger sample size (say 200) for the aHWE test. In addition, results from computer simulation suggest that a modest sample size of 100 can reasonably estimate the genotypic frequencies, with good power to detect the significance of a small double reduction (α = 1/7). A low false positive rate implies that as long as double reduction is tested to be significant, the likelihood of its actual existence is high.
As a proof of concept, we use our procedure to analyze data from a full-sib family of autohexaploid chrysanthemum [21]. Although these data were not collected from a natural population, marker segregation in the family follows a similar pattern to that expected in nature. Thus, it is reasonable to demonstrate the utility of our procedure using a full-family dataset. We show our test results by randomly choosing several markers (Tables 4  and 5) and further explain these results in terms of aHWE and double reduction by our procedure.
In conclusion, our computational procedure is robust for testing aHWE and the occurrence of double reduction. It could have immediate implications for analyzing population genetic data collected from natural populations of any autohexaploid or allohexaploid species, including sweetpotato, wheat, kiwifruit, etc. Results from our procedure can provide insight into the evolutionary forces that act on the genomes of hexaploids and can also be used to detect genotyping errors in marker data. As the first step of molecular breeding in hexaploids, genomewide association studies (GWAS) have been increasingly used as a routine approach for studying the genetic architecture of agriculturally important traits [32][33][34]. Our aHWE testing procedure provides valuable assistance for the quality control of markers and the evolutionary inference of any significant loci detected from GWAS. In this study, we focus our analysis and modeling on single markers, but a joint analysis of two, even more than two, markers is essential, despite its tediousness in model derivations, given that non-random associations between different markers [28] (as modeled in tetraploids) contribute to hexaploid diversity and evolution in a different way.