A number of statistical tests for detecting population growth are described. We compared the statistical power of these tests with that of others available in the literature. The tests evaluated fall into three categories: those tests based on the distribution of the mutation frequencies, on the haplotype distribution, and on the mismatch distribution. We found that, for an extensive variety of cases, the most powerful tests for detecting population growth are Fu's FS test and the newly developed R2 test. The behavior of the R2 test is superior for small sample sizes, whereas FS is better for large sample sizes. We also show that some popular statistics based on the mismatch distribution are very conservative.
Comparison of DNA sequences within and between species is a powerful approach not only for determining the evolutionary forces acting in specific gene regions but also for determining relevant aspects of the evolutionary history of the species (for reviews see, Takahata 1996<$REFLINK> ; Rogers 1997<$REFLINK> ; Harpending et al. 1998<$REFLINK> ; Jorde, Bamshad, and Rogers 1998<$REFLINK> ; Cann 2001<$REFLINK> ). The coalescent theory (Kingman 1982a<$REFLINK> , 1982b<$REFLINK> ; Hudson 1990<$REFLINK> ; Donnelly and Tavaré 1995<$REFLINK> ; Fu and Li 1999<$REFLINK> ) is the most powerful theoretical approach for interpreting DNA sequence data. The coalescent is a population genetic model focused primarily on the neutral evolution of gene trees; this model provides the framework for the development of statistical tests and also provides very efficient computer simulations methods.
Tajima (1989b)<$REFLINK> , Slatkin and Hudson (1991)<$REFLINK> , and Rogers and Harpending (1992)<$REFLINK> pioneered the study of the effect of some demographic events on DNA sequence data. They have shown that a relatively recent demographic event, such as a population growth, causes most of the coalescent events to occur before the expansion and, consequently, samples of these populations have gene genealogies stretched near the external nodes and compressed near the root (i.e., star genealogies). Thus, population size changes can leave a particular footprint that may eventually be detected in DNA sequence data. This theoretical framework prompted the development of statistical tests for detecting population expansion.
The analysis of the distribution of pairwise differences, or mismatch distribution (Slatkin and Hudson 1991<$REFLINK> ; Rogers and Harpending 1992<$REFLINK> ), provides a method for inferring such demographic events. These authors have shown that, for nonrecombining DNA regions, constant size populations presented mismatch distributions with shapes with very little resemblance to that expected in growing populations. This prompted the development of some statistical tests for detecting expansion processes (Harpending et al. 1993<$REFLINK> ; Harpending 1994<$REFLINK> ; Eller and Harpending 1996<$REFLINK> ; Rogers et al. 1996<$REFLINK> ). One of the most frequently used tests is the raggedness statistic rg (Harpending et al. 1993<$REFLINK> ). Although the distribution of the rg statistic is unknown, its confidence intervals could be obtained by computer simulations based on the coalescent algorithm. But because methods based on the mismatch distribution use little information accumulated in the data (Felsenstein 1992<$REFLINK> ), tests based on the mismatch distribution should be very conservative.
In recent years, a number of authors have developed several methods of statistical inference and statistical tests using different approaches (e.g., Griffiths and Tavaré 1994<$REFLINK> ; Bertorelle and Slatkin 1995<$REFLINK> ; Rogers 1995<$REFLINK> ; Aris-Brosou and Excoffier 1996<$REFLINK> ; Fu 1996<$REFLINK> , 1997<$REFLINK> ; Kuhner, Yamato, and Felsenstein 1998<$REFLINK> ; Weiss and Von Haeseler 1998<$REFLINK> ; Galtier, Depaulis, and Barton 2000<$REFLINK> ; Furlong and Brookfield 2001<$REFLINK> ). More recently, specific methods for detecting population expansions have also been developed for the analysis of microsatellite data (e.g., Kimmel et al. 1998<$REFLINK> ; Reich and Goldstein 1998<$REFLINK> ; Beaumont 1999<$REFLINK> ; Reich, Feldman, and Goldstein 1999<$REFLINK> ; King, Kimmel, and Chakraborty 2000<$REFLINK> ).
Here we report the development of new statistical tests for detecting past population growth. We performed an extensive analysis of their statistical power against different alternative hypotheses, and we compared their relative performance with respect to others published in the literature. Although some authors (Braverman et al. 1995<$REFLINK> ; Simonsen, Churchill, and Aquadro 1995<$REFLINK> ; Fu 1996<$REFLINK> , 1997<$REFLINK> ) have also investigated the power of some statistical tests against population growth and genetic hitchhiking (which leave similar footprints in DNA sequences), at present there is no exhaustive comparative analysis. The major population growth model investigated was the sudden (instantaneous) growth, although we also studied the power under the logistic model of population growth. The power of these tests was evaluated using random data sets generated by computer simulations based on the coalescent (Hudson 1990<$REFLINK> ).
Materials and Methods
We analyzed the performance of 17 statistical tests to distinguish specific models of population growth from the null hypothesis of a constant size population under the neutral model. Thus, we determined the power of these tests to reject the null hypothesis when the alternative hypothesis is really true. On the basis of the sequence information used, the test statistics evaluated have been classified into three major classes, namely classes I, II and III (see below). We developed several new statistical tests based on high-order moments (within classes I and III) because the distortion of the gene tree caused by the population growth would suggest that these types of tests could be more powerful than other tests available in the literature.
Class I Statistics
Class I statistics use information of the mutation (segregating site) frequency. These statistics could be appropriate to distinguish population growth from constant size populations because the former generates an excess of mutations in external branches of the genealogy (i.e., recent mutations) and therefore an excess of singletons (substitutions present in only one sampled sequence) (Tajima 1989a<$REFLINK> , 1989b<$REFLINK> ; Slatkin and Hudson 1991<$REFLINK> ).
We studied the following test statistics: Tajima's D, and Fu and Li's D*, F*, D (named DF) and F statistics (Tajima 1989a<$REFLINK> ; Fu and Li 1993<$REFLINK> ; see also Simonsen, Churchill, and Aquadro 1995<$REFLINK> ). These tests are based on the difference between two alternative estimates of the mutational parameter 𝛉 = 2Nu, where N is the effective number of gene copies in the population (the number of females in the population for mtDNA regions or double the population size for an autosomal region) and u is the mutation rate. Tajima's D and Fu and Li's D* and F* statistics use information from only intraspecific data, whereas Fu and Li's DF and F statistics use information from the number of recent mutations; the latter, therefore, requires the presence of an outgroup to be computed.<$REFLINK> ) software.
We also built two R2 related tests namely, R3 and R4. These statistics differ from the R2 test in the power exponent values; in R3 and R4, the exponent values of 2 and 1/2 (eq. 1) are replaced by 3 and 1/3, and by 4 and 1/4, respectively.
We have constructed three additional test statistics (R2E, R3E, and R4E) that use information on the number of mutations in external branches; thus, an outgroup will be required for their estimation. The R2E test is defined as
Class II Statistics
In class II, we include statistical tests that use information from the haplotype distribution. We have only studied Fu's FS test statistic (Fu 1997<$REFLINK> ) within this class. This statistic, which is based on the Ewens' sampling distribution (Ewens 1972<$REFLINK> ), has low values with the excess of singleton mutations caused by the expansion.
Class III Statistics
Class III statistical tests use information from the distribution of the pairwise sequence differences (or mismatch distribution). It has been shown that population expansions leave a particular signature in the distribution of the pairwise sequence differences (Slatkin and Hudson 1991<$REFLINK> ; Rogers and Harpending 1992<$REFLINK> ); therefore, statistics based on the mismatch distribution can be used to test for demographic events. We evaluated the following statistics. (1) The raggedness rg statistic (Harpending et al. 1993<$REFLINK> ; Harpending 1994<$REFLINK> ). The raggedness statistic, which measures the smoothness of the mismatch distribution, differs among constant size and growing populations: lower rg values are expected under the population growth model. (2) The mean absolute error (MAE) between the observed and the theoretical mismatch distribution (Rogers et al. 1996<$REFLINK> ). (3) We also developed a new statistical test, the ku test, based on the fourth central moment (i.e., on the kurtosis) of the mismatch distribution. Given that population expansion generates more smoothly peaked distributions, this statistic can distinguish between constant size and growing populations. Let d,nc, and Wi be the maximum number of differences in the mismatch distribution, the number of pairwise comparisons (= n (n − 1)/2), and the frequency of pairs of DNA sequences that differ by i mutations, respectively. We define:
We obtained the empirical distribution of each statistical test by Monte Carlo simulations based on the coalescent process for a neutral infinite-sites model, assuming a large population size (Kingman 1982a<$REFLINK> , 1982b<$REFLINK> ; Hudson 1990<$REFLINK> ). We also assumed that there is neither intragenic recombination nor migration and that the mutation rate is homogeneous across the DNA region. We performed the simulations conditional on the number of segregating sites (S); that is, placing randomly S mutations along the tree (the so-called fixed S method). Given that the actual value of 𝛉 is usually unknown, this method seems to be appropriate for testing purposes (Hudson 1993<$REFLINK> ). The routine ran1 (Press et al. 1992<$REFLINK> ) was used as a random number generator. We conducted coalescent simulations for constant population size (null hypothesis) and for population growth (alternative hypothesis); the empirical distribution was estimated from 100,000 computer replicates for both the null and the alternative hypotheses.
For the constant size model (null hypothesis), the samples were generated using conventional procedures (Hudson 1990<$REFLINK> ); in this model only two parameters are required: the sample size and the number of segregating sites. The sudden population growth model (Rogers and Harpending 1992<$REFLINK> ) considers a population that was formerly at equilibrium, but te generations before the present one the population grew suddenly to the current size. Coalescent simulations under the sudden expansion model require four parameters: n,S,te, and De, where De, the degree of the expansion event, is:
Critical Values and the Power of the Tests
We determined the critical values of each statistical test from its empirical distribution. The power of each test, or the probability of rejecting the null hypothesis (constant size population) when the alternative hypothesis (population growth) is true, was estimated as the proportion of computer replicates generated under the alternative hypothesis for which the null hypothesis was rejected. For the analysis, we fixed a significance level of α = 0.05. Because the critical region for all alternative hypotheses would consist of only one side of the distribution, we conducted one-tailed tests. Specifically, all analyzed statistics, except Ch,Che, and ku had lower values under the population growth model.
Given that under the null hypothesis the empirical distribution of some statistics presented a reduced number of points (e.g., the distribution of D* statistic; see Results), the actual probability of rejecting the null hypothesis when it is true (i.e., the size of the test) could be lower than the nominal significance level of 0.05.
Sudden Population Growth Model
We studied the power of 17 statistical tests under different values of n,S,De, and Te. Although we have examined the power for a wide range of the parameter space, we will show only the most relevant cases (additional results and figures are available from the authors). The parameters fixed for illustrating the power were n = 10 and n = 50, S = 10 and S = 50, De = 10 and De = 100, and Te = 0.1 (time for the maximum power; see below). These values give a clear view of the statistical power under some realistic cases: for small and big sample sizes, for a low and high number of mutations and for reasonable population growth parameters. In all cases, the parameter sets were chosen to avoid saturation of the power curves.
The power analysis of the tests R3, R4, R3E, and R4E show a similar power than the R2 and will not be presented here. Nevertheless, for some specific set of parameters the R4 and R4E tests presented a slightly higher power than R2. Generally, results of the statistical power of all statistical tests that use interspecific data presented a similar power than its equivalent statistic using intraspecific information (figures not shown).
Figure 1 shows the effect of Te—the time elapsed since the expansion event—on the statistical power of different statistical tests. It can be observed that R2 and Fu's FS are the most powerful tests: the R2 test is the most powerful for small sample sizes, whereas the behavior of Fu's FS is better for large samples. The power of Tajima's D and Fu and Li's F* is lower than R2 and FS. The results also indicate that some commonly used tests based on the mismatch distribution, rg and MAE, are among the least powerful. All statistical tests show a peak in the statistical power at intermediate values of Te (Te ∼ 0.1); thus, it is unlikely to detect a population expansion when Te is too small or too large. This result agrees with that obtained by Simonsen, Churchill, and Aquadro (1995)<$REFLINK> and Fu (1997)<$REFLINK> .
The results of the effect of De on the power to reject the constant size model are depicted in figure 2 . All statistical tests, except class III, increase the power to reject the constant size model with increasing De; therefore, large samples will be needed to detect small population growth events. Again, tests based on the mismatch distribution are very insensitive in detecting population growths. The most powerful statistics are the R2 and the FS. Tajima's D, Fu and Li's F* and D* and Ch have comparatively less power.
Figures 3 and 4 show the effect of the sample size and the number of segregating sites on the power to reject the neutral constant size model under specific alternative hypotheses. It should be expected that both variables have a major effect on the statistical power, the larger the values of n or S, the more the power of the tests. But the effect on the power is different for different statistics: for small sample sizes (and a small number of segregating sites) the R2 statistical test is the most powerful (figs. 3A and 4A ), whereas for larger sample sizes FS is the most powerful one. Moreover, for small sample sizes the power of DF and F is better than the counterpart tests without outgroup, although they are not as powerful as R2 and FS (figure not shown). The results also indicate that statistical tests based on the mismatch distribution, the rg, and the MAE are among the least powerful. In fact, in some cases, the power decreases as the sample size increases.
It should be noted that statistics D* (figs. 3A and 4B ) and FS (fig. 4A ) have an irregular behavior because they show some atypical power drops with increasing sample size or the number of segregating sites. This unexpected pattern has two different explanations. In the case of Fu and Li's D* statistic, the power drop is caused by a marked decrease in the actual significance level (i.e., the size of the test). In fact, the D* empirical distribution has a reduced number of possible points causing, for some specific values, this level to drop to 0.02. The atypical pattern of the FS test is due to the intrinsic structure of the statistic. In fact, the empirical distribution of FS (both under H0 and H1 hypotheses) presents pronounced changes at specific ranges of values. That pattern causes marked changes in the power when these values are within the rejection region (results not shown). Nevertheless, this irregular behavior is not present in coalescent simulations conditional on the value of 𝛉 (results not shown).
Logistic Population Growth Model
We also conducted the analysis of power under a more realistic population growth scenario, the logistic population growth model. Using this model, we performed an explorative analysis of the most relevant cases to validate the conclusions of our work. We found that the assumption of the logistic population growth model does not change the major conclusions of the work. Even so, in comparison with the sudden growth model the maximum power of the tests is reached at higher values of the elapsed time; for instance, for the parameter sets used in Fu (1997)<$REFLINK> (r = 10, c = 1) the maximum power is at Ts ∼ 1.2. In general, as expected (1) all statistical tests have less power under the logistic than under the sudden growth models; nevertheless the decrease in the power is relatively uniform for all statistical tests and (2) the larger the value of r, the more power the tests have.
Application to DNA Sequence Data
The present results have been applied to two published DNA data sets: the mtDNA variation analysis of a Turkish human population (Comas et al. 1996<$REFLINK> ), and the survey of a human noncoding autosomal region (Alonso and Armour 2001<$REFLINK> ). Comas et al. (1996)<$REFLINK> sequenced 360 base pairs of the region I of the mtDNA D-loop in 45 individuals. From the mismatch distribution analysis the authors suggested that the Turkish population had expanded recently. We determined the power of the different tests to identify which is most powerful against population growth. For the total data (n = 45; S = 56) and considering that De = 100 and Te = 0.4 (scaled in terms of N generations) most tests were powerful enough, and several of them could reject the null hypothesis of constant size. We also determined whether the tests could also reject the null hypothesis for small sample sizes. For that, we reanalyzed a subset of 10 randomly chosen sequences from the data of Comas et al. (1996)<$REFLINK> . Table 1 shows the estimates of the power and of P values of some statistical tests. The results clearly illustrate that the constant size hypothesis can be rejected by the most powerful tests (Fu's FS and R2).
We also compared the P values and the power of the R2 and some of the statistical tests used in Alonso and Armour (2001)<$REFLINK> . These authors performed a nucleotide variation study in 100 chromosomes sampled from different African and Euroasiatic populations. Although the surveyed region is autosomal, the Alonso and Armour (2001)<$REFLINK> results suggested that recombination should be reduced. We analyzed the Japanese population (n = 20; S = 5) using the same values of the recombination parameter R (R = 2Nρ, where ρ is the recombination rate per generation) as the published ones; for that analysis we used Hudson's (1983)<$REFLINK> algorithm to generate DNA samples under the coalescent with recombination (results based on 10,000 replicates). For the power analysis we consider that De = 100 and Te = 0.1. For R = 0 (no recombination) only the FS test can reject the null hypothesis of constant size. But for increasing recombination values the power of R2 and Tajima's D tests increases, whereas it decreases for FS and rg. In fact, for R = 10 only R2 allows the null hypothesis to be rejected.
In this article, we have examined the power of several statistical tests to determine which are most powerful in different population growth scenarios. The analysis has been performed by using a coalescent-based approach. There are other alternative approaches (likelihood-based methods) to study a population expansion process: the maximum likelihood (e.g., Griffiths and Tavaré 1994<$REFLINK> ; Kuhner, Yamato, and Felsenstein 1998<$REFLINK> ; Weiss and Von Haeseler 1998<$REFLINK> ) and the Bayesian approaches (see Stephens 2001<$REFLINK> ). The likelihood provides a framework for testing hypotheses; specifically, tests based on the likelihood ratio test statistic, δ = −2 ln (L0/L1), where L0 and L1 are the maximum likelihood values under the null and the alternative hypothesis, can be used to discriminate between constant size and population growth. Unfortunately, the standard χ2 approximation for the distribution of δ might be inadequate. The empirical distribution of δ could be generated, however, by computer simulation and from that distribution the critical values could also be obtained; nevertheless, this method is computationally very intensive.
We have shown that tests based on the mismatch distribution have little power against population growth. The MAE test is the less powerful one; although rg is more powerful than MAE, it works less well than nearly all class I and class II tests examined. ku, the newly developed test of class III, although better than MAE and rg, is clearly inferior to other class I and class II tests.
On the other hand, several class I and class II tests can detect population expansion even for small De values. We have shown that two of the surveyed tests (R2 and FS) are the most powerful for a variety of different conditions. These tests should therefore be chosen to test constant population size versus population growth. In particular, we suggest using the R2 statistical test for small sample sizes and FS for large ones. Nevertheless, because R2 and FS statistics use different kinds of information, discrepancies between these tests could provide information about the action of other evolutionary processes, for example on the intragenic recombination (see below).
Fu (1997)<$REFLINK> studied the power of some statistics under the logistic model of population growth. He conducted coalescent simulations fixing theta (𝛉 = 5, 𝛉 = 10) instead of fixing S. To check the behavior of R2, and other mismatch-based statistics, under these conditions we performed some additional simulations conditional on 𝛉. We found that the R2 and FS are again the most powerful statistics (see an example in fig. 5 ). Interestingly, rg and MAE have better results fixing 𝛉 than fixed S.
The results from the present analysis are appropriate for nonrecombining DNA regions (i.e., mitochondrial or Y-chromosomal DNA regions). It is expected, however, that intragenic recombination substantially affects the power of the statistical tests surveyed (Rozas et al. 1999<$REFLINK> ; Wall 1999<$REFLINK> ). Indeed, a loss of power for those tests based on the haplotype distribution is expected (class II tests; e.g., Fu's FS test) or for those based on the mismatch distribution (class III tests; e.g., rg test). The reason is that recombination, by shuffling nucleotide variation among DNA sequences (1) increases the number of haplotypes and (2) generates a much smoother mismatch distribution (Poisson-like). Consequently, class II and class III tests could be inadequate in detecting the signature left by a population growth on a recombining DNA region. Class I tests, on the contrary, should be less sensitive to intragenic recombination. To check our prediction, we conducted a few coalescent simulations using different values of the recombination parameter. Our preliminary results comparing the power of R2 and FS tests show that the behavior of the former is better than the FS for increasing levels of recombination (also see table 1 ).
Coalescent Simulations Conditional on the Number of Segregating Sites
The present power analyses have been performed conducting coalescent simulations conditional on the number of segregating sites. Given that the actual value of 𝛉 is usually unknown, and that estimates of 𝛉 are usually obtained from DNA polymorphism data information, the method seems to be appropriate (Hudson 1993<$REFLINK> ; Depaulis, Mousset, and Veuille 2001<$REFLINK> ; Wall and Hudson 2001<$REFLINK> ). But Markovtsova, Marjoram, and Tavaré (2001)<$REFLINK> pointed out correctly that the power of coalescent-based tests are not independent of 𝛉 and, therefore, the statistical power might vary as a function of 𝛉 for a given n and S. To check that effect on the R2 we performed a prospective analysis generating samples conditional on 𝛉 and S using the rejection algorithm of Tavaré et al. (1997)<$REFLINK> . The results yield the same conclusions as that of Depaulis, Mousset, and Veuille (2001)<$REFLINK> and Wall and Hudson (2001)<$REFLINK> , i.e., the fixed S method seems to be appropriate unless the actual value of 𝛉 is far from Watterson's (1975)<$REFLINK> estimate of 𝛉.
Competitive Alternative Hypotheses
It should be stressed that a significant result (a significant departure from the null hypothesis) should be interpreted cautiously: there are several putative alternative hypotheses to single null hypotheses. Indeed, processes other than population expansion, such as genetic hitchhiking (Maynard Smith and Haigh 1974<$REFLINK> ), could also produce similar genealogies (i.e., departures of the statistical tests in the same direction). Therefore, additional analyses could be necessary to discriminate between some competitive alternative hypotheses. For instance, because genetic hitchhiking in regions undergoing recombination will affect a relatively small fraction of the genome (close to the advantageous mutation), surveys at different gene regions across the genome could provide the opportunity to discriminate between population expansion and genetic hitchhiking (see Galtier, Depaulis, and Barton 2000<$REFLINK> ).
To summarize, FS and R2 are the best statistical tests for detecting population growth. The behavior of R2 is better for small sample sizes, whereas FS is better for bigger sample sizes. Additionally, preliminary results also indicate that the behavior of R2 should be superior when the intragenic recombination is considered. On the other hand, some popular statistics based on the mismatch distribution, rg and MAE, are very conservative.
Wolfgang Stephan, Reviewing Editor
Keywords: population growth population expansion coalescent simulations neutrality tests
Address for correspondence and reprints: Julio Rozas, Departament de Genètica, Facultat de Biologia, Universitat de Barcelona, Diagonal 645, E-08071 Barcelona, Spain. E-mail: email@example.com
We thank M. Aguadé, A. Navarro, H. Quesada, and C. Segarra for critical comments on the manuscript. This work was supported by grants PB97-0918 from the Dirección General de Investigación Científica y Técnica, Spain and 1999SGR-25 from the Comissió Interdepartamental de Recerca i Tecnologia, Catalonia, Spain, conferred on M. Aguadé.