The relative timing and size of regional human population growth following our expansion from Africa remain unknown. Human mitochondrial DNA (mtDNA) diversity carries a legacy of our population history. Given a set of sequences, we can use coalescent theory to estimate past population size through time and draw inferences about human population history. However, recent work has challenged the validity of using mtDNA diversity to infer species population sizes. Here we use Bayesian coalescent inference methods, together with a global data set of 357 human mtDNA coding-region sequences, to infer human population sizes through time across 8 major geographic regions. Our estimates of relative population sizes show remarkable concordance with the contemporary regional distribution of humans across Africa, Eurasia, and the Americas, indicating that mtDNA diversity is a good predictor of population size in humans. Plots of population size through time show slow growth in sub-Saharan Africa beginning 143–193 kya, followed by a rapid expansion into Eurasia after the emergence of the first non-African mtDNA lineages 50–70 kya. Outside Africa, the earliest and fastest growth is inferred in Southern Asia ∼52 kya, followed by a succession of growth phases in Northern and Central Asia (∼49 kya), Australia (∼48 kya), Europe (∼42 kya), the Middle East and North Africa (∼40 kya), New Guinea (∼39 kya), the Americas (∼18 kya), and a second expansion in Europe (∼10–15 kya). Comparisons of relative regional population sizes through time suggest that between approximately 45 and 20 kya most of humanity lived in Southern Asia. These findings not only support the use of mtDNA data for estimating human population size but also provide a unique picture of human prehistory and demonstrate the importance of Southern Asia to our recent evolutionary past.
How humans colonized the globe remains “one of the greatest untold stories in the history of humankind” (Goebel 2007). Mitochondrial DNA (mtDNA) has been a crucial line of evidence in developing the current understanding of our genetic prehistory. Phylogenetic studies of human mtDNA variation support a late Pleistocene expansion of modern humans from Africa (Cann et al. 1987; Vigilant et al. 1991; Watson et al. 1997; Ingman et al. 2000). More recent work suggests a “southern” migration route from sub-Saharan Africa along the Indian Ocean coast into Eurasia (Forster and Matsumura 2005; Macaulay et al. 2005; Thangaraj et al. 2005) and Sahul (Ingman and Gyllensten 2003; Friedlaender et al. 2005; van Holst Pellekaan et al. 2006) and a later migration from the Levant into North Africa and Europe (Olivieri et al. 2006). However, despite these advances, we lack a clear understanding of the timing and size of regional human population expansions.
Past population size can be estimated from genetic data using coalescent theory (Hudson 1990). For any locus, the information that can be extracted about population size depends on the accuracy with which the underlying genealogical tree can be resolved and the extent to which the historical demographic processes can be accommodated by the coalescent model. mtDNA's high copy number, absence of recombination (Elson et al. 2001), and rapid substitution rate (Howell et al. 1996), mean that its genealogical tree is well resolved, thus making mtDNA very well suited for coalescent inference of population size. Recently, however, Bazin et al. (2006) have argued that mtDNA variation is a poor indicator of population size in animals, probably due to the effects of selection acting to reduce diversity (known as genetic draft). Bazin et al. compared intraspecies mtDNA and allozyme diversity across major taxonomic divisions and found no association between mtDNA diversity and population size, whereas allozyme diversity did predict population size. Although the pattern Bazin et al. (2006) identify appears not to hold for human genetic diversity (human mtDNA and autosomal diversity both imply a similar population size [Eyre-Walker 2006]) human mtDNA does show evidence of selection at some loci (Mishmar et al. 2003; Kivisild et al. 2006). This raises the question of whether mtDNA is in fact a reliable predictor of human population size.
Previous work has used coalescent analyses of pairwise mtDNA sequence differences to argue for exponential growth in human population size beginning 30–130 kya (Harpending et al. 1993; Sherry et al. 1994; Rogers 1995; Ingman et al. 2000). Recent improvements in the available inference methods and complete mtDNA sequence data mean that population genetic parameters can now be estimated from sequence data with greater accuracy, without requiring a simple parametric growth model, and with an explicit framework for quantifying uncertainty in parameter estimates. The Bayesian skyline plot (BSP) (Drummond and Rambaut 2003; Drummond et al. 2005) uses Bayesian coalescent inference of phylogeny and a Markov Chain Monte Carlo (MCMC) (Metropolis et al. 1953) sampling algorithm to simultaneously estimate a posterior probability distribution for the ancestral genealogy, branch lengths, substitution model parameters, and population parameters through time, given a set of gene sequences. The resulting BSP represents a credibility interval for effective population size that incorporates uncertainty in nuisance parameters such as the underlying ancestral gene tree, branch lengths, and rate parameters. Unlike previous coalescent inference methods, by generating a sample of piecewise population size estimates, results are not contingent on a prespecified parametric growth model and the posterior distribution can be used to directly assess uncertainty in parameter estimates given the assumptions of the model. In addition, information loss is minimized by inferring population parameters directly from the sequence data using a coalescent framework, rather than converting sequence data to pairwise distances between sequences (Steel et al. 1988).
In this report, we use the BSP together with a global data set of 357 human mtDNA–coding region sequences to estimate effective population sizes through time for 8 major geographic regions. By comparing our coalescent-based estimates of regional effective population size with anthropological estimates of census population sizes, we show that mtDNA diversity is a remarkably good predictor of regional human population size. As a result, we are able to use our regional population size estimates to gain novel insight into the human colonization of the globe during the last 100,000 years. Most strikingly, we find evidence of a major Southern Asian chapter in our recent evolutionary past.
Materials and Methods
A data set of 357 human mtDNA–coding region sequences were assembled from the >3,000 sequences available on GenBank and divided into 8 geographic regions (supplementary table S1, Supplementary Material online). Only sequences that could be unambiguously assigned to regional groupings were considered for analysis. We sought a balanced sample of between approximately 25 and 50 sequences from each of the regions, with a larger sampling of the more diverse African population. Data were machine aligned in ClustalX (Thompson et al. 1997) using reference sequence J01415.1. Regional boundaries (fig. 2a) were defined based on geographical barriers and independent genetic evidence for population subdivisions (Cavalli-Sforza et al. 1994). Results were found to be robust across a range of different regional definitions (supplementary figs. 1–3, Supplementary Material online). Alternative regional sequence groupings are shown in supplementary table S2 (Supplementary Material online).
Bayesian Skyline Plots
BSPs (Drummond et al. 2005) of effective population size were produced from the data using MCMC (Metropolis et al. 1953) sampling in the program BEAST (version 1.3) (Drummond and Rambaut 2003). Effective population size is a compound population genetic parameter generally considered linearly proportional to census population size—here, the population of breeding females. It is influenced by many factors, including local extinction and recolonization and various forms of nonrandom mating (Wakeley 2000), and assumes that regional populations are isolated (i.e., there is no migration between populations). The underlying population size function of the BSP can be specified either as a stepwise or as a piecewise linear function of population size change. The main figures were obtained with a piecewise linear model. Estimates of effective population size are derived from the inferred lineage coalescent rate through time. Ancestral gene trees were inferred based on a general time-reversible (GTR) substitution model with site-specific rates for first, second, and third codon positions. A Bayes factor computed via importance sampling (Newton et al. 1994) indicated that this model fitted the data better than the commonly used GTR + Γ + I model (log Bayes factor = 19.1). Each MCMC sample was based on a run of 40,000,000 generations, sampled every 4,000, with the first 4,000,000 generations discarded as burn-in. Examination of autocorrelation times of the MCMC plots indicated runs had converged to the equilibrium distribution and provided adequate posterior samples by this time. All runs had an effective sample size of the coalescent prior of at least 1,000.
In order to plot population size with respect to time, it was necessary to calibrate rates of molecular evolution. Recent evidence of systematic variation in observed rates of molecular evolution through time (Ho et al. 2005) suggests that rates should be calibrated using divergence events of a similar time depth to the timescale of interest. For this reason, an internal rate calibration was used. Rates were fixed to 1.691 × 10−8 substitutions/site/year, based on a 45-kya age for haplotype Q in New Guinea. Haplogroup Q occurs at high frequencies only in New Guinea (parts of Melanesia have a closely related subtype [Q2], which also occurs at very low frequencies in Micronesia and Polynesia, probably as a result of more recent admixture [Friedlaender et al. 2005]). This is consistent with archaeological evidence for modern human arrival on the super continent Sahul (comprising what is now New Guinea and Australia) by approximately ∼42–45 kya (O'Connell and Allen 2004). In order to isolate uncertainty in the coalescent process and facilitate regional comparison, uncertainty in the calibration time itself was not incorporated into our analyses. Inferred times do, however, scale linearly with respect to the rate calibration, such that if, for example, a 50-kya age for the New Guinea clade were assumed, the timescale would be increased by approximately 11%.
Results and Discussion
Figure 1 shows BSPs of effective population size through time for each of the 8 regions examined here. The growth curves are derived from changes in the inferred lineage coalescent rate through time. These changes are generally taken to reflect changes in census population size but could also be influenced by changes in population structure, selection pressures, or some combination of these. We can test whether our effective population size estimates provide a good indication of relative census population sizes by comparing relative size estimates for the “present” with contemporary regional population sizes. Figure 2b shows that relative regional population size estimates at time t = 0 are remarkably consistent with independent anthropological and historical estimates of relative precolonial population sizes across Africa, Eurasia, and the Americas (Biraben 1979). Only estimates for Australia and New Guinea differ significantly from the anthropological estimates. This is likely to be due to differences in population level processes between Australia and New Guinea and the other 6 regions. Genetic distance between populations in Australia and New Guinea rises quickly with increasing geographic distance (Cavalli-Sforza et al. 1994). Studies of Australian Aboriginals show that tribes are genetically very isolated (Huoponen et al. 2001), with the majority of marriages within the tribe (Birdsell 1993). Highly structured populations with low-migration rates and little extinction/recolonization can be expected to preserve variation that would be lost due to drift in a larger metapopulation and hence produce relatively high effective population size estimates (Wakeley 2000). Excluding Australia and New Guinea, there is a striking relationship between effective and census population size estimates (r = 0.98). This relationship was found to be robust across different definitions of the regional group boundaries (supplementary figs. 1–3, Supplementary Material online). These results indicate that mtDNA variation can be used to infer relative regional population sizes in humans.
The finding of Bazin et al. (2006) that population size does not influence species’ mtDNA diversity across the animal kingdom appears not to apply, at least at the population level, in humans. This is most likely because the historically small human population size and short timescales involved make it unlikely that a selective sweep has been able to “reset” regional mtDNA diversity since our expansion from Africa. We were therefore able to combine results from figure 1 to estimate the global population distribution of modern humans over the last 100,000 years (fig. 2c). An important caveat in interpreting these figures is that the population demographic history inferred from sequences sampled from a particular region does not necessarily represent the population history of the region but rather the history of the genetic lineages now inhabiting the region. For example, regional interference can occur if a major migration event brings the genetic signature from a region with a long history to a newly inhabited region. However, we note that even in cases where we know this has happened, such as the expansion from Africa itself and the colonization of the Americas, our demographic estimates appear to correctly infer a recent expansion event rather than being mislead by the history of the parent population.
Figures 1 and 2 together provide a unique picture of the human colonization of the globe. The plots all show evidence of significant population growth and reveal clear differences in regional population histories. The timescale in these figures was calibrated by assuming an mtDNA evolutionary rate of 1.691 × 10−8 substitutions/site/year. Using a slower rate would lead to older dates than presented. However, the relative times of the population expansions across regions would not be altered by a different calibration. We infer slow, roughly exponential population growth in sub-Saharan Africa from an most recent common ancestor (MRCA) 143–193 kya (95% highest posterior density [HPD]; fig. 1a), followed by a period of rapid growth in Eurasia shortly after the emergence of the first non-African mtDNA lineages (N-haplogroup MRCA 53–69 kya, 95% HPD; M haplogroup MRCA 50–64 kya, 95% HPD). This result is consistent with analyses of human microsatellite and SNPs data suggesting slow growth in Africa and more recent rapid growth following a bottleneck in Asia and Europe (Gonser et al. 2000; Zhivotovsky et al. 2000; Alonso and Armour 2001; Rogers 2001; Marth et al. 2004). The congruence between our findings and those from unlinked genetic loci provides further evidence that the observed trends in effective population size are not merely an artifact of selection, which would require that identical selection pressures had acted simultaneously on unrelated genes.
Southern Asia shows the earliest and most pronounced population expansion outside Africa, with a 5-fold increase in population size estimated by ∼52 kya (fig. 1b). An early Southern Asian growth phase is consistent with high mtDNA diversity in populations from the Indian Ocean Coast (Kivisild et al. 2003; Macaulay et al. 2005; Thangaraj et al. 2005; Sun et al. 2006), as well as recent archaeological evidence (Mellars 2006a), and strongly supports a rapid coastal migration along the “Southern Route” from Africa into Southern Asia. A southern expansion route is also supported by the analyses in supplementary figure S1 (Supplementary Material online), with growth inferred first in Southwestern Asia (India/Middle East) and slightly later in South East Asia. Perhaps most striking, however, is the magnitude of the Southern Asian growth phase, which implies that between approximately 45 and 20 kya over half of the global human population lived on the Indian subcontinent and what is now the Thai and Malay Peninsulas. Our population size estimates indicate that the proportion of people living in the region peaked at over 60% approximately 38 kya. This figure may even be an underestimate given the potential for regional interference along the lines mentioned above because early Southern Asian growth could cause us to infer a premature increase in the population size estimates for the other regions.
We infer less rapid, progressively later population growth in Northern and Central Asia (∼49 kya; fig. 1c), Europe (∼42 kya; fig. 1e), and the Middle East and North Africa (∼40 kya; fig. 1f). Slower growth rates in these regions may reflect less favorable conditions for human occupation at this time. The later timing of growth in Europe and the Middle East and North Africa is consistent with radio-carbon dating of the Upper Palaeolithic transition (Mellars 2006b) and mtDNA lineage analysis (Olivieri et al. 2006) indicating an expansion from the Levant into Europe and North Africa 40–45 kya. We also see later growth in Australia (∼48 kya; fig. 1d) and New Guinea (∼39 kya; fig. 1g). In the Americas, we find evidence of a rapid population expansion beginning ∼18 kya (fig. 1h). This result is consistent with a hypothesized colonization event after the last glacial maximum ∼20 kya, when lower sea levels made it possible to cross from Siberia to North America via the Bering land bridge (Forster 2004).
By ∼35 kya global growth outside Africa begins to slow, most markedly in Southern Asia and Australia, whereas in Europe estimated effective population size actually declines slightly. This leveling of growth may reflect the impact of Malthusian environmental constraints and, particularly in Europe, increasing glaciation during this time. Climate also seems the most likely explanation for a second major growth phase inferred in Europe beginning 10–15 kya (fig. 1e), shortly after the end of the last ice age ∼15 kya. The estimated timing of this growth phase fits neatly with archaeological evidence for an expansion of hunter-gatherers from glacial refugia in southern Europe (Housley et al. 1997). Such an interpretation is also supported by previous work into the relative Neolithic and Palaeolithic contributions to the European mtDNA gene pool (Richards et al. 2000) and the discovery among the Saami (in northern Scandinavia) of southwestern European mtDNA haplotypes dated to the late Pleistocene (Achilli et al. 2005).
On a more recent timescale, the advent of agriculture is thought to have driven human population growth from ∼5–10 kya. Uncertainty in the timing of the second European expansion means, we cannot rule out the possibility that this growth was linked, at least in part, to the spread of agriculture from Anatolia beginning ∼9 kya (Gkiasta et al. 2003; Gray and Atkinson 2003). However, the inferred expansion time is centered somewhat earlier than would be expected under this scenario and an equally strong growth signal is not inferred for Africa, Papua New Guinea, the Americas, and the rest of Eurasia, where we would expect agriculture to have had a similar effect. This does not mean that no significant growth occurred with the development of early agriculture (although evidence of poor health in early agricultural populations makes this a possibility [Larsen 2006]). Rather, a more likely explanation is that due to the limits of the temporal resolution of the method, very recent expansion in human population size associated with agriculture has simply not left a growth signal that we can detect here. Although the resolution of the method is ultimately limited by the rate of molecular evolution, in the future multilocus analyses will allow a finer-grained analysis.
The results reported here extend our understanding of the origin and evolution of modern humans. By verifying our regional population size estimates against independent anthropological and historical evidence, we have shown that mtDNA variation can be used to predict population size in humans. Our estimates of effective population size through time show that Southern Asia was not only a key waypoint in the human expansion from Africa but also a major chapter in human prehistory. In addition, we find complex patterns of regional population growth that can be linked with archaeological evidence and known climatic events. As the amount of available sequence data continues to grow and results from a range of loci are combined, we anticipate that coalescent-based methods will reveal an increasingly detailed picture of human prehistory.
Supplementary materials are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
We are grateful to Mark Beaumont, Athena Ferreira, Andrew Meade, Mark Pagel, Marcel van de Steeg, and Chris Venditti for helpful advice and comments. We also thank the many contributors to GenBank. We also acknowledge support from Leverhulme Trust grant F/00 239/T.