-
PDF
- Split View
-
Views
-
Cite
Cite
Ziheng Yang, Likelihood and Bayes Estimation of Ancestral Population Sizes in Hominoids Using Data From Multiple Loci, Genetics, Volume 162, Issue 4, 1 December 2002, Pages 1811–1823, https://doi.org/10.1093/genetics/162.4.1811
Close -
Share
Abstract
Polymorphisms in an ancestral population can cause conflicts between gene trees and the species tree. Such conflicts can be used to estimate ancestral population sizes when data from multiple loci are available. In this article I extend previous work for estimating ancestral population sizes to analyze sequence data from three species under a finite-site nucleotide substitution model. Both maximum-likelihood (ML) and Bayes methods are implemented for joint estimation of the two speciation dates and the two population size parameters. Both methods account for uncertainties in the gene tree due to few informative sites at each locus and make an efficient use of information in the data. The Bayes algorithm using Markov chain Monte Carlo (MCMC) enjoys a computational advantage over ML and also provides a framework for incorporating prior information about the parameters. The methods are applied to a data set of 53 nuclear noncoding contigs from human, chimpanzee, and gorilla published by Chen and Li. Estimates of the effective population size for the common ancestor of humans and chimpanzees by both ML and Bayes methods are ∼12,000-21,000, comparable to estimates for modern humans, and do not support the notion of a dramatic size reduction in early human populations. Estimates published previously from the same data are several times larger and appear to be biased due to methodological deficiency. The divergence between humans and chimpanzees is dated at ∼5.2 million years ago and the gorilla divergence 1.1-1.7 million years earlier. The analysis suggests that typical data sets contain useful information about the ancestral population sizes and that it is advantageous to analyze data of several species simultaneously.
THE amount of sequence polymorphism in a population is mainly determined by the parameter θ= 4Nμ, where N is the (effective) population size and μ is the mutation rate per nucleotide site per generation. In addition to its effect on the amount of neutral variation maintained in a population, the population size is also important in affecting the efficiency of natural selection and is a critical parameter in population genetics models of mutation and selection. It is important to competing theories of origin and evolution of modern humans. When an estimate of the mutation rate is available, as is assumed here, the population size can be obtained from estimates of θ. The population size of a present-day species can be estimated by using observed DNA sequence variation in the population (e.g., Kreitman 1983; Fu 1994; Takahata et al. 1995; Yang 1997a). The population size of modern humans has been estimated to be ∼10,000 (e.g., Takahata et al. 1995; Zhao et al. 2000; Yu et al. 2001).
The population size of an extinct ancestral species, such as the common ancestor of humans and chimpanzees, is harder to estimate, but a few methods have been developed (see Edwards and Beerli 2000 and Satta et al. 2000 for extensive reviews). They require data from multiple loci and make use of the information in the conflict of gene trees among loci. For example, Takahata (1986) suggested a method for estimating the population size of the common ancestor of two closely related species when a pair of homologous DNA sequences are available from the two species at each locus. The average divergence at many loci provides information on the species divergence time, while the variation of sequence divergence among loci reflects the ancestral population size. The method of Takahata (1986) uses summary statistics from the data and was later extended to a full-likelihood analysis and to the case of three species, where the population sizes of the two extinct ancestors as well as the two speciation dates were estimated (Takahata et al. 1995).
In the case of three species, another simpler method has also been used (Nei 1987; Wu 1991; Hudson 1992). This approach uses the proportion of loci at which the gene tree does not match the species tree to estimate the ancestral population size and exploits the fact that ancestral polymorphism creates conflicts between the gene tree and the species tree. I refer to it as the tree-mismatch method. Its application to human and great ape sequence data has led to estimates of the population size for the ancestor of humans and chimpanzees that are much larger than estimated population sizes for modern humans (Ruvolo 1997; Chen and Li 2001). For example, Chen and Li (2001) sequenced 53 noncoding contigs from human, common chimpanzee, gorilla, and orangutan, and estimated the population size of the common ancestor of humans and chimpanzees to range from 52,000 to 150,000, depending on the generation time used (15 or 20 years) and on whether parsimony or neighbor joining was used to infer the gene trees.
The tree-mismatch approach has room for improvement. First, the conflicts in the estimated gene trees among loci can be due to both ancestral polymorphism and sampling errors in tree reconstruction. As the sequences are highly similar, with few variable or informative sites at each locus, the reconstructed gene tree, no matter what method was used to infer it, is unreliable. Ignoring the sampling error in the estimated gene tree leads to overestimation of the conflict among loci and overestimation of the ancestral population size (see below). Second, the branch lengths in the gene tree provide, in a probabilistic sense, information about the ancestral population parameters, but are ignored by the method. The method clearly makes poor use of the information in the data. Third, the method assumes that one knows the species divergence times, while they might not be known. Indeed, some authors argue for the importance of accounting for ancestral polymorphism when estimating speciation dates (Takahata et al. 1995).
It seems advantageous to use information in both the conflicting gene genealogies as well as the branch lengths while accounting for their uncertainties. This can be achieved by using the likelihood method under a coalescent model developed by Takahata et al. (1995). However, the infinite-sites model assumed in the method is sometimes violated by the data. In this article, I extend the method of Takahata et al. for estimating ancestral population sizes, using data from three species. I use the nucleotide substitution model of Jukes and Cantor (1969) to correct for multiple substitutions at the same site. The model is also implemented in a Bayes framework, with Markov chain Monte Carlo (MCMC) used to achieve efficient calculation. The methods are applied to the data of Chen and Li (2001).
MAXIMUM-LIKELIHOOD ESTIMATION
The species phylogeny is represented ((12)3) and is assumed known. Species 1 and 2 diverged τ1 generations ago while species 3 diverged τ0 generations earlier (Figure 1a). In this article, species 1, 2, and 3 represent human (H), chimpanzee (C), and gorilla (G), respectively. The effective population size of the ancestor of species 1 and 2 is N1, and that of the common ancestor of all three species is N0. The mutation rate μ is measured as the number of nucleotide substitutions per site per generation. As rate and time are confounded in such analysis, the parameters of the model are θ0 = 4N0μ, θ1 = 4N1μ, γ0 =τ0μ, and γ1 =τ1μ (Figure 1a). Collectively they are denoted 0398;= {θ0, θ1, γ0, γ1}. The data contain DNA sequences from multiple loci, with one individual sequenced from each of the three species at each locus. It is assumed that there is no recombination within a locus and free recombination between loci. The population is assumed to be random mating, with no gene flow after species divergence.
—(a) The species tree ((12)3) for three species: 1 (human), 2 (chimpanzee), and 3 (gorilla). Species 1 and 2 diverged τ1 generations ago and species 3 diverged τ0 generations earlier. The population sizes are N0 in the common ancestor of all three species and N1 in the common ancestor of species 1 and 2. The four parameters in the model are θ0 = 4N0μ, θ1 = 4N1μ, γ0 =τ0μ, and γ1 =τ1μ, where μ is the mutation rate per site per generation. There are four possible gene trees, represented by b-e. In case I (b), sequences 1 and 2 coalescence between the two speciation events and the gene tree T0 = ((12)3) is consistent with the species tree. This is referred to as case I in the text. In c-e, both coalescence events occur in the common ancestor of all three species. In this case (case II), the gene tree can be T1 = ((12)3), T2 = ((23)1), or T3 = ((31)2), with equal probability.
—(a) The species tree ((12)3) for three species: 1 (human), 2 (chimpanzee), and 3 (gorilla). Species 1 and 2 diverged τ1 generations ago and species 3 diverged τ0 generations earlier. The population sizes are N0 in the common ancestor of all three species and N1 in the common ancestor of species 1 and 2. The four parameters in the model are θ0 = 4N0μ, θ1 = 4N1μ, γ0 =τ0μ, and γ1 =τ1μ, where μ is the mutation rate per site per generation. There are four possible gene trees, represented by b-e. In case I (b), sequences 1 and 2 coalescence between the two speciation events and the gene tree T0 = ((12)3) is consistent with the species tree. This is referred to as case I in the text. In c-e, both coalescence events occur in the common ancestor of all three species. In this case (case II), the gene tree can be T1 = ((12)3), T2 = ((23)1), or T3 = ((31)2), with equal probability.
Probability of data at a locus given the gene tree and branch lengths: Given the gene tree Ti, which is one of T0, T1, T2, or T3 in Figure 1b, and its branch lengths (bi0 and bi1), the probability of observing the data, P(Di|Ti, bi0, bi1), in Equation 8 can be calculated under any model of nucleotide substitution (Lio and Goldman 1998). The model of Jukes and Cantor (1969) is used in this article, which seems sufficient for such highly similar sequences. Under this model, the sequence data can be summarized as counts of five possible site patterns (configurations): xxx, xxy, yxx, xyx, and xyz, where x, y, and z are any three different nucleotides. The data at locus i can thus be represented by the number of sites and Li (2001) is listed in Table 1.
Number of site patterns from the data of Chen and Li (2001)
| Locus (i) . | ni . | ni0 . | ni1 . | ni2 . | ni3 . | ni4 . | dHCG-O . | Rate . |
|---|---|---|---|---|---|---|---|---|
| 1-2609 | 472 | 462 | 3 | 3 | 4 | 0 | 0.0299 | 1.010 |
| 2-1251 | 531 | 509 | 13 | 4 | 5 | 0 | 0.0662 | 2.236 |
| 3-2659 | 560 | 551 | 4 | 1 | 4 | 0 | 0.0206 | 0.696 |
| 7-2012 | 528 | 511 | 10 | 2 | 5 | 0 | 0.0284 | 0.959 |
| 8-1364 | 475 | 465 | 6 | 3 | 1 | 0 | 0.0295 | 0.996 |
| 9-1386 | 484 | 471 | 3 | 3 | 7 | 0 | 0.0437 | 1.476 |
| 10-1412N | 474 | 462 | 8 | 3 | 1 | 0 | 0.0254 | 0.858 |
| 10-2207 | 480 | 475 | 3 | 2 | 0 | 0 | 0.0337 | 1.138 |
| 10-2215-3 | 515 | 505 | 4 | 3 | 3 | 0 | 0.0278 | 0.939 |
| 10-2891-2 | 545 | 538 | 1 | 3 | 3 | 0 | 0.0227 | 0.767 |
| 11-1419-2 | 474 | 464 | 4 | 1 | 5 | 0 | 0.0317 | 1.071 |
| 11-2224 | 371 | 369 | 0 | 0 | 2 | 0 | 0.0200 | 0.675 |
| 11-73646 | 463 | 451 | 5 | 1 | 6 | 0 | 0.0294 | 0.993 |
| 12-1482-2 | 368 | 359 | 4 | 2 | 3 | 0 | 0.0299 | 1.010 |
| 12-2906 | 396 | 390 | 5 | 0 | 1 | 0 | 0.0259 | 0.875 |
| 12-2924 | 492 | 479 | 6 | 3 | 4 | 0 | 0.0229 | 0.773 |
| 12-2927-2 | 471 | 461 | 3 | 5 | 2 | 0 | 0.0298 | 1.006 |
| 14-2960 | 301 | 297 | 3 | 1 | 0 | 0 | 0.0248 | 0.838 |
| 14-2963 | 366 | 360 | 6 | 0 | 0 | 0 | 0.0253 | 0.854 |
| 15-2265-2 | 459 | 450 | 5 | 2 | 2 | 0 | 0.0259 | 0.875 |
| 15-2266 | 510 | 497 | 8 | 1 | 4 | 0 | 0.0243 | 0.821 |
| 16-598D4 | 518 | 511 | 3 | 0 | 4 | 0 | 0.0164 | 0.554 |
| 17-0787 | 450 | 443 | 2 | 3 | 2 | 0 | 0.0264 | 0.892 |
| 17-0801 | 491 | 485 | 3 | 2 | 1 | 0 | 0.0321 | 1.084 |
| 17-0812-2 | 374 | 361 | 8 | 2 | 3 | 0 | 0.0348 | 1.175 |
| 17-0813 | 444 | 436 | 4 | 2 | 2 | 0 | 0.0291 | 0.983 |
| 17-1574 | 359 | 352 | 4 | 1 | 2 | 0 | 0.0226 | 0.763 |
| 17-2294 | 514 | 500 | 9 | 4 | 1 | 0 | 0.0277 | 0.936 |
| 17-2987 | 497 | 486 | 4 | 1 | 5 | 1 | 0.0285 | 0.963 |
| 17-2988 | 494 | 481 | 7 | 4 | 2 | 0 | 0.0306 | 1.034 |
| 17-0784 | 462 | 454 | 3 | 1 | 4 | 0 | 0.0484 | 1.635 |
| 17-276O15 | 433 | 419 | 7 | 3 | 4 | 0 | 0.0324 | 1.094 |
| 17-2984 | 502 | 483 | 8 | 7 | 4 | 0 | 0.0245 | 0.827 |
| 17-2986 | 419 | 412 | 2 | 2 | 3 | 0 | 0.0185 | 0.625 |
| 18-0864 | 373 | 360 | 7 | 5 | 1 | 0 | 0.0415 | 1.402 |
| 18-0866 | 443 | 434 | 2 | 4 | 3 | 0 | 0.0175 | 0.591 |
| 18-1506 | 461 | 456 | 3 | 0 | 2 | 0 | 0.0257 | 0.868 |
| 18-1584 | 481 | 470 | 1 | 5 | 5 | 0 | 0.0400 | 1.351 |
| 18-2558 | 443 | 434 | 5 | 4 | 0 | 0 | 0.0372 | 1.256 |
| 19-0946 | 320 | 310 | 6 | 4 | 0 | 0 | 0.0169 | 0.571 |
| 19-0953 | 479 | 474 | 2 | 1 | 1 | 1 | 0.0240 | 0.811 |
| 20-1636 | 535 | 517 | 4 | 10 | 4 | 0 | 0.0447 | 1.510 |
| 20-2012 | 511 | 502 | 5 | 2 | 2 | 0 | 0.0391 | 1.321 |
| 20-2018 | 535 | 522 | 5 | 4 | 4 | 0 | 0.0319 | 1.077 |
| 20-2019 | 410 | 401 | 5 | 2 | 2 | 0 | 0.0299 | 1.010 |
| 20-2020 | 450 | 439 | 3 | 3 | 5 | 0 | 0.0334 | 1.128 |
| 20-2064 | 512 | 504 | 6 | 1 | 1 | 0 | 0.0260 | 0.878 |
| 20-2085 | 542 | 534 | 1 | 2 | 5 | 0 | 0.0219 | 0.740 |
| 20-2352 | 511 | 502 | 2 | 4 | 3 | 0 | 0.0279 | 0.942 |
| 20-2472 | 452 | 444 | 4 | 2 | 2 | 0 | 0.0520 | 1.756 |
| 20-2560 | 517 | 510 | 5 | 1 | 0 | 1 | 0.0286 | 0.966 |
| 20-2563 | 545 | 535 | 1 | 4 | 5 | 0 | 0.0217 | 0.733 |
| 20-2568 | 487 | 480 | 2 | 4 | 1 | 0 | 0.0195 | 0.659 |
| Locus (i) . | ni . | ni0 . | ni1 . | ni2 . | ni3 . | ni4 . | dHCG-O . | Rate . |
|---|---|---|---|---|---|---|---|---|
| 1-2609 | 472 | 462 | 3 | 3 | 4 | 0 | 0.0299 | 1.010 |
| 2-1251 | 531 | 509 | 13 | 4 | 5 | 0 | 0.0662 | 2.236 |
| 3-2659 | 560 | 551 | 4 | 1 | 4 | 0 | 0.0206 | 0.696 |
| 7-2012 | 528 | 511 | 10 | 2 | 5 | 0 | 0.0284 | 0.959 |
| 8-1364 | 475 | 465 | 6 | 3 | 1 | 0 | 0.0295 | 0.996 |
| 9-1386 | 484 | 471 | 3 | 3 | 7 | 0 | 0.0437 | 1.476 |
| 10-1412N | 474 | 462 | 8 | 3 | 1 | 0 | 0.0254 | 0.858 |
| 10-2207 | 480 | 475 | 3 | 2 | 0 | 0 | 0.0337 | 1.138 |
| 10-2215-3 | 515 | 505 | 4 | 3 | 3 | 0 | 0.0278 | 0.939 |
| 10-2891-2 | 545 | 538 | 1 | 3 | 3 | 0 | 0.0227 | 0.767 |
| 11-1419-2 | 474 | 464 | 4 | 1 | 5 | 0 | 0.0317 | 1.071 |
| 11-2224 | 371 | 369 | 0 | 0 | 2 | 0 | 0.0200 | 0.675 |
| 11-73646 | 463 | 451 | 5 | 1 | 6 | 0 | 0.0294 | 0.993 |
| 12-1482-2 | 368 | 359 | 4 | 2 | 3 | 0 | 0.0299 | 1.010 |
| 12-2906 | 396 | 390 | 5 | 0 | 1 | 0 | 0.0259 | 0.875 |
| 12-2924 | 492 | 479 | 6 | 3 | 4 | 0 | 0.0229 | 0.773 |
| 12-2927-2 | 471 | 461 | 3 | 5 | 2 | 0 | 0.0298 | 1.006 |
| 14-2960 | 301 | 297 | 3 | 1 | 0 | 0 | 0.0248 | 0.838 |
| 14-2963 | 366 | 360 | 6 | 0 | 0 | 0 | 0.0253 | 0.854 |
| 15-2265-2 | 459 | 450 | 5 | 2 | 2 | 0 | 0.0259 | 0.875 |
| 15-2266 | 510 | 497 | 8 | 1 | 4 | 0 | 0.0243 | 0.821 |
| 16-598D4 | 518 | 511 | 3 | 0 | 4 | 0 | 0.0164 | 0.554 |
| 17-0787 | 450 | 443 | 2 | 3 | 2 | 0 | 0.0264 | 0.892 |
| 17-0801 | 491 | 485 | 3 | 2 | 1 | 0 | 0.0321 | 1.084 |
| 17-0812-2 | 374 | 361 | 8 | 2 | 3 | 0 | 0.0348 | 1.175 |
| 17-0813 | 444 | 436 | 4 | 2 | 2 | 0 | 0.0291 | 0.983 |
| 17-1574 | 359 | 352 | 4 | 1 | 2 | 0 | 0.0226 | 0.763 |
| 17-2294 | 514 | 500 | 9 | 4 | 1 | 0 | 0.0277 | 0.936 |
| 17-2987 | 497 | 486 | 4 | 1 | 5 | 1 | 0.0285 | 0.963 |
| 17-2988 | 494 | 481 | 7 | 4 | 2 | 0 | 0.0306 | 1.034 |
| 17-0784 | 462 | 454 | 3 | 1 | 4 | 0 | 0.0484 | 1.635 |
| 17-276O15 | 433 | 419 | 7 | 3 | 4 | 0 | 0.0324 | 1.094 |
| 17-2984 | 502 | 483 | 8 | 7 | 4 | 0 | 0.0245 | 0.827 |
| 17-2986 | 419 | 412 | 2 | 2 | 3 | 0 | 0.0185 | 0.625 |
| 18-0864 | 373 | 360 | 7 | 5 | 1 | 0 | 0.0415 | 1.402 |
| 18-0866 | 443 | 434 | 2 | 4 | 3 | 0 | 0.0175 | 0.591 |
| 18-1506 | 461 | 456 | 3 | 0 | 2 | 0 | 0.0257 | 0.868 |
| 18-1584 | 481 | 470 | 1 | 5 | 5 | 0 | 0.0400 | 1.351 |
| 18-2558 | 443 | 434 | 5 | 4 | 0 | 0 | 0.0372 | 1.256 |
| 19-0946 | 320 | 310 | 6 | 4 | 0 | 0 | 0.0169 | 0.571 |
| 19-0953 | 479 | 474 | 2 | 1 | 1 | 1 | 0.0240 | 0.811 |
| 20-1636 | 535 | 517 | 4 | 10 | 4 | 0 | 0.0447 | 1.510 |
| 20-2012 | 511 | 502 | 5 | 2 | 2 | 0 | 0.0391 | 1.321 |
| 20-2018 | 535 | 522 | 5 | 4 | 4 | 0 | 0.0319 | 1.077 |
| 20-2019 | 410 | 401 | 5 | 2 | 2 | 0 | 0.0299 | 1.010 |
| 20-2020 | 450 | 439 | 3 | 3 | 5 | 0 | 0.0334 | 1.128 |
| 20-2064 | 512 | 504 | 6 | 1 | 1 | 0 | 0.0260 | 0.878 |
| 20-2085 | 542 | 534 | 1 | 2 | 5 | 0 | 0.0219 | 0.740 |
| 20-2352 | 511 | 502 | 2 | 4 | 3 | 0 | 0.0279 | 0.942 |
| 20-2472 | 452 | 444 | 4 | 2 | 2 | 0 | 0.0520 | 1.756 |
| 20-2560 | 517 | 510 | 5 | 1 | 0 | 1 | 0.0286 | 0.966 |
| 20-2563 | 545 | 535 | 1 | 4 | 5 | 0 | 0.0217 | 0.733 |
| 20-2568 | 487 | 480 | 2 | 4 | 1 | 0 | 0.0195 | 0.659 |
ni0, ni1, ni2, ni3, ni4 are counts of sites with patterns xxx, xxy, yxx, xyx, and xyz in human (H), chimpanzee (C), and gorilla (G), while ni = ni0 + ni1 + ni2 + ni3 + ni4 is the total number of sites at locus i.
Number of site patterns from the data of Chen and Li (2001)
| Locus (i) . | ni . | ni0 . | ni1 . | ni2 . | ni3 . | ni4 . | dHCG-O . | Rate . |
|---|---|---|---|---|---|---|---|---|
| 1-2609 | 472 | 462 | 3 | 3 | 4 | 0 | 0.0299 | 1.010 |
| 2-1251 | 531 | 509 | 13 | 4 | 5 | 0 | 0.0662 | 2.236 |
| 3-2659 | 560 | 551 | 4 | 1 | 4 | 0 | 0.0206 | 0.696 |
| 7-2012 | 528 | 511 | 10 | 2 | 5 | 0 | 0.0284 | 0.959 |
| 8-1364 | 475 | 465 | 6 | 3 | 1 | 0 | 0.0295 | 0.996 |
| 9-1386 | 484 | 471 | 3 | 3 | 7 | 0 | 0.0437 | 1.476 |
| 10-1412N | 474 | 462 | 8 | 3 | 1 | 0 | 0.0254 | 0.858 |
| 10-2207 | 480 | 475 | 3 | 2 | 0 | 0 | 0.0337 | 1.138 |
| 10-2215-3 | 515 | 505 | 4 | 3 | 3 | 0 | 0.0278 | 0.939 |
| 10-2891-2 | 545 | 538 | 1 | 3 | 3 | 0 | 0.0227 | 0.767 |
| 11-1419-2 | 474 | 464 | 4 | 1 | 5 | 0 | 0.0317 | 1.071 |
| 11-2224 | 371 | 369 | 0 | 0 | 2 | 0 | 0.0200 | 0.675 |
| 11-73646 | 463 | 451 | 5 | 1 | 6 | 0 | 0.0294 | 0.993 |
| 12-1482-2 | 368 | 359 | 4 | 2 | 3 | 0 | 0.0299 | 1.010 |
| 12-2906 | 396 | 390 | 5 | 0 | 1 | 0 | 0.0259 | 0.875 |
| 12-2924 | 492 | 479 | 6 | 3 | 4 | 0 | 0.0229 | 0.773 |
| 12-2927-2 | 471 | 461 | 3 | 5 | 2 | 0 | 0.0298 | 1.006 |
| 14-2960 | 301 | 297 | 3 | 1 | 0 | 0 | 0.0248 | 0.838 |
| 14-2963 | 366 | 360 | 6 | 0 | 0 | 0 | 0.0253 | 0.854 |
| 15-2265-2 | 459 | 450 | 5 | 2 | 2 | 0 | 0.0259 | 0.875 |
| 15-2266 | 510 | 497 | 8 | 1 | 4 | 0 | 0.0243 | 0.821 |
| 16-598D4 | 518 | 511 | 3 | 0 | 4 | 0 | 0.0164 | 0.554 |
| 17-0787 | 450 | 443 | 2 | 3 | 2 | 0 | 0.0264 | 0.892 |
| 17-0801 | 491 | 485 | 3 | 2 | 1 | 0 | 0.0321 | 1.084 |
| 17-0812-2 | 374 | 361 | 8 | 2 | 3 | 0 | 0.0348 | 1.175 |
| 17-0813 | 444 | 436 | 4 | 2 | 2 | 0 | 0.0291 | 0.983 |
| 17-1574 | 359 | 352 | 4 | 1 | 2 | 0 | 0.0226 | 0.763 |
| 17-2294 | 514 | 500 | 9 | 4 | 1 | 0 | 0.0277 | 0.936 |
| 17-2987 | 497 | 486 | 4 | 1 | 5 | 1 | 0.0285 | 0.963 |
| 17-2988 | 494 | 481 | 7 | 4 | 2 | 0 | 0.0306 | 1.034 |
| 17-0784 | 462 | 454 | 3 | 1 | 4 | 0 | 0.0484 | 1.635 |
| 17-276O15 | 433 | 419 | 7 | 3 | 4 | 0 | 0.0324 | 1.094 |
| 17-2984 | 502 | 483 | 8 | 7 | 4 | 0 | 0.0245 | 0.827 |
| 17-2986 | 419 | 412 | 2 | 2 | 3 | 0 | 0.0185 | 0.625 |
| 18-0864 | 373 | 360 | 7 | 5 | 1 | 0 | 0.0415 | 1.402 |
| 18-0866 | 443 | 434 | 2 | 4 | 3 | 0 | 0.0175 | 0.591 |
| 18-1506 | 461 | 456 | 3 | 0 | 2 | 0 | 0.0257 | 0.868 |
| 18-1584 | 481 | 470 | 1 | 5 | 5 | 0 | 0.0400 | 1.351 |
| 18-2558 | 443 | 434 | 5 | 4 | 0 | 0 | 0.0372 | 1.256 |
| 19-0946 | 320 | 310 | 6 | 4 | 0 | 0 | 0.0169 | 0.571 |
| 19-0953 | 479 | 474 | 2 | 1 | 1 | 1 | 0.0240 | 0.811 |
| 20-1636 | 535 | 517 | 4 | 10 | 4 | 0 | 0.0447 | 1.510 |
| 20-2012 | 511 | 502 | 5 | 2 | 2 | 0 | 0.0391 | 1.321 |
| 20-2018 | 535 | 522 | 5 | 4 | 4 | 0 | 0.0319 | 1.077 |
| 20-2019 | 410 | 401 | 5 | 2 | 2 | 0 | 0.0299 | 1.010 |
| 20-2020 | 450 | 439 | 3 | 3 | 5 | 0 | 0.0334 | 1.128 |
| 20-2064 | 512 | 504 | 6 | 1 | 1 | 0 | 0.0260 | 0.878 |
| 20-2085 | 542 | 534 | 1 | 2 | 5 | 0 | 0.0219 | 0.740 |
| 20-2352 | 511 | 502 | 2 | 4 | 3 | 0 | 0.0279 | 0.942 |
| 20-2472 | 452 | 444 | 4 | 2 | 2 | 0 | 0.0520 | 1.756 |
| 20-2560 | 517 | 510 | 5 | 1 | 0 | 1 | 0.0286 | 0.966 |
| 20-2563 | 545 | 535 | 1 | 4 | 5 | 0 | 0.0217 | 0.733 |
| 20-2568 | 487 | 480 | 2 | 4 | 1 | 0 | 0.0195 | 0.659 |
| Locus (i) . | ni . | ni0 . | ni1 . | ni2 . | ni3 . | ni4 . | dHCG-O . | Rate . |
|---|---|---|---|---|---|---|---|---|
| 1-2609 | 472 | 462 | 3 | 3 | 4 | 0 | 0.0299 | 1.010 |
| 2-1251 | 531 | 509 | 13 | 4 | 5 | 0 | 0.0662 | 2.236 |
| 3-2659 | 560 | 551 | 4 | 1 | 4 | 0 | 0.0206 | 0.696 |
| 7-2012 | 528 | 511 | 10 | 2 | 5 | 0 | 0.0284 | 0.959 |
| 8-1364 | 475 | 465 | 6 | 3 | 1 | 0 | 0.0295 | 0.996 |
| 9-1386 | 484 | 471 | 3 | 3 | 7 | 0 | 0.0437 | 1.476 |
| 10-1412N | 474 | 462 | 8 | 3 | 1 | 0 | 0.0254 | 0.858 |
| 10-2207 | 480 | 475 | 3 | 2 | 0 | 0 | 0.0337 | 1.138 |
| 10-2215-3 | 515 | 505 | 4 | 3 | 3 | 0 | 0.0278 | 0.939 |
| 10-2891-2 | 545 | 538 | 1 | 3 | 3 | 0 | 0.0227 | 0.767 |
| 11-1419-2 | 474 | 464 | 4 | 1 | 5 | 0 | 0.0317 | 1.071 |
| 11-2224 | 371 | 369 | 0 | 0 | 2 | 0 | 0.0200 | 0.675 |
| 11-73646 | 463 | 451 | 5 | 1 | 6 | 0 | 0.0294 | 0.993 |
| 12-1482-2 | 368 | 359 | 4 | 2 | 3 | 0 | 0.0299 | 1.010 |
| 12-2906 | 396 | 390 | 5 | 0 | 1 | 0 | 0.0259 | 0.875 |
| 12-2924 | 492 | 479 | 6 | 3 | 4 | 0 | 0.0229 | 0.773 |
| 12-2927-2 | 471 | 461 | 3 | 5 | 2 | 0 | 0.0298 | 1.006 |
| 14-2960 | 301 | 297 | 3 | 1 | 0 | 0 | 0.0248 | 0.838 |
| 14-2963 | 366 | 360 | 6 | 0 | 0 | 0 | 0.0253 | 0.854 |
| 15-2265-2 | 459 | 450 | 5 | 2 | 2 | 0 | 0.0259 | 0.875 |
| 15-2266 | 510 | 497 | 8 | 1 | 4 | 0 | 0.0243 | 0.821 |
| 16-598D4 | 518 | 511 | 3 | 0 | 4 | 0 | 0.0164 | 0.554 |
| 17-0787 | 450 | 443 | 2 | 3 | 2 | 0 | 0.0264 | 0.892 |
| 17-0801 | 491 | 485 | 3 | 2 | 1 | 0 | 0.0321 | 1.084 |
| 17-0812-2 | 374 | 361 | 8 | 2 | 3 | 0 | 0.0348 | 1.175 |
| 17-0813 | 444 | 436 | 4 | 2 | 2 | 0 | 0.0291 | 0.983 |
| 17-1574 | 359 | 352 | 4 | 1 | 2 | 0 | 0.0226 | 0.763 |
| 17-2294 | 514 | 500 | 9 | 4 | 1 | 0 | 0.0277 | 0.936 |
| 17-2987 | 497 | 486 | 4 | 1 | 5 | 1 | 0.0285 | 0.963 |
| 17-2988 | 494 | 481 | 7 | 4 | 2 | 0 | 0.0306 | 1.034 |
| 17-0784 | 462 | 454 | 3 | 1 | 4 | 0 | 0.0484 | 1.635 |
| 17-276O15 | 433 | 419 | 7 | 3 | 4 | 0 | 0.0324 | 1.094 |
| 17-2984 | 502 | 483 | 8 | 7 | 4 | 0 | 0.0245 | 0.827 |
| 17-2986 | 419 | 412 | 2 | 2 | 3 | 0 | 0.0185 | 0.625 |
| 18-0864 | 373 | 360 | 7 | 5 | 1 | 0 | 0.0415 | 1.402 |
| 18-0866 | 443 | 434 | 2 | 4 | 3 | 0 | 0.0175 | 0.591 |
| 18-1506 | 461 | 456 | 3 | 0 | 2 | 0 | 0.0257 | 0.868 |
| 18-1584 | 481 | 470 | 1 | 5 | 5 | 0 | 0.0400 | 1.351 |
| 18-2558 | 443 | 434 | 5 | 4 | 0 | 0 | 0.0372 | 1.256 |
| 19-0946 | 320 | 310 | 6 | 4 | 0 | 0 | 0.0169 | 0.571 |
| 19-0953 | 479 | 474 | 2 | 1 | 1 | 1 | 0.0240 | 0.811 |
| 20-1636 | 535 | 517 | 4 | 10 | 4 | 0 | 0.0447 | 1.510 |
| 20-2012 | 511 | 502 | 5 | 2 | 2 | 0 | 0.0391 | 1.321 |
| 20-2018 | 535 | 522 | 5 | 4 | 4 | 0 | 0.0319 | 1.077 |
| 20-2019 | 410 | 401 | 5 | 2 | 2 | 0 | 0.0299 | 1.010 |
| 20-2020 | 450 | 439 | 3 | 3 | 5 | 0 | 0.0334 | 1.128 |
| 20-2064 | 512 | 504 | 6 | 1 | 1 | 0 | 0.0260 | 0.878 |
| 20-2085 | 542 | 534 | 1 | 2 | 5 | 0 | 0.0219 | 0.740 |
| 20-2352 | 511 | 502 | 2 | 4 | 3 | 0 | 0.0279 | 0.942 |
| 20-2472 | 452 | 444 | 4 | 2 | 2 | 0 | 0.0520 | 1.756 |
| 20-2560 | 517 | 510 | 5 | 1 | 0 | 1 | 0.0286 | 0.966 |
| 20-2563 | 545 | 535 | 1 | 4 | 5 | 0 | 0.0217 | 0.733 |
| 20-2568 | 487 | 480 | 2 | 4 | 1 | 0 | 0.0195 | 0.659 |
ni0, ni1, ni2, ni3, ni4 are counts of sites with patterns xxx, xxy, yxx, xyx, and xyz in human (H), chimpanzee (C), and gorilla (G), while ni = ni0 + ni1 + ni2 + ni3 + ni4 is the total number of sites at locus i.
Mutation rate variation among loci: An important factor that may influence the estimation of ancestral population sizes is the variation of mutation rates among loci (Yang 1997a; Chen and Li 2001). For example, estimation of ancestral population size from comparison between two species was found to be sensitive to even slight rate variation (Yang 1997a). If relative rates for the loci are available, it will be straightforward to incorporate them in the likelihood calculation (Yang 1997a). In this article, I use the average distance from the orangutan to the three African apes to calculate the relative rate for the locus (Table 1). This ad hoc approach appears sensible since the orangutan diverged from the African apes very early and ancestral polymorphism in their common ancestor does not seem important. The likelihood calculation proceeds as before except that the branch lengths for the gene tree at each locus are multiplied by the relative rate for that locus.
Application to the data of Chen and Li: Chen and Li (2001) sequenced one individual from each of the four species, human, chimpanzee, gorilla, and orangutan, at 53 independent noncoding loci (contigs). An advantage of the data set is that the sequences are expected to be outside and far away from coding regions and not affected by selection at linked sites or loci. The model of this article assumes the molecular clock and uses only three species. The counts of site patterns are listed in Table 1. At some loci, the total number of sites used in this article is larger than that in Chen and Li (2001; Table 1), because some sites had alignment gaps in the orangutan and were removed by Chen and Li.
Maximum-likelihood estimates of parameters
| Parameter . | One rate for all loci . | Variable rates among loci . |
|---|---|---|
| (N̂0) | 0.003057 (38,000) | 0.002348 (29,000) |
| (N̂1) | 0.000990 (12,000) | 0.001650 (21,000) |
| (time in MY) | 0.001089 (1.1 MY) | 0.001704 (1.7 MY) |
| (time in MY) | 0.005194 (5.2 MY) | 0.004936 (4.9 MY) |
| -3,099.41 | -3,100.01 |
| Parameter . | One rate for all loci . | Variable rates among loci . |
|---|---|---|
| (N̂0) | 0.003057 (38,000) | 0.002348 (29,000) |
| (N̂1) | 0.000990 (12,000) | 0.001650 (21,000) |
| (time in MY) | 0.001089 (1.1 MY) | 0.001704 (1.7 MY) |
| (time in MY) | 0.005194 (5.2 MY) | 0.004936 (4.9 MY) |
| -3,099.41 | -3,100.01 |
In converting θ into N and γ into speciation time, the generation time is assumed to be g = 20 years in all species and the mutation rate to be 10-9 substitutions per site per year.
Maximum-likelihood estimates of parameters
| Parameter . | One rate for all loci . | Variable rates among loci . |
|---|---|---|
| (N̂0) | 0.003057 (38,000) | 0.002348 (29,000) |
| (N̂1) | 0.000990 (12,000) | 0.001650 (21,000) |
| (time in MY) | 0.001089 (1.1 MY) | 0.001704 (1.7 MY) |
| (time in MY) | 0.005194 (5.2 MY) | 0.004936 (4.9 MY) |
| -3,099.41 | -3,100.01 |
| Parameter . | One rate for all loci . | Variable rates among loci . |
|---|---|---|
| (N̂0) | 0.003057 (38,000) | 0.002348 (29,000) |
| (N̂1) | 0.000990 (12,000) | 0.001650 (21,000) |
| (time in MY) | 0.001089 (1.1 MY) | 0.001704 (1.7 MY) |
| (time in MY) | 0.005194 (5.2 MY) | 0.004936 (4.9 MY) |
| -3,099.41 | -3,100.01 |
In converting θ into N and γ into speciation time, the generation time is assumed to be g = 20 years in all species and the mutation rate to be 10-9 substitutions per site per year.
The three-species data are analyzed by the maximum-likelihood (ML) method of this article. The estimates of parameters are given in Table 2. If we assume a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year (e.g., Nachman and Crowell 2000), the estimates suggest a population size for the ancestor of humans and chimpanzees of ∼12,000. This is several times smaller than the estimates of Chen and Li (2001) from the same data, at a minimum of 52,000. The estimate is also similar to estimates of the population size of modern humans, for example, 12,000 by Yu et al. (2001). The population size for the common ancestor of all three species is estimated to be ∼38,000. The same analysis estimated the human-chimpanzee divergence time at 5.1 million years ago (MYA) and the gorilla divergence at ∼1.1 million years (MY) earlier. Those estimates are largely consistent with those of previous studies (e.g., Hasegawa et al. 1985; Ruvolo 1997; Kumar and Hedges 1998; Yoder and Yang 2000). Figure 2a shows the likelihood surface as a function of θ0 and θ1 when γ0 and γ1 are fixed at their maximum-likelihood estimates (MLEs). The ∼95% confidence region is given by the likelihood contour at 3.32 units below the optimum, that is, at -3102.73 (Figure 2a). The sampling errors are quite large. Analysis of the human and chimpanzee sequences at the 53 loci using the ML method of Takahata et al. (1995) under the infinite-sites model leads to and (N. Takahata, personal communication). Those estimates are similar to the MLEs of Table 2.
To examine the effect of mutation rate variation among loci, I calculated the average sequence distance under the model of Jukes and Cantor (1969) from the human, chimpanzee, and gorilla to the orangutan, that is, dHCG-O = (dH-O + dC-O + dG-O)/3. This is divided by the average across all loci to give the relative rate for that locus (Table 1). The average distance from the orangutan to the African apes is found to be 0.0296; this is consistent with a mutation rate of 10-9 substitutions per site per year and an orangutan divergence date of ∼13 MYA (e.g., Hasegawa et al. 1987) since 0.0296/(2 × 13 × 106) = 1.1387 × 10-9. The relative rates for loci calculated this way are used as fixed constants in the likelihood calculation for the data of three species. The MLEs of parameters are shown in Table 2. Using the same generation time and mutation rate as above, we get the estimate of the population size for the ancestor of humans and chimpanzees to be 21,000, larger than the estimate under the assumption of a constant rate for all loci. The population size for the ancestor of all three species is estimated to be 29,000, smaller than under the constant-rate model. The differences between the two analyses are somewhat surprising, as one might expect the population sizes to be smaller when rate variation among loci is accounted for. However, it is noted that the distance from orangutan to the African apes has only a weak correlation (0.44) with the average distance within the H-C-G group, which appears to suggest that the mutation rates are rather homogeneous among loci and that the conflict among loci in sequence divergence is mainly caused by ancestral polymorphism.
If the average distances within the H-C-G group, (dHC + dCG + dGH)/3, are used as relative rates for the loci, parameter estimates become , , , and , with . Those correspond to a population size of 36,000 for the ancestor of humans and chimpanzees, a population size of only 18 for the ancestor of all three species, 4.5 MY for the H-C divergence, and 7.7 MY for the gorilla divergence. This calculation effectively attributes all variation in sequence divergence among loci to mutation rate variation and causes underestimation of θ0 and γ1 and overestimation of θ1 and γ0 (see also Table 2).
Comparison with the tree-mismatch method: When the gene tree is T2 or T3 (Figure 1), there is a mismatch between the species tree and the gene tree. This occurs with probability PSG = f(T2) + f(T3) = 2(1 -Ψ)/3 = ⅔e-2γ0/θ1 (e.g., Nei 1987, pp. 288-289). The tree-mismatch method estimates θ1 by equating this probability to the proportion (pˆ) of loci at which the estimated gene tree differs from the species tree, with γ0 being assumed known; that is, . Chen and Li (2001) used the orangutan to root the H-C-G tree and were able to resolve the gene tree at 33 loci, out of which 9 mismatches were found, at a proportion of 27.3%. Several coding loci were included as well, so that 16 mismatches were found at a total of 52 resolved loci, at the proportion 30.8%. The authors assumed an H-C divergence at γ1 = 1.6 MYA and arrived at , which, if the generation time is 20 years, corresponds to a minimum population size of Nˆ1 = 52,000 for the ancestor of humans and chimpanzees. In this article, the molecular clock has been assumed, which can also be used to root the H-C-G tree. Under the clock, T1, T2, or T3 is the ML tree if ni1, ni2, or ni3 is the greatest among the three, respectively (Saitou 1988; Yang 1994). The clock-rooting approach uses more “informative” sites than out-group rooting and resolves the gene tree at 49 of the 53 loci, out of which 18 are mismatches, at the proportion 36.7%. This proportion is even higher than those of Chen and Li and produces even larger estimates of θ1 and N1.
—Log-likelihood surface (contour) as a function of θ0 and θ1 when γ0 and γ1 are fixed at their MLEs. (a) The same substitution (mutation) rate is assumed for all loci. (b) Fixed relative rates obtained from comparison between the orangutan and the African apes (human, chimpanzee, and gorilla) are used to account for possible evolutionary rate variation among loci. Maximum-likelihood estimates of parameters are listed in Table 2.
—Log-likelihood surface (contour) as a function of θ0 and θ1 when γ0 and γ1 are fixed at their MLEs. (a) The same substitution (mutation) rate is assumed for all loci. (b) Fixed relative rates obtained from comparison between the orangutan and the African apes (human, chimpanzee, and gorilla) are used to account for possible evolutionary rate variation among loci. Maximum-likelihood estimates of parameters are listed in Table 2.
Figure 3 shows the probabilities that the species tree (S), the gene tree (G), and the estimated gene tree (E) differ from each other. The probability of a mismatch between the species tree and the gene tree is PSG = 2(1 -Ψ)/3 = 0.0739, much lower than the observed mismatch proportion pˆ = 0.367. The probability of a mismatch between the species tree and the estimated gene tree is higher. With 466 sites (the average across the 53 loci, Table 1), PSE = 0.2028, which is 2.7 times as high as PSG (Figure 3). Now consider the four gene trees T0, T1, T2, and T3 (Figure 1), which occur with probabilities f(T0) =Ψ = 0.8892 and f(T1) = f(T2) = f(T3) = (1 -Ψ)/3 = 0.0369 (Equation 1). According to the simulation, the probability that the topology of the gene tree is incorrectly reconstructed when the true gene tree is T0, T1, T2, or T3 is P0 = 0.1453 and P1 = P2 = P3 = 0.1981. Note that for the estimated gene tree, we consider only the topology and disregard its divergence times relative to the speciation times. The average error probability of gene tree reconstruction is thus (see Figure 3). When gene trees T0 or T1 are incorrectly reconstructed, the estimated gene tree will always be a mismatch with the species tree; such errors will cause an overcount of f(T0)P0 + f(T1)P1 = 0.1366. Conversely, when gene trees T2 or T3 are incorrectly reconstructed, the estimated tree will not be counted as a mismatch one-half of the time, so the undercount is f(T2)P2/2 + f(T3)P3/2 = f(T2)P2 = 0.0074. The difference between those two error rates gives rise to the net error due to gene tree reconstruction: PSE - PSG = f(T0)P0 =ΨP0 = 0.1290. The above argument suggests that ignoring errors in gene tree reconstruction always causes overestimation of the mismatch between the species tree and the gene tree and leads to overestimation of the ancestral population size N1. It is interesting that the bias in the tree-mismatch method is caused by reconstruction errors for gene tree T0 alone and thus can be reduced if Ψ is reduced, for example, if the two speciation events are very close or if the ancestral population size N1 is large. Obviously factors that reduce the reconstruction error P0, such as longer sequences (Figure 3) and higher mutation rates, will reduce the bias as well.
—Tree-mismatch probabilities calculated using Monte Carlo simulation plotted as functions of the sequence length. MLEs of parameters in Table 2 (one rate for all loci) are used in the simulation. Note that S, G, and E in the subscripts stand for the species tree, the gene tree, and the estimated gene tree (the ML tree), respectively. Thus PSG is the probability that the species tree and the gene tree differ; this is 0.0739 for the parameter values used. PSE is the probability of a mismatch between the species tree and the estimated gene tree, and PGE is the probability of a mismatch between the gene tree and the estimated gene tree. Ptie is the proportion of replicates in which a tie occurs, that is, the two best trees are equally good. Ties are excluded in calculation of PSE and PGE. Ten million replicates were simulated for each sequence length.
—Tree-mismatch probabilities calculated using Monte Carlo simulation plotted as functions of the sequence length. MLEs of parameters in Table 2 (one rate for all loci) are used in the simulation. Note that S, G, and E in the subscripts stand for the species tree, the gene tree, and the estimated gene tree (the ML tree), respectively. Thus PSG is the probability that the species tree and the gene tree differ; this is 0.0739 for the parameter values used. PSE is the probability of a mismatch between the species tree and the estimated gene tree, and PGE is the probability of a mismatch between the gene tree and the estimated gene tree. Ptie is the proportion of replicates in which a tie occurs, that is, the two best trees are equally good. Ties are excluded in calculation of PSE and PGE. Ten million replicates were simulated for each sequence length.
THE BAYES APPROACH USING MCMC
Prior and posterior distributions of parameters in the Bayes analysis
| . | Prior . | Posterior . | ||
|---|---|---|---|---|
| Parameter . | (α, β)a . | Mean (95% interval)b . | Mean . | Mean (95% interval)b . |
| Good priors | ||||
| θ0 (N0) | (2, 2,000) | 12,500 (1,500, 34,800) | 0.00158 | 19,700 (2,900, 41,600) |
| θ1 (N1) | As above | 0.00100 | 12,400 (1,700, 32,100) | |
| γ0 (time in MY) | (4, 2,500) | 1.6 MY (0.44 MY, 3.51 MY) | 0.00164 | 1.6 MY (0.7 MY, 2.7 MY) |
| γ1 (time in MY) | (20, 4,000) | 5.0 MY (3.1 MY, 7.4 MY) | 0.00530 | 5.3 MY (4.4 MY, 6.1 MY) |
| Poor priors | ||||
| θ0 (N0) | (1.5, 300) | 62,500 (4,500, 194,800) | 0.00263 | 32,900 (5,300, 57,800) |
| θ1 (N1) | As above | 0.00265 | 33,100 (5,300, 88,600) | |
| γ0 (time in MY) | (4, 2,000) | 2.0 MY (0.55 MY, 4.4 MY) | 0.00190 | 1.9 MY (0.8 MY, 3.2 MY) |
| γ1 (time in MY) | (4, 800) | 5.0 MY (1.3 MY, 11.0 MY) | 0.00464 | 4.6 MY (3.4 MY, 5.7 MY) |
| . | Prior . | Posterior . | ||
|---|---|---|---|---|
| Parameter . | (α, β)a . | Mean (95% interval)b . | Mean . | Mean (95% interval)b . |
| Good priors | ||||
| θ0 (N0) | (2, 2,000) | 12,500 (1,500, 34,800) | 0.00158 | 19,700 (2,900, 41,600) |
| θ1 (N1) | As above | 0.00100 | 12,400 (1,700, 32,100) | |
| γ0 (time in MY) | (4, 2,500) | 1.6 MY (0.44 MY, 3.51 MY) | 0.00164 | 1.6 MY (0.7 MY, 2.7 MY) |
| γ1 (time in MY) | (20, 4,000) | 5.0 MY (3.1 MY, 7.4 MY) | 0.00530 | 5.3 MY (4.4 MY, 6.1 MY) |
| Poor priors | ||||
| θ0 (N0) | (1.5, 300) | 62,500 (4,500, 194,800) | 0.00263 | 32,900 (5,300, 57,800) |
| θ1 (N1) | As above | 0.00265 | 33,100 (5,300, 88,600) | |
| γ0 (time in MY) | (4, 2,000) | 2.0 MY (0.55 MY, 4.4 MY) | 0.00190 | 1.9 MY (0.8 MY, 3.2 MY) |
| γ1 (time in MY) | (4, 800) | 5.0 MY (1.3 MY, 11.0 MY) | 0.00464 | 4.6 MY (3.4 MY, 5.7 MY) |
Parameters α and β are for the gamma priors; the prior mean is α/β (not shown).
Mean and 2.5 and 97.5% percentiles of the prior or posterior distributions for population sizes or speciation times. In converting θ and γ into N and speciation time, the generation time is assumed to be g = 20 years and the mutation rate 10-9 substitutions per site per year.
Prior and posterior distributions of parameters in the Bayes analysis
| . | Prior . | Posterior . | ||
|---|---|---|---|---|
| Parameter . | (α, β)a . | Mean (95% interval)b . | Mean . | Mean (95% interval)b . |
| Good priors | ||||
| θ0 (N0) | (2, 2,000) | 12,500 (1,500, 34,800) | 0.00158 | 19,700 (2,900, 41,600) |
| θ1 (N1) | As above | 0.00100 | 12,400 (1,700, 32,100) | |
| γ0 (time in MY) | (4, 2,500) | 1.6 MY (0.44 MY, 3.51 MY) | 0.00164 | 1.6 MY (0.7 MY, 2.7 MY) |
| γ1 (time in MY) | (20, 4,000) | 5.0 MY (3.1 MY, 7.4 MY) | 0.00530 | 5.3 MY (4.4 MY, 6.1 MY) |
| Poor priors | ||||
| θ0 (N0) | (1.5, 300) | 62,500 (4,500, 194,800) | 0.00263 | 32,900 (5,300, 57,800) |
| θ1 (N1) | As above | 0.00265 | 33,100 (5,300, 88,600) | |
| γ0 (time in MY) | (4, 2,000) | 2.0 MY (0.55 MY, 4.4 MY) | 0.00190 | 1.9 MY (0.8 MY, 3.2 MY) |
| γ1 (time in MY) | (4, 800) | 5.0 MY (1.3 MY, 11.0 MY) | 0.00464 | 4.6 MY (3.4 MY, 5.7 MY) |
| . | Prior . | Posterior . | ||
|---|---|---|---|---|
| Parameter . | (α, β)a . | Mean (95% interval)b . | Mean . | Mean (95% interval)b . |
| Good priors | ||||
| θ0 (N0) | (2, 2,000) | 12,500 (1,500, 34,800) | 0.00158 | 19,700 (2,900, 41,600) |
| θ1 (N1) | As above | 0.00100 | 12,400 (1,700, 32,100) | |
| γ0 (time in MY) | (4, 2,500) | 1.6 MY (0.44 MY, 3.51 MY) | 0.00164 | 1.6 MY (0.7 MY, 2.7 MY) |
| γ1 (time in MY) | (20, 4,000) | 5.0 MY (3.1 MY, 7.4 MY) | 0.00530 | 5.3 MY (4.4 MY, 6.1 MY) |
| Poor priors | ||||
| θ0 (N0) | (1.5, 300) | 62,500 (4,500, 194,800) | 0.00263 | 32,900 (5,300, 57,800) |
| θ1 (N1) | As above | 0.00265 | 33,100 (5,300, 88,600) | |
| γ0 (time in MY) | (4, 2,000) | 2.0 MY (0.55 MY, 4.4 MY) | 0.00190 | 1.9 MY (0.8 MY, 3.2 MY) |
| γ1 (time in MY) | (4, 800) | 5.0 MY (1.3 MY, 11.0 MY) | 0.00464 | 4.6 MY (3.4 MY, 5.7 MY) |
Parameters α and β are for the gamma priors; the prior mean is α/β (not shown).
Mean and 2.5 and 97.5% percentiles of the prior or posterior distributions for population sizes or speciation times. In converting θ and γ into N and speciation time, the generation time is assumed to be g = 20 years and the mutation rate 10-9 substitutions per site per year.
The proposal density q can be rather flexible as long as it specifies an aperiodic and irreducible Markov chain. The algorithm I implemented cycles through several steps, with each step updating some but not all variables. In step 1, the gene tree and branch lengths at each locus i (Ti, bi0, bi1) are updated, while parameters 0398; are fixed. Each locus is updated once in this step. Step 2 updates parameters 0398; while the branch lengths {bi0, bi1} are fixed. This step can cause changes to the gene trees at some loci. Step 3 is a mixing step, in which parameters θ0, θ1, γ0, γ1 and branch lengths at all loci are multiplied by a constant while the gene trees remain unchanged. The MCMC algorithm is tedious and the details are given in the appendix.
The Markov chain is started from a random initial state. Sampling starts after a certain number of generations, which are discarded as burn-in, and samples are taken every certain number of steps, thus “thinning” the chain. Convergence of the chain is checked by examining the output and also by running multiple chains. The algorithm is also run without sequence data, and the posterior distribution generated is found to be close to the prior.
Application to the data of Chen and Li: The Bayes MCMC algorithm is applied to the data of Chen and Li (2001; see Table 1). I used two sets of priors (Table 3). Parameters α and β in the gamma prior distributions are chosen by considering likely values and ranges of ancestral population sizes and species divergence times and converting them into parameters θ0, θ1, γ0, and γ1 using a generation time of 20 years and a mutation rate of 10-9 substitutions per site per year. The first set is considered more realistic and referred to as the “good priors” in Table 3. Ancestral population sizes N0 and N1 are centered ∼12,500, close to estimates for modern humans, with the 2.5 and 97.5% percentiles at 1500 and 34,800, respectively. The divergence time for humans and chimpanzees is centered ∼5 MY, while the divergence time for the gorilla is centered ∼1.6 MY. Note that parameters θ0, θ1, γ0, and γ1 are all ⪡1 but are definitely >0; thus values of α>1 are used so that the gamma distribution has a mode >0.
—Prior and posterior distributions for parameters θ0, θ1, γ0, and γ1. Parameter estimates are shown in Table 3 (good priors).
—Prior and posterior distributions for parameters θ0, θ1, γ0, and γ1. Parameter estimates are shown in Table 3 (good priors).
The posterior distributions of parameters θ0, θ1, γ0, and γ1 are shown in Figure 4 together with their priors. They are also summarized in Table 3 (good priors). The means of the posterior distributions for θ0, θ1, γ0, and γ1 are listed, and then the means and the 95% credibility sets for the two population sizes (N0 and N1) and for the two speciation times are listed. The posterior means and medians are close. The population size for the ancestor of humans and chimpanzees is estimated to be 12,400, with the 95% credibility interval (CI) to be (1700, 32,100). The H-C divergence is dated at 5.3 MY, with the 95% CI to be (4.4, 6.1). The estimates of θ1 and γ1 are very similar to the MLEs. The posterior mean of θ0 is smaller and that of γ0 is larger than the MLEs (Tables 2 and 3). The correlation coefficients calculated from the posterior distributions of parameters θ0, θ1, γ0, and γ1 are shown in Table 4. There is strong negative correlation between θ0 and γ0 and between θ1 and γ1. Comparison of the prior and posterior distributions (Figure 4) suggests that the data contain much more information about the divergence times, especially the H-C divergence time (γ1), than about the population sizes.
To see the effect of prior assumptions on the posterior distributions, I used a second set of priors, which are more spread out and also assume large population sizes (mean 62,500). The posterior distributions are summarized in Table 3 under the heading “Poor priors.” The point estimates of both N0 and N1 are ∼33,000, smaller than the prior means. The H-C divergence is dated at 4.6 MY, and the gorilla divergence is dated 1.9 MY earlier. Those estimates appear reasonable, although the H-C divergence date is too recent. Similar strong correlations between the parameters are discovered as in the analysis using the good priors. The negative correlation between γ1 and θ1 (calculated to be -0.76), combined with the assumed and estimated large population sizes, appears to have led to a H-C divergence date that is too recent.
Correlation coefficients in the posterior distribution
| . | θ0 . | θ1 . | γ0 . |
|---|---|---|---|
| θ1 | 0.05 | ||
| γ0 | -0.58 | 0.43 | |
| γ1 | -0.16 | -0.60 | -0.41 |
| . | θ0 . | θ1 . | γ0 . |
|---|---|---|---|
| θ1 | 0.05 | ||
| γ0 | -0.58 | 0.43 | |
| γ1 | -0.16 | -0.60 | -0.41 |
Correlation coefficients in the posterior distribution
| . | θ0 . | θ1 . | γ0 . |
|---|---|---|---|
| θ1 | 0.05 | ||
| γ0 | -0.58 | 0.43 | |
| γ1 | -0.16 | -0.60 | -0.41 |
| . | θ0 . | θ1 . | γ0 . |
|---|---|---|---|
| θ1 | 0.05 | ||
| γ0 | -0.58 | 0.43 | |
| γ1 | -0.16 | -0.60 | -0.41 |
DISCUSSION
ML and Bayes methods of this article estimated the population size for the common ancestor of humans and chimpanzees to be ∼12,000, similar to estimates for modern humans. The estimates are several times smaller than those obtained by Chen and Li (2001) from the same data using the tree-mismatch method, which range from 52,000 to 150,000. Even the worst-case estimates—e.g., 36,000 by ML under the assumption that all sequence divergence variation among loci is due to mutation rate variation and 33,000 from the Bayes analysis using the poor priors—are smaller than the minimum estimate of Chen and Li. The tree-mismatch method used by Chen and Li appears to have serious biases due to errors in gene tree reconstruction, and the likelihood and Bayes estimates reported here are probably more reliable. Thus it may be concluded that the sequence data of Chen and Li (2001) do not support much larger ancestral populations than the modern humans or the notion that early human populations experienced dramatic size reductions (Hacia 2001; Kaessmann et al. 2001).
While the ML and Bayes methods are expected to have better statistical properties than the simple tree-mismatch method, it is worthwhile to examine some of the assumptions made in this article. First, the evolutionary rate is assumed to be constant over lineages. This assumption seems reasonable as the species compared are very closely related; Chen and Li’s (2001) relativerate tests supported the molecular clock. The large differences between the tree-mismatch method and the likelihood and Bayes methods are clearly not due to the use of the clock assumption in this article; use of clock rooting in the tree-mismatch method produced even larger estimates of the population size for the ancestor of humans and chimpanzees. Second, the analysis assumes no recombination within a locus. The effect of recombination on estimation of parameters θ0, θ1, γ0, and γ1 is not well understood, although Satta et al. (2000) emphasized its possible significance. As the human, chimpanzee, and gorilla sequences are extremely similar, most of the recombination events will not be visible in the sequence data, and the few sites at which more than two nucleotides are observed in the data (see counts ni4 for site pattern xyz in Table 1) are probably due to multiple substitutions at the same site. Third, the substitution model of Jukes and Cantor (1969) is simplistic. More complex models, such as those that account for variable substitution rates among sites within the locus, can be easily implemented, but are expected to have little effect. The most serious issue seems to be mutation rate variation among loci. In the case of two species, the ancestral population size is overestimated when mutation rate variation is ignored and accounting for the bias leads to dramatic reduction in the estimated population size (Yang 1997a). In this article, the population size of the ancestor of humans and chimpanzees is not very large under the constant-rate model and becomes larger when variable rates for loci are assumed. The effect is much less important and also in the opposite direction compared with the two-species case. Lack of strong correlation among sequence distances with the orangutan seems to suggest that the rates are relatively homogeneous among those loci. It seems that simultaneous analysis of data from three species allows the parameters to constrain each other, leading to a better use of information in the data. It is quite likely that the estimation can be further improved by sampling multiple individuals from the same species.
The ML and Bayes methods produced similar results for the data analyzed in this article. However, the ML calculation is slower than the MCMC algorithm. The Bayes approach also provides a framework for incorporating prior information about the parameters. For example, a wealth of information is available about the divergence time between humans and chimpanzees. By forcing a very narrow prior distribution for γ1, such information can be incorporated in the Bayes analysis. Using an informative prior will reduce the adverse effect of strong correlation among parameters when other parameters are estimated. Furthermore, the Bayes algorithm seems easier than ML to extend to data that contain more than three species and more than one individual from each species.
Program availability: C programs implementing the MCMC algorithm and calculating the mismatch probabilities (PSG, PSE, and PGE, etc.) are available from the author upon request. The C and Mathematica programs for the likelihood method are available as well, but they make use of the Mathlink library and are awkward to use.
APPENDIX: IMPLEMENTATION OF THE MCMC ALGORITHM
Updating the gene tree and branch lengths at each locus: This step of the algorithm goes through the L loci to update the branch lengths and possibly the gene tree as well. The algorithm visits each locus once. The description in this section concerns one locus i. The subscript i in bi0 and bi1 may be suppressed. The proposal algorithm for gene trees T2 and T3 is simpler and is described first, before the algorithm for gene trees T0 and T1 (Figure 1).
Performance of the algorithm: The performance of the MCMC algorithm is noted to depend on the choice of priors (values of α and β in the gamma distributions). For some priors, the algorithm is noted not to mix well, and in particular, parameters θ0 and γ0 appear to change slowly. The high correlation among parameters and the constraints seem to cause difficulties for the algorithm. In such cases, the chain has to be run much longer than usual to achieve stable estimates. In proposing new values for γ0 and γ1, only small steps are taken (with H3 in the range 0.01-0.05) to achieve an acceptance rate of ∼0.1-0.3. For other variables, even large steps (with H1, H2, H4 in the range 0.1-0.5) are accepted at high frequencies (>50%). The mixing step seems rather effective so that ∼1000 generations seem enough for the burn-in. For some priors, <500,000 generations appear sufficient, which takes a few minutes on a Pentium III PC at 1.2 GHz for the data of Chen and Li (2001). For other priors, the algorithm has to be run much longer.
Acknowledgement
I am very grateful to Drs. W.-H. Li and F.-C. Chen for providing the data analyzed in this article. I thank M. Hasegawa and B. Larget for discussions and N. Takahata for comments. This study is supported by grant 31/G13580 from the Biotechnology and Biological Sciences Research Council (United Kingdom).
Footnotes
Communicating editor: Y.-X. Fu
LITERATURE CITED




