Comparison of the performance and accuracy of different inference methods, such as maximum likelihood (ML) and Bayesian inference, is difficult because the inference methods are implemented in different programs, often written by different authors. Both methods were implemented in the program MIGRATE, that estimates population genetic parameters, such as population sizes and migration rates, using coalescence theory. Both inference methods use the same Markov chain Monte Carlo algorithm and differ from each other in only two aspects: parameter proposal distribution and maximization of the likelihood function. Using simulated datasets, the Bayesian method generally fares better than the ML approach in accuracy and coverage, although for some values the two approaches are equal in performance.
Motivation: The Markov chain Monte Carlo-based ML framework can fail on sparse data and can deliver non-conservative support intervals. A Bayesian framework with appropriate prior distribution is able to remedy some of these problems.
Results: The program MIGRATE was extended to allow not only for ML(-) maximum likelihood estimation of population genetics parameters but also for using a Bayesian framework. Comparisons between the Bayesian approach and the ML approach are facilitated because both modes estimate the same parameters under the same population model and assumptions.
Population genetics changed considerably after Kingman (1982a, b, c) introduced the n-coalescent. The n-coalescent (coalescent for short) allows us to calculate probabilities of relationships among a random population sample. This in turn facilitates calculations of probabilities of whole genealogies under a specific population model, for example, two populations exchanging migrants at a constant rate. The first applications that calculated the likelihood of the population size parameter based on DNA samples were described by Griffiths and Tavaré (1994) and Kuhner et al. (1995). Bahlo and Griffiths (2000) and Beerli and Felsenstein (1999, 2001) extended the basic estimation of a single parameter to joint estimations of migration rates and population sizes, whereas Kuhner et al. (2000) allowed for the estimation of recombination rate. These maximum-likelihood (ML) approaches were complemented by several Bayesian approaches (Nielsen, 1998, 2000; Hey and Nielsen, 2004; Beaumont, 1999 and others). All of these approaches try to estimate population genetic parameters. They typically treat the genealogy as a nuisance parameter and summarize over all possible genealogies G; to be precise, they sample over all possible labeled histories T and branch lengths B, taking into account the genetic data and the population genetic model. The likelihood of the data given the model parameters is
Nielsen (personal communication, 2001) suggested that the ML approach is hampered by several problems. Maximizing the likelihood function L(D|π) for complicated scenarios with many parameters is very difficult. The Metropolis–Hastings algorithm (Metropolis et al., 1953; Hastings, 1970) with static driving values π0, as implemented in MIGRATE and other programs, can take a prohibitively long run time required to explore completely all the possible genealogies. These problems have been shown by Abdo et al. (2004), although Abdo et al. apparently failed to recognize that the problems are far less serious when using biologically reasonable datasets and when the guidelines about convergence outlined in the MIGRATE-manual (available from ) are followed.
The program MIGRATE uses a Metropolis–Hastings algorithm to explore all possible genealogies (Beerli and Felsenstein, 1999). The adaptation of the program to a Bayesian framework was not difficult because only a module handling the prior distributions and a minor change in the program flow need to be added, together with changes in the input and output user interfaces.
The program MIGRATE calculates the posterior probability distribution per locus, treating each locus as completely unlinked to the others. This assumption is reasonable: most biologists would prefer to sample such loci rather than partially linked or completely linked loci because unlinked loci can be treated as independent replicates of the genealogical history. MIGRATE approximates the posterior distribution
The updating scheme of the genealogies is the same in the ML and the Bayesian approach and was described by Beerli Felsenstein (1999). The updating scheme of the parameters is based on arbitrary prior distributions r(π). MIGRATE allows the user to choose between a small number of prior distributions.
Uniform prior distribution between a minimum and a maximum value for each parameter.
Exponential prior distribution with a minimum, mean and maximum value for each parameter.
The incorporation of additional prior distributions, such as a gamma distribution, are planned.
A key issue in Metropolis–Hastings algorithms is the acceptance or not of a change of the current state in the Markov chain. The algorithm should accept fairly often so that the chain can explore the solution space more efficiently; poor algorithms will reject often and force very long runs to achieve equilibrium and an appropriate sample of the possible states. Typically, the acceptance or rejection of a move in the Markov chain is based on a ratio that consists of two parts: (1) the ratio of probabilities to move from an old state to a new state using a prior distribution and the effect of the data (Metropolis et al., 1953), the Metropolis ratio rM; and (2) the ratio of probabilities to be in the old or new state and go to the new or old state (Hastings, 1970), the Hastings ratio, rH. In the Bayesian implementation in MIGRATE the ratio of accepting a move suggested by the parameter prior is only dependent on the Kingman coalescent probability density. The acceptance/rejection ratio is
The performance of the improvements were illustrated on a dataset for four populations with a unidirectional migration pattern (Fig. 1). Simulated DNA sequence alignments, generated using the population model described in Figure 1, were analyzed to show the performance of the Bayesian and the ML approach. One dataset with 10 loci, and four groups of 100 single locus datasets were analyzed. Each dataset contained 20 individuals from each of the four populations. Using a coalescence-based simulator (cf. Hudson, 2002) ‘true’ genealogies using population sizes (ΘT) for all populations of 0.1, 0.01, 0.001 and 0.0001, and Mji referenced in Figure 1 were created. DNA sequences of 10 000 bp length were then simulated on these true genealogies using an F84 model with equal base frequencies and transition/transversion ratio of 2.0. These datasets were then analyzed using either the ML inference mode (Beerli and Felsenstein, 1999, 2001) or the Bayesian inference mode in MIGRATE. The ML mode was run for 10 short chains visiting 100 000 genealogies and storing 5000, updating the driving parameter after each chain, and two long chains with 10 000 000 visited genealogies and 50 000 sampled using an adaptive heating scheme. The Bayesian inference was run for 10 000 000 updates, approximately half of which were updates of the 16 parameters and approximately half (∼5 000 000 because of random switching between genealogy and parameter updates) were genealogy updates. New parameters were proposed using an exponential prior distribution with population size mean of 2ΘT and boundaries of ΘT/10 and 10ΘT, and scaled migration rate mean M of 200 and boundaries of 0 and 1000. Results for uniform priors with the same boundaries were very similar, and therefore are not shown.
The results and problematic issues are shown only for population 4, but the pattern is identical for the other three populations. The scenario chosen for an example is difficult for any gene flow estimator because it requires the estimation of 12 migration rates and 4 population sizes. With high migration rates, haplotypes are distributed evenly over all populations, so that establishing the directionality of gene flow from estimated migration rates is difficult. With low migration rates, however, the difference from zero, and thus the directionality, is also difficult to establish. The number of variable sites or the number of alleles in the dataset is crucial for accurate estimation of population size and migration rates of any magnitude. Single locus datasets with low variability do not allow estimating migration rates with great precision.
Despite these difficulties, with sufficient data, estimates are expected to be useful for inferring direction and magnitude of gene flow and magnitude of population size. Using the 16 parameter model analyzed here will produce very variable parameter estimates from single locus data, however, and such analyses are not advisable for real biological data.
2.1 Multilocus analysis
Figure 2 shows that the variability of individual loci resulting from the coalescent can be large and that there are difficulties in reaching convergence; the combined estimate over all loci, however, gives a rather accurate picture. The variability for migration rate estimates is much larger than for the population size estimates. It is difficult to establish the gene flow direction (M41 versus M14) for the single locus estimates. The estimate over all loci clearly allows the distinction between the two directions: M14 is much bigger than M41. The estimation of migration parameter values between populations with no direct connections, for example, migration rate M24 between population 2 and 4, is consistently low (Fig. 2).
2.2 Comparison of Bayesian and ML inference
MIGRATE allows direct comparison of the success of parameter inference using the Bayesian approach and the ML approach. In theory the results should be very similar. Table 1 shows medians and quartiles of 100 single locus runs. The medians and quartiles were chosen because they are a better indicator of the distribution of the results than mean and standard deviation because these are heavily influenced by large outliers. The median of the maximum posterior probabilities is similar to the median of the ML estimates for moderate values of the population size (Θ = 0.01). The results for the low-variability datasets are mixed; the medians of the two methods are still comparable but the range of the quartiles of ML M estimates are very large; standard deviations (data not shown) were even larger because of outliers in the ML analysis. Several of the 100 runs reported values that were very different from the true value. The datasets with the smallest true Θ (0.0001) shows even more problems with the ML approach because the medians for Θ is strongly overestimated and the range of the quartiles for M is huge. In contrast to ML the Bayesian runs recover the population size, but report very low values for the migration rate. Figure 3 shows a comparison of posterior distributions of the scaled migration rate M of the first datasets of each population size category (Table 1). The power to make inferences about the magnitude of the migration rate is directly correlated with the magnitude of the population size. For very small population sizes there is no power to estimate such low migration rates in the chosen 16 parameter problems with a single locus dataset of 10 000 bp for each of the 100 individuals. The posterior distribution is similar to the exponential prior distribution used. In contrast to the problems encountered in the migration rate estimations, the posterior distributions for Θ are strongly peaked near 0.0001 (data not shown).
Medians and quartiles of 100 single-locus datasets for the two inference methods (I): maximum likelihood (M) and Bayesian (B). Simulated datasets that were generated with four different values of ‘true’ population sizes (Θ(t)). Θ is 4× effective population size × mutation rate per site per generation, and M is immigration rate over mutation rate. The range of number of migrants per generation Nem = ΘM/4 covers a wide range from 0.0025 (corresponding to a Θ = 0.0001) to 2.5 (corresponding to a Θ = 0.1) migrants per generation. Run conditions for ML and Bayes inferences are specified in the text.
The ML method has difficulty is recovering the expected values when the dataset is very variable, whereas the Bayesian inference is closer to the ‘true’ values for all the scenarios. The range of the quartiles of the ML approach is often much larger than the range of the Bayesian approach.
The coverage of the Bayesian approach is rather conservative and includes the ‘true’ values in the 95% credibility interval with frequencies of 0.85–1.00 for the migration and population size parameters (Table 2), whereas the ML approach has difficulty with convergence, especially on low-variability datasets, and so has a rather low coverage (frequencies between 0.06 and 0.94).
Coverage analysis of 100 simulated data sets that were generated with four different values of ‘true’ population sizes (Θ(t)). Θ is 4× effective population size × mutation rate per site per generation, and M is immigration rate over mutation rate. Coverage is measured as the percentage of times the true value is within the estimated 95% support interval. Run conditions for ML and Bayes inferences are specified in the text.
The scenario chosen for an example is difficult for any gene flow estimation program that uses only a single sample in time. The problem stems from the fact that the only information about the directionality are the mutations in the dataset. If the migration rate is high, all mutations, even the rare ones, are distributed over all populations and any directionality estimation based on a single locus will fail. With low migration rates among the populations, each population will acquire unique mutations and in principle the magnitude and directionality can be estimated even for single locus dataset, if there is enough variability in the dataset. In reality, however, such an estimation has proven difficult because the difference between the migration rates between two populations is small and often close to zero. Estimation based on single locus datasets thus often cannot recover the directionality, but multilocus estimates will allow the inference of the migration direction (Fig. 2).
The power to estimate migration rate is crucially dependent on the number of variable sites or number of alleles in the dataset. Too little variation leads to haphazard results in the ML method because the MCMC process has no strong guidance whether to insert or remove migration events during the course of the analysis; the process is more dependent on the static driving parameters. Comparison of several runs will deliver very different results and therefore show non-convergence. The only remedy is to run these analyses much longer to get a better estimate of the uncertainty of the estimate. Bayesian analysis is straightforward in such cases because when the posterior distribution is similar to the prior distribution, we can conclude that the dataset does not contain enough information for the inference. The ML method also has difficulties exploring the distribution around the ML estimate with highly variable data because the genealogy is very well defined by the large number of variable sites: the static driving value and the updating scheme (Beerli and Felsenstein, 1999) will not explore many different migration scenarios and therefore the tails of the distribution are not visited. This results in too narrow support intervals with small coverage values. In contrast, Bayesian inference manipulates the parameters using a diffuse prior. This forces more changes of the genealogy, therefore exploring more different migration scenarios and visiting the tails of the posterior distribution more efficiently.
The coverage shown for the Bayesian runs might be conservative but this is preferable to the coverage reported for ML, especially in the low-variability datasets (Θ ≤ 0.001, Table 2). Some ML runs did not really converge and were estimating either very large or zero migration rates.
Many users of MIGRATE have reported in numerous email queries that achieving convergence with the ML approach with low-information data, such as single locus datasets or data with a low mutation rate, is difficult and needs special attention. Bayesian inference seems to allow such users to achieve reliable results with less effort than the ML approach. It seems appropriate that if only the parameters and their support intervals are of interest, then biologists should prefer the Bayesian approach, although it will be interesting to see whether this will hold for all biological datasets.
The author thank Rasmus Nielsen for the suggestion to implement a Bayesian method into MIGRATE. At first, the author was uncertain about its success for complicated scenarios, like the one used (Fig. 1). Thomas Uzzell, Koffi Sampson and two anonymous reviewers helped to improve the manuscript. Funding for this research was supplied through Florida State University and National Science Foundation grant DEB-0108249 to Scott Edwards and P.B.