-
PDF
- Split View
-
Views
-
Cite
Cite
Pierre Pudlo, Jean-Michel Marin, Arnaud Estoup, Jean-Marie Cornuet, Mathieu Gautier, Christian P. Robert, Reliable ABC model choice via random forests, Bioinformatics, Volume 32, Issue 6, 15 March 2016, Pages 859–866, https://doi.org/10.1093/bioinformatics/btv684
Close -
Share
Abstract
Motivation: Approximate Bayesian computation (ABC) methods provide an elaborate approach to Bayesian inference on complex models, including model choice. Both theoretical arguments and simulation experiments indicate, however, that model posterior probabilities may be poorly evaluated by standard ABC techniques.
Results: We propose a novel approach based on a machine learning tool named random forests (RF) to conduct selection among the highly complex models covered by ABC algorithms. We thus modify the way Bayesian model selection is both understood and operated, in that we rephrase the inferential goal as a classification problem, first predicting the model that best fits the data with RF and postponing the approximation of the posterior probability of the selected model for a second stage also relying on RF. Compared with earlier implementations of ABC model choice, the ABC RF approach offers several potential improvements: (i) it often has a larger discriminative power among the competing models, (ii) it is more robust against the number and choice of statistics summarizing the data, (iii) the computing effort is drastically reduced (with a gain in computation efficiency of at least 50) and (iv) it includes an approximation of the posterior probability of the selected model. The call to RF will undoubtedly extend the range of size of datasets and complexity of models that ABC can handle. We illustrate the power of this novel methodology by analyzing controlled experiments as well as genuine population genetics datasets.
Availability and implementation: The proposed methodology is implemented in the R package abcrf available on the CRAN.
Contact:jean-michel.marin@umontpellier.fr
Supplementary information:Supplementary data are available at Bioinformatics online.
1 Introduction
Approximate Bayesian computation (ABC) represents an elaborate statistical approach to model-based inference in a Bayesian setting in which model likelihoods are difficult to calculate (due to the complexity of the models considered).
Since its introduction in population genetics (Beaumont etal., 2002; Pritchard etal., 1999; Tavaré etal., 1997), the method has found an ever increasing range of applications covering diverse types of complex models in various scientific fields (see, e.g. Arenas et al., 2015; Beaumont, 2008, 2010; Chan et al., 2014; Csillèry etal., 2010; Theunert etal., 2012; Toni etal., 2009). The principle of ABC is to conduct Bayesian inference on a dataset through comparisons with numerous simulated datasets. However, it suffers from two major difficulties. First, to ensure reliability of the method, the number of simulations is large; hence, it proves difficult to apply ABC for large datasets (e.g. in population genomics where tens to hundred thousand markers are commonly genotyped). Second, calibration has always been a critical step in ABC implementation (Blum et al., 2013; Marin etal., 2012). More specifically, the major feature in this calibration process involves selecting a vector of summary statistics that quantifies the difference between the observed data and the simulated data. The construction of this vector is therefore paramount and examples abound about poor performances of ABC model choice algorithms related with specific choices of those statistics (Didelot etal., 2011; Marin et al., 2014; Robert etal., 2011), even though there also are instances of successful implementations.
We advocate a drastic modification in the way ABC model selection is conducted: we propose both to step away from selecting the most probable model from estimated posterior probabilities and to reconsider the very problem of constructing efficient summary statistics. First, given an arbitrary pool of available statistics, we now completely bypass selecting among those. This new perspective directly proceeds from machine learning methodology. Second, we postpone the approximation of model posterior probabilities to a second stage, as we deem the standard numerical ABC approximations of such probabilities fundamentally untrustworthy. We instead advocate selecting the posterior most probable model by constructing a (machine learning) classifier from simulations from the prior predictive distribution (or other distributions in more advanced versions of ABC), known as the ABC reference table. The statistical technique of random forests (RF) (Breiman, 2001) represents a trustworthy machine learning tool well adapted to complex settings as is typical for ABC treatments. Once the classifier is constructed and applied to the actual data, an approximation of the posterior probability of the resulting model can be produced through a secondary RF that regresses the selection error over the available summary statistics. We show here how RF improves upon existing classification methods in significantly reducing both the classification error and the computational expense. After presenting theoretical arguments, we illustrate the power of the ABC-RF methodology by analyzing controlled experiments as well as genuine population genetics datasets.
2 Materials and methods
Bayesian model choice (Berger, 1985; Robert, 2001) compares the fit of M models to an observed dataset . It relies on a hierarchical modelling, setting first prior probabilities on model indices and then prior distributions on the parameter θ of each model, characterized by a likelihood function . Inferences and decisions are based on the posterior probabilities of each model .
2.1 ABC algorithms for model choice
While we cannot cover in much detail the principles of ABC, let us recall here that ABC was introduced in Tavaré etal. (1997) and Pritchard etal. (1999) for solving intractable likelihood issues in population genetics. The reader is referred to, e.g. Beaumont (2008, 2010), Toni etal. (2009), Csillèry etal. (2010) and Marin etal. (2012) for thorough reviews on this approximation method. The fundamental principle at work in ABC is that the value of the intractable likelihood function at the observed data and for a current parameter θ can be evaluated by the proximity between and pseudo-data simulated from . In discrete settings, the indicator is an unbiased estimator of (Rubin, 1984). For realistic settings, the equality constraint is replaced with a tolerance region , where is a measure of divergence between the two vectors and is a tolerance value. The implementation of this principle is straightforward: the ABC algorithm produces a large number of pairs from the prior predictive, a collection called the reference table, and extracts from the table the pairs for which .
To approximate posterior probabilities of competing models, ABC methods (Grelaud etal., 2009) compare observed data with a massive collection of pseudo-data, generated from the prior predictive distribution in the most standard versions of ABC; the comparison proceeds via a normalized Euclidean distance on a vector of statistics computed for both observed and simulated data. Standard ABC estimates posterior probabilities at stage (B) of Algorithm 1 below as the frequencies of those models within the k nearest-to- simulations, proximity being defined by the distance between and the simulated ’s.
Selecting a model means choosing the model with the highest frequency in the sample of size k produced by ABC, such frequencies being approximations to posterior probabilities of models. We stress that this solution means resorting to a k-nearest neighbor (k-nn) estimate of those probabilities, for a set of simulations drawn at stage (A), whose records constitute the so-called reference table, see Biau etal. (2015) or Stoehr etal. (2015).
Algorithm 1. ABC model choice algorithm
(A) Generate a reference table including simulations from
(B) Learn from this set to infer about m at
Selecting a set of summary statistics S(x) that are informative for model choice is an important issue. The ABC approximation to the posterior probabilities will eventually produce a right ordering of the fit of competing models to the observed data and thus select the right model for a specific class of statistics on large datasets (Marin etal., 2014). This most recent theoretical ABC model choice results indeed shows that some statistics produce nonsensical decisions and that there exist sufficient conditions for statistics to produce consistent model prediction, albeit at the cost of an information loss due to summaries that may be substantial. The toy example comparing MA(1) and MA(2) models in Supplementary Informations and Supplementary Data clearly exhibits this potential loss in using only the first two autocorrelations as summary statistics. Barnes et al. (2012) developed an interesting methodology to select the summary statistics but with the requirement to aggregate estimation and model pseudo-sufficient statistics for all models. This induces a deeply inefficient dimension inflation and can be very time consuming.
It may seem tempting to collect the largest possible number of summary statistics to capture more information from the data. This brings closer to but increases the dimension of . ABC algorithms, like k-nn and other local methods suffer from the curse of dimensionality [see e.g. Section 2.5 in Hastie etal. (2009)] so that the estimate of based on the simulations is poor when the dimension of is too large. Selecting summary statistics correctly and sparsely is therefore paramount, as shown by the literature in the recent years. [See Blum etal. (2013) surveying ABC parameter estimation.] For ABC model choice, two main projection techniques have been considered so far. First, Prangle etal. (2014) show that the Bayes factor itself is an acceptable summary (of dimension one) when comparing two models, but its practical evaluation via a pilot ABC simulation induces a poor approximation of model evidences (Didelot etal., 2011; Robert etal., 2011). The recourse to a regression layer like linear discriminant analysis (LDA, Estoup etal., 2012) is discussed below and in Supplementary Section S1. Other projection techniques have been proposed in the context of parameter estimation: see, e.g. Fearnhead and Prangle (2012); Aeschbacher etal. (2012).
Given the fundamental difficulty in producing reliable tools for model choice based on summary statistics (Robert etal., 2011), we now propose to switch to a different approach based on an adapted classification method. We recall in the next section the most important features of the RF algorithm.
2.2 RF methodology
The classification and regression trees (CART) algorithm at the core of the RF scheme produces a binary tree that sets allocation rules for entries as labels of the internal nodes and classification or predictions of Y as values of the tips (terminal nodes). At a given internal node, the binary rule compares a selected covariate Xj with a bound t, with a left-hand branch rising from that vertex defined by . Predicting the value of Y given the covariate X implies following a path from the tree root that is driven by applying these binary rules. The outcome of the prediction is the value found at the final leaf reached at the end of the path: majority rule for classification and average for regression. To find the best split and the best variable at each node of the tree, we minimize a criterium: for classification, the Gini index and, for regression, the L2-loss. In the randomized version of the CART algorithm (see Supplementary Algorithm S1), only a random subset of covariates of size is considered at each node of the tree.
The RF algorithm (Breiman, 2001) consists in bagging (which stands for bootstrap aggregating) randomized CART. It produces randomized CART trained on samples or sub-samples of size produced by bootstrapping the original training database. Each tree provides a classification or a regression rule that returns a class or a prediction. Then, for classification we use the majority vote across all trees in the forest, and, for regression, the response values are averaged.
Three tuning parameters need be calibrated: the number of trees in the forest, the number of covariates that are sampled at a given node of the randomized CART and the size of the bootstrap sub-sample. This point will be discussed in Section 3.4.
For classification, a very useful indicator is the out-of-bag error (Hastie etal., 2009, Chapter 15). Without any recourse to a test set, it gives some idea on how good is your RF classifier. For each element of the training set, we can define the out-of-bag classifier: the aggregation of votes over the trees not constructed using this element. The out-of-bag error is the error rate of the out-of-bag classifier on the training set. The out-of-bag error estimate is as accurate as using a test set of the same size as the training set.
2.3 ABC model choice via RF
The above-mentioned difficulties in ABC model choice drives us to a paradigm shift in the practice of model choice, namely to rely on a classification algorithm for model selection, rather than a poorly estimated vector of probabilities. As shown in the example described in Section 3.1, the standard ABC approximations to posterior probabilities can significantly differ from the true . Indeed, our version of stage (B) in Algorithm 1 relies on a RF classifier whose goal is to predict the suited model at each possible value s of the summary statistics . The RF is trained on the simulations produced by stage (A) of Algorithm 1, which constitute the reference table. Once the model is selected as , we opt to approximate by another RF, obtained from regressing the probability of error on the (same) covariates, as explained below.
It can be evaluated from simulations drawn as in stage (A) of Algorithm 1, independently of the reference table (Stoehr etal., 2015), or with the out-of-bag error in RF that, as explained above, requires no further simulation. Both classifiers and sets of summary statistics can be compared via this error scale: the pair that minimizes the prior error rate achieves the best approximation of the ideal Bayesian classifier. In that sense, it stands closest to the decision we would take were we able to compute the true .
We seek a classifier in stage (B) of Algorithm 1 that can handle an arbitrary number of statistics and extract the maximal information from the reference table obtained at stage (A). As introduced above, RF classifiers (Breiman, 2001) are perfectly suited for that purpose. The way we build both a RF classifier given a collection of statistical models and an associated RF regression function for predicting the allocation error is to start from a simulated ABC reference table made of a set of simulation records made of model indices and summary statistics for the associated simulated data. This table then serves as training database for a RF that forecasts model index based on the summary statistics. The resulting algorithm, presented in Algorithm 2 and called ABC-RF, is implemented in the R package abcrf associated with this article.
Algorithm 2: ABC-RF
(A) Generate a reference table including simulation from
(B) Construct randomized CART which predict m using
forb = 1 todo
draw a bootstrap (sub-)sample of size from the reference table
grow a randomized CART Tb (Supplementary Algorithm S1)
endfor
(C) Determine the predicted indexes for and the trees
(D) according to a majority vote among the predicted indexes
The justification for choosing RF to conduct an ABC model selection is that, both formally (Biau, 2012; Scornet etal., 2015) and experimentally (Hastie etal., 2009, Chapter 5), RF classification was shown to be mostly insensitive both to strong correlations between predictors (here the summary statistics) and to the presence of noisy variables, even in relatively large numbers, a characteristic that k-nn classifiers lack.
This type of robustness justifies adopting a RF strategy to learn from an ABC reference table for Bayesian model selection. Within an arbitrary (and arbitrarily large) collection of summary statistics, some may exhibit strong correlations and others may be uninformative about the model index, with no terminal consequences on the RF performances. For model selection, RF thus competes with both local classifiers commonly implemented within ABC: It provides a more non-parametric modelling than local logistic regression (Beaumont, 2008), which is implemented in the DIYABC software (Cornuet etal., 2014) but is extremely costly—see the method of Estoup etal. (2012) to reduce the dimension using linear discriminant projection before resorting to local logistic regression. This software also includes a standard k-nn selection procedure [i.e. the so-called direct approach in Cornuet etal. (2008)] which suffers from the curse of dimensionality and thus forces selection among statistics.
2.4 Approximating the posterior probability of the selected model
The outcome of RF computation applied to a given target dataset is a classification vote for each model which represents the number of times a model is selected in a forest of n trees. The model with the highest classification vote corresponds to the model best suited to the target dataset. It is worth stressing here that there is no direct connection between the frequencies of the model allocations of the data among the tree classifiers (i.e. the classification vote) and the posterior probabilities of the competing models. Machine learning classifiers hence miss a distinct advantage of posterior probabilities, namely that the latter evaluate a confidence degree in the selected model. An alternative to those probabilities is the prior error rate. Aside from its use to select the best classifier and set of summary statistics, this indicator remains, however, poorly relevant since the only point of importance in the data space is the observed dataset .
It therefore provides the complement of the posterior probability that the true model is the selected model.
To produce our estimate of the posterior probability , we proceed as follows:
We compute the value of for the trained RF and for all terms in the ABC reference table; to avoid overfitting, we use the out-of-bag classifiers;
We train a RF regression estimating the variate as a function of the same set of summary statistics, based on the same reference table. This second RF can be represented as a function that constitutes a machine learning estimate of ;
We apply this RF function to the actual observations summarized as and return as our estimate of .
This corresponds to the representation of Algorithm 3 which is implemented in the R package abcrf associated with this paper.
Algorithm 3: Estimating the posterior probability of the selected model
(a) Use the RF produced by Algorithm 2 to compute the out-of-bag classifiers of all terms in the reference table and deduce the associated binary model prediction error
(b) Use the reference table to build a RF regression function regressing the model prediction error on the summary statistics
(c) Return the value of as the RF regression estimate of
3 Results: illustrations of the ABC-RF methodology
To illustrate the power of the ABC-RF methodology, we now report several controlled experiments as well as two genuine population genetic examples.
3.1 Insights from controlled experiments
The Supplementary Information details controlled experiments on a toy problem, comparing MA(1) and MA(2) time-series models, and two controlled synthetic examples from population genetics, based on single-nucleotide polymorphism (SNP) and microsatellite data. The toy example is particularly revealing with regard to the discrepancy between the posterior probability of a model and the version conditioning on the summary statistics . Figure 1 shows how far from the diagonal are realizations of the pairs , even though the autocorrelation statistic is quite informative (Marin etal., 2012). Note in particular the vertical accumulation of points near . Supplementary Table S1 demonstrates the further gap in predictive power for the full Bayes solution with a true error rate of 12% versus the best solution (RF) based on the summaries barely achieving a 16% error rate.
Illustration of the discrepancy between posterior probabilities based on the whole data and based on a summary. The aim is to choose between two nested time series models, namely moving averages of order 1 and 2 [denoted MA(1) and MA(2), respectively; see Supplementary Information for more details]. Each point of the plot gives two posterior probabilities of MA(2) for a dataset simulated either from the MA(1) (blue) or MA(2) model (orange), based on the whole data (x-axis) and on only the first two autocorrelations (y-axis)
Illustration of the discrepancy between posterior probabilities based on the whole data and based on a summary. The aim is to choose between two nested time series models, namely moving averages of order 1 and 2 [denoted MA(1) and MA(2), respectively; see Supplementary Information for more details]. Each point of the plot gives two posterior probabilities of MA(2) for a dataset simulated either from the MA(1) (blue) or MA(2) model (orange), based on the whole data (x-axis) and on only the first two autocorrelations (y-axis)
For both controlled genetics experiments in the Supplementary Information, the computation of the true posterior probabilities of the three models is impossible. The predictive performances of the competing classifiers can nonetheless be compared on a test sample. Results, summarized in Supplementary Tables S2 and Supplementary Data in the Supplementary Information, legitimize the use of RF, as this method achieves the most efficient classification in all genetic experiments. Note that that the prior error rate of any classifier is always bounded from below by the error rate associated with the (ideal) Bayesian classifier. Therefore, a mere gain of a few percents may well constitute an important improvement when the prior error rate is low. As an aside, we also stress that, since the prior error rate is an expectation over the entire sampling space, the reported gain may exhibit much better performances over some areas of this space.
Supplementary Figure S2 displays differences between the true posterior probability of the model selected by Algorithm 2 and its approximation with Algorithm 3. Moreover, we found that the values of the votes provided by Algorithm 2 is only useful to assess the model that best fits the data but that any conclusion regarding level of confidence necessitates the computation of the posterior probability of the selected model provided by Algorithm 3.
3.2 Microsatellite dataset: retracing the invasion routes of the Harlequin ladybird
The original challenge was to conduct inference about the introduction pathway of the invasive Harlequin ladybird (Harmonia axyridis) for the first recorded outbreak of this species in eastern North America. The dataset, first analyzed in Lombaert etal. (2011) and Estoup etal. (2012) via ABC, includes samples from three natural and two biocontrol populations genotyped at 18 microsatellite markers. The model selection requires the formalization and comparison of 10 complex competing scenarios corresponding to various possible routes of introduction [see Supplementary Information for details and analysis 1 in Lombaert etal. (2011)]. We now compare our results from the ABC-RF algorithm with other classification methods for three sizes of the reference table and with the original solutions by Lombaert etal. (2011) and Estoup etal. (2012). We included all summary statistics computed by the DIYABC software for microsatellite markers (Cornuet etal., 2014), namely 130 statistics, complemented by the nine LDA axes as additional summary statistics (see Supplementary Section S4).
In this example, discriminating among models based on the observation of summary statistics is difficult. The overlapping groups of Supplementary Figure S8 reflect that difficulty, the source of which is the relatively low information carried by the 18 autosomal microsatellite loci considered here. Prior error rates of learning methods on the whole reference table are given in Table 1. As expected in such a high dimension settings (Hastie etal., 2009, Section 2.5), k-nn classifiers behind the standard ABC methods are all defeated by RF for the three sizes of the reference table, even when k-nn is trained on the much smaller set of covariates composed of the nine LDA axes. The classifier and set of summary statistics showing the lowest prior error rate is RF trained on the 130 summaries and the nine LDA axes.
Harlequin ladybird data: estimated prior error rates for various classification methods and sizes of the reference table
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 39.91 | 39.30 | 39.04 |
| Standard ABC (k-nn) on DIYABC summaries | 57.46 | 53.76 | 51.03 |
| Standard ABC (k-nn) on LDA axes | 39.18 | 38.46 | 37.91 |
| Local logistic regression on LDA axes | 41.04 | 37.08 | 36.05 |
| RF on DIYABC summaries | 40.18 | 38.94 | 37.63 |
| RF on DIYABC summaries and LDA axes | 36.86 | 35.62 | 34.44 |
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 39.91 | 39.30 | 39.04 |
| Standard ABC (k-nn) on DIYABC summaries | 57.46 | 53.76 | 51.03 |
| Standard ABC (k-nn) on LDA axes | 39.18 | 38.46 | 37.91 |
| Local logistic regression on LDA axes | 41.04 | 37.08 | 36.05 |
| RF on DIYABC summaries | 40.18 | 38.94 | 37.63 |
| RF on DIYABC summaries and LDA axes | 36.86 | 35.62 | 34.44 |
Note. Performances of classifiers used in stage (B) of Algorithm 1. A set of 10 000 prior simulations was used to calibrate the number of neighbors k in both standard ABC and local logistic regression. Prior error rates are estimated as average misclassification errors on an independent set of 10 000 prior simulations, constant over methods and sizes of the reference tables. corresponds to the number of simulations included in the reference table.
Harlequin ladybird data: estimated prior error rates for various classification methods and sizes of the reference table
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 39.91 | 39.30 | 39.04 |
| Standard ABC (k-nn) on DIYABC summaries | 57.46 | 53.76 | 51.03 |
| Standard ABC (k-nn) on LDA axes | 39.18 | 38.46 | 37.91 |
| Local logistic regression on LDA axes | 41.04 | 37.08 | 36.05 |
| RF on DIYABC summaries | 40.18 | 38.94 | 37.63 |
| RF on DIYABC summaries and LDA axes | 36.86 | 35.62 | 34.44 |
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 39.91 | 39.30 | 39.04 |
| Standard ABC (k-nn) on DIYABC summaries | 57.46 | 53.76 | 51.03 |
| Standard ABC (k-nn) on LDA axes | 39.18 | 38.46 | 37.91 |
| Local logistic regression on LDA axes | 41.04 | 37.08 | 36.05 |
| RF on DIYABC summaries | 40.18 | 38.94 | 37.63 |
| RF on DIYABC summaries and LDA axes | 36.86 | 35.62 | 34.44 |
Note. Performances of classifiers used in stage (B) of Algorithm 1. A set of 10 000 prior simulations was used to calibrate the number of neighbors k in both standard ABC and local logistic regression. Prior error rates are estimated as average misclassification errors on an independent set of 10 000 prior simulations, constant over methods and sizes of the reference tables. corresponds to the number of simulations included in the reference table.
Supplementary Figure S9 shows that RFs are able to automatically determine the (most) relevant statistics for model comparison, including in particular some crude estimates of admixture rate defined in Choisy etal. (2004), some of them not selected by the experts in Lombaert etal. (2011). We stress here that the level of information of the summary statistics displayed in Supplementary Figure S9 is relevant for model choice but not for parameter estimation issues. In other words, the set of best summaries found with ABC-RF should not be considered as an optimal set for further parameter estimations under a given model with standard ABC techniques (Beaumont etal., 2002).
The evolutionary scenario selected by our RF strategy agrees with the earlier conclusion of Lombaert etal. (2011), based on approximations of posterior probabilities with local logistic regression solely on the LDA axes, i.e. the same scenario displays the highest ABC posterior probability and the largest number of selection among the decisions taken by the aggregated trees of RF. Using Algorithm 3, we got an estimate of the posterior probability of the selected scenario equal to 0.4624. This estimate is significantly lower than the one of about 0.6 given in Lombaert etal. (2011) based on a local logistic regression method. This new value is more credible because it is based on all the summary statistics and, on a method adapted to such an high dimensional context and less sensitive to calibration issues. Moreover, this small posterior probability corresponds better to the intuition of the experimenters and indicates that new experiments are necessary to give a more reliable answer (e.g. the genotyping of a larger number of loci).
3.3 SNP dataset: inference about human population history
Because the ABC-RF algorithm performs well with a substantially lower number of simulations compared to standard ABC methods, it is expected to be of particular interest for the statistical processing of massive SNP datasets, whose production is on the increase in the field of population genetics. We analyze here a dataset including 50 000 SNP markers genotyped in four Human populations (The 1000 Genomes Project Consortium, 2012). The four populations include Yoruba (Africa), Han (East Asia), British (Europe) and American individuals of African ancestry, respectively. Our intention is not to bring new insights into Human population history, which has been and is still studied in greater details in research using genetic data but to illustrate the potential of ABC-RF in this context. We compared six scenarios (i.e. models) of evolution of the four Human populations which differ from each other by one ancient and one recent historical events: (i) a single out-of-Africa colonization event giving an ancestral out-of-Africa population which secondarily split into one European and one East Asian population lineages, versus two independent out-of-Africa colonization events, one giving the European lineage and the other one giving the East Asian lineage; (ii) the possibility of a recent genetic admixture of Americans of African origin with their African ancestors and individuals of European or East Asia origins. The SNP dataset and the compared scenarios are further detailed in the Supplementary Information. We used all the summary statistics provided by DIYABC for SNP markers (Cornuet etal., 2014), namely 112 statistics in this setting complemented by the five LDA axes as additional statistics.
To discriminate between the six scenarios of Supplementary Figure S12, RF and Supplementary Datafferent sizes. The estimated prior error rates are reported in Table 2. Unlike the previous example, the information carried here by the 50 000 SNP markers is much higher, because it induces better separated simulations on the LDA axes (Fig. 2) and much lower prior error rates (Table 2). RF using both the initial summaries and the LDA axes provides the best results.
Human SNP data: projection of the reference table on the first four LDA axes. Colors correspond to model indices. The location of the additional datasets is indicated by a large black star
Human SNP data: projection of the reference table on the first four LDA axes. Colors correspond to model indices. The location of the additional datasets is indicated by a large black star
Human SNP data: estimated prior error rates for classification methods and three sizes of reference table
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 9.91 | 9.97 | 10.03 |
| Standard ABC (k-nn) using DYIABC summaries | 23.18 | 20.55 | 17.76 |
| Standard ABC (k-nn) using only LDA axes | 6.29 | 5.76 | 5.70 |
| Local logistic regression on LDA axes | 6.85 | 6.42 | 6.07 |
| RF using DYIABC initial summaries | 8.84 | 7.32 | 6.34 |
| RF using both DIYABC summaries and LDA axes | 5.01 | 4.66 | 4.18 |
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 9.91 | 9.97 | 10.03 |
| Standard ABC (k-nn) using DYIABC summaries | 23.18 | 20.55 | 17.76 |
| Standard ABC (k-nn) using only LDA axes | 6.29 | 5.76 | 5.70 |
| Local logistic regression on LDA axes | 6.85 | 6.42 | 6.07 |
| RF using DYIABC initial summaries | 8.84 | 7.32 | 6.34 |
| RF using both DIYABC summaries and LDA axes | 5.01 | 4.66 | 4.18 |
Note. Same comments as in Table 1.
Human SNP data: estimated prior error rates for classification methods and three sizes of reference table
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 9.91 | 9.97 | 10.03 |
| Standard ABC (k-nn) using DYIABC summaries | 23.18 | 20.55 | 17.76 |
| Standard ABC (k-nn) using only LDA axes | 6.29 | 5.76 | 5.70 |
| Local logistic regression on LDA axes | 6.85 | 6.42 | 6.07 |
| RF using DYIABC initial summaries | 8.84 | 7.32 | 6.34 |
| RF using both DIYABC summaries and LDA axes | 5.01 | 4.66 | 4.18 |
| Classification method trained on . | Prior error rates (%) . | ||
|---|---|---|---|
| . | . | . | |
| LDA | 9.91 | 9.97 | 10.03 |
| Standard ABC (k-nn) using DYIABC summaries | 23.18 | 20.55 | 17.76 |
| Standard ABC (k-nn) using only LDA axes | 6.29 | 5.76 | 5.70 |
| Local logistic regression on LDA axes | 6.85 | 6.42 | 6.07 |
| RF using DYIABC initial summaries | 8.84 | 7.32 | 6.34 |
| RF using both DIYABC summaries and LDA axes | 5.01 | 4.66 | 4.18 |
Note. Same comments as in Table 1.
The ABC-RF algorithm selects Scenario 2 as the predicted scenario on the Human dataset, an answer which is not visually obvious on the LDA projections of Figure 2 in which Scenario 2 corresponds to the blue color. But considering previous population genetics studies in the field, it is not surprising that this scenario, which includes a single out-of-Africa colonization event giving an ancestral out-of-Africa population with a secondarily split into one European and one East Asian population lineage and a recent genetic admixture of Americans of African origin with their African ancestors and European individuals, was selected. Using Algorithm 3, we got an estimate of the posterior probability of scenario 2 equal to 0.998, corresponding to a high level of confidence in choosing scenario 2.
Computation time is a particularly important issue in the present example. Simulating the 10 000 SNP datasets used to train the classification methods requires 7 h on a computer with 32 processors (Intel Xeon(R) CPU 2 GHz). In that context, it is worth stressing that RF trained on the DIYABC summaries and the LDA axes of a 10 000 reference table has a smaller prior error rate than all other classifiers, even when they are trained on a 50 000 reference table. In practice, standard ABC treatments for model choice are based on reference tables of substantially larger sizes [i.e. 105 to 106 simulations per scenario (Bertorelle et al., 2010; Estoup et al., 2012)]. For the above setting in which six scenarios are compared, standard ABC treatments would hence request a minimum computation time of 17 days (using the same computation resources). According to the comparative tests that we carried out on various example datasets, we found that RF globally allowed a minimum computation speed gain around a factor of 50 in comparison to standard ABC treatments: see also Supplementary Section S4 for other considerations regarding computation speed gain.
3.4 Practical recommendations regarding the implementation of the algorithms
We develop here several points, formalized as questions, which should help users seeking to apply our methodology on their dataset for statistical model choice.
Are my models and/or associated priors compatible with the observed dataset?
This question is of prime interest and applies to any type of ABC treatment, including both standard ABC treatments and treatments based on ABC RF. Basically, if none of the proposed model - prior combinations produces some simulated datasets in a reasonable vicinity of the observed dataset, it is a signal of incompatibility and we consider it is then useless to attempt model choice inference. In such situations, we strongly advise reformulating the compared models and/or the associated prior distributions to achieve some compatibility in the above sense. We propose here a visual way to address this issue, namely through the simultaneous projection of the simulated reference table datasets and of the observed dataset on the first LDA axes. Such a graphical assessment can be achieved using the R package abcrf associated with this paper. In the LDA projection, the observed dataset need be located reasonably within the clouds of simulated datasets (see Fig. 2 as an illustration). Note that visual representations of a similar type (although based on PCA) as well as computation for each summary statistics and for each model of the probabilities of the observed values in the prior distributions have been proposed by Cornuet etal. (2010) and are already automatically provided by the DIYABC software.
Did I simulate enough datasets for my reference table?
A rule of thumb is to simulate between 5000 and 10 000 datasets per model among those compared. For instance, in the example dealing with Human population history (Section 3.3), we have simulated a total of 50 000 datasets from six models (i.e. about 8300 datasets per model). To evaluate whether or not this number is sufficient for RF analysis, we recommend to compute global prior error rates from both the entire reference table and a subset of the reference table (for instance from a subset of 40 000 simulated datasets if the reference table includes a total of 50 000 simulated datasets). If the prior error rate value obtained from the subset of the reference table is similar, or only lightly higher, than the value obtained from the entire reference table, one can consider that the reference table contains enough simulated datasets. If a substantial difference is observed between both values, then we recommend an increase in the number of datasets in the reference table. For instance, in the Human population history example, we obtained prior error rate values of 4.22% and 4.18% when computed from a subset of 40 000 simulated datasets and the entire 50 000 datasets of the reference table, respectively. In this case, the benefit of producing more simulated dataset in the reference table seems negligible.
Did my forest grow enough trees?
According to our experience, a forest made of 500 trees usually constitutes an interesting trade-off between computation efficiency and statistical precision (Breiman, 2001). To evaluate whether or not this number is sufficient, we recommend to plot the estimated values of the prior error rate and/or the posterior probability of the best model as a function of the number of trees in the forest. The shapes of the curves provide a visual diagnostic of whether such key quantities stabilize when the number of trees tends to 500. We provide illustrations of such visual representations in the case of the example dealing with Human population history in Figure 3. Such a graphical assessment can be achieved using the R package abcrf associated with this paper
Human SNP data: evolution of the ABC-RF prior error rate when with respect to the number of trees in the forest
Human SNP data: evolution of the ABC-RF prior error rate when with respect to the number of trees in the forest
How do I set and for classification and regression?
For a reference table with up to 100 000 datasets and 250 summary statistics, we recommend keeping the entire reference table, that is, when building the trees. For larger reference tables, the value of can be calibrated against the prior error rate, starting with a value of and doubling it until the estimated prior error rate is stabilized. For the number of summary statistics sampled at each of the nodes, we see no reason to modify the default number of covariates which is chosen as for classification and for regression when d is the total number of predictors (Breiman, 2001). Finally, when the number of summary statistics is lower than 15, one might reduce to .
4 Discussion
This article is purposely focused on selecting a statistical model, which can be rephrased as a classification problem trained on ABC simulations. We defend here the paradigm shift of assessing the best fitting model via a RF classification and in evaluating our confidence in the selected model by a secondary RF procedure, resulting in a different approach to precisely estimate the posterior probability of the selected model. We further provide a calibrating principle for this approach, in that the prior error rate provides a rational way to select the classifier and the set of summary statistics which leads to results closer to a true Bayesian analysis.
Compared with past ABC implementations, ABC-RF offers improvements at least at four levels: (i) on all experiments we studied, it has a lower prior error rate; (ii) it is robust to the size and choice of summary statistics, as RF can handle many superfluous statistics with no impact on the performance rates (which mostly depend on the intrinsic dimension of the classification problem (Biau, 2012; Scornet etal., 2015), a characteristic confirmed by our results); (iii) the computing effort is considerably reduced as RF requires a much smaller reference table compared with alternatives (i.e. a few thousands versus hundred thousands to billions of simulations) and (iv) the method is associated with an embedded and error-free evaluation which assesses the reliability of ABC-RF analysis. As a consequence, ABC-RF allows for a more robust handling of the degree of uncertainty in the choice between models, possibly in contrast with earlier and over-optimistic assessments.
Because of a massive gain in computing and simulation efforts, ABC-RF will extend the range and complexity of datasets (e.g. number of markers in population genetics) and models handled by ABC. In particular, we believe that ABC-RF will be of considerable interest for the statistical processing of massive SNP datasets whose production rapidly increases within the field of population genetics for both model and non-model organisms. Once a given model has been chosen and confidence evaluated by ABC-RF, it becomes possible to estimate parameter distribution under this (single) model using standard ABC techniques (Beaumont etal., 2002) or alternative methods such as those proposed by Excoffier etal. (2013).
Acknowledgements
The authors are grateful to the referees for their supportive and constructive comments throughout the editorial process. The use of random forests was suggested to J.-M.M. and C.P.R. by Bin Yu during a visit at CREST, Paris. We are grateful to Gérard Biau for his help about the asymptotics of random forests. Some parts of the research were conducted at BIRS, Banff, Canada, and the authors (P.P. and C.P.R.) took advantage of this congenial research environment.
Funding
This research was partly supported by the ERA-Net BiodivERsA2013-48 (EXOTIC), with the national funders FRB, ANR, 25 MEDDE, BELSPO, PT-DLR and DFG, part of the 2012–2013 BiodivERsA call for research proposals. This work was also supported by the Labex NUMEV.
Conflict of Interest: none declared.
References
Author notes
†The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.
Associate Editor: Inanc Birol

![Illustration of the discrepancy between posterior probabilities based on the whole data and based on a summary. The aim is to choose between two nested time series models, namely moving averages of order 1 and 2 [denoted MA(1) and MA(2), respectively; see Supplementary Information for more details]. Each point of the plot gives two posterior probabilities of MA(2) for a dataset simulated either from the MA(1) (blue) or MA(2) model (orange), based on the whole data (x-axis) and on only the first two autocorrelations (y-axis)](https://oup.silverchair-cdn.com/oup/backfile/Content_public/Journal/bioinformatics/32/6/10.1093_bioinformatics_btv684/3/m_btv684f1p.gif?Expires=1605302411&Signature=qOUryvPPlXOxBNfFpvneG2lmhasa5SbsrjUS-ej0vSzHTOUDb3gO0KwOOmRxxd3gY7eG5rITHt-PUEiQVmSDje5n--~3kwoB5ukHTD3WnrgcLTVpS92is-LE4NtX-Bu5ZpzUOIpeS-b9rAD3Hk0Q00AfbTr7GXxHqE0tW5ZGaDizrZ752Sd2XDqQByicRVOT7i~zQmLknU6QdoMBwsT6pEgEWIZ0tvCTCfDFjiusk0C5XR124rPSXkztRRl84W4IIeDZeGeUU~68nxxLhW~0jmkbmVAuZWKixsKW84VbkfX8UMqQV6qfllu-w9kp8jH2rZUB1KepQaKtDKWPsfW6lw__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)

