Bayesian model comparison for simulation-based inference

Comparison of appropriate models to describe observational data is a fundamental task of science. The Bayesian model evidence, or marginal likelihood, is a computationally challenging, yet crucial, quantity to estimate to perform Bayesian model comparison. We introduce a methodology to compute the Bayesian model evidence in simulation-based inference (SBI) scenarios (also often called likelihood-free inference). In particular, we leverage the recently proposed learnt harmonic mean estimator and exploit the fact that it is decoupled from the method used to generate posterior samples, i.e. it requires posterior samples only, which may be generated by any approach. This flexibility, which is lacking in many alternative methods for computing the model evidence, allows us to develop SBI model comparison techniques for the three main neural density estimation approaches, including neural posterior estimation (NPE), neural likelihood estimation (NLE), and neural ratio estimation (NRE). We demonstrate and validate our SBI evidence calculation techniques on a range of inference problems, including a gravitational wave example. Moreover, we further validate the accuracy of the learnt harmonic mean estimator, implemented in the HARMONIC software, in likelihood-based settings. These results highlight the potential of HARMONIC as a sampler-agnostic method to estimate the model evidence in both likelihood-based and simulation-based scenarios.


INTRODUCTION
Bayesian model comparison provides a robust and principled statistical framework for the selection of appropriate scientific models to describe observational data.The key quantity to perform model comparison in a Bayesian inference framework is the model evidence, or marginal likelihood, whose estimate allows one to assign relative weights to different models (see, e.g., Trotta 2008 for a review of Bayesian model comparison, particularly in the context of cosmology).However, obtaining a precise and accurate estimate of the Bayesian model evidence is a computationally challenging task, involving a multi-dimensional integral which may quickly exceed the available computational resources for parameter spaces of even moderate dimensions.A variety of techniques for computing the Bayesian model evidence have been proposed (see, e.g., Friel & Wyse 2012;Llorente et al. 2020 for reviews).
One of the most widely used classes of algorithms for estimating the model evidence, particularly in astrophysics and cosmology, is nested sampling (Skilling 2006), a method for which posterior inferences can also be computed as a byproduct (see, e.g., Buchner 2021;Ashton et al. 2022 for reviews of nested sampling).Common nested sampling algorithms, such as (Feroz & Hobson 2008;Feroz et al. 2009) and (Handley et al. 2015a,b) have been of remarkable success in multiple research areas.However, they can struggle in high dimensional parameter spaces.The ★ E-mail: a.spuriomancini@ucl.ac.uk † Authors contributed similarly.
recently proposed proximal nested sampling framework scales to very high dimensional settings (Cai et al. 2021).However, proximal nested sampling is restricted to log-convex likelihoods.Nevertheless, such likelihoods are common and so proximal nested sampling is likely to be particularly useful for inverse imaging problems.As the name suggests, nested sampling couples the computation of the model evidence to the sampling approach, restricting its flexibility.That is, sampling must be performed in a prescribed (i.e.nested) manner.
The recently proposed learnt harmonic mean estimator (McEwen et al. 2021) for computation of the model evidence removes this restriction.While the original harmonic mean estimator (Newton & Raftery 1994) can fail catastrophically since its variance may become very large and may not be finite (Neal 1994), the learnt harmonic mean solves this problem by learning an approximation of the optimal internal importance sampling target distribution (McEwen et al. 2021).Critically, the learnt harmonic mean estimator requires only samples from the posterior and so is agnostic to sampling strategy, affording it great flexibility, which is crucial to the current work.
The need for efficient and reliable methods for computing the model evidence applies not only to likelihood-based settings, but also to simulation-based inference (SBI) frameworks.In the SBI setting (sometimes referred to as likelihood-free inference; LFI), the likelihood is either not available or too costly to be evaluated, and the inference process relies solely on the ability to simulate observables.Approximate Bayesian computation (ABC) is the traditional, prototypical SBI technique that relies on rejection sampling of parameter sets on the basis of a similarity metric between the simulated observables and the observations (see, e.g., Beaumont 2019).However, ABC methods can easily require an inaccessible number of simulations to reach convergence, limiting their applicability.More recently neural density estimation techniques have been introduced to surrogate densities directly.Such novel frameworks have seen numerous successful applications in various scientific areas, carrying great promise for the future due to their ability to avoid the evaluation of an explicit (and possibly incorrect) likelihood function, while limiting the number of simulations with clever use of cutting-edge machine learning algorithms.For a recent review of SBI techniques we refer the reader to Cranmer et al. (2020).Neural density estimation methods have recently found remarkable success in cosmology, with general-purpose open-source software readily available (e.g. 1 by Alsing et al. 2019;  2 by Cole et al. 2021).
The development of new SBI techniques has so far mostly concentrated on optimising the task of parameter estimation, while model selection has received less attention.Nevertheless, model selection is a critical component of a complete statistical analysis, particularly in scientific fields where selection of the appropriate model is often the fundamental question.While there has been some consideration of model selection for SBI, the field remains nascent.Brewer et al. (2011) propose a technique based on diffusive nested sampling to compute the model evidence for ABC.However, this approach is restricted to ABC, which as discussed above can be inefficient and of limited applicability, and is not straightforwardly generalisable to modern neural density estimation approaches.
In this article we introduce a methodology to compute the model evidence, in order to facilitate Bayesian model comparison, for modern neural density estimation approaches to SBI.Our methodology leverages the learnt harmonic mean estimator (McEwen et al. 2021).We exploit the fact that the learnt harmonic mean estimator is agnostic to sampling strategy and only requires samples from the posterior distribution.In some neural density estimation approaches samples can be generated directly (e.g. by pushing samples from a simple base distribution, such as a Gaussian, forward through a normalising flow; Papamakarios et al. 2021).To support such neural density estimation approaches it is therefore essential that model evidence computation is decoupled from sampling strategy, as is the case with the learnt harmonic mean estimator.
The remainder of this article is structured as follows.In Sec. 2 and Sec. 3 we concisely review, respectively, Bayesian model comparison and neural density estimation approaches to SBI.In Sec. 4 we introduce our methodology to perform Bayesian model comparison in the context of SBI, leveraging the learnt harmonic mean estimator (McEwen et al. 2021).We present algorithms to compute the Bayesian model evidence for each of the three main neural density estimation approaches to SBI that are reviewed by Cranmer et al. (2020).In Sec. 5 we report the results from numerical experiments that demonstrate and validate our methodology.Finally, we conclude in Sec.6 with a review of our main findings.

BAYESIAN MODEL COMPARISON
We review here the fundamentals of Bayesian model comparison, focusing on the challenges associated with estimation of the model evidence.For a more extensive review we refer the reader to, e.g., Trotta (2008).We also summarise the key concepts underlying the learnt harmonic mean estimator since it is an integral component of the current work (we refer the reader to McEwen et al. 2021 for further details).

Bayesian model evidence
The definition of model evidence in a Bayesian statistical framework follows directly from Bayes' theorem.For a given model M, Bayes' theorem describes the connection between the conditional probabilities of model parameters  and data : where Since the model evidence is a normalisation factor for the posterior distribution and is independent of the model parameters, the evidence is usually disregarded in parameter estimation tasks.However, for model selection the evidence becomes the crucial quantity to compute.Being the integral of the likelihood over the prior (cf.Eq. 2), the evidence allows one to assign relative weights to different models.The evidence ratio between two competing models M 1 and M 2 enters the expression for the comparison of their posterior distributions, which again follows from Bayes' theorem: In many cases a priori probabilities (M 1 ) and (M 2 ) of the two models are considered to be equal, hence the ratio of posterior distributions becomes equivalent to the evidence ratio or Bayes factor For notational brevity, henceforth we drop the explicit conditioning on models unless there are multiple models under consideration.

Algorithms for evidence estimation
Computing the evidence for a given model is numerically challenging due to the multi-dimensional integral in Eq. 2. Many techniques have been proposed to tackle this challenge, such as thermodynamic integration (e.g.Beltran et al. 2005;Gregory 2005;Bridges et al. 2006), the Savage-Dickey density ratio (e.g.Nested sampling reduces the computation of the evidence to the evaluation of a one-dimensional integral, and as a byproduct provides samples that can be used to compute posterior inferences, thus supporting both parameter estimation and model selection.Multimodal nested sampling, implemented in the3 software (Feroz & Hobson 2008;Feroz et al. 2009; see also Buchner et al. 2014 for the Python wrapper4 ), has seen enormous success, with widespread application across multiple research fields, as has the slice sampling nested sampling algorithm implemented in the -5 software (Handley et al. 2015a,b).Proximal nested sampling (Cai et al. 2021) is implemented in the6 software, which has only been released very recently but is likely to be particularly useful for inverse imaging problems.
Measuring the model evidence is a numerical process that, if repeated multiple times, produces a distribution of values.Ideally, these distributions would be narrow, and even more importantly they should provide unbiased estimates of the model evidence.However, this might not always be the case (Lemos et al. 2022).It is, therefore, crucial to develop alternative ways to estimate the model evidence, so as to perform cross-checks on the final model selection statements.Furthermore, as the name nested sampling suggests, sampling must be performed in a prescribed nested manner, where samples are drawn from the prior distribution within likelihood level-sets (isocontours).This close coupling between sampling strategy and computation of the model evidence reduces flexibility.In some settings sampling may be performed by alternative means and nested sampling approaches to compute the evidence may not be possible.
The recently proposed learnt harmonic mean estimator (McEwen et al. 2021) for computation of the model evidence removes this restriction since it only requires samples from the posterior and is agnostic to how those samples are generated.This is its key property that we leverage in the current work, hence we concisely review the learnt harmonic mean estimator below.

Learnt harmonic mean estimator
The original harmonic mean estimator for computation of the Bayesian model evidence was first introduced by Newton & Raftery (1994).From Bayes' theorem it follows that the reciprocal evidence is related to the harmonic mean of the likelihood by This relation can be used to construct an estimator of the reciprocal evidence using  samples {  }  =1 of the posterior distribution ( | ).However, this estimator may present very large or even diverging variance (Neal 1994).Gelfand & Dey (1994) proposed a modification to the original harmonic mean estimator, what we call the re-targeted harmonic mean estimator, introducing a normalised target distribution () to define the modified estimator from which the original harmonic mean estimator is recovered for () = ().
The original harmonic mean estimator can be interpreted as importance sampling, where the importance sampling distribution is the posterior and the target distribution is the prior.It is therefore not surprising that the original estimator suffers poor variance properties since the prior is typically broader than the posterior, whereas importance sampling requires the sampling density to be broader than the target.By introducing a new target () this issue can be circumvented provided () is narrower than the posterior.Critically, however, the introduced target must be a normalised probability distribution.
The question remains: how does one set the target distribution?A variety of approaches have been considered previously, however none have proved completely satisfactory.One approach is to consider a multivariate Gaussian (Chib 1995); however, such a target typically has tails that are too broad.An alternative is to consider indicator functions (Robert & Wraith 2009;van Haasteren 2014); however, for complicated posterior distributions these typically capture a small region of parameter space and so are inefficient.
It was recognised by McEwen et al. (2021) that the optimal target distribution is the normalised posterior While this exact quantity is a priori inaccessible since it involves knowledge of the evidence  -precisely the quantity we are attempting to estimate -an approximation of () can be derived from posterior samples by machine learning techniques.This is the rationale of the learnt harmonic mean estimator of McEwen et al. (2021).Moreover, the learnt approximation of the posterior needs not be highly accurate; but critically it must have narrower tails than the posterior.Strategies to learn appropriate targets with these properties are presented in McEwen et al. (2021), setting up an optimisation problem to minimise the variance of the result estimator, while ensuring it is unbiased, and with possible regularisation.The learnt harmonic mean estimator is implemented in the 7 software.
We conclude this section by highlighting that the learnt harmonic mean estimator produces estimates of the evidence purely from samples of the posterior distribution; there is no requirement on the specific method used for sampling, i.e.
is agnostic to the method used to generate posterior samples.As we shall see later in this work, this is the key property of the learnt harmonic mean estimator that allows it to be used in a variety of simulation-based inference scenarios.

SIMULATION-BASED INFERENCE (SBI)
We provide a brief overview of the main algorithms used for SBI (simulation-based inference), referring the reader to Cranmer et al. (2020) for a more extensive review.The focus of recent SBI developments and existing literature is on parameter estimation, hence we discuss SBI in this context.In Sec. 4 we introduce methodologies to perform Bayesian model comparison in an SBI setting.
The original SBI methodology, based on ABC (approximate Bayesian computation) (see, e.g., Beaumont 2019), involves simulating realisations of the observables at each of the explored points in parameter space, and accepting or rejecting these points based on their similarity with the observed data, within a tolerance .This rejection sampling method recovers an accurate representation of the underlying density distribution in the limit  → 0, at which point the low simulation efficiency makes computational costs infeasibly high for inference of parameter spaces with even moderate dimensionality.More recently, neural density estimation techniques have been introduced to overcome this computational limitation by increasing simulation efficiency.We focus the remainder of this article on neural density estimation approaches for SBI.
In contrast to ABC, neural density estimation leverages deep neural networks to approximate conditional probability densities and is able to speed up inference by orders of magnitude (Papamakarios & Murray 2016).Neural density estimation involves training a conditional density estimator   , parameterised by weights , to approximate a target distribution (either the posterior distribution, the likelihood function or a density ratio proportional to the likelihood) from a training set of  pairs of (typically) prior samples and simulations {  ,   }  1=1 .Provided the density estimator is sufficiently expressive,   will recover an accurate estimate of the target distribution in the limit  → ∞.
Neural density estimation workflows have three main phases: simulation, training and inference.The simulation phase generates the training pairs {  ,   } that are used in the training phase to tune the weights of the neural network  such that   approximates the target density.In the inference phase,   is then conditioned on a specific observation  0 and parameter inference is performed.
Single runs of the simulation and training phases amortises the training of the density estimator, allowing offline inference to be run on multiple different observations, aptly named amortised neural density estimation.However, we are often interested in inference on a specific observation  0 .For this, amortised neural density estimation tends to be inefficient as generating training pairs for the density estimator across the entire prior parameter support includes many points in parameter space with very low posterior density ( |  0 ).
To rectify this simulation inefficiency one can run sequential neural density estimation, where multiple rounds of simulation and training are run sequentially to ensure there is a greater focus on regions of high posterior density.This is done by generating simulations from an alternative prior proposal distribution p().This proposal distribution is iteratively updated between rounds such that for  rounds, the proposal posterior of the -th round p ( |  0 ) becomes the proposal prior for the subsequent round p+1 ().This sequential approach can further increase simulation efficiency by orders of magnitude compared to the amortised counterpart (Papamakarios & Murray 2016), at the expense of forgoing observation-agnostic flexibility.
We briefly review the three main approaches to neural density estimation (Cranmer et al. 2020).When referring to variants of these implementations we follow the nomenclature of Durkan et al. (2020).

Neural posterior estimation (NPE)
Neural posterior estimation (NPE) was first introduced by Papamakarios & Murray (2016) and involves training a conditional density estimator to approximate the posterior density, such that   ( | ) → ( | ), by minimising the loss function For sequential neural posterior estimation, iteratively updating the proposal distribution between inference rounds results in the density estimator learning a proposal posterior density p( | ) that is related to the true posterior density by Three variants of NPE have been introduced to recover the true posterior from the proposal posterior.
The original neural posterior estimation method (NPE-A; Papamakarios & Murray 2016) trains a mixture density network to target the posterior distribution.A post-hoc analytical correction is then applied to the resulting proposal posterior to recover an approximation of the true posterior (cf.Eq. 10).NPE-A considers Gaussian or Gaussian mixture proposal distributions so that the correction factor can be computed analytically.
To circumvent the requirement for analytical computation, Lueckmann et al. ( 2018) propose a method (NPE-B) where the proposal correction is embedded as an importance weight in the loss function.Whilst more flexible than NPE-A, this method has been shown to suffer poor performance as the importance weights in the loss function are susceptible to high variance, resulting in early termination of training.
Finally, Greenberg et al. (2019) propose a neural posterior estimation method (NPE-C) that reparameterises the problem to recover a learnt approximation   ( | ) of the true posterior from a density estimator q ( | ) of the proposal posterior using a tractable sum of discrete atomic proposals over the support of the true posterior.This latter approach allows more flexibility in the choice of density estimator, including cutting-edge normalising flow models (Papamakarios et al. 2017;Durkan et al. 2019).In our subsequent experiments we only consider this neural posterior estimation method, which we simply refer to as NPE for the remainder of this article.
NPE learns the posterior density directly, typically for a probabilistic model from which samples can be drawn directly.For example, samples can be drawn directly from a Gaussian mixture density network or from a normalising flow, where for the latter samples are first drawn from a simple base distribution such as a Gaussian and pushed forward through the flow to yield samples of the target distribution.Consequently, generating samples avoids the need for Markov chain Monte Carlo (MCMC) sampling and can be performed rapidly and in parallel, significantly reducing computation time for inference.

Neural likelihood estimation (NLE)
Neural likelihood estimation (NLE) was first introduced by Papamakarios et al. ( 2019) and involves training a conditional density estimator to approximate the likelihood function (considering it as a probability distribution over the data), such that   ( |) → ( |), by minimising the loss function In contrast to NPE, sequential NLE can be implemented without a correction between q ( |) and   ( |).In principle, simulations can be generated for any proposal distribution and, consequently, simulations from all sequential rounds, not just the latest, can be used when training (Papamakarios et al. 2019).
This ability to seamlessly optimise simulation efficiency, however, comes at the expense of requiring an external MCMC sampling stage to generate samples from the surrogate posterior   ( |) () for inference, which increases inference time and computational cost significantly relative to NPE, where samples can be generated directly.

Neural ratio estimation (NRE)
Neural ratio estimation (NRE) was first introduced by Hermans et  the likelihood, where  denotes the model weights.This is done by training a binary classifier to discriminate samples drawn from the joint and marginal distributions of training pairs.The classifier then learns the ratio A further NRE variant was devised by Durkan et al. (2020), which we adopt for our numerical experiments and simply refer to as NRE for the remainder of this paper.
Similarly to NLE, an additional MCMC sampling step is required for NRE in order to generate samples from the surrogate posterior   ( , ) ().As with NLE, this increases inference time and computation cost significantly relative to NPE.

BAYESIAN MODEL COMPARISON FOR SBI
We discussed the importance and challenge of computing the model evidence for Bayesian model selection in Sec. 2, which is a fundamental component of many scientific analyses.Separately in Sec. 3 we discussed three recent neural density estimation techniques for parameter estimation in an SBI (simulation-based inference) setting, which offer great promise for scientific analyses where the likelihood is often intractable or too costly to be evaluated.In this section we unify these two topics by introducing a methodology to compute the Bayesian model evidence in SBI scenarios.
We do so by presenting three novel techniques to compute the evidence for neural posterior, likelihood, and ratio estimation methods (NPE, NLE, NRE, respectively).In particular, we leverage the recently proposed learnt harmonic mean estimator and exploit the fact that it is decoupled from the method used to generate posterior samples, i.e. it requires posterior samples only, which may be generated by any approach.This flexibility, which is lacking in many alternative methods for computing the model evidence, allows us to develop SBI model comparison techniques for all of the three neural density estimation approaches.The evidence computation technique corresponding to each neural density estimation approach is represented schematically in Fig. 1.Our approaches support density estimation training phases run in either an amortised or sequential setting.

Neural posterior estimation (NPE)
Our approach to compute the evidence for NPE is shown schematically in Fig. 1(a).We use NPE to learn an approximation  NPE  ( | ) of the posterior, parameterised by network weights .This approach provides the ability to rapidly generate samples directly from the surrogate posterior, i.e.   direct ∼  NPE  ( | ).While NPE also provides the ability to evaluate the surrogate normalised posterior, the normalisation constant itself, i.e. the model evidence, is not accessible.To compute the model evidence we therefore adopt the learnt harmonic mean estimator, using the samples drawn directly from the surrogate posterior.For the learnt harmonic mean estimator it is also necessary to evaluate the likelihood at sample positions (see Eq. 7), hence we adopt NLE to provide a surrogate likelihood.Using NLE we learn an approximation  NLE  ( |) of the likelihood, parameterised by a separate set of network weights .With a set of posterior samples and the surrogate likelihood learned by NLE to hand, we use the learnt harmonic mean estimator to obtain an estimate of the reciprocal of the model evidence by The proposed approach to compute the model evidence in the NPE setting does involve training two neural density estimators, both NPE and NLE.However, it does not require any MCMC sampling.Samples can be generated directly from the surrogate posterior learnt by NPE (e.g. by pushing samples from a simple base distribution such as a Gaussian through a normalising flow), which is highly efficient and can also be computed in parallel.This is only possible due to the flexibility of the learnt harmonic mean estimator, which as commented already is agnostic to sampling strategy, requiring only posterior samples that can be generated by any technique.
Given trained NPE and NLE surrogate densities, an alternative naïve technique can also be considered to estimate the model evidence.For any model parameter  the ratio of the unnormalised surrogate posterior, formed from the surrogate likelihood and prior, to the normalised surrogate posterior, i.e.
provides an estimate of the evidence.An estimate of the evidence is thus recovered for a single parameter , which need not be drawn from any particular distribution.However, such an estimate of the evidence will be incredibly noisy, i.e. will have an extremely large variance.Many parameters can be used to generate many estimates of the evidence that can be averaged.Nevertheless, the resulting estimate of the evidence remains highly noisy with a very large variance.This naïve estimator relies on the ratio of two approximate quantities, hence approximation errors compound.Contrast this with the learnt harmonic mean estimator.While our learnt harmonic mean approach does use NPE to learn a surrogate posterior  NPE  ( | ), the density is never evaluated.We only require samples from the corresponding distribution.The learnt harmonic mean does require learning the importance target (), and this is indeed learnt to approximate the posterior, but the target need only be normalised and have tighter tails than the true posterior ( | ) -it does not need to be an accurate approximation of the posterior.Consequently, our proposed approach to compute the evidence in the NPE setting, based on the learnt harmonic mean estimator, does not suffer compounding sources of error and thus provides increased stability over the naïve approach.
While the focus of the current article is SBI, we also comment that the ideas presented here can also be applied to accelerate evidence computation for likelihood-based inference.Crucially, throughout our approach to compute the evidence in the NPE setting, MCMC sampling is not required.Posterior samples can be generated directly, rapidly and in parallel.If a likelihood is available this can simply be substituted for the surrogate likelihood learnt by NLE.Therefore in the likelihood-based setting the approach can be altered to leverage the speed of posterior sample generation of NPE, while adopting the analytical likelihood function, to obtain a rapid estimate of the reciprocal evidence without any further computation, as described by Clearly in this setting NLE need not be performed.

Neural likelihood estimation (NLE)
Our approach to compute the evidence for NLE is shown schematically in Fig. 1 This proposed approach to compute the model evidence in the NLE setting involves training only one neural density estimator, which is decidedly more efficient than training two such estimators as required in the NPE and NRE settings (cf.Sec.4.1 and Sec.4.3).However, it does require MCMC sampling to generate samples from the unnormalised surrogate posterior which is required to compute the evidence using the learnt harmonic mean estimator.
With a trained NLE surrogate density  NLE  ( |  ) and the prior () to hand, alternative techniques that only require the likelihood function and prior could also be considered to compute an estimate of the evidence.We adopt the learnt harmonic mean estimator since it decouples evidence estimation from MCMC sampling strategy to provide greater flexibility and has also been shown to be accurate and robust.

Neural ratio estimation (NRE)
Our approach to compute the evidence for NRE is shown schematically in Fig. 1 The proposed approach to compute the model evidence in the NRE setting does involve training two neural density estimators, both NRE and NLE.Furthermore, external MCMC sampling is required to generate samples from the trained NRE surrogate posterior.
An alternative way to compute the Bayes factor  12 between two competing models M 1 and M 2 , i.e. the ratio of model evidences Eq. 4, in the NRE setting is to train an additional NRE model as a binary classifier to discriminate samples from the joint and marginal distribution of the two models, respectively.The classifier then learns the ratio where  12 denotes the network weights for a model trained in such a manner.Following similar notation, the standard neural ratio for a single model, say model M 1 , can be denoted   11 .While it is not possible to estimate the evidence of a single model directly, the Bayes factor comparing the two models, which is the critical quantity for model comparison, can then be recovered by We understand this method is already known to the SBI community but were not able to locate any references discussing or applying it.Since this approach does not lie within the family of methodologies introduced in the current article, which leverage the learnt harmonic  ).The light brown section of the plot shows results for the simulation-based evidence pipelines summarised in Fig. 1.These are all based on the use of to derive evidence estimates from posterior samples obtained with neural posterior estimation (NPE), neural likelihood estimation (NLE) and neural ratio estimation (NRE), in their amortised and sequential variants.mean estimate to compute the model evidence from samples of the surrogate posterior distribution, we leave the analysis of this approach to further work.

NUMERICAL EXPERIMENTS
Here we present the results from our numerical experiments that demonstrate and validate our SBI evidence calculation techniques.For validation purposes, for each problem, we compare the value of the evidence computed by our proposed approach (which we stress does not include any knowledge of the likelihood) to values computed by likelihood-based approaches, either derived analytically (when possible) and/or computed numerical by likelihood-based algorithms (e.g. by , and/or ).Of course in practical SBI settings, likelihood-based alternatives will not typically be available.Nevertheless, it is useful to consider problems here where the likelihood is available so that we can validate our SBI evidence computation techniques against likelihood-based alternatives.All of the SBI examples were implemented using the SBI8 software (Tejero-Cantero et al. 2020).

Linear Gaussian
The first problem we consider is that of a simple simulator which linearly adds Gaussian noise   to the value of the parameters   , for  = 1, 2, 3: This is a standard test problem in the SBI software.The Gaussian noise has zero mean and unit variance, and for the model parameters we assume a uniform prior   ∼ U [−2, 2] for each component  = 1, 2, 3.The likelihood for this model is Gaussian in the parameters  = { 1 ,  2 ,  3 }: We assume an observation  0 = (0, 0, 0).For this model the Bayesian evidence can be computed analytically: Fig. 2 summarises the results obtained from our model evidence estimates for the linear Gaussian problem.The pink background section shows results for likelihood-based methods for validation purposes, while results for SBI (likelihood-free) methods are shown in the light brown region.For all methods we repeat the evidence estimation exercise 25 times to empirically describe the statistical distributions of the model evidence estimates, shown by the blue areas in each 'violin' of Fig. 2. All of the evidence values reported in Fig. 2 can be compared with the analytical value of Eq. 22, overplotted in red.Likelihood-based approaches include: (a) , which produces samples and evidence estimates, run with importance sampling (Feroz et al. 2019), 1000 live points, efficiency sampling of 0.3 and evidence tolerance of 0.01; (b) , which also produces samples as well as evidence estimates, run using 1000 live points; (c) , which produces evidence estimates from posterior samples, thus we adopt the affine sampler of Goodman & Weare (2010), implemented in the 9 software (Foreman-Mackey et al. 2013), to generate 10 5 post burn-in posterior samples from 100 random walkers and adopt a hypersphere model for the learnt importance target, with radius equal to the square root of the number of parameters.
All of the three likelihood-based methods provide unbiased average estimates of the evidence.Notice that for this example the evidence estimates computed by have a standard deviation approximately four times smaller than that of and .The results for the SBI methods are shown in the light brown section of the plot.We report results for NPE, NLE and NRE, and for each of them we provide an estimate using the amortised as well as the sequential approach.For NPE and NRE we use 10 5 simulations in the amortised approach, while in the sequential one we use 10 rounds with 10 4 simulations each, thus totalling the same number of simulations for the two approaches.NLE requires increased computational cost at training time, therefore for this method we reduce the number of simulations by an order of magnitude in both amortised and sequential approaches, following Hermans et al. (2019).For each SBI method after training a density estimator (in an amortised or sequential fashion) we collect a total of 10 5 posterior samples, either directly for NPE or by MCMC sampling using 100 random walkers for NLE and NRE.We train the importance target model using 20% of the samples and use the remaining 80% to com-9 https://github.com/dfm/emceepute the evidence.As explained in Sec.4.1 and Sec.4.3, NPE and NRE require an additional training of an NLE estimator to provide a surrogate likelihood.
All of the SBI evidence computation techniques provide accurate estimates of the evidence, with the distribution range capturing the true analytic evidence.The variances of the SBI estimates are generally larger than the likelihood-based approaches, which is to be expected since in contrast to the likelihood-based setting we do not include any knowledge of the likelihood.We observe that performing inference using the sequential approach can result in an improvement in the estimate, but the difference is not always significant.The NPE and NRE approaches exhibit some bias, which is nevertheless within the spread of the distribution of values.The NLE estimates show excellent agreement with the reference values, despite using fewer simulations in training.This may be due to the fact that the NLE approach requires only a single neural density estimator (whereas the NPE and NRE approaches require two), resulting in fewer sources of approximation error.

Radiata pine
The second problem we considered is one of the classic benchmark examples used to evaluate techniques to estimate the model evidence (Friel & Wyse 2012;Enderlein 1961).We refer to McEwen et al. (2021), who demonstrated the effectiveness of the learnt harmonic mean estimator for this example, for a more detailed presentation of the problem: here we simply report the main points relevant to our evidence estimation task.
Our dataset is comprised of measurements of Radiata pine trees of the maximum compression strength parallel to the grain   , for  = 1 . . .42.The original scientific problem can be stated in terms of two models: in Model 1 the density   is assumed as a predictor for   , while in Model 2 the predictor is assumed to be the resinadjusted density   .Both predictors are modelled with a Gaussian The model parameters are {, , }, whose prior distributions are  ∼ N (  , ( 0 ) −1 ),  ∼ N (  , ( 0 ) −1 ),  ∼ Ga( 0 ,  0 ), (24) with (  ,   ,  0 ,  0 ,  0 ,  0 ) = (3000, 185, 0.06, 6, 3, 2 × 300 2 ).The evidence for this model can be computed analytically (cf. McEwen et al. 2021, Eq. 104); the numerical value of its logarithm is log  = −310.12829.Fig. 3 summarises our findings for the Radiata pine example; the colour codes are the same as in Fig. 2. We repeat the same experiments as in the linear Gaussian example, except this time for simplicity we do not attempt to calculate the evidence with or (as this would require some effort to adapt the prior function for the Radiata pine model to be compatible with the uniform distribution on the unit cube required by these nested samplers).Therefore, for the likelihood-based case we report only numerical results obtained by applying to samples, using a kernel density estimate for the learnt harmonic mean estimator importance target, with radius 0.02 of the target distribution.As in the linear Gaussian example, we use 20% of 10 5 posterior samples from 100 random walkers to train , and compute an estimate of the evidence with the remaining 80%.As we can see in the pink section of the plot, this provides unbiased and tight estimates of the evidence.
In the light brown background section of Fig. 3 we can compare results for SBI methods.The number of simulations we use to train the density estimators in the various methods is the same as in the linear Gaussian example, as is the number of posterior samples used to train and derive evidence estimates.All of the SBI evidence pipelines provide reasonably accurate estimates of the evidence, with distribution ranges capturing the true analytic evidence.Similar to the previous linear Gaussian example, we observe that performing inference using the sequential approach can result in an improvement in the estimate, but the difference is not always significant.The NPE and NRE approaches again exhibit some bias, which is nevertheless within the spread of the distribution of values.The NLE estimates again show excellent agreement with the reference values.

Gravitational waves
The final problem we consider is a simulated measurement of a gravitational wave (GW) signal from a single interferometer.We consider a merger between two black holes of mass  1 =  2 = 20 , following a similar numerical setup to the one considered by Jeffrey & Wandelt (2020).We compute the noiseless time series of the strain signal using the10 software (Biwer et al. 2019).We only consider the '+' polarization of the detector strain ℎ +,× .The duration of the signal is ∼ 0.12s, sampled at steps of duration ∼ 488s each.We rescale the signal by a multiplicative factor 10 21 in order to work with values O (1).For each point in parameter space, our simulated observable is a noisy gravitational waveform obtained by adding Gaussian noise with zero mean and standard deviation  = 0.3 to the noiseless template from .For this inference problem we vary the two black hole masses  1 ,  2 , each one over a uniform prior U [10, 30]  .We do not consider geometrical properties of the black holes such as spin or inclination angle, similarly to Jeffrey & Wandelt (2020).As noted by Hermans et al. (2021), who studied a similar GW SBI problem, such a simulated experimental setup requires significant computational demand.Hence, for this example we run only 10 repetitions of each inference method to empirically describe the statistical distribution of evidence estimates.These estimates mimic a realistic scenario within GW data analysis pipelines comparing model evidences for two different numerical approximant models assumed in the generation of the noiseless template waveforms.The first of these two waveform models corresponds to the actual one used to generate the simulated observation, a reduced-order effective-one-body model (SEOBNR, Taracchini et al. 2014), which we refer as the source model.The second model we consider is an inspiral-merger-ringdown phenomenological model (IMRPhenom, Hannam et al. 2014), which we refer to as the alternative model.For this configuration, we expect to find the Bayes factors comparing models to favour the source SEOBNR model.We verify this numerically, taking advantage of this problem's deliberately low-dimensional parameter space, which allows us to compute an estimate of the evidence for each model using direct numerical integration.We find the logarithm of the Bayes factor computed by direct numerical integration to be ∼ 3.25, favouring the source model as anticipated.
We run the inference pipeline in multiple likelihood-based contexts, namely (a) obtaining samples and evidence estimates with , using the same configuration for this method as the one described in Sec.5.1; (b) sampling the parameter space with , collecting the same number of posterior samples as in both previous numerical examples, and using to obtain an estimate of the evidence, with a kernel density estimate for the importance target distribution of radius 0.002 and 0.02 for the source and alternative models, respectively.Numerical results from these likelihood-based methods are reported in the pink background section of Fig. 4, where we show violin plots for log-Bayes factors computed with and .The distribution of log-Bayes factors always favours the true underlying source model and, as in previous examples, we find strong agreement between each likelihood-based approach.
Numerical results from the SBI evidence estimates are shown in the light brown background section of Fig. 4. The evidence pipelines we consider are the same as described in Sec. 4 and for each pipeline, the number of simulations used to train the density estimators are the same as those used for the linear Gaussian and Radiata pine examples in Sec.5.1 and Sec.5.2, for both amortised and sequential approaches.We also use the same number of posterior samples to train and derive estimates of the evidence.We notice that all SBI methods produce log-Bayes factor estimates in good agreement with the likelihood-based ones, albeit with a comparatively larger variance than presented in Fig. 2 and Fig. 3, due to compounding errors from the multiple evidence estimates required to obtain Bayes factors (cf.Eq. 4).NLE produces more unbiased estimates of the log-Bayes factors compared to NPE and NRE -a similar trend to that observed in the linear Gaussian and Radiata pine examples (cf.Fig 2 and Fig. 3).

CONCLUSIONS
In this article we introduce methodology to compute the model evidence for modern neural density estimation approaches to SBI (simulation-based inference).This approach leverages the learnt harmonic mean estimator introduced in McEwen et al. ( 2021); in particular, its property that it is decoupled from the sampling strategy and it only requires samples of the posterior.Exploiting this property we develop SBI model comparison techniques for all three main neural density estimation approaches: NPE (neural posterior estimation), NLE (neural likelihood estimation), and NRE (neural ratio estimation).Generating samples from the surrogate posterior can be performed directly for the NPE approach, which can be computed rapidly and in parallel, whereas MCMC (Markov chain Monte Carlo) sampling is required for the NLE and NPE approaches.
We demonstrate and validate our SBI evidence calculation techniques on a range of inference problems, using the learnt harmonic mean estimator as implemented in the software.The examples considered include two simple problems for which the analytical value of the evidence is known, and a more realistic application to a simulated gravitational wave analysis.For all of these examples, in addition to using to provide estimates of the evidence in an SBI setting, we further test the effectiveness of the learnt harmonic mean estimator in the corresponding likelihood-based setting.This way, we show that can be reliably used in both likelihood-based and simulation-based (likelihood-free) contexts to provide precise and accurate estimates of the Bayesian evidence.
For likelihood-based approaches we compare the analytical value of the evidence against estimates obtained with the nested sampling algorithms and , as well as against estimates from (with posterior samples obtained with the affine sampler ).We find that always produces values of the evidence that are in excellent agreement with the analytical values and with those computed by and .Nested sampling algorithms have become the standard way to compute estimates of the evidence in many scientific fields, including astrophysics and cosmology.Our results clearly corroborate the findings of McEwen et al. (2021), where the learnt harmonic mean was introduced and its effectiveness demonstrated, and further highlight the potential of this method to be used as an alternative to nested sampling for evidence estimation.This is particularly important for model comparison tasks in cosmological analyses from future surveys, where the ability to cross-check the quoted evidence estimates is of the utmost importance to ensure robust scientific statements.
Given the agnostic nature of the learnt harmonic mean estimator with respect to the method used to obtain samples of the posterior distribution, it can also be leveraged to compute the model evidence in a variety of simulation-based (likelihood-free) contexts.In our experiments we compare the different SBI evidence computation approaches that we propose, which rely on learning approximations of either the posterior (NPE), the likelihood (NLE) or the likelihood ratio (NRE).
We validate all SBI evidence estimator approaches, computing using , to those computed by likelihood-based alternatives (we stress that in a practical SBI setting the likelihood is not available but for validation purposes it is nevertheless useful to compare to the likelihood-based setting).As is to be expected, the variances of the SBI evidence estimates are generally larger than the likelihoodbased approaches (since the latter exploit knowledge of the likelihood, whereas the SBI approaches clearly do not).Furthermore, we find that the NLE evidence calculation techniques typically provide excellent agreement with the reference values, providing more accurate evidence estimates compared to NPE and NRE.We suggest that this may be due to the fact that our NLE evidence estimation approach requires only a single neural density estimator (whereas the NPE and NRE approaches require two), resulting in fewer sources of approximation error.This result is particularly encouraging in view of applications to cosmological scenarios, where it is very common to perform SBI inference using the software (Alsing et al. 2019), which indeed implements (sequential) NLE to sample the posterior distribution.
Comparing amortised and sequential approaches to SBI inference, again coupled with for evidence estimation, we notice that the sequential approaches lead to either similar or more accurate estimates of the evidence.This is not surprising as it confirms that the quality of the approximation performed in applying the SBI technique does have an important effect on the derived quality of the evidence estimate.This increase in quality of the approximation comes at the expense of a reduction in the generalisation of the inference framework, which cannot be re-used for different observations.
To conclude, the methodology and proof-of-concept analysis presented in this article highlights the potential of as an additional tool for Bayesian model selection that can also be successfully applied within novel, promising SBI settings.At present, in order to characterise the errors of the SBI evidence estimators proposed here we recommend bootstrapping, as performed in the simple numerical experiments presented in this article.Further research will focus on alternative approaches to estimate the variance of the estimators directly, folding in both sampling variance and neural density approximate error.Future research will also focus on extending the applicability of to larger data and parameter spaces, as well as leveraging more effective learnt importance target models.

Figure 1 .
Figure1.Schematic overview of three novel techniques that we introduce to compute the evidence in SBI settings for neural posterior, likelihood and ratio estimation methods (NPE, NLE, NRE, respectively).The yellow training blocks represent all phases of neural density estimation, where each block can be run in an amortised or sequential setting.

Figure 2 .
Figure 2. Model evidence values estimated with different likelihood-based and simulation-based (likelihood-free) methods for the linear Gaussian example described in Sec.5.1, whose analytical truth value is shown in red.For all methods we repeat the evidence estimation exercise 25 times to empirically describe the statistical distributions of the model evidence estimates, shown by the blue areas in each 'violin'.The mean and one standard deviation error bars are illustrated in purple.The pink section of this plot shows likelihood-based results obtained with , and (the latter using samples from).The light brown section of the plot shows results for the simulation-based evidence pipelines summarised in Fig.1.These are all based on the use of to derive evidence estimates from posterior samples obtained with neural posterior estimation (NPE), neural likelihood estimation (NLE) and neural ratio estimation (NRE), in their amortised and sequential variants.

Figure 3 .
Figure 3. Model evidence values estimated with different likelihood-based and simulation-based (likelihood-free) methods for the Radiata Pine example described in Sec.5.2, whose analytical truth value is shown in red.Colour codes and labels are consistent with Fig. 2.

Figure 4 .
Figure 4. Logarithm of the Bayes factors between source and alternative waveform models estimated with different likelihood-based and simulation-based (likelihood-free) methods for the gravitational wave example described in Sec.5.3.Shown in red, an estimate of the true value of the logarithm of the Bayes factor was obtained using numerical integration.Colour codes and labels are consistent with Fig. 2 and Fig. 3.
( | , M) is the posterior distribution of the parameters, given the observed data  and the assumed model M, ( |, M) is the likelihood function of the data  given parameters  and model M, ( |M) is the prior distribution of model parameters  for a given model M, and ( |M) is the model evidence, i.e. the probability of data  for a given model M. The Bayesian model evidence is given by the normalisation factor of the posterior distribution ( |M, ):