Parameter inference and model comparison using theoretical predictions from noisy simulations

When inferring unknown parameters or comparing different models, data must be compared to underlying theory. Even if a model has no closed-form solution to derive summary statistics, it is often still possible to simulate mock data in order to generate theoretical predictions. For realistic simulations of noisy data, this is identical to drawing realizations of the data from a likelihood distribution. Though the estimated summary statistic from simulated data vectors may be unbiased, the estimator has variance which should be accounted for. We show how to correct the likelihood in the presence of an estimated summary statistic by marginalizing over the true summary statistic in the framework of a Bayesian hierarchical model. For Gaussian likelihoods where the covariance must also be estimated from simulations, we present an alteration to the Sellentin-Heavens corrected likelihood. We show that excluding the proposed correction leads to an incorrect estimate of the Bayesian evidence with JLA data. The correction is highly relevant for cosmological inference that relies on simulated data for theory (e.g. weak lensing peak statistics and simulated power spectra) and can reduce the number of simulations required.


INTRODUCTION
It is increasingly common, especially in cosmological surveys, to attempt to make inferences from data d using theory summary statistics µ that can be obtained only from simulations.
One example, currently popular in cosmology, is weak lensing peak statistics (Dietrich & Hartlap 2010;Kacprzak et al. 2016;Peel et al. 2017;Shan et al. 2018;Martinet et al. 2018).Peak statistics broadly aim to use the number of density peaks in the cosmological matter distribution to constrain cosmological parameters and models.The number of peaks in the density field (or weak lensing signal) is the result of highly non-linear structure formation and large-volume dark matter simulation are often needed to generate the theoretical number.The number of peaks in a given simulation is stochastic due to cosmic variance and then further sources of noise are added to simulate realistic observed data.
The data d are not the raw maps or catalogues, but the observed/derived summary statistics.For example, d would E-mail: niall.jeffrey.15@ucl.ac.uk be the observed number of peaks in a weak lensing mass map reconstruction (Kacprzak et al. 2016;Jeffrey et al. 2018).
It was recently noted by Sellentin & Heavens (2016) that the common practice of debiasing a covariance matrix estimated from simulations of mock data (Hartlap et al. 2007) is insufficient.The sampling distribution of this estimated covariance should be incorporated into the likelihood distribution and, therefore, into the posterior distributions of the inferred parameters.Failure to do so leads to biased and overly-optimistic inferences.
In this work we note that, as with the estimated covariance described by Sellentin & Heavens (2016), an unbiased estimated summary statistic μ is nevertheless itself a random variable, drawn from a sampling distribution with associated variance.If unaccounted for, this will lead to inaccurate parameter inference and misleading model comparison results.
In a Bayesian hierarchical framework, we present how to derive the posterior distribution of parameters by using a corrected likelihood distribution which takes into account that the theoretical predictions are based on noisy simulations.
An alternative to our presented hierarchical approach is to use likelihood-free inference (LFI) methods (e.g.Alsing et al. 2018, Leclercq 2018).In general LFI methods assume that the likelihood is unknown, and simulations are used to estimate the resulting posterior distribution conditional on data.However, if simulations are expensive and we believe we know the naive likelihood 1 then LFI methods would be unnecessary and require more simulations, due to the increased number of degrees of freedom in the model.Instead, we show how to directly construct a Bayesian hierarchical model containing the free parameters of the known naive likelihood.
In Section 2 we show how to marginalize over possible summary statistics µ to derive the likelihood P(d | μ), the probability of the data d conditional on the estimated summary statistic μ.In Section 3, we consider the case in which the naive likelihood is Gaussian, and derive the corrected likelihood distribution in the presence of both known and unknown (estimated) covariance matrices.In Section 4, we use a one-dimensional toy model to demonstrate the effect of estimated summary statistics; we show that the corrected likelihood distribution matches samples generated from the toy model.In Section 5, we use the public JLA supernovae data to show the effect of estimating summary statistics from simulations, using draws from the known likelihood as mock simulations.

Posterior and likelihood
Using the example of weak lensing peak statistics, we would wish to evaluate the posterior distribution of parameters of interest θ (e.g.Ω m , σ 8 etc.) conditional on our observed data d (the number of peaks in a weak lensing map), which is given by Bayes' theorem: If we were able to directly model the summary statistic with a deterministic function µ(θ) (e.g. if we could calculate the number of weak lensing peaks for given parameters θ), then we could substitute 2 µ(θ) into the likelihood: Of course, this is not possible in many cases where the deterministic model is not known.We must rely on estimates μi of the true, but unattainable, summary statistic, with simulations run at position i in parameter space with parameters θ i .The correct representation of Bayes' theorem for the posterior distribution is 1 Assuming we could condition on the true summary statistics which can be calculated P(d|µ(θ)).
2 For a deterministic µ = f(θ), then or equivalently keeping in mind that, as observed quantities, the observed summary statistic d and the statistics from the simulations { μi } are independent and, therefore, separable.The brackets {} represent the set of simulations run over the positions i in parameter space.
The factors P({ μi }) and P(d) are both Bayesian evidence terms for the observed data and observed simulations.As constants, they can be largely ignored.

Likelihood correction
At position i in parameter space a set of M simulated data are generated.As it is not possible to calculate3 the summary statistic µ, we estimate it from M simulated data realizations d sim .The estimate is often the mean where each j th data realization is independent.In some cases, the summary statistic may not be the expectation of the likelihood P(d| µ), so the estimator μ would be another function of the simulated data (not the mean).
Even if μ is an unbiased estimate (that is, μ = µ), it is often mistakenly assumed that which is not generally correct.
The correct likelihood to be used for parameter inference is with the estimated { μi } from simulations as used Inference using noisy simulations 3 in equation 3 or equation 4.This can be rewritten as a marginalization over the unknown true summary statistic or alternatively which is the same up to a constant evidence factor P({ μi }), as would be expected from equations 3 and 4. Which of the previous two distributions one wishes to think of as the traditional likelihood is somewhat academic, as once they are included into Bayes' theorem (equation 3 or 4), the posterior is the same.
Adding the corrected likelihood into equation 3 gives the posterior distribution for the parameters of interest Here and in what follows, we drop explicit dependence on θ in terms like P({ μi }| µ, θ), and distributions will only condition on the previous layer in the hierarchical model (figure 1), as is a common practice (e.g.Leistedt et al. 2018).
One can nevertheless keep in mind that probability distributions with µ is also conditioned on θ.Equation 9 is the most general form, but each factor in the final integral can be evaluated for certain forms of the naive likelihood (P(d | µ)) and chosen priors: • The first factor is the naive likelihood that would be used if the summary statistic µ could be calculated.
• The second factor is a sampling distribution of μ.If our simulated data sets d sim, j for a given set of parameters θ i are independent and realistic, then each is an independent and identically distributed (i.i.d.) draw from the naive likelihood distribution.Assuming we know the naive likelihood, it is usually possible to evaluate the sampling distribution P( μ | µ).
• The final factor of equation 7 is the prior on the summary statistic µ conditional on position in parameter space.This is not assigning prior probabilities to the values of the parameters themselves but on to possible forms of the summary statistic.
If we believe, for example, that µ should vary smoothly in parameters space, this could be enforced through a smoothness prior P(µ|θ).This can be compared to emulation methods (e.g.Heitmann et al. 2009, Bird et al. 2019, Jennings et al. 2019), where µ is estimated from simulations using a smoothing prior, either explicitly (Gaussian processes) or implicitly (machine learning).However, the uncertainties are not generally included in the final posterior distribution in a principled hierarchical way as described here.
Conversely, if we claim to know nothing (or very little) about µ a priori, then we might consider a uniform prior (Section 3.1).

Bayesian Hierarchical Model
The model described in the previous section is hierarchical and has a network of parameters related by conditional probabilities.Specific probability distributions of interest, such as the posterior probability distribution of the parameters, are evaluated by appropriate use of Bayes' theorem and marginalization.
The left panel of figure 1 shows the probabilistic graphical representation of the hierarchical model pertaining to equation 9.This graphical representation may make the logical steps of the previous section clearer, especially allowing us to see where conditions can be dropped (e.g.

GAUSSIAN NAIVE LIKELIHOOD
For many cosmological analyses, the data d are assumed to have Gaussian noise and are, therefore, drawn from a Gaussian likelihood.In this section we derive the corrected likelihood for the case with known covariance (Section 3.1) and unknown (estimated) covariance (Section 3.2).
A Gaussian likelihood is usually an approximation, as there are likely to be some sources of non-Gaussian noise.It may be a very good approximation however.By the central limit theorem it may be the correct distribution in some limit of large numbers.For example, in a survey to measure the matter power spectrum P(k), if the galaxies are a Poisson process, then for modes that average many galaxies (high k) the likelihood is approximately Gaussian.Similarly, if weak lensing peaks are Poissonian, the binned counts of peaks will be approximately Gaussian for large numbers of observed peaks.
For cases where the naive likelihood is non-Gaussian and one wishes to calculate the corrected likelihood (conditional on an estimated summary statistic μ), one should evaluate equation 7 analytically or numerically.

Known Covariance
Consider the case where the naive likelihood (the first factor in the integral of equation 9) is a Gaussian/normal distribution, such that and the covariance Σ is assumed known.In this case, the sampling distribution (the second factor in the integral of equation 9) is at position i in parameter space.
If we do not wish to enforce any prior knowledge about µ, it seems reasonable to use the Jeffreys prior (Jeffreys 1946, Jeffreys 1998) as an objective prior distribution for µ, which, for our Gaussian likelihood (equation 10), is uniform This flat prior on µ, if unbounded, is formally improper.However, the resulting (posterior-like) distribution, P(µ i | μi , Σ), is Gaussian and therefore a true probability distribution.
With these distributions (equation 10-12) for a Gaussian naive likelihood, we perform the marginalization integration (equation 7), where x is a certain function of {d, μ, Σ, M} (but not µ) and X is a certain function of {Σ, M}4 .The first factor can be brought outside the integral.The integration over µ of the second factor, which is a normal distribution, evaluates to one, which removes the dependence on x.Normalizing the resulting distribution with respect to d gives the corrected likelihood: For summary statistics ( μi = μ(θ i )) estimated from simulations (run with θ i parameters), where the likelihood distribution for data d conditional on the true (but unknown) µ i is Gaussian with known covariance Σ, and with an objective Jeffreys prior on µ, then equation 14 is the corrected form of the likelihood.It is this corrected likelihood that should be used for parameter inference.
In this case, the corrected likelihood has the same Gaussian form as the naive likelihood, but with a scaled covariance.At first glance, this scaling could be mistaken for Bessel's correction for an unbiased estimate of the sample variance; however, here we actually know the covariance Σ and the added scaling comes from uncertainty in our estimate μ.

Unknown Covariance
In the previous section, we assumed that the summary statistic µ was estimated from simulations, but the covariance Σ was known.This situation is unlikely and it is foreseeable that both the summary statistic and covariance would also be estimated from simulated data.
The estimate of the covariance from N independent data simulations is given by where † is the conjugate transpose, and dsim = 1 is the just the estimated summary statistic μ as given by equation 5 (but with N simulations, not M).
For the case where the summary statistic µ is not estimated from simulations, Sellentin & Heavens (2016) calculate the corrected likelihood For a Gaussian naive likelihood P(d | µ, Σ) the distribution of the estimated covariance P( Ŝ | Σ) is Wishart.With these distributions and a Jeffreys prior for Σ, the resulting Sellentin-Heavens corrected likelihood is given by where p is the number of elements in the data vector d (i.e. the dimensionality).This has the form of a multivariate tdistribution.
In the case considered in this work, represented by the right panel of figure 1 we are assuming that the summary statistic µ cannot be calculated, and that we must estimate μ from simulations.The integral in equation 16 must then be replaced by ( This integral is the (corrected) likelihood in the posterior distribution of θ for the model shown in the right-hand panel of figure 1.We note that a given sample mean μi and sample covariance Ŝi are independent, even if the simulated data used to estimate them are the same (Anderson 2003), so that P( μ, Ŝ) = P( μ) P( Ŝ).If the simulated data used to evaluate μ and Ŝ are the same, one can set N = M in what follows.If, as is often the case in cosmological analyses, the covariance is assumed not to vary with respect to the parameters of interest and is instead estimated once with more simulated data realizations, then N is fixed and the i indices on the estimated covariance are dropped.
Using the same distributions as described so far in Section 3, marginalizing over the unknown true summary statistic µ and covariance Σ (equation 18) and renormalizing gives the new corrected likelihood (details in Appendix A): This corrected likelihood gives the probability of observing the data d conditional on an estimated mean summary statistic μi = μ(θ i ) from M simulations and an estimated covariance matrix Σi = Σ(θ i ) from N simulations, where we have assumed that the naive likelihood is Gaussian and have used Jeffreys priors on µ and Σ.

TOY MODEL DEMONSTRATION
As a verification and demonstration of the result given in equation 14, where μ is estimated from simulated data and Σ is known, we construct a toy model.This toy model also relies on the assumed flat prior on µ and the fact that the sampling distribution is symmetric with respect to µ and μ.
Let us assume that in different realizations of an experiment, different experimenters randomly and independently generate M simulations, from which μ is estimated according to equation 5.The underlying likelihood distribution for the data with known µ is Gaussian and therefore the simulated data are themselves i.i.d.draws from the Gaussian distribution (equation 10).Each experimenter then draws a realization of the data d from the naive Gaussian likelihood distribution with the known variance and the mean given by their estimated μ.
Though each experimenter draws their data realization d from a Gaussian likelihood with the known variance Σ, the data realizations from all the experimenters will be distributed according to the corrected likelihood (equation 14) with variance M+1 M Σ.In figure 2 we take a one-dimensional case where µ = 42, Σ = π ≈ 3.14, and M = 2.We see that the data samples from 105 different experiments match the corrected likelihood distribution (equation 14), whereas the naive likelihood distribution underestimates the variance, as expected.

JLA SUPERNOVAE DEMONSTRATION
In this section, we use public type Ia supernova (SN Ia) data 5 from the SDSS-II/SNLS3 Joint Light-Curve Analy-sis (JLA) (Betoule et al. 2014) as a demonstration of the corrected likelihoods described in the previous sections.This is, of course, only a demonstration, as the summary statistic µ(θ) (SN Ia apparent magnitudes) can actually be calculated for the model considered.We generate simulated data by drawing realizations from the known likelihood6 , and estimate μ.We can then constrain cosmological parameters using a likelihood distribution conditional on our estimated μ.

Data and Model
The data are observed B-band peak apparent magnitudes d = (m B,obs,1 , m B,obs,2 , ...) for 740 SN Ia over a range of redshifts up to z = 1.3.The supernovae also have associated light-curve stretch X 1 , colour at maximum-brightness C and host stellar mass M stellar , which are included in the model and covariance.The data and associated covariance are shown in figure 3.
We use the model from Betoule et al. (2014) where the SN Ia are standardizable candles with expected apparent magnitude where α and β are nuisance parameters for the stretch and colour respectively.M B is the absolute magnitude of the host with a correction term ∆M depending on M stellar (where Θ is the Heaviside function).We take a flat wCDM Universe, with luminosity distance where c is the speed-of-light in a vacuum, H 0 is the Hubble parameter, Ω m is the matter density parameter, and w is the equation of state for dark energy.

Likelihood and Priors
We assume a Gaussian likelihood, where the log-likelihood is given by: where the data and covariance are those described in Section 5.1 (and shown in Fig. 3), and our summary statistic µ is given in equation 20.For simplicity we take uniform priors in the ranges: Simulations for this demonstration are run on a regular grid of shape [12,13,11] (for Ω m , w, M B ) spanning the prior range.The regular grid is a particularly poor choice to sample the posterior distribution when simulations are expensive.However, this is a demonstration, and for realworld analysis many better sampling schemes are available (including latin hypercubes and grid transformations to better sample the expected posterior distribution).
Once the posterior is evaluated at these grid positions, the parameter space is upsampled to a [48, 52, 44] grid.The new grid positions are evaluated by interpolating the posterior distribution from the original grid using a radial basis function 'thin plate' spline (Duchon 1976;Bookstein 1989;Jones et al. 2001).This spline interpolation worked particularly well in avoiding edge effects or artefacts around points when we compared their results with those of more poorlyperforming simple polynomial splines.

Results
First, let us imagine three different experimenters, who, despite having access to the same data (described above), run their own independent simulations to estimate the summary statistics μ(θ) on the grid in parameter space.This results in different μi for i =1, 2, 3.
Evaluating the posterior distribution using the naive Gaussian likelihood set-up described in Section 5.2, and using M = 2 simulations per parameter grid position, results in the three posterior distributions in figure 4. The three different experimenters have three different posterior distributions due to their different μi .
Having different individual posterior distributions is in itself is not a problem.If different experiments have different data but the same underlying parameters, their resulting posterior distributions will look different, and will quantify their own individual uncertainty in the parameters.However, this variance of the data has been taken into account, and will be reflected in each posterior distribution.In the case of different μi in figure 4, the fact that μ was a random draw from a distribution (just like the data) has not been taken into account.As they have ignored the resulting cor- rection to their likelihood, each experimenter will be overly optimistic about their own inferences.
In figure 5, the posterior distribution has been calculated using the likelihood correction (equation 14), which takes into account the variance in μi (using the µ summary statistic for clarity).Using ChainConsumer (Hinton 2016), we measure a 25 per cent increase in the determinant of the parameter covariance with the corrected likelihood.The resulting posterior distribution is slightly broader, reflecting the added uncertainty in the inference.with M = 2 (we set μ = µ in this example for clarity).The contours are broader than in figure 4 (25 per cent increase in the parameter covariance determinant) as this likelihood takes into account that the estimated summary statistic is a draw from a sampling distribution.

Model Comparison
The comparison of different theoretical models using the data in a Bayesian framework is usually done by calculating the Bayesian evidence: Two models can be compared by evaluating the Bayes factor: If one has no reason to believe a given model more than another a priori, then the second factor (the ratio of the prior distributions) equals one.In this case, the Bayes factor becomes a ratio of the model probabilities (conditional on the data).We evaluate the Bayesian evidence (equation 23) for the three parameter case for the JLA analysis described in Section 5.2 with the uncorrected naive Gaussian likelihood and the corrected likelihood (equation 14) with M = 2. Calculating the Bayes factor gives  As a check, after increasing the number of evaluated grid points by a factor of 4 we still calculate the same K values.Additionally, we calculate lnK using a different cosmological parametrisation, sampling scheme and data (Appendix B) and get similar results.
For all three, the corrected likelihood is more than a factor of exp[30] more probable than the uncorrected.This is further validation of the corrected likelihood; the model (i.e. the corrected likelihood) shows a better goodness-offit.Furthermore, if one were using an estimated summary statistic, but not using the corrected likelihood, one's belief in a model would be incorrect by this factor.This effect would be particularly harmful if comparing two models, where it is possible to calculate µ for the first, but μ is estimated from simulations for the second.Using the same Gaussian likelihood for both, without the correction for the second, would lead one to incorrectly favour the first model.
In figure 6, the log Bayes factors ln K (equation 25) are shown for the three estimated summary statistics μi as a function of the number of simulations M. As the number of simulations increase, the error incurred by using the uncorrected likelihood decreases.The value asymptotes to, but will never reach, ln K = 0.For even a large number of simulations (large M), the error is not negligible.For many analyses, rather than running 10 2 expensive simulations, it would be better to use the corrected likelihood and avoid this error.

DISCUSSION & CONCLUSIONS
In this work, we have shown how to take the sampling distribution of estimated summary statistics, μ, into account for parameter inference in a cosmological context.For situations where the naive likelihood is Gaussian, we have evaluated this correction (by marginalizing over the unknown µ) for the case with known covariance (equation 14) and estimated covariance (equation 19).
We have validated the corrected likelihood with a toy model (Section 4).Using JLA SN Ia data, we have demonstrated the effect of the corrected likelihood on cosmological parameter inference.For model selection, in our simple three-parameter inference demonstration, we show that the log Bayesian evidence lnZ will be incorrect with a difference of over 30 if the uncorrected likelihood is not used.
In the era of DES (DES Collaboration et al. 2018), LSST (LSST Science Collaboration et al. 2009) and Euclid (Amendola et al. 2016), cosmological analyses will have access to large cosmological data sets.Sole reliance on twopoint statistics in the linear regime will be tantamount to wasting data which is rich in cosmological information.However, many summary statistics (µ) that access information beyond these two-point statistics in the linear regime cannot be calculated analytically and need realistic simulations to be run to estimate μ.
A typical approach has been to run an excessive number of simulations at each position in parameter space, such that the variance of μ in negligible.This approach has diminishing returns, as variance asymptotically tends to zero as 1 M .This effectively aims to increase the number of simulations M until the sampling distribution P( μ | µ) can be effectively viewed as Dirac delta function (a limit it will, of course, never reach).Accepting a small increase in the resulting parameter constraints and correcting the likelihood for this sampling distribution means that fewer simulations have to be run.
If one does not wish to take the sampling distribution into account, one might use "cheap" simulations where it is possible to run enough that one effectively reaches the Dirac delta function limit.This has two potential pitfalls: firstly, the limit is never truly reached, which may affect the inferred parameters or model comparison results; secondly, cheap simulations are likely to be less realistic.It is far better to have slightly broader posterior distributions and to have used reliable simulations, than to have tighter constrains on parameters that are biased due to unreliable simulations.
The approach taken in this paper requires the acceptance that simulations are not "free".Simulations are increasingly an essential part of analyses.Like data, reliable simulations are often expensive in terms of time and resources and are, therefore, an acceptable contribution to the uncertainty of inferred parameters.

Figure 2 .
Figure2.One-dimensional toy model for the naive Gaussian likelihood with known variance, where µ = 42, Σ = π ≈ 3.14, and M = 2.The samples (described in Section 4) are distributed according to the corrected likelihood distribution (equation 14), whereas the naive likelihood distribution has reduced variance.

Figure 3 .
Figure 3. Left panel: The observed magnitude m B data for 740 SN Ia.The error bars are taken as the square-root of the diagonal elements of the covariance.Right panel: The covariance matrix as described in Section 5.1.

Figure 4 .
Figure4.JLA posterior distribution for Ω m , w, and M B (described in Section 5.2) using three independent estimates μ with M = 2 simulations per position in parameter space.This uses the naive Gaussian distribution (equation 10) without the correction (equation 14), and the contours are therefore optimistically reduced.

Figure 5 .
Figure5.JLA posterior distribution for Ω m , w, and M B (described in Section 5.3) using the corrected likelihood (equation14) with M = 2 (we set μ = µ in this example for clarity).The contours are broader than in figure4(25 per cent increase in the parameter covariance determinant) as this likelihood takes into account that the estimated summary statistic is a draw from a sampling distribution.
μ1 , μ2 , and μ3 respectively, where lnK = lnZ corrected − lnZ uncorrected(26)    using the corrected likelihood to evaluate the evidence Z corrected and the naive, uncorrected likelihood for Z uncorrected .

Figure 6 .
Figure6.The log Bayes factor ln K as a function of the number of simulations M for the three estimated summary statistics μi (using cubic spline interpolation between evaluated points).
Probabilistic graphical representation in plate notation of Bayesian hierarchical models from Sections 2.2 & 3.1 (left panel) and Section 3.2.(right panel).Shaded nodes are "observed', either from experimental data (d) or from simulations ( μ and Σ).Each plate (rectangular box) includes the amount of data associated with the variable, for example each μi (run at position i in parameter space) comes from M i simulations.