Marginal Post Processing of Bayesian Inference Products with Normalizing Flows and Kernel Density Estimators

Bayesian analysis has become an indispensable tool across many different cosmological fields including the study of gravitational waves, the Cosmic Microwave Background and the 21-cm signal from the Cosmic Dawn among other phenomena. The method provides a way to fit complex models to data describing key cosmological and astrophysical signals and a whole host of contaminating signals and instrumental effects modelled with `nuisance parameters'. In this paper, we summarise a method that uses Masked Autoregressive Flows and Kernel Density Estimators to learn marginal posterior densities corresponding to core science parameters. We find that the marginal or 'nuisance-free' posteriors and the associated likelihoods have an abundance of applications including; the calculation of previously intractable marginal Kullback-Leibler divergences and marginal Bayesian Model Dimensionalities, likelihood emulation and prior emulation. We demonstrate each application using toy examples, examples from the field of 21-cm cosmology and samples from the Dark Energy Survey. We discuss how marginal summary statistics like the Kullback-Leibler divergences and Bayesian Model Dimensionalities can be used to examine the constraining power of different experiments and how we can perform efficient joint analysis by taking advantage of marginal prior and likelihood emulators. We package our multipurpose code up in the pip-installable code margarine for use in the wider scientific community.


INTRODUCTION
Bayesian analysis is a cornerstone of modern statistical inference.It uses Bayes theorem to iteratively update the probability of a given hypothesis or model (posterior).The inference is informed with existing knowledge about the model parameters (prior) and we define a representative probability distribution for the data given the choice of model and parameters (likelihood) to perform the inference.Various computational approaches to Bayesian analysis exist, including Markov Chain Monte Carlo (MCMC) methods (e.g.Foreman-Mackey et al. 2013) and Nested Sampling algorithms (e.g.Handley et al. 2015a,b).
These experiments are often plagued by 'nuisance' parameters that characterise foregrounds and instrumental effects in the data.This leads to high dimensional parameter spaces, of which only a few of the parameters model the core science.Consequently, drawing conclusions about the parameters of interest and making comparisons across different experiments with different nuisance parameters can be challenging.For example, in the Year 1 analysis of data from DES the likelihood contains a total of 26 parameters of which 20 could be considered nuisance parameters with only 6 corresponding to cosmological parameters of interest (Dark Energy Survey Collaboration 2018).Similarly, for REACH the likelihood can contain upwards of 15 parameters, with only 3-7 of these corresponding to the cosmological 21-cm signal.
We present a post-processing tool called margarine that can be used to calculate marginal, nuisance-free or equivalently nuisancemarginalised posterior distributions pertaining to the core science goals of the above experiments through density estimation with Masked Autoregressive Flows (MAFs, Papamakarios et al. 2017) and Kernel Density Estimators (KDEs, Rosenblatt 1956;Parzen 1962).
Density estimators model probability distributions given a set of representative samples, and there are a number of different methods through which this can be done.They can be used to evaluate the logarithm of the probability of a set of samples on the learned distribution, meaning that they can be used as a computationally inexpensive likelihood generators and, when trained on marginal samples from sub-spaces of a larger distribution, to calculate marginal statistics such as the marginal Kullback-Leibler divergence and marginal Bayesian Model Dimensionality.The density estimators used in this work are outlined in more detail in section 3.
As likelihood generators, the density estimators can be used to perform efficient joint analysis of constraints from multiple different experiments probing the same core science with different nuisance parameters.
The Kullback-Leibler (KL) divergence can be used to determine how much information has been gained about the parameters of a model through our Bayesian analysis when moving from the prior and the posterior (Kullback & Leibler 1951).The marginal Kullback-Leibler (KL) divergence, accessible through the density estimators discussed above, is a measure of how much information is gained in the inference when contracting specifically the core science prior onto the core science posterior without including contributions from correlations with or constraints on the nuisance parameters.This allows us to confidently compare the information gained from observations by different experiments of the same signal, regardless of their specific instrumental nuances.
The Bayesian Model Dimensionality gives a measure of the number of parameters that have been constrained during our inference (Handley & Lemos 2019a).The marginal Bayesian Model Dimensionality (BMD) can be used to compare the effective number of core science parameters constrained by different experiments and models independent of the experiments specific nuisance parameters.The KL divergence and BMD are further explored in section 2.
Both the marginal KL divergence and the BMD, which prior to margarine were not easy to assess, can therefore provide a way to quantitatively determine which experimental approaches are the most informative, leading to improvements to instrumentation in the future.
Finally, we note that if the density estimator is set up correctly such that it is a bĳective transformation between the unit hyper-cube and the target posterior, then it can be used to generate the prior on a subsequent Nested Sampling or MCMC run.This allows far more complex priors based on current experimental results to be used, incorporating our current knowledge of the parameter space (this idea was first introduced in Alsing & Handley 2021).
In addition to the previously discussed examples, Bayesian analysis has historically been applied to many different sub-fields in astrophysics, such as the study of velocity fields (e.g Kaiser & Lahav 1989), the CMB (e.g.Trotta 2008), the study of galaxies from JWST (e.g.Curtis-Lake et al. 2023) and gravitational wave studies (e.g.Romero-Shaw et al. 2021;Agazie et al. 2023) among others.Due to the ubiquity of Bayesian analysis in cosmology and astrophysics, post-processing of samples from MCMC and Nested Sampling algorithms is an important area of research.margarine, presented in this paper, is an invaluable addition to the Bayesian workflow and allows for direct and specific comparison of the constraining ability of different experimental approaches, rapid emulation of nuisance-free likelihoods and experimentally informed priors.
We implement our approach in Python using tensorflow and the keras backend and release the code as the pip installable package margarine.The code is fully documented with examples, subject to continuous integration testing and generally does not require high performance computing to be used.
In section 2 we introduce Bayesian analysis and the marginal statistical quantities that can be evaluated with margarine.We discuss the two different types of density estimators used by margarine in section 3.In section 4, we demonstrate several applications of the code base, including applications to samples from the Dark Energy Survey and mock observations with the REACH pipeline.We summarise the paper in section 5.
margarine was previously introduced in the conference proceedings for the 2022 International Conference on Bayesian and Maximum Entropy Methods in Science and Engineering (MaxEnt22, Bevins et al. 2022b) in which the nuisance-free likelihood was derived and demonstrated.We review this work briefly in this paper.

Bayesian analysis
Bayesian analysis is concerned with the estimation of posterior probabilities using Bayes theorem where Θ are the parameters of our model  describing the data  and L (Θ) corresponds to the likelihood.The likelihood, the probability of the data given the model, is often assumed to be Gaussian in nature (although more complicated likelihoods can be implemented, Scheutwinkel et al. 2022b).The prior, (Θ) = (Θ|), encodes our existing knowledge about the parameters before any sampling is performed.(Θ) is often chosen to be uniform or log-uniform for each parameter within some physically motivated range, however it can be useful to inform this with existing experimental results.The evidence, Z, is a marginal likelihood integrated over all the parameters weighted by the prior The estimation of Z via equation ( 2) is the primary concern of the Nested Sampling algorithm (Skilling 2004) and through the evaluation of equation ( 2) we can derive samples on the posterior P (Θ|, ).The algorithm works by generating samples in Θ from the prior, calculating the likelihood and numerically approximating the integral in equation ( 2).At each iteration, new samples at higher likelihoods are generated to sample the posterior bulk.The method by which the samples are generated is different in different implementations (e.g.Feroz et al. 2009;Handley et al. 2015a), and the efficiency of the sampling algorithm is often dependent on the dimensionality of the parameter space.The evidence can be used for model comparison, with a higher evidence indicating a preference for one model over another.
If our model  (Θ) contains nuisance, , and cosmological signal parameters, , then we can write Θ = {, }.Mathematically, the marginal posterior for the signal parameters, P (), can be calculated by integrating out the dependence on  P (|, ) = ∫ P (, |, ). (3) Whilst we have access to samples in the marginal space, however, the probability P (|, ), referred to as P () from now on, is typically a difficult quantity to calculate as it involves performing a high dimensional integral over the nuisance parameters .We can effectively perform this marginalisation by training density estimators on samples in  from MCMC or Nested Sampling chains excluding the samples in .From these density estimators, we can generate samples from the distribution P () and importantly calculate log P () for a given set of  which gives us access to a whole host of previously intractable marginal summary statistics.The density estimators used in this work are outlined in more detail in section 3.

Nuisance-free likelihood
In Bevins et al. (2022b) we defined the 'nuisance-free' likelihood function as where the marginal posterior, P (), and prior, (), can be emulated from samples with a density estimator in margarine.Here Z is the Bayesian evidence from the full fit with  and  and is crucial to the definition of the 'nuisance-free' likelihood.Without the inclusion of the evidence, any resultant sampling performed with L () will not produce sensible evidences that can be used for model comparison.
We demonstrate applications of the nuisance-free likelihood to perform efficient joint analysis in section 4.1 with a simple toy example and reference Bevins et al. (2022b) for a more complete description.

Marginal Kullback-Leibler divergence
As mentioned in the introduction, the KL divergence quantifies the information gain when moving from the prior to posterior and is an effective measure of the contraction between the two distributions.The marginal KL divergence corresponding to the astrophysical or cosmological parameters  is defined as the average Shannon Information over the posterior, P (), and is a measure of how much information in bits the data provides us about the astrophysical or cosmological part of the parameter space.It is approximately equal to the log of the ratio of the prior volume,  , and the posterior volume,  P .D is a strong function of the prior, and inherits the property of being additive for independent parameters from the Shannon Information.Calculation of the KL divergence requires the posterior distribution to be appropriately normalized, which requires knowledge of the Bayesian evidence, and consequently the quantity is not attainable with common MCMC sampling techniques which generate samples on the non-normalised posterior through the relationship The KL-divergence can be related to the Bayesian evidence via and acts as an Occam penalty that penalizes complex models with large numbers of dimensions (Hergt et al. 2021).In practice, the above equation is typically used to calculate the KL divergence from Nested Samples, where we have ready access to the Bayesian evidence Z.However, using density estimators to calculate the KL divergence, marginal or not, also has the added advantage that it does not require knowledge of the Bayesian evidence because the density estimators naturally return normalized distributions.

Marginal Bayesian Model Dimensionality
An alternative measure of constraint is the Bayesian Model Dimensionality (BMD, Handley & Lemos 2019a) which gives a measure of the effective number of parameters that are being constrained and can be used to quantify tensions between different experimental results (Handley & Lemos 2019b;Glanville et al. 2022).The BMD is defined such that a Gaussian constraint around a single parameter corresponds to a value of one, as does a correlated constraint between two parameters.This is demonstrated visually in Handley & Lemos (2019a) and allows us to predict the expected BMD for a given posterior by asking how Gaussian the distribution is.
The marginal BMD  is given by and a full derivation of the BMD can be found in Handley & Lemos (2019a).The quantity is the variance of the Shannon information and therefore a higher order statistic than the KL divergence.It is only weakly prior dependent and like the KL divergence is additive for independent parameters and invariant under a change of variables.

DENSITY ESTIMATORS IN PRACTICE
In principle, one can use any type of density estimator to model subspaces in a larger parameter space and calculate the marginal Bayesian statistics discussed.The only requirements are that it can be used to resample the space and approximate the normalised logprobability associated with the posterior subspace.

Masked Autoregressive Flows
Complex densities are best estimated using expressive neural networks (Papamakarios et al. 2019) that have been trained to transform between some base distribution and the target probability distribution.margarine uses Masked Autoregressive Flows (MAF, Papamakarios et al. 2017) to perform this transformation.An example Masked Autoencoder for Distribution Estimation (MADE) neural network, which forms the basis of the MAF, is shown in Fig. 1.The network works by dividing the target probability distribution into a series of one dimensional conditional distributions and approximating each conditional probability as a Gaussian where   and   are the standard deviation and mean of the conditionals.Throughout the paper, we use subscripts for dimensions ( The vectors of  and  are therefore functions of the networks weights and biases that need to be trained.We train the weights and biases using the change of variables formula where the goal is to minimize the KL-divergence between the true probability distribution of the samples  and the prediction from the network   () given by where  are the weights of the network and E  (  ) is the expectation value over ().The second term in the KL divergence is independent of the network, and so it is constant when we change the hyperparameters of the network.We can therefore focus on minimizing the first term of the KL divergence meaning that our minimization problem becomes a maximization problem over the log-probability predicted by the network for the samples on the target distribution Equivalently by a change of variables argmax where  ′ is a function of the network weights  (Alsing et al. 2019).
Here the second term is just the derivative over the network and the first term is trivially calculated from the output  and .Importantly, the MADE networks are bĳective, meaning that there is a one-to-one relationship between samples in the base distribution and target distribution.We can therefore, once trained, use them to transform samples back and forth between the two.
By chaining a series of these networks together, we create a MAF.This allows the emulator to learn more complex distributions and can be expressed using the following formalism where the index here refers to a given network in the chain.We train the chain of neural networks simultaneously with one loss function and feed the outputs of proceeding networks into subsequent networks in the chain.
To improve the accuracy of the density estimation, we can transform the target posterior samples, that we input to our MAF, into a Gaussianized space via the standard normal cumulative distribution function (CDF).The target distribution in this normalised space is a skewed and scaled version of the base distribution  0 which makes the transformation easier to learn.This is demonstrated in Fig. 2.
Due to the invariance of the KL divergence and, similarly, the BMD under a change of variables, we can calculate their values in any transformed version of the original parameter space provided the prior and posterior undergo the same change of variables.
There are a number of tunable parameters involved when designing MAFs including the number of networks, the number of layers in each network, the number of epochs and the learning rate.All of these parameters can be explored using margarine, and we use recommendations from a previous work (Alsing & Handley 2021) and set these as the defaults in margarine.

Kernel Density Estimation
A one dimensional Kernel Density Estimator (KDE) is built by summing a series of Gaussian kernels centred around each sample   in the set of samples .The KDE provides an estimate of the probability at each sample  and where there are more samples in  the sum will be larger representing a higher probability.Mathematically a 1D KDE is defined as where ℎ is known as the bandwidth and is a smoothing parameter on the Gaussian kernel  with a standard deviation given by the standard deviation of the  samples, and in the set .This scales to higher dimensions where  and   become  dimensional vectors and ℎ becomes a 2D matrix of bandwidths.We use a multivariate Gaussian kernel where ℎ is a matrix of smoothing parameters on the covariance matrix and   is equivalent to an  dimensional vector of means.Since the KDE is a sum of known kernels with a known covariance matrix and set of means,   , then the probability distribution of the samples is trivially calculable, as is its logarithm.
When generating KDEs, we perform the same type of normalisation of our target distribution as is done with the MAFs, transforming it into a Gaussianized space via a standard normal CDF (see Fig. 2).The transformation allows the density estimator to better capture the edges of very flat and uniform posteriors.
We can generate samples from trained KDEs and calculate marginal Bayesian statistics, however KDEs are not strictly bĳective.Adapting a KDE to be bĳective is a non-linear process and requires the use of a root-finding algorithm to transform from the hypercube to the KDE space.This is a process that is not currently optimized, but is fully implemented in margarine and the equivalent transformation using MAFs is much more computationally efficient.
There are fewer tunable parameters for the KDE, however we can change the bandwidth of the kernel.We use the Silverman (Silverman 1986) approach1 to calculate ℎ but note that this can be modified with margarine.In one dimension ℎ is given by where  is the standard deviation of the data and  is the number of samples Silverman (1986).In the multivariate case, this just becomes where  is the number of dimensions and ℎ   = 0 for  ≠ .

Marginal Joint Analysis
To illustrate the application of the nuisance free likelihood function, we use an example from 21-cm cosmology.The sky-averaged (global) 21-cm signal is the differential brightness temperature between the radio background,   , and the spin temperature,   , of neutral hydrogen during the Cosmic Dawn and Epoch of Reionization.  is a statistical temperature that quantifies the ratio of the number of neutral hydrogen atoms with proton and electron spins aligned versus anti-aligned.The signal manifests itself as an absorption trough against the radio background when the first stars begin to form in the early universe around redshift  ≈ 20 − 40, and Lyman- coupling drives the spin temperature down to the gas kinetic temperature.Various astrophysical processes then begin to heat the gas around  ≈ 15−25, particularly as the first X-ray sources form, and the coupled spin temperature is raised back up to and sometimes in excess of the radio background.UV emission then begins to ionize the neutral hydrogen around  ≈ 5−10, and the signal vanishes during the Epoch of Reionization.The signal has a rest frequency of 1420.4MHz and is redshifted by the expansion of the universe.Since we are interested in the evolution of the signal over the range  ≈ 5 − 50, we use radio telescopes to observe the signal in the range  ≈ 30 − 250 MHz.The global 21-cm signal is predicted to be of order 100 mK in magnitude (although there is some uncertainty in this value e.g.Mesinger et al. 2010;Barkana et al. 2018;Reis et al. 2020Reis et al. , 2021;;Muñ oz et al. 2022) and is masked by foregrounds from our own Galaxy (Bernardi et al. 2009), extragalactic radio sources (Niţu et al. 2021) and the ionosphere (Shen et al. 2021), which collectively are around five orders of magnitude brighter than the signal.To a crude approximation, the signal can be modelled with a Gaussian absorption profile.For a review of the physics of the 21-cm signal, see Furlanetto et al. (2006); Mesinger (2019).
We generate two different mock sky-averaged 21-cm data sets covering different bandwidths of 50 − 100 and 85 − 190 MHz and observing different foregrounds from different locations.We call these two experiments  and  and model the foreground as where  is a common scale factor for both experiments and  is set as −2.6 and −2.5.The model is based on the expectation that the foreground is dominated by synchrotron emission and the power law is based on observational constraints from experiments like EDGES (Mozdzen et al. 2017) We show the process being applied to two different distributions; a uniform distribution in orange and Gaussian-like distribution in purple.The samples are first min-max normalized so that the range of the distributions is 0-1.They are then pushed through the standard normal CDF to Gaussianize the samples.As can be seen for a uniform distribution, the resultant distribution is the standard normal.However, for the non-trivial distribution in purple, the result is a shifted and scaled version of the standard normal.This process is done in each dimension before training and samples drawn from the flows or KDEs are in the Gaussianized space but can easily be transformed back to the non-normalized space.smooth properties.However, the structure of the antenna beam can introduce non-smooth chromatic features into the data and more complex forward modelling of the foreground is need (Anstey et al. 2020).For simplicity, we assume an achromatic beam and thus a data set that includes a smooth foreground and Gaussian 21-cm signal.The signal in our mock data is given by where  min is the signal amplitude,  is the frequency range of the data,   is the central frequency of the absorption feature and Δ is the signal's width.We set  min = 0.25 K,   = 80 MHz and Δ = 10 MHz.We model each data set separately using the Nested Sampling algorithm polychord, equation ( 23) and a log-log polynomial foreground model given by where   are coefficients to be fitted (e.g Bowman et al. 2018;Singh et al. 2021;Bevins et al. 2022c).In practice, the foreground modelling is not always consistent across different data sets and to further emphasise that the two mock experiments are observing different parts of the sky, we assume they have different complexities and fit a 3-term polynomial to experiment  and a 4-term polynomial to experiment  for the foreground.Generally, when polynomials are being used, the order of the polynomial would be optimised for through an assessment of the Bayesian evidence.We inject Gaussian random noise into our data sets with standard deviations of 35 mK and 15 mK for experiments  and  respectively.
In our Nested Sampling runs, we use a Gaussian log-likelihood function where  D corresponds to the mock data and  corresponds to the instrument noise and is a fitted parameter.In the top row of Fig. 3 we show the sum of the noise and signal models that went into the mock datasets and the functional averages of our fitted 21-cm signals where   is the weight corresponding to sample  output by polychord.
We also show in the center right panel of Fig. 3 the resultant  21 found fitting both data sets simultaneously with where   and   are the foreground parameters used to model each data set.Through the combination of the two data sets we can see by visual inspection that we get a much better fit to the data than is achieved for each experiment individually, as is to be expected.
However, in order to do this we have had to sample over the nuisance parameters modelling the foregrounds,   and   .With margarine we can train density estimators on the posteriors from the individual fits for the core science parameters  = { min ,   , Δ} for both experiments and given we know the Bayesian evidence from the previous runs can generate nuisance-free likelihood functions for each experiment as in section 2.2.We can then sample over just the parameters of our 21-cm model without having to consider modelling the foregrounds.Since each experiment is independent of the others nuisance parameters, this approach will produce the same results as sampling equation ( 27).In practice, we could emulate P A () and the equivalent for experiment  and perform the joint analysis sampling over and we would arrive at the same posterior distribution found when sampling over L A,B (), however we would not recover the same Bayesian evidence as found when sampling over L A,B (, ).
We show the resultant  21 from the joint analysis with margarine in the center left panel of Fig. 3 and we can see that the fit is consistent with the more complex analysis.In particular, we note the approximate consistency between the Bayesian evidences for each fit.The error on the evidence for the margarine fit is likely underestimated due to uncertainty in the density estimation, however this is currently hard to quantify and left for future work.
In this illustrative example there are only a few nuisance parameters and the likelihood is analytic and quick to evaluate.However, generally speaking there are many more nuisance parameters that need to be modelled and the complexity of the models can lead to significant increases in the evaluation time per call to the likelihood (e.g.Bevins et al. 2022a).In these circumstances, performing marginal joint analysis with margarine can be significantly more efficient than sampling all over the core science and nuisance parameters.This has previously been demonstrated with data from Planck and the Dark Energy Survey (Bevins et al. 2022b) and more recently with data from HERA and SARAS3 (Bevins et al. 2023).

Prior Emulation
Since the 21-cm signal is dependent on many different astrophysical process, semi-numerical simulations can take several hours to produce a single model.This is impractical if we want to use Bayesian inference techniques like MCMC and Nested Sampling algorithms to fit real data.Signal emulators are often used as a fast alternative that can accurately model the relationship between the astrophysics and the signal structure and produce signal realizations in 10s of milliseconds (Cohen et al. 2020;Bye et al. 2022;Bevins et al. 2021b).An even cheaper alternative to emulators is to approximate the signal with a Gaussian absorption feature, as alluded to in the previous section and demonstrated in Bernardi et al. (2016) and Monsalve et al. (2017).However, it is not always clear how to set the priors on the parameters  min ,   and Δ in equation ( 23).Whilst there is some uncertainty in the theoretical modelling of the 21-cm signal, that can be overcome with this phenomenological model, we can motivate our priors on  min ,   and Δ with semi-numerical simulations using margarine.
The idea here is to take a large set of physically motivated signals generated using an emulator from a uniformly sampled prior on the astrophysical parameters describing the underlying semi-numerical simulation.For each physical signal we then calculate the depth corresponding to  min , the central frequency   and approximate the width Δ.Through this process we arrive at a physically motivated set of samples in the phenomenological parameters which we can train a MAF on using margarine, creating a physically motivated prior on our phenomenological parameters.The resultant model and prior combination is less constrained by the assumptions that go into our semi-numerical models, but still satisfies our broad expectations about the 21-cm signal.
We use the global 21-cm signal emulator globalemu (Bevins et al. 2021b) to model a set of 25,000 signals based on a parameterisation of the signal that includes   the minimum virial circular velocity that is proportional to the cube root of the minimum halo mass for star formation,  * the star formation efficiency,   the X-ray production efficiency which is proportional to the X-ray luminosity,  the CMB optical depth and   the radio production efficiency proportional to the radio luminosity.The parameterisation is discussed in more detail in Reis et al. (2020) and references therein.Whilst 25,000 signals may seem like a large number of models, the equivalent number of calls made to the emulator in a Nested Sampling or MCMC run would be several orders of magnitude larger.
In scenarios with late and inefficient star formation, corresponding to high   and low  * , and strong X-ray heating corresponding to high   in the early universe, we find that the signal cannot be well approximated by a Gaussian profile because it features a weak absorption feature and strong emission.We therefore filter out these scenarios before training our MAF and of the 25,000 models we initially started with, 88% are used to produce our physically motivated prior.
In Fig. 4 we show the distribution on log 10 (| min |),   and Δ generated from the physical models and samples from the corresponding MAF which can be used as a prior on the Gaussian model.
In 21-cm cosmology, instrumental effects can introduce nonsmooth structure to the data.These effects are often sinusoidal (e.g. Hills et al. 2018;Sims & Pober 2020;Singh & Subrahmanyan 2019;Bevins et al. 2021aBevins et al. , 2022a) ) and if not corrected for they can affect the recovery of the 21-cm signal.For example, part of a damped sinusoidal structure in the data with a periodicity of 5 MHz can be modelled with the approximate Gaussian signal profile discussed in this paper if the prior on the width is wide.However, from Fig. 4 we see that a signal profile of width 5 MHz is not likely to represent a physical scenario.The goal of the physically motivated prior is to therefore provide some information about the anticipated signal that can help to separate it from and prevent fitting systematic structures in the data.

Marginal Bayesian Statistics
Next, we turn our attention to the calculation of marginal Bayesian statistics with margarine.

Analytic Examples
First, we investigate a simple two-dimensional multivariate normal distribution with means of zero and standard deviations of one with an analytic KL divergence and BMD.This should be trivial for both the MAF and the KDE to learn, as it resembles the standard normal distribution used for the base of the MAF and the kernel used for the KDE.The analytic KL divergence for this example is given in

Difference
Figure 3.The top panels in the graph shows two simulated data sets comprising noise and a 21-cm signal as orange lines.Note that the data sets also contain foregrounds but we do not show them here since they are several orders of magnitude larger than the 21-cm signal.In the same panels in purple, we show the averages of the functional posteriors derived from Nested Sampling fits to both of these data sets.Neither fit recovers the true signal exactly, however when we analytically combine the full likelihoods for each experiment, including foregrounds, and jointly fit the data we get a much better agreement across the whole frequency range as can be seen in the center right panel.We show the results of this joint analysis as an orange line.For comparison, we show in the center left panel, as an orange line, the average of the functional posterior from a joint fit performed with margarine using the nuisance-free likelihood function.We can see that the joint fits are approximately equivalent, more accurate than the fits to each individual experiment, particularly experiment , and the Bayesian evidences are similar in magnitude when using the full likelihood and the nuisance-free likelihood.We show the difference, which is at most an order of magnitude lower than the expected noise in a global 21-cm experiment, between the two fits in the bottom panel as a function of frequency.Handley & Lemos (2019a) as where  is the prior volume and Σ is the covariance matrix of the multivariate normal distribution.For a multivariate Gaussian the Bayesian Model Dimensionality is equal to the number of dimensions.
We draw samples from this multivariate normal and train both a MAF and a KDE on the distribution.The calculated statistics are shown in Tab. 1 assuming a uniform prior that encapsulates the range of the samples.Both the KDE and the MAF produce estimates of the KL divergence in close agreement with the analytic solution.The MAF performs much better when estimating the BMD and the true analytic BMD is within the one sigma error.
The errors in D and  are estimated by propagating both samples generated with the density estimator and the original samples through the density estimator to evaluate the log-probabilities and comparing the resultant statistics.If the density estimator is a perfect representation of the original distribution, then we would expect samples drawn from it to be from the same distribution as the origi- Original Samples MAF Figure 4.A physically motivated prior on the phenomenological parameters of a Gaussian absorption feature used to model the sky-averaged 21-cm signal.  is the central frequency of the absorption feature,  min is the corresponding temperature, and Δ is the approximate width of the 21-cm signal.The prior is derived by approximating the phenomenological parameters for a set of physically motivated signals generated with a neural network emulator and learning the corresponding distribution, in orange, with a MAF.Samples from the MAF are shown in purple.The resultant prior and model combination is semi-independent of the assumptions that go into the semi-numerical simulations that the emulator is trained on and quicker to evaluate than the emulator.
nal samples.This would lead to an approximate equivalence between the two log-probability distributions as functions of the associated Nested Sampling weights (Harrison et al. 2015) and so any deviation we see gives us an indication of the error in our marginal statistics.
In addition to the multivariate Gaussian example, we look at another two-dimensional distribution with an analytic solution where the 'posterior' for one parameter follows a triangle distribution and the other is unconstrained following a uniform distribution.In this case, the KL divergence is given as D = log 2 − 0.5 and the BMD as  = 0.5.We see a good level of agreement between the analytic KL and BMD estimates from margarine, with both the MAF and KDE assuming a uniform prior as before.Again, however, it appears that the MAF performs better when estimating the BMD in comparison to the KDE.

Accuracy vs Samples and Dimensions
Using the multivariate Gaussian example, we next assess how the accuracy of the KL divergence and BMD estimates from margarine scale with the number of samples in the distribution and the number of dimensions.In the previous example, we used a mean of zero and a standard deviation of one, however in this section we randomly assign a mean between -2 and 2 for each dimension and a standard deviation of between 0.1 and 1.We use a diagonal covariance matrix.Fig. 5 shows how the accuracy of the KL estimates, left panel, and BMD, right panel, increase with increasing number of samples for a five dimensional problem.We find that both the KDE and the MAF are able to accurately recover the analytic solution, black dashed lines, for the KL divergence for sample sizes ≳ 100.We also see that the MAF is more confident and more accurate in its estimate.For sample sizes ≳ 300, the MAF recovers the BMD well however the KDE consistently underestimates it regardless of the number of samples in the training data.The KDE errors are however larger than for the MAF acknowledging its inability to recover the analytic solution.
In Fig. 6, we explore how the number of samples needs to scale with the number of dimensions for the MAF and KDE to accurately recover the Bayesian statistics.We use the same multivariate Gaussian example with random means and standard deviations but vary the number of dimensions from two to ten.We quantify the level of accuracy using the fractional difference Again it is clear that both the KDE and the MAF are much better at estimating the KL divergence than the BMD likely because the BMD has a stronger dependence on the predicted log-probability and hence on the accuracy of the density estimation.However, the MAF is consistently better at estimating the BMD than the KDE and requires fewer samples than the KDE to accurately recover the BMD.
As expected, when we increase the number of dimensions in the problem, we require more samples to maintain the same level of accuracy.However, this relationship is stronger for the BMD and stronger still for the KDE than the MAF indicating that the MAF is much more expressive

Toy Example Posteriors
Following testing on analytic problems, we generate the examples using polychord and a Gaussian likelihood, both with five parameters, and we show the corresponding distributions in orange in Fig. 7.The first (left-hand panel) has a series of correlations between the parameters and involves sampling the likelihood where    corresponds to parameter  sample  assuming a uniform prior between -1 and 1 for each.The second (right-panel of Fig. 7) has a combination of uncorrelated parameters with both log-uniform and uniform priors on the parameters (right-hand panel) and involves sampling a Gaussian likelihood We are able to use the samples from polychord and the analysis tool anesthetic (Handley 2019) to calculate the KL divergence and BMD for both toy examples and these values are reported, with errors, in Fig. 7.We show samples drawn from a KDE trained on the correlated samples and from a MAF trained on the uncorrelated samples, with values of D and  listed for both types of density estimator.
For the correlated samples, we see that the estimated KL divergence and BMD from the MAF and from the KDE are in close agreement with the 'true' value output from anesthetic.This data set is made of ≈ 7000 samples and is therefore well within the regime where we expect the MAFs and KDEs to provide an accurate KL divergence estimate and the MAF to produce an accurate BMD estimate (see Fig. 6.For the uncorrelated samples, we also see that the MAF KL divergence and BMD estimates are in agreement with the anesthetic values.For the KDE, the KL divergence and BMD are less consistent with the values from the MAF and from anesthetic, however we find we have larger errors when using the KDEs likely because they are less expressive than the MAFs.For this example, we have fewer samples, ≈ 4000 but we still expect the estimates of the statistics to be accurate, with the exception of the KDE BMD estimate.
Typically, the upper and lower bounded range on the BMD estimates are larger than for the KL divergence estimates because it has a more complex dependence on log L and therefore on the accuracy of the density estimation as previously discussed.Future work is needed to investigate whether improvements can be made to the BMD estimates by modifying the MAF loss function, network configurations or KDE bandwidths among other tunable parameters.

REACH
REACH is an experiment aiming to detect the sky-averaged 21-cm signal in the frequency range 50 − 170 MHz (de Lera Acedo et al. 2022).
The REACH data analysis pipeline uses Nested Sampling, implemented with polychord, to model the different components of the data.The pipeline has been extensively tested on mock observations (Anstey et al. 2020;Anstey et al. 2022;Cumner et al. 2021;Scheutwinkel et al. 2022b,a).We generate mock observational data using a realistic foreground model derived from an allsky map and inject a Gaussian signal profile with an amplitude of | min | = 155 mK, a central frequency of   = 85 MHz and a standard deviation of Δ = 15 MHz.
The mock data correspond to a single snapshot observation taken from the Karoo radio observatory with a dipole antenna and modelled with a foreground model that takes advantage of the spectral structure of the sky, a correction for the non-uniform response of the antenna to the sky, Gaussian noise and a Gaussian signal profile corresponding to a 16 dimensional parameter space.In the left-hand panel of Fig. 8 we show the posterior distribution for the signal parameters from our fit in orange, having marginalised over the other 13 parameters, and the corresponding KDE reconstruction from margarine is shown in purple.We see a reasonable consistency between the marginal KL divergence calculated for these samples when using both the KDE and MAF.
However, we note that there are some differences in the BMD estimate, with the MAF giving a larger value for .From a visual inspection of the samples, we would expect the BMD to be around 1-2 due to the tight Gaussian constraint on   and weaker but still nontrivial constraint on | min |.Both estimates are, therefore, somewhat consistent with expectations.For this example, we have ≈ 2500 samples and the levels of accuracy for the MAF and KDE that we see are again consistent with predictions from Fig. 6.

Dark Energy Survey
The Dark Energy Survey (DES) is designed to help us understand why the Universe's rate of expansion is accelerating.The goal of the Here the KDE and MAF are being trained on a five dimensional multivariate Gaussian with means randomly chosen from the range -2 to 2 and standard deviations chosen randomly from the range 0.1 to 1.The true KL divergence and BMD shown by the black dashed lines are analytically calculated using equation ( 30) and equal to the number of dimensions in the distribution respectively.We see that both the KDE and MAF are able to accurately recover the KL divergence for sample sizes ≳ 100 and the MAF performs well for the BMD with sample sizes ≳ 300.The KDE significantly underestimates the value of the BMD, however, this is acknowledged by the larger error bars.collaboration is to map millions of galaxies, thousands of supernova and large scale cosmic structures in order to help understand the dark energy which makes up 70% of the universe.The survey has targeted the measurements of the dark matter and dark energy densities as well as the dark energy equation of state The Year 1 DES analysis2 (Handley & Lemos 2020) aimed to constrain the baryon density parameter, Ω  , the dark matter density parameter, Ω  , the approximate ratio of the sound horizon to the angular diameter distance,   , the optical depth of the CMB to reionization, , and the amplitude and tilt of the power spectrum,   , and   .The samples from the cosmological subspace are shown in orange in the right-hand panel of Fig. 8, where we have marginalised over a set of 20 nuisance parameters, along with a MAF reconstruction of the subspace in gray.
Unlike the toy examples and REACH samples, the priors on the DES analysis are non-uniform, astrophysically informed and cannot easily be transformed into a space in which the priors are uniform.As a result, we have to use the density estimators built into margarine to evaluate both the log-probability of the posterior and the prior if we want to calculate marginal statistics.
While this is possible, it is expected to lead to an increased uncertainty in the marginal statistics, which can be seen in Fig. 8.In addition, the problem is further complicated by the size of the cosmological parameter space, since we expect larger parameter spaces to be harder to replicate well with margarine, and consequently we expect larger errors on the marginal statistics.
Again, the density estimators give us different estimates for the BMD, however the errors are large.Here we have ≈ 7000 samples and six dimensions, and from Fig. 6 we would expect the BMD estimate to be better from the MAF.From a visual examination of the distribution, we would expect the BMD to be between ≈ 3 and ≈ 4 given the Gaussian nature of the 1D posteriors on Ω  ℎ 2 , Ω  ℎ 2 ,   and log(10 10   ).There is a reasonable agreement between the KL divergence calculated for the cosmological constraints from DES with both the MAF and the KDE.

CONCLUSIONS
In this paper we have demonstrated a number of applications of two different types of density estimator, Masked Autoregressive Flows and Kernel Density Estimators, to the calculation of marginal Bayesian statistics, the efficient modelling of multiple data sets and the derivation and emulation of physical priors.margarine is a multipurpose tool that can be used to enhance our Bayesian analysis workflows.
The evaluation of marginal KL divergences and Bayesian Model Dimensionalities allows for improved comparison of the constraining power of different experiments targeting the same astrophysical or cosmological signals with different systematics or nuisance parameters.In turn, this can lead to improvements in experimental design with techniques that provide more information about the signals of interest being pursued in the future.The calculation of marginal Bayesian statistics has been illustrated in Scheutwinkel et al. (2022a), Anstey et al. (2023) and Bevins et al. (2022c).
We find that the MAFs are much more expressive than the KDEs and perform better when estimating the KL divergences and BMDs.KDEs, however, perform well when estimating the KL divergence in low dimensions with large sample sizes as illustrated in Fig. 6 and are quicker to train than the MAFs.In practice, we would advise the  31) to estimate the accuracy and compare the predicted statistics from margarine with the analytic solutions given by equation ( 30) and the dimensionality of the distributions.The top row shows the KL divergence, the bottom row the BMD, the first column the accuracy for the MAF and the second column the accuracy for the KDE.We see a stronger dependence of the accuracy of the predicted statistics on the number of dimensions and the sample size for the KDE than for the MAF demonstrating that the MAF is much more expressive.We also see a higher level of error for the BMD than for the KL divergence, as previously seen in Fig. 5, but note that the fractional error in the MAF BMD estimates is typically much lower than for the KDE.The increased noise across dimension and sample sizes for the MAF is likely due to the random initialisation of the neural network weights.
use of MAFs over KDEs because of their higher degree of accuracy and better scaling with dimensions and sample sizes.The nuisance-free likelihood function allows for more efficient combination of constraints from different data sets, preventing the need to sample instrument specific foreground and systematic effects in joint analysis.This was demonstrated with data from the Planck and Dark Energy Survey experiments in Bevins et al. (2022b) and more recently with data from the 21-cm power spectrum experiment HERA and sky-averaged 21-cm experiment SARAS3 (Bevins et al. 2023).
Finally, margarine can be used to generate non-trivial priors (e.g.Alsing & Handley 2021) either from experimental results or simulations, as illustrated in this paper.In principle, this allows us to inform the analysis of data from upcoming probes like REACH (de Lera Acedo et al. 2022) or the Simons Observatory (Ade et al. 2019;Lee et al. 2019) with results from existing experiments like SARAS3 (Bevins et al. 2022c) and Planck (Planck Collaboration 2020).
We anticipate further development of margarine and additional unexpected applications that evolve as the code is used (e.g.Bevins & Handley 2023).

ACKNOWLEDGEMENTS
HTJB acknowledges the support of the Science and Technology Facilities Council (STFC) through grant number ST/T505997/1 and support from the Kavli Institute for Cosmology, Cambridge and the Kavli Foundation.WJH and AF were supported by Royal Society University Research Fellowships.PHS acknowledges support from a Trottier Space Institute Fellowship and the Canada 150 Research Chairs Program.EdLA was supported by the STFC through the Ernest Rutherford Fellowship.JA was supported by the research project grant "Fundamental Physics from Cosmological Surveys" funded by the Swedish Research Council (VR) under Dnr 2017-04212.

DATA AVAILABILITY
The DES samples are available at https://zenodo.org/record/4116393#.Y9zfPC8RpKM.All other data is available upon reasonable request to HTJB.The signal subspace, in orange, for a fit to simulated observations of the 21-cm signal with the REACH experiment along with samples, in purple, from KDE trained on the three signal parameters effectively marginalising over the other thirteen parameters in the fit.We report the marginal KL divergence and BMD found for this set of three parameters using both a MAF and KDE.Right Panel: The cosmological subspace for the Year 1 DES samples shown in orange, along with a set of samples taken from a trained MAF in gray.Again, we report the corresponding marginal Bayesian statistics calculated with margarine.

Figure 1 .
Figure 1.The Masked Autoencoder for Distribution Estimation (MADE) architecture, when trained appropriately, transforms samples from a complex probability distribution  (  ) on to samples from a standard normal distribution under the assumption that the probability distribution  (  ) can be broken into conditional one-dimensional Gaussian probability distributions.Many different MADE networks can be stacked together and trained to produce a normalising flow increasing the expressivity of the density estimation.By definition, normalising flows are bĳective, and a trained implementation can be used to calculate log  (  ) and draw samples from  (  ).At each hidden layer in the MADE we use a 'tanh' activation function.

Figure 2 .
Figure 2. The figure demonstrates how a 1D distribution is Gaussianized to aid with training of the Masked Autoregressive Flows and Kernel Density Estimators.We show the process being applied to two different distributions; a uniform distribution in orange and Gaussian-like distribution in purple.The samples are first min-max normalized so that the range of the distributions is 0-1.They are then pushed through the standard normal CDF to Gaussianize the samples.As can be seen for a uniform distribution, the resultant distribution is the standard normal.However, for the non-trivial distribution in purple, the result is a shifted and scaled version of the standard normal.This process is done in each dimension before training and samples drawn from the flows or KDEs are in the Gaussianized space but can easily be transformed back to the non-normalized space.

Figure 5 .
Figure5.The figure illustrates how the KL divergence and the BMD estimates from margarine converge on the analytic values for a toy example with increasing number of samples in the training distribution.Here the KDE and MAF are being trained on a five dimensional multivariate Gaussian with means randomly chosen from the range -2 to 2 and standard deviations chosen randomly from the range 0.1 to 1.The true KL divergence and BMD shown by the black dashed lines are analytically calculated using equation (30) and equal to the number of dimensions in the distribution respectively.We see that both the KDE and MAF are able to accurately recover the KL divergence for sample sizes ≳ 100 and the MAF performs well for the BMD with sample sizes ≳ 300.The KDE significantly underestimates the value of the BMD, however, this is acknowledged by the larger error bars.

Figure 6 .
Figure 6.The accuracy of recovered statistics for a multivariate Gaussian distribution with random means and standard deviations per dimension as a function of sample size and dimension.We use equation (31) to estimate the accuracy and compare the predicted statistics from margarine with the analytic solutions given by equation (30) and the dimensionality of the distributions.The top row shows the KL divergence, the bottom row the BMD, the first column the accuracy for the MAF and the second column the accuracy for the KDE.We see a stronger dependence of the accuracy of the predicted statistics on the number of dimensions and the sample size for the KDE than for the MAF demonstrating that the MAF is much more expressive.We also see a higher level of error for the BMD than for the KL divergence, as previously seen in Fig.5, but note that the fractional error in the MAF BMD estimates is typically much lower than for the KDE.The increased noise across dimension and sample sizes for the MAF is likely due to the random initialisation of the neural network weights.
Figure 8. Left Panel:The signal subspace, in orange, for a fit to simulated observations of the 21-cm signal with the REACH experiment along with samples, in purple, from KDE trained on the three signal parameters effectively marginalising over the other thirteen parameters in the fit.We report the marginal KL divergence and BMD found for this set of three parameters using both a MAF and KDE.Right Panel: The cosmological subspace for the Year 1 DES samples shown in orange, along with a set of samples taken from a trained MAF in gray.Again, we report the corresponding marginal Bayesian statistics calculated with margarine.

Table 1 .
The table shows the KL divergence and BMD estimated by margarine for two trivial two-dimensional distributions, with analytic KL and BMD values for comparison.We see a good level of agreement between the analytic values and margarine although the MAF typically performs better when estimating the BMD.