Single frequency CMB B-mode inference with realistic foregrounds from a single training image

With a single training image and using wavelet phase harmonic augmentation, we present polarized Cosmic Microwave Background (CMB) foreground marginalization in a high-dimensional likelihood-free (Bayesian) framework. We demonstrate robust foreground removal using only a single frequency of simulated data for a BICEP-like sky patch. Using Moment Networks we estimate the pixel-level posterior probability for the underlying {E,B} signal and validate the statistical model with a quantile-type test using the estimated marginal posterior moments. The Moment Networks use a hierarchy of U-Net convolutional neural networks. This work validates such an approach in the most difficult limiting case: pixel-level, noise-free, highly non-Gaussian dust foregrounds with a single training image at a single frequency. For a real CMB experiment, a small number of representative sky patches would provide the training data required for full cosmological inference. These results enable robust likelihood-free, simulation-based parameter and model inference for primordial B-mode detection using observed CMB polarization data.


INTRODUCTION
Polarized galactic foregrounds are the most significant challenge for the detection of the polarized cosmic microwave background (CMB) B-mode signal of primordial gravitational waves.Such a detection would provide a direct probe of cosmic inflation in the early Universe.
Typically, the cosmological signal is recovered using multifrequency observations, for which there have been numerous recent developments in the context of Planck and ground-based experiments (Planck Collaboration et al. 2020a;Darwish et al. 2021).The sensitivity expected from the next generation of experiments poses further challenges (Ade et al. 2019;CMB-S4 Collaboration et al. 2020).
A significant hindrance is the difficulty either numerically simulating the polarized galactic dust emission or constructing analytic models.In particular, if it were possible to generate synthetic realizations of the dust polarization signal, the task could be remarkably simplified.With the ability to forward model by generating mock data, the use of simulation-based methods to perform likelihood-free inference of unknown parameters is relatively straightforward.
Generating a significant number of polarized foreground signal realizations has been so far impossible.Numerical simulations are computationally costly and fraught with difficulties.Furthermore, low-noise and high-precision observational data is also limited.
In this Letter, we show that with only a single training image of polarized dust emission we can use wavelet phase harmonic synthesis to generate new representative samples.These samples form a training set with which we can perform likelihood-free inference.We test the high-dimensional posterior probability space of pixel values directly, which proves to be a powerful general validation of this combined synthesis and likelihood-free approach.
★ E-mail: niall.jeffrey@phys.ens.frUsing only a single frequency (143 GHz) we use a hierarchy of convolutional neural networks to form a Moment Network (Jeffrey & Wandelt 2020) to estimate posterior mean and marginal variance per pixel of the extra-galactic CMB signal given the observed polarization data.Summary of our approach: • Simulate a single polarized foreground data patch.
• Synthesize fast realizations using wavelet phase harmonics to form a large augmented set of foreground maps.• Draw cosmological parameters from a prior probability distribution and generate mock CMB signals.
• Draw foreground amplitude parameters from a prior probability distribution to match a BICEP-like sky at a single 143 GHz frequency.• Train Moment Network to estimate the marginal posterior mean and variance for the CMB signal for each pixel.• Validate the per-pixel posterior estimates with a quantile-type test using new unseen simulated (not synthesized) foreground data.By validating the pixel-by-pixel statistical model, we have demonstrated that this approach is robust for parameter inference and, by extension, tensor-to-scalar  detection.
For application to real data, the inference task will be in many ways easier.Here we have chosen the set-up to be particularly difficult to demonstrate the strengths of the method: noise-free, a single training image, and a single frequency.Measurement noise will Gaussianize the data, so would limit our demonstration of highly non-Gaussian dust foreground removal.Aylor et al. (2020) previously demonstrated an adversarial networks, using > 10 3 cleaned (Planck) training images, for total intensity foreground cleaning.Using the results of this Letter, we can now use very few representative foreground data patches for polarization (B-mode) analysis as part of our new Bayesian validated likelihoodfree approach.

Overview:
In Bayesian parameter inference of unknown parameters , we aim to evaluate the posterior probability distribution for some statistical model M given the observed data (or summary statistics of the observed data) d  (see Jaynes 2003 for details).
In a typical analysis, the likelihood L () = (d  |, M) is assumed or modelled analytically.In likelihood-free inference (also known as simulation-based inference) the likelihood is not assumed.Instead, the sampling distribution of the data (d|, M) as a function of the unknown parameters is estimated from forward modelled data.
With density estimation likelihood-free inference (Papamakarios & Murray 2016;Alsing et al. 2018Alsing et al. , 2019)), the inference task is posed as a density estimation problem.The simulated mock data d realizations and their respective parameters  form a cloud of points in {d,  } space.In this space we could estimate the following distributions: the joint (d, , M), the conditional ( |d, M), or the conditional (d|, M) known as the sampling distribution (which becomes likelihood if evaluated for observed data d  ).
Provided mock data  can be generated, with each realization having an associated  label, the problem of inference given the actual observed data is relatively straightforward.Using this density estimation approach, a number of cosmological analysis have now been carried out (e.g.Brehmer et al. 2019;Jeffrey et al. 2021;Kodi Ramanah et al. 2020;Lemos et al. 2021) that infer model parameters  without needing to assume or approximate a closed-form likelihood.
In this work, thanks to wavelet phase harmonic synthesis (section 2.3.3),we now have the means to generate fast realizations of dust foregrounds for polarization CMB analysis (given our single training image).However, rather than using likelihood-free inference to infer only a set of cosmological parameters, here we choose to demonstrate a pixel-by-pixel inference at the level of the map as a powerful test of this approach.2.1.1Moment Networks: These are a simple hierarchy of fast neural regression models that compute increasing moments of any lowerdimensional marginal posterior density (Jeffrey & Wandelt 2020).
For a pixel-level inference, each unknown pixel value of the underlying CMB signal s is a parameter to be inferred.The probability space of all pixels in a given  signal map s  has a dimension of D = 256 × 256 > 6.5 × 10 4 .Evaluation of the full joint posterior probability at such a dimensionality D is intractable.
Moment Networks side-step the problem of estimating the posterior density.Particularly useful for high-dimensional parameter spaces, they directly estimate moments of the marginal posterior distribution of single parameters or subsets of parameters.For example, we could estimate the mean and variance of the marginal posterior probability for each pixel of some unknown signal  = s, where   is any given pixel (element of the signal vector) and  are all other pixels.We could also estimate moments (e.g.covariance) of two-dimensional marginal posteriors for pairs of parameters: We require that the unknown parameters are drawn from prior distribution s  ∼ (s|M) and, through the forward model, the associated data are drawn from the sampling distribution d  ∼ (d|s  , M).
The first layer of the hierarchy in the Moment Network finds some function F ( ) of our data that minimizes a squared loss over the distribution of possible training examples {d  , s  }, then F , which is a neural network, evaluated for the observed data is the mean of the posterior distribution F (   ) =   ( |  ) .Moment Networks allow the use of far simpler neural network architectures, which reduces training failure risks and improves inference speed, and have recently been applied to Cosmic Void inference (Kreisch et al. 2021).
In this work, at the next level of the hierarchy, the function G minimizes for fixed, already trained F .If  1 is minimized over the training data (drawn from the correct prior and forward model) then G(   ) gives the posterior variance for the marginalized posterior of the signal for each pixel.This result is exact and independent of the true underlying posterior or prior distributions (Jaynes 2003;Adler & Öktem 2018).
In this work, we train the first network F of the two-layer hierarchy to return   =    ( |  ) , which is the mean of the marginal posterior of the signal s  for each pixel given the data {d  ,d  }: The second network G gives the variance

𝐵
of the marginal posterior distribution for each pixel of s  given the data {d  ,d  }: In this work we show the result for s  here as it is the most significantly contaminated by polarized dust foregrounds.Both stages of the Moment Network use a UNet convolutional neural network from the DeepMass1 package (Jeffrey et al. 2020) -see section 3.

CMB data model
We select a flat Λ-Cold Dark Matter (ΛCDM) cosmological model for this analysis.This choice of model is not a requirement for this approach; one could choose any cosmological model M provided it is possible to generate mock observables where the unknown cosmological parameters   are drawn from some prior probability distribution (  |M).
The parameters used for the CMB signal generation must be drawn from the assumed prior  , ∼ (  |M), so that the Moment Network output matches the expected properties of the posterior probability distribution.In this work we choose to use the Planck Collaboration et al. (2020b) base_plikHM_TTTEEE_lowl_lowE2 parameter posterior distribution as our prior.
Our sampled cosmological parameters are: the Hubble parameter  0 , the baryon density Ω  , the cold dark matter density Ω  , the Reionization optical depth , primordial amplitude   , scalar spectral index   .We fix the tensor-to-scalar ratio parameter  = 0, but this could be easily allowed to vary over a wide prior range for such an analysis.We thin the Markov chain Monte Carlo Planck chains uniformly to sample approximately 3000 sets of parameters  , .For each set of parameters drawn from the prior, we calculate the theoretical CMB polarization   and  power spectrum  ℓ ( , ) using the software (Lewis et al. 2000;Howlett et al. 2012).For each  ℓ ( , ) sample, we generate a full-sky Gaussian realization of the  and  signals with HEALPix (Górski et al. 2005).These Gaussian signal realizations do not include higher-order N-point corrections caused by any primordial non-Gaussianity or gravitational lensing, and therefore would be insufficient for a full delensing analysis, though such signals would not be difficult to simulate if required (Millea et al. 2020).From each full-sky HEALPix map (resolution nside=1024), we extract 18 non-overlapping patches using gnomonic projection for both   and   .Each patch is 20 • × 20 • to match a BICEP-like survey area (BICEP2 Collaboration et al. 2018) with 256×256 pixels.Each   and   map is convolved with a Gaussian beam smoothing with the size of a single pixel width ( ≈ 4.69 arcmin).In total, we generate 52,224 sets of {  ,   } maps.
In reality, the CMB polarization signal is measured in terms of the {, } Stokes fields, rather than the pseudo-scalar{, } fields (Zaldarriaga 2001), which generally introduces artefacts when transforming between these representations for masked or non-periodic data.However, this is already a solved problem in a single frequency, unlike polarization dust foreground marginalization.Again, though this --leakage effect should be included for analyses with actual observational data, adding this effect here would diminish our clear demonstration of non-Gaussian foreground marginalization.
The left panel of Figure 1 shows an example  signal simulation.The centre-left panel shows the same map with simulated foregrounds significantly obscuring the signal.

Wavelet phase harmonic synthesis:
Using a single simulation of the polarized dust foregrounds f *  , f *  , we synthesize many further realizations using wavelet phase harmonic synthesis {f  , f  }.The original simulated foreground image f *  , f *  is of a polarization signal as produced by the thermal emission of dust in the diffuse ISM.It is built from the same magnetohydrodynamic (MHD) simulation as described in Regaldo-Saint Blancard et al. (2020).
Wavelet phase harmonic statistics are descriptions of signals that have been designed taking advantage of comparisons between convolutional neural networks and non-linear harmonic analysis operations.These statistics form models that can be used to generate new realizations of a given signal.
We can define the wavelet phase harmonic coefficients  of a foreground map in relation to a wavelet phase harmonic operator, where  = (f * ).These statistics are designed to characterize the coherent structures that appear in non-Gaussian random fields, by quantifying the phase alignment between different scales (Zhang & Mallat 2021;Mallat et al. 2020).For cosmic-web density fields, Allys et al. ( 2020) demonstrated that these statistics include most of the information captured by various high-order statistics.
For a detailed mathematical description, we refer the reader to Appendix A of Regaldo-Saint Blancard et al. (2021), in which these statistics were measured for noisy dust polarization Planck data.However, for an overall understanding of wavelet phase harmonic statistics, the reader may consider that these statistics emulate similar properties to information capture in a convolutional neural network.Unlike neural networks, these statistics require no training, so it is possible to estimate the phase harmonic coefficients from our single training image  * = (f * ).New synthesized f maps are generated by minimizing a loss, with f initialized as Gaussian white noise.Using differentiable  operators, the pyWPH3 code uses gradient-based optimization.
As we are working in ,  space to synthesize, we can apply the operator directly on the complex field (as in Regaldo-Saint Blancard et al. 2021).Here we extend this to include the individual components, giving a new loss: A further development of phase harmonic synthesis in this work is the use of differentiable histograms to match the Kullback-Leibler divergence of the 1-point distributions of the synthesized f  and f  fields with the target (see Appendix B -Supporting Materials).
These techniques combine to give a realistic new set of synthesized polarized foregrounds that match the single initial training simulation.The robustness of the synthesized maps for this problem is tested in the following section.

Foreground amplitude prior:
We match the synthesized foreground maps, which have 512×512 pixels, to a 40 • × 40 • area.For each periodic synthesized map, we transform from ,  to , .We then select sub-patches of 256×256 pixels to break the periodic boundary conditions and to match the survey size of the simulated CMB signal maps.From 102 full 512×512 synthesized foreground maps, we use data augmentation methods of image flips, translations Figure 2.Each pair of panels on the left and right correspond to the validation data examples "A" and "B" respectively.The right panel of each pair shows the power spectra of data components and residuals.The left panel shows the ratio of the true power to the power of the residuals, between the posterior mean and the truth (orange solid line) or between the data and the truth (blue dashed line);  (ℓ)  ℎ / (ℓ)   can be interpreted as signal-to-noise (see section 3.2 for discussion).If interpreted as a foreground-cleaned map, the posterior mean per pixel shows a significant improvement over the data, as would be expected.
and rotations (applied equally to f  and f  ) to generate a set of 52,224 foreground maps of size 256×256.
These synthesized foreground maps are matched to realistic foreground properties for a BICEP-like experiment.We model the polarized dust foreground amplitude of the BICEP field using the Planck analysis values and uncertainties.We first assume that the large scale foregrounds are known well enough that we can subtract the mean foreground signal so that f  = f  = 0 for each map.The foreground amplitude parameters   and   are defined with respect to a power-law spectrum: The Planck analysis values and uncertainties for each   contribute to our prior (   ,   ).As a conservative leastinformative approach, we assume independent Gaussians: (   ) = N (0.358, 0.0917 2 ) & N (0.172, 0.0440 2 ), where the model and values are derived from Planck Collaboration et al. (2020c) section 3.2.
To match these amplitudes defined on the celestial sphere to the pixel variance of our foregrounds patches, we generate 1000 fullsky Gaussian realizations of foreground maps with power spectrum parameters drawn from the above prior.From these full-sky realizations, we create patches with the correct angular size and resolution (these realizations are then discarded).This Monte Carlo procedure transforms the prior distribution with respect to the power spectrum amplitude parameters (   ,   ) to a prior on the pixel standard deviation of the patches (  ,   ).The synthesized foreground maps { f  , , f , } are rescaled to match the sampled amplitudes.

Marginal posterior moments
The previous section described the statistical model that generated mock data samples.For each  sample, we have: foreground contaminated data maps (d  , , d , ), clean signal maps (s  , , s , ), associated cosmological and foreground parameters (  , , , ,  , ).
In this work, we target the posterior probability distribution of the pixel values of the -mode signal given the data (s  |d  , d  ) with all other parameters marginalized away.We choose the -mode signal for this demonstration as the signal power is much weaker than -mode (by a factor of 300) so is most obscured by polarized dust foregrounds.By validating the posterior probability at pixel-level, this is a powerful demonstration of the robustness of the synthesized forward model and the likelihood-free inference.
For both stages of the Moment Network, the posterior mean and variance networks, F (d  , d  ) and G(d  , d  ), we train the UNet convolutional neural network from DeepMass for 100 epochs with varied batch size and learning rates (Convergence is achieved at epoch 50).From the synthesized training samples we retain 5120 as a test set to confirm that over-fitting does not occur.Moment Networks are fast and inexpensive, taking 94 seconds per epoch to train with an Nvidia V100 32GB and evaluation of moments taking ≈ 1ms per new data map.
To validate the inference we do not use further synthesized images, but we take a new unseen simulation.We use four non-overlapping foreground patches from an approximately-independent snapshot of the same steady-state MHD simulation (see section 3.1 of Regaldo-Saint Blancard et al. 2020 for details).The same steps of the forward model are applied to these new patches, including drawing the foreground amplitude from a prior for each map.Four new simulated CMB signals are added to generate four new data sets, which are labelled A, B, C, and D.
Figure 1 shows the results of the trained Moment Network applied to validation data A. The left panels show the clean target s  and the foreground-contaminated data d  , though the posterior is conditioned on the -and -mode data.This figure is a demonstration of 5 of the 6 targets of the "Summary of our approach" described in the Introduction (section 1).
Though naively the behaviour of the mean map appears to match what would be expected from a linear filter (e.g.Wiener filtering of noise), here, at a single frequency, the ability to distinguish between target signal and the statistically-homogeneous foregrounds is entirely due to non-Gaussianity.Appendix B (Supporting Materials) shows the same example with a Gaussian foreground model.

Posterior mean as foreground cleaning
The aim of this high-dimensional likelihood-free inference approach is to estimate properties of the pixel posterior distribution.By validating this procedure, we can be confident in the forward model and the inference scheme for cosmological parameters or models.
However, the mean of the pixel posterior distribution is expected to be close to the true pixel values.As such, we can interpret the centre right panel of Figure 1 as a foreground-cleaned map.It is important to note that the mean of the posterior is not a sample from the posterior, so it does not have the properties of in-painting; pixels with high uncertainty are expected to have values close to the global mean 0.This occurs when the foreground signal is particularly obscuring, which are accompanied by an appropriate increase in the marginal variance per pixel.We can test the accuracy of the recovered pixel values, provided we remember they are the mean values of a high-dimensional posterior probability (s  |d  , d  ).
The two left panels of Figure 2 show results from validation data A and the two right panels show result from validation data B. The right panel of each pair, shows the power spectra of the components and residuals.We can see that the signal power dominates the foreground at ℓ ≈ 1000 for validation data A but the foregrounds always dominate for validation data B. Though not shown, example C is neither foreground-nor signal-dominated and D is signal dominated.The residuals are defined as the difference between the truth and a given map, either the original data   or mean posterior map   .The left panels of each pair show the power spectrum of the truth divided by the power spectrum of the residual; this quantity (truth divided by an effective error) can be interpreted as a signal-to-noise.We clearly see that if we take the mean per pixel as a cleaned map, we significantly reduce the relative error compared with the foreground contaminated data.

Posterior probability validation
For each pixel value, we have a marginal posterior mean and variance.As this is simulated data and we have a large number of pixels, we can validate that the true values are distributed around the mean with the correct variance as predicted by our inference pipeline.In particular the rescaled pixel residuals (   −   )/  are expected to have zero mean and unit variance.
Figure 3 shows the rescaled residual distributions for each of the validation data maps A, B, C, and D. The clear Gaussianity of the residuals is not generally expected, but indicates that the marginal posterior distributions are Gaussian for this data model.The high level of agreement between a unit Normal distribution and the rescaled residuals for each of the validation maps is a strong demonstration that the forward model and the high-dimensional likelihood-free inference is robust.This is the primarily validation of our approach.

CONCLUSIONS
In this Letter we have described two significant contributions to simulation-based inference of CMB polarization data.The first is the development of WPH synthesis to take a single training image of polarized dust foregrounds to create a large set of realistic training images for use in a forward model.The second is the application of high-dimensional likelihood-free methods, in particular Moment Networks, to target the posterior probability of the CMB signal pixel values given our observed data.Finally we validate both aspects by performing a quantile-type test using mock data derived from an unseen (not synthesized) simulation of dust foregrounds.
The quantile-type test validates our approach in the most difficult limiting case: pixel-level, noise-free, highly non-Gaussian dust foregrounds with a single training image at a single frequency.It would be straightforward to combine this single-frequency (morphological) approach with existing multi-frequency models.Future interesting developments would include astrophysical modelling of dust frequency decorrelation (Tassis & Pavlidou 2015), improvements in the accuracy and efficiency of WPH synthesis, and extensions of the likelihood-free inference methods.But now, with a few representative patches of the galactic microwave sky, either simulated or data-driven, we can overcome a significant challenge of dust emission for CMB polarization analysis in a likelihood-free framework.

Figure 1 .
Figure 1.The two left panels show the simulated clean signal s  and the foreground-contaminated data d  (validation data A -section 3.3).The centre right panel shows the mean of the marginal posterior probability per pixel F (d  , d  ) and the far right shows the variance of the marginal posterior per pixel G (d  , d  ).This CMB signal has been inferred (the posterior probability estimated) using only a single frequency and a single training image.Patches of reduced power in the pixel posterior mean are not artefacts; the mean is expected to move closer to 0  when the posterior variance is higher.

Figure 3 .
Figure 3.For each of the validation data maps (A, B, C, D), the distribution of the rescaled pixel residuals (  −   )/  .This distribution is expected to have mean zero and unit standard deviation; the apparent Gaussianity is not necessarily expected or required and, though surprising, shows that the marginal posterior distributions are Gaussian for this data model.This result validates both the synthesis procedure with a single training image and the high-dimensional likelihood-free inference with Moment Networks.