Inference of the optical depth to reionization from low multipole temperature and polarisation Planck data

This paper explores methods for constructing low multipole temperature and polarisation likelihoods from maps of the cosmic microwave background anisotropies that have complex noise properties and partial sky coverage. We use Planck 2018 High Frequency Instrument (HFI) and updated SRoll2 temperature and polarisation maps to test our methods. We present three likelihood approximations based on quadratic cross spectrum estimators: (i) a variant of the simulation-based likelihood (SimBaL) techniques used in the Planck legacy papers to produce a low multipole EE likelihood; (ii) a semi-analytical likelihood approximation (momento) based on the principle of maximum entropy; (iii) a density-estimation `likelihood-free' scheme (DELFI). Approaches (ii) and (iii) can be generalised to produce low multipole joint temperature-polarisation (TTTEEE) likelihoods. We present extensive tests of these methods on simulations with realistic correlated noise. We then analyse the Planck data and confirm the robustness of our method and likelihoods on multiple inter- and intra-frequency detector set combinations of SRoll2 maps. The three likelihood techniques give consistent results and support a low value of the optical depth to reoinization, tau, from the HFI. Our best estimate of tau comes from combining the low multipole SRoll2 momento (TTTEEE) likelihood with the CamSpec high multipole likelihood and is tau = 0.0627+0.0050-0.0058. This is consistent with the SRoll2 team's determination of tau, though slightly higher by 0.5 sigma, mainly because of our joint treatment of temperature and polarisation.


INTRODUCTION
Over the last decade, observations of the cosmic microwave background (CMB) (Hinshaw et al. 2013;Planck Collaboration 2020d;Henning et al. 2018;Aiola et al. 2020), together with measurements of the baryon acoustic oscillation scale from large galaxy surveys (Gil-Marín et al. 2020;Bautista et al. 2020) and many other cosmological observations have transformed cosmology into a high precision science. In cosmological data analysis, an accurate representation of the likelihood, as well as the ability to model systematics, are crucial in order to make reliable inferences from data. Exact likelihoods are often either unknown or computationally expensive to compute. In addition, systematics in the data may bias the results if they cannot be modelled with fidelity and included in the likelihood.
These issues are of particular importance for the measurement of the optical depth to reionization from Planck temperature and polarisation CMB maps. Heuristic likelihood models for CMB data ★ E-mail: rmvd2@cam.ac.uk on a cut sky with idealised noise properties have been proposed by e.g. Hamimeche & Lewis (2008); Mangilli et al. (2015). However, the accuracy of these models is difficult to quantify, especially for cross-correlations of Planck polarisation maps which have complex noise correlations and systematics. For these reasons, the Planck collaboration adopted a simulation-based approach to construct a low multipole polarisation likelihood from the HFI Planck maps (Planck Collaboration 2016c, hereafter PSRoll1).
In this paper, we apply three likelihood approximations to measure the optical depth to reionization from Planck HFI maps. All three methods use Bayesian statistics to make inferences about models from data. Bayes' theorem can be used to infer the posterior density P (θ|d 0 , M) of a set of parameters θ describing a model M from a realisation of data d 0 : where P d is the posterior, L d the likelihood, the prior and Z d the evidence. The subscript d denotes the dependence on the data set. We compare the simulation-based likelihood (S B L) method of PSRoll1, which was used in the Planck 2018 analysis of cosmological parameters (Planck Collaboration 2020d, hereafter PCP18), with a flexible semi-analytic likelihood approximation ( ; Gratton 2017) and a density-estimation 'likelihood-free' method 1 ( ; Alsing et al. 2019).
can easily be adapted to produce a joint temperature-polarisation likelihood at low multipoles. However, this is nontrivial for S B L and when trying to achieve near optimal results.
The optical depth to reionization provides a measure of the time at which the intergalactic medium (IGM) was reionized by photons produced by early generations of stars and galaxies. Following recombination at ∼ 1000, the IGM remained almost neutral until reionization. Assuming abrupt reionization at re , the Thomson optical depth is where is the Thomson cross-section and we have assumed the base ΛCDM model 2 with a helium abundance by mass of P (assuming that helium remains neutral). The Gunn-Peterson test (Gunn & Peterson 1965;Fan et al. 2006) provides strong astrophysical evidence that the intergalactic medium was highly ionised by a redshift of = 6.5. Using the Planck 2018 base ΛCDM parameters, Eq. (2) yields a lower limit of ≈ 0.04 for re = 6.5.
In this paper, we have measured from two sets of Planck maps: the Planck 2018 legacy release (Planck Collaboration 2020a,b), which for HFI is based on the SRoll1 map-making algorithm described in PSRoll1, and the improved map-making algorithm SRoll2 (Delouis et al. 2019, hereafter PSRoll2). The values of computed from the low multipole spectra alone 3 from these maps are: SRoll1 = 0.0506 ± 0.0086, (PCP18), (3a) SRoll2 = 0.0566 +0.0053 −0.0062 , ).
These estimates improve significantly on the result from WMAP (Hinshaw et al. 2013) of = 0.089 ± 0.014. Measurements of using a pixel-based likelihood on Low Frequency Instrument (LFI) Planck and WMAP data have been presented in Lattanzi et al. (2017) and Natale et al. (2020), albeit with larger uncertainties than in Eqs. (3a) and (3b). Note that the estimates from Planck are just above the Gunn-Peterson limit of ∼ 0.04 inferred from Eq. (2). This implies that reionization occurred late, i.e. re cannot be much greater than about 6.5 (see for example Kulkarni et al. 2019, and references therein). 1 'Likelihood-free' (LF) methods are clearly not likelihood-free. What is meant is that the likelihood L is inferred by fitting to numerical simulations rather than being expressed as a simple functional form. 2 As in the Planck papers we refer to the six parameter ΛCDM model (spatially flat, power law scalar adiabatic fluctuations, cosmological constant) as the base ΛCDM model. 3 It is important to distinguish constraints on derived using a low multipole likelihood alone, together with a constraint on the parameter combination exp(−2 ), from full Monte Carlo Markov Chain (MCMC) explorations that combine a low multipole likelihood with a high multipole likelihood. For most of this paper, we will analyse low multipole likelihoods via single parameter scans through values of , and so we use Eqs. (3a) and (3b) as our references to previous work. Full MCMC parameter searches combining our low multipole likelihoods with a high multipole likelihood are deferred until Sec. 6.3. For reference, assuming the base six parameter ΛCDM model, and adding the Plik high multipole likelihood, the best fit values of are higher than those of Eqs. (3a) and (3b) by about 0.5 (see Eqs. (56a) and (56b)).
Measuring using the CMB is challenging. At high multipoles, the CMB power spectra are damped by a factor −2 leading to a degeneracy between the parameter and the amplitude of the initial cosmological scalar perturbations (which is partially broken by CMB lensing). Reionization also induces a polarisation signal at super-horizon scales in the power spectrum leading to a 'reionization bump' at low multipoles (ℓ < ∼ 20) with an amplitude that scales approximately as 2 . The power spectrum at low multipoles can therefore be used to constrain , provided systematics can be kept under control.
An enormous effort has been made to improve the fidelity of the Planck HFI polarisation maps. These improvements are presented in detail in PSRoll1 and PSRoll2. In the analysis described in PSRoll2, residual systematics at 100 and 143 GHz are reduced to levels below the notional detector noise levels at multipoles ℓ < ∼ 10, as demonstrated by a number of null tests. PSRoll1 and PSRoll2 describe sets of end-to-end simulations of the two SRoll pipelines. In this paper, we use these simulations to characterise large-scale systematic modes and correlated noise at low multipoles. We then generate a large number of realisations with realistic noise and systematics over a grid of values, which are used to train, or calibrate, two of the likelihood models. All of the likelihoods are based on a quadratic cross spectrum (QCS) estimator, which we use to measure the foreground cleaned cross-spectra at low multipoles (2 ≤ ℓ ≤ 29) from the SRoll1 and SRoll2 temperature and polarisation maps 4 . This paper is organised as follows: in Sec. 2 we review the QCS power spectrum estimator. In Sec. 3, we discuss map compression and foreground cleaning procedures as well as residual systematics in the maps and their contribution to the power spectrum. In Sec. 3.2 we derive the pixel-pixel noise covariance matrices required for the QCS estimator and the likelihood computations. In Sec. 4 we present the different likelihood methods used to measure : the simulation-based likelihood (C-SimLow) in Sec. 4.1, the likelihood approximation scheme (momento) in Sec. 4.2 and the likelihoodfree approach (pydelfi) in Sec. 4.3. In Sec. 5 we test our three likelihoods on simulations with realistic correlated noise. In Sec. 6 we analyse the Planck data and perform cross-checks to validate our results. Sec. 7 presents our conclusions.

QUADRATIC CROSS SPECTRUM ESTIMATOR
To make inferences from large data sets such as CMB maps, data compression is often required to reduce the size of the data vector to a manageable level. Here we compress maps into summary statistics, namely the angular power spectra 5 ℓ , for each mode ( ≡ , , , . . . ). Quadratic estimators can be constructed to measure the temperature and polarisation power spectra on an incomplete sky which have lower variance than traditional pseudoℓ (PCL) estimators (Tegmark & de Oliveira-Costa 2001;Efstathiou 2006) and are easily computable at low multipoles from low resolution maps 6 .
Ordering the , and pixel values as a data vector x, one 4 Most of our results are based on the 100 × 143 full-mission cross-spectra, though we also investigate 100 × 100 and 143 × 143 'detector set' crossspectra. 5 Throughout the paper, we use the following notation:˜ℓ are the undeconvolved andˆℓ the deconvolved power spectra of the data. Theory spectra are denoted by ℓ . 6 It is also possible to write down a pixel-based likelihood for low resolution maps, provided the signal and noise are Gaussian and the noise covariance can write down a quadratic power spectrum estimate y ℓ (Tegmark 1997a): where C is the covariance matrix of the data vector x, which we will assume is composed of a signal and a noise term, C = S + N, evaluated for a fiducial model. In Eq. (5), ℓ is the theoretical CMB power spectrum for mode . If the fiducial ℓ is chosen to be close to the truth, can be accurately calculated and then Eq. (5) gives a minimum variance estimate of the power spectra. One can see from Eq. (6) that the estimator in Eq. (5) for the and spectra mixes components of the data vector x with the much lower amplitude and components. This is undesirable, since systematic errors in the maps and covariance matrix C could bias the and spectra. We therefore 'reshape' C by writing The revised quadratic estimator is theñ with expectation value Fisher matrix ℓℓ = 1 2 tr and variance If the matrixF is invertible, unbiased estimates of the power spectra can be computed from (dropping the mode index to avoid unwieldy notation in the remainder of this section) with covariance matrix Note that if we had used the exact covariance matrix of Eq. (6), the Fisher matrix in Eq. (11) would take the more familiar form and the covariance matrix of Eq. (13) becomes Δˆℓ Δˆℓ = −1 ℓℓ . For realistic sky cuts, using the reshaped covariance matrix in place of the exact covariance matrix leads to a negligible increase in variance compared to the optimal estimator, given in Eq. (4) (see Efstathiou 2006).
In our application, the Planck noise properties in polarisation are complex and so it is dangerous to estimate power spectra using Eq. (12) since this requires the subtraction of a noise term. We therefore modified the quadratic estimator by applying it to crossspectra of maps ( ) and ( ) on the assumption that the noise between these maps is uncorrelated (Efstathiou & Gratton 2014). The QCS estimator is which can be computed rapidly via spherical harmonic transforms (Efstathiou 2006). For each map, x ( ) , we form the weighted map and compute the PCL cross spectrum ( , ) ℓ of the maps z ( ) and z ( ) . The estimator in Eq. (15) is theñ where Ω is the solid angle of a single map pixel, each assumed to be of the same size. The QCS estimates can therefore be computed very rapidly for large numbers of simulations, since the matrices (C ( ) ) −1 and (C ( ) ) −1 need only be computed once. The expectation value of Eq. (15) is and estimates of the power spectraˆℓ can be recovered by inversion of Eq. (19) as in Eq. (12). The QCS estimator was used, together with the simulationbased likelihood (S B L), to analyse the Planck HFI maps in PSRoll1, PCP18 and Pagano et al. (2020). Although the QCS estimator is not 'optimal' in any formal sense, it has a significantly lower variance than a PCL estimator applied to the Planck polarisation maps. However in addition to lower variance, the QCS estimator produces estimates of the power spectrum with a covariance matrix that is effectively diagonal. It is because the QCS estimates ofˆℓ for each multipole are effectively independent that the S B L likelihood approach is feasible.
The variance of the QCS estimates is somewhat more complicated than Eq. (11): The error bars shown in plots of the power spectra below (Figs. 4 and 5) are computed from Eq. (20), though this expression is not used in the likelihoods. In this paper, we assume a base ΛCDM model with = 0.06 to compute the signal matrix S. The construction of the noise covariance matrices N (including realistic correlated noise) is described in Sec. 3.2 and is based on the end-to-end simulations described in Planck Collaboration (2020b) and PSRoll2, respectively, for the analysis of SRoll1 and SRoll2 maps. In contrast, the analysis of Pagano et al. (2020) used a simplified noise model based on the Planck FFP8 simulations (Planck Collaboration 2016b) for the QCS computations. / tot pix . The apodised mask (0 ≤ ≤ 1) with sky = 0.85 is used as a 'processing' mask to degrade the high resolution maps to low resolution (as discussed in the text). The mask with sky = 0.70 is binary and used to compute foreground cleaning coefficients and to mask the temperature maps when computing power spectra. The more conservative mask with sky = 0.54 is binary as well and used to compute polarisation power spectra.

DATA
The SRoll1 and SRoll2 map-making algorithms are described in detail in PSRoll1, Planck Collaboration (2020b) and PSRoll2. Briefly, the algorithms find global solutions minimising the variance in the response of each polarised bolometer within a given frequency band with respect to a number of instrumental parameters. Analogue to digital converter nonlinearity (ADCNL) introduced large polarisation systematics at low multipoles in the 2015 Planck HFI maps (Planck Collaboration 2016a). These systematics were substantially reduced with the SRoll1 processing used to produce the Planck 2018 HFI legacy maps. Although SRoll1 reduced systematics arising from first order ADCNL, second order ADCNL caused temperature to polarisation dipole leakage in the maps (Planck Collaboration 2020d). The SRoll2 map-making algorithm reduced these large scale polarisation systematics for 100 and 143 GHz still further via the following refinements: (i) the revised ADCNL corrections in SRoll2 obviate the need for fitting an effective gain variation of the bolometers; (ii) the polarisation angle and efficiency for each bolometer were treated as marginalised parameters; (iii) the thermal dust and CO templates were updated.
These improvements reduced significantly large-scale systematics in the polarisation data. The SRoll1 and SRoll2 100 and 143 GHz Q and U maps are compared in Figs. 4 and 5 of Pagano et al. (2020). In this paper, we have analysed both the SRoll1 and SRoll2 maps, together with their respective sets of end-to-end simulations, so that the reader can assess the impact of changes in the HFI data processing.

Map compression and foreground cleaning
To apply the QCS estimator, we degrade the high resolution Planck , and maps following a procedure similar to that described in PSRoll1 and PSRoll2. We first apply an apodised mask with sky = 0.85 (as plotted in Fig. 1) to suppress the Galactic plane region 7 . The maps were then smoothed using the harmonic-space smoothing operator: 7 This procedure is unnecessary for foreground subtracted simulations, but required for the real data to avoid smearing high amplitude foreground emission in the Galactic plane to high Galactic latitudes. with ℓ 1 = lr side and ℓ 2 = 3 lr side , and degraded from side = 2048 (5.03 × 10 7 HealPix pixels, Górski et al. (2005)) to lr side = 16 (3072 pixels) in the low resolution maps. We apply the smoothing operator given in Eq. (21) and a HealPix pixel window function at the map level to match the lr side = 16 low resolution covariance matrices, N 0 , discussed in Sec. 3.2.
The low resolution maps are foreground cleaned by fitting high and low frequency templates. Specifically, we use the 353 GHz maps as a dust template and (in polarisation only) either the Planck LFI 30 GHz or WMAP K-band (22 GHz) maps as synchrotron templates. Following Efstathiou & Gratton (2019, hereafter EG19, see Secs. 7.1 and 8) we minimise cleaned map residuals with respect to the coefficients 1 and 2 for the two map templates 1 and 2 . In polarisation, the sum in Eq. (22) extends over the unmasked pixels in the and maps defined by the sky = 0.70 mask plotted in Fig. 1. We therefore determine two sets of coefficients for polarisation, which we denote 1 and 2 . For temperature, we have applied dust template subtraction with a coefficient 1 and ignored synchrotron, for reasons discussed below.
The template coefficients used in this paper are listed in Table  1. The polarisation coefficients are listed in pairs, one pair for each of the SRoll1 and SRoll2 maps to give an impression of the sensitivity of the dust coefficient on the choice of low frequency template. Polarised dust emission dominates the 100 and 143 GHz and maps at low resolution, with synchrotron making a small (but non-negligible) contribution at 100 GHz. As Table 1 shows, the polarisation dust coefficients are stable. However, for 100 GHz the amplitude of the 30 GHz coefficient differs between SRoll1 and SRoll2. The SRoll2 polarisation cleaning coefficients are in excellent agreement with the coefficients determined by Pagano et al. (2020) ( 353 = 0.0186, 22 = 0.0095 for 100 GHz). As we will show below, ignoring the synchrotron correction in polarisation causes shifts in of a fraction of a standard deviation. Maps cleaned with WMAP K-band are nearly indistinguishable to the ones cleaned with Planck 30 GHz maps. This similarity gives us confidence in our synchrotron cleaning coefficients.
Dust cleaning in temperature using 353 GHz or higher frequencies removes almost all of the foreground emission at low multipoles at 143 GHz leaving noise-free CMB signal over most of the sky (and indistinguishable from the Planck component separated maps), as discussed in detail in EG19. At 100 GHz, the main contaminant, following 353 cleaning, is CO line emission which makes a small but easily detectable contribution to the signal. Since the residual foregrounds are small at low multipoles, we subtract only dust emission in temperature using 353 GHz maps and the cleaning coefficients determined by EG19 (as listed in Table 1). We use the same tem-  Fig. 1 has been applied. The upper set of maps at each frequency shows the and maps before foreground subtraction. The lower set of maps at each frequency shows the and maps after subtraction of 353 and 30 GHz Planck maps with the template cleaning coefficients listed in Table 1. The colour scale is in units of K. perature cleaning coefficients for SRoll1 and SRoll2 since these coefficients are insensitive to the map-making algorithm.
To avoid introducing correlated noise into the QCS spectra, we foreground clean the frequency maps using pairs of half-mission (HM) template maps 8 . For example, for the 100 × 143 crossspectrum, we clean the 100 GHz channel with 353 GHz HM1 and 30 GHz HM1 maps and the 143 GHz channel with 353 GHz HM2 and 30 GHz HM2 maps.
The SRoll2 polarisation maps before and after foreground cleaning are plotted in Fig. 2, which shows clearly that the polarisation maps at 100 and 143 GHz are dust dominated at low multipoles before foreground subtraction. Fig. 3 shows the foreground cleaned lr side = 16 and maps used for the main cosmological results in this paper. The SRoll1 maps are shown in the upper panels and the SRoll2 maps are shown in the lower panels. One can clearly see systematic features in the SRoll1 maps, particularly at 100 GHz. These systematic features can be reduced by subtracting a smoothed mean mapn determined from end-to-end simulations as discussed able rings from the full-mission frequency maps into two halves, i.e. each frequency channel has two half-mission maps: HM1 and HM2 -2 +2  in the next section. For reference, the rms fluctuations in the SRoll2 100 GHz polarisation maps over the unmasked area shown in Fig. 3 are = 0.66 K before foreground cleaning and = 0.36 K after foreground cleaning. The rms of the scaled 353 GHz dust template is = 0.55 K and for the scaled 30 GHz synchrotron template is = 0.11 K.

Noise covariance matrices
In this section, we describe how we fit a parametric model to estimates of the pixel-pixel noise covariance matrices (NCMs) computed from end-to-end simulations. Once we have a model for a NCM, we can generate large numbers of simulations on the assumption that the noise is Gaussian, allowing us to construct likelihoods as discussed in Sec. 4. We compute two sets of NCMs, one each for SRoll1 and SRoll2, from the end-to-end simulations of the mapmaking pipelines 9 . In each case, the fiducial CMB and the input foreground model was subtracted, leaving maps containing noise and map-making systematics. We then construct empirical NCMs, where n are the simulated sky maps for the Stokes parameters , and ,n a smoothed template of the mean of the maps (see Eq. (30)) and the number of simulations used in the sum in Eq. (23). To test whether our methodology is biased as a consequence of overfitting we do not use all of the end-to-end simulations to compute Eq. (23). We excluded 100 simulations from each set to allow validation tests of the likelihoods as discussed in Sec. 5. We approach the problem of fitting a model toN as a maximum likelihood inference problem. We assume a Gaussian probability distribution for each noise realisation where M is the model for the noise covariance matrix. We assume that M consists of three terms where and are scaling parameters for two matrices N 0 and N 1 and the term Y Y models large-scale modes with parameters as described more fully below. Since we are dealing with and maps, the noise covariance matrices M are of size (2 pix , 2 pix ). We neglect noise in temperature and only fit polarisation noise, since the low resolution maps are signal dominated to high accuracy. Moreover, we assume that the noise for 100 and 143 GHz maps is uncorrelated.
The matrices N 0 are the lr side = 16 low resolution map-making covariance matrices (see e.g Tristram et al. 2011) computed for the parameters of the FFP8 simulations (Planck Collaboration 2016a,b). These covariance matrices contain structure representing the scanning strategy, detector white noise and '1/ '-type noise, but do not include complexities associated with corrections for ADCNL. Since these matrices were designed to match the 2015 Planck half-mission maps, they do not necessarily match the noise levels of the SRoll1 and SRoll2 maps at high multipoles. The matrices N 1 were constructed from the diagonal components (( 2 ) , ( 2 ) , ( 2 ) of the 3 × 3 , , high resolution pixel noise estimates produced by the map-making algorithms. These noise estimates were degraded in resolution to lr side = 16 (appropriate for a low resolution map ) by computing: for each and (see App. A of Efstathiou et al. (2009)). We assume that the noise is diagonal = 2 , Ω is the solid angle of a high-resolution map pixel, ℓ denotes the Legendre polynomials and ℓ is the smoothing operator applied to the high-resolution maps given in Eq. (21). The sum therefore allows us to model destriping noise correlations with adjustable 'white noise' levels, using the matrices N 0 and N 1 as templates. The final term in Eq. (25) models additional large-scale noise correlations in the SRoll1 and SRoll2 end-to-end simulations. Terms up to ℓ max = 4 are added to the noise model via Y Y , where each column of the matrix Y is a map of the spherical harmonic functions ℓ ( , ). The square matrix , with dimensions equal to the number of large-scale modes that we wish to fit (42 modes to model , and correlations), controls the covariance of the modes. In the noise fitting procedure we apply the binary polarisation mask with sky = 0.54, shown in Fig. 1, to the and maps. This avoids fitting modes behind the Galactic plane.
To solve our inference problem, we minimise the 'action' S ≡ − ln(P (N|M)) with respect to the free parameters in our model for M. We solve numerically for the scalar parameters and , solving analytically for the at each step. The matrix is given by (for the derivation see App. A) When fitting the NCM M using Eq. (25) the dominating components are the 1/ -type noise matrix ( ≈ 1.8) as well as the large scale modes up to ℓ max = 4. The smoothed low resolution covariance matrix N lr only subtracted a small amount of power from the diagonal with ≈ −0.1. As summarised in the start of this section, uncorrected ADCNL leads to 'stripy' residuals in the Planck and maps at 100, 143, and 217 GHz, which are substantially reduced in SRoll2 compared to SRoll1 (see, for example, Fig. 4 in Pagano et al. 2020). The endto-end simulations provide templates for these residuals. Instead of subtracting the mean map averaged over the simulations,x, we subtract a smoothed templatē with ℓ max = 4 (this is the maximum likelihood solution of the mapmaking equation, as discussed in e.g. Tegmark (1997b)). We have approximated the NCM in Eq. (30) by the term N 0 to avoid having to iterate to obtain a solution for M.

Quadratic temperature and polarisation power spectra
As an illustration of our methods, Fig. 4 shows the , , and spectra, computed using the QCS estimator technique presented in Sec. 2, for the cross correlation of the 100 and 143 GHz fullmission maps for both SRoll1 and SRoll2 for 2 ≤ ℓ ≤ 31. The , and maps have been foreground cleaned as discussed in Sec. 3.1. We apply a mask with sky = 0.70, shown in Fig. 1, to the maps and we apply a mask with sky = 0.54, also shown in Fig. 1, to the and maps. For the cross-spectra we compare our results with the COMMANDER 2018 spectrum (PCP18) and observe good agreement between them, even though the COMMANDER spectrum is computed using a mask with larger sky fraction ( sky = 0.86). In particular, we observe the same behaviour in the spectra: first, at ℓ = 2 there is a very low value, and, second, there is a 'dip' at around ℓ 20 − 25. The error bars on the COMMANDER spectrum are asymmetric 1 posterior widths.
To guide the eye, the solid and dashed black lines in Fig. 4 show the , and spectra for base ΛCDM model with = 0.055, close to the best fit value for the SRoll1 and SRoll2 maps cited in Eqs. (3a) and (3b), and for = 0.070, which is disfavoured at about the 2 − 3 level. The and spectra have large cosmic variance and so do not provide strong constraints on given the low values  cross-spectrum (PCP18) (dotted lined) with the associated error bar as a grey shaded region is shown for comparison. The maps are foreground cleaned with the 30 GHz and 353 GHz channel for synchrotron and dust emission, respectively. Theoretical spectra are shown for = 0.055 (solid line) and = 0.070 (dashed line). of inferred from the spectra. The SRoll1 and SRoll2 , and spectra are very similar and mainly show differences at low multipoles. The spectra are approximately consistent with zero, providing an important null-test for SRoll1 and SRoll2. The 2 values divided by the degrees of freedom (29 multipoles) are 1.28 for SRoll1 and 1.13 for SRoll2. Fig. 5 shows , , and spectra computed using the QCS scheme for intra-and inter-frequency cross-spectra for combinations of the publicly available SRoll2 detector set 10 maps. The , and maps are foreground cleaned, following the procedure discussed in Sec. 3, i.e. the ds1 maps are cleaned using 353 GHz HM1 and 30 GHz HM1 maps and the ds2 maps are cleaned with 353 GHz HM2 and 30 GHz HM2 maps. The temperature and polarisation masks are the same as those used as for the previously 10 A detector set denotes a combination of bolometers chosen able to fully determine the polarisation state of (non-circularly-polarised) incoming light: At 100 GHz, ds1 consists of the 100-1a/b and 100-4a/b bolometer pairs and ds2 of the 100-2a/b and 100-3a/b pairs. At 143 GHz, ds1 consists of the 143-1a/b and 143-3a/b pairs, and ds2 of the 143-2a/b and 143-4a/b pairs (PSRoll1).  discussed full-mission cross-spectra and shown in Fig. 1. For comparison, the empty black circles are the 100 × 143 full-mission QCS estimates with the corresponding error bars (as in Fig. 4).
For the signal-dominated and cases seen in Fig. 5, we measure almost identical spectra for the different detector set combinations as expected. In the low multipole polarisation regime that is most constraining for (2 ≤ ℓ 12), the estimates scatter between the theoretical curves with = 0.055 and = 0.070. This indicates that higher values of may be favoured for some detector set combinations. At intermediate scales, 10 ≤ ℓ ≤ 25, the spectra scatter around the theoretical curves and no clear trend is visible. However, the 100 ds1 × 100 ds2 intra-frequency cross spectrum shows a few outliers in polarisation (purple line). This potentially suggests that the 100 GHz maps are more affected by noise and unresolved systematics than the 143 GHz ones. The illustrated spectra are all approximately consistent with the null hypothesis of zero signal power.

LIKELIHOODS
In this section we present and compare a simulation-based likelihood, a likelihood-approximation scheme and a likelihood-free approach. The methods are called: S B L, and and their respective likelihoods are named C-SimLow 11 , momento and pydelfi.

Simulation-based likelihood
The simulation-based likelihood (S B L), originally presented in PSRoll1, uses low-ℓ QCS estimates of only the polarisation spectrum to measure the optical depth.
The joint sampling distribution for all the power spectrum elements is in general a function of all the power spectra components defining the model. However, the QCS procedure with reshaping, see Eq. (7), does a good job of approximately factorising this distribution by multipole. Then the distribution of the power spectrum elements at a given multipole depends mainly on the theory elements at that multipole. By only considering a single spectrum, in this case, the requirement to handle intra-multipole correlations is avoided, motivating an approximate likelihood form: with ℓ min = 2 and ℓ max = 29, a product of one-dimensional functions. One then uses realisations generated according to a set of theory models to fit parametric forms to each of these one-dimensional sampling distributions. This is done for each ℓ at the input theory ℓ values that happen to occur in the models considered. These fits are then evaluated at a realised ℓ value set equal to that of the real datâ ℓ . Finally, a further fit to these numbers, now as a function of the theory , gives the effective likelihood function at that multipole.
As the mask and noise do correlate the multipoles even with QCS, one may ask where such effects manifest themselves in the above procedure. At a given ℓ = ℓ 0 say, one can imagine computing the marginalised one-dimensional sampling distribution forˆℓ 0 for a specific theory power spectrum. Because of the couplings, in general this will differ between models even if they happen to share the same value of ℓ 0 . The fitting procedure above then effectively averages over these distributions. One expects the variations to be relatively modest for plausible models, and hopes that the theory models used to generate the realisations are close enough to reality not to lead to significant errors in the effective averaging.
We therefore need to generate full maps of the observed CMB on which to measure the spectra. Considering the theory ℓ 's for each mode over a large region of the parameter space would be computationally costly, since one would typically spend much time exploring low probability regions. We thus only explore the region of the power spectrum around the theory ℓ of interest.
Our implementation of the simulation-based likelihood largely follows that of Planck Collaboration (2020c). The main difference is that we use Gaussian realisations of our noise fit to the Planck simulations, rather than using the outputs of the noise simulations directly, given the limited number of the latter. This allows us to use many more independent noise realisations throughout the procedure.
We generate 191 theoretical power spectra, ℓ ( , ), uniformly sampled over a range of values from 0.01 − 0.2 inclusive with a 11 We implement our own version of the publicly available likelihood SimLow (PSRoll1) and call it C-SimLow; differences in implementation will be highlighted below. step size of Δ = 0.001, where denotes all the cosmological model parameters defined in Sec. 6. Only is varied along with , to keep the product 10 9 −2 fixed at 1.870, consistent with a high-ℓ likelihood constraint. For each of these theoretical power spectra, we generate 10000 Monte Carlo realisations of the CMB.
Using the fitted NCM discussed in Sec. 3.2, we generate Gaussian noise realisations that capture noise and systematics for the relevant frequency and detector set maps. At the map level, we combine the signal and noise maps. With this large suite of simulations we set up the simulation-based likelihood in the following way: (i) Compute low-ℓ QCS estimates of the simulations,ˆs im ℓ , from given theory power spectra that have values ℓ .
(ii) Compute the quasi-conditional P (ˆs im ℓ | ℓ ) by fitting ℓ-by-ℓ a model to the distribution of spectra. We perform an unbinned maximum likelihood fit to the log-distribution, ln P (ˆs im ℓ | ℓ ) as a function ofˆs im ℓ , using a third-order polynomial for the central part of the distribution and a first-order polynomial for its tails. Using Lagrange multipliers, we impose smoothness and continuity at the boundaries ensuring a good fit, ℓ (ˆs im ℓ | ℓ ), to the conditional. (iii) We evaluate the above fits across the models atˆℓ , the QCS spectra for the data, fitting the same functional form as that used above but this time for the ℓ dependence.
(iv) Finally we construct our likelihood by combining the fits for each multipole as given in Eq. (31), yielding The method will be called S B L throughout the paper and we shall denote our implementation of it by C-SimLow. In the present paper, we use C-SimLow with only polarisation data for the reason discussed above of the difficulties in handling correlations between variables.

Likelihood approximation scheme
The General Likelihood Approximate Solution Scheme ( ; Gratton 2017) was developed to allow a principled Bayesian analysis of data even in situations where the sampling distribution for the data is not fully known. Our low-ℓ analysis of the CMB polarisation is a case in point -because of difficulties in quantifying the noise in the maps, we choose to use quadratic cross-spectra for robustness. However, the joint distribution of the multipoles of such spectra, computed on a masked sky, does not have a simple analytic form. Instead, assumes one can compute certain moments of functions of the data, here the spectral multipoles, as a function of the parameters of the model under investigation. One can then imagine uses a maximum entropy construction to compute a leastpresumptive sampling distribution consistent with these moments. We now introduce the method, briefly summarising Gratton (2017), to which the reader is referred for a fuller presentation.
For an initial illustration, consider a situation in which one can obtain the mean¯( ) and variance 2 ( ) of some function of the data in terms of a model parameter , ideally analytically but also potentially via forward simulations. Then, one maximises the entropy of the system subject to the constraints on¯( ), 2 ( ) and normalisation of P ( ), imposed via Lagrange multipliers. The required values of the multipliers, ( ) and ( ), may be solved for, in general numerically, as a function of . The sampling distribution is then given as where Z is a theory-dependent normalisation constant (also called the evidence or partition function). Equivalently, this can be understood as solving for the action (or negative log-likelihood) S, The method naturally extends to multiple statistics , ∈ [1, . . . , ] fitted to models with multiple parameters , ∈ [1, . . . , ], and using higher moments: where summation is implied over multiple indices. However, the evaluation of the Lagrange multipliers quickly becomes very expensive, requiring the numerical computation of many multi-dimensional integrals. We can avoid this cost by instead computing more moments. To simplify notation, we introduce a 'meta-index' to first run over all indices , then all pairs of indices , and so on. then runs over the , then the and so on, and similarly runs over the , then the and so on. We can express the moments of the as derivatives of the evidence Z with respect to the Lagrange multipliers: Next, we differentiate Eq. (37) with respect to Now, differentiating the action in Eq. (36) yields We can solve Eq. (39) for the , in terms of the derivatives of the first moments and second order cumulants. Substituting into (40), and adopting a vector/matrix notation to avoid explicitly writing meta-indices, we obtain which does not depend explicitly on the prior and the Lagrange multipliers. So, obtaining the moments by calculation or simulations, we can compute the gradient of S. This gradient can then subsequently be integrated between two points in parameter space in order to find the difference in approximate log-likelihood between the two models.
For an instructive if overly simple example, consider applying the above procedure in a 1-d problem in which the prior ( ) is uniform and the first two moments of happen to be calculable as = and 2 = 2 + 2 , where is a variable parameter of the model and 2 is fixed. Assume we will work to linear order in for S. then ranges over just one element, with 0 simply being . Knowing up to second moments in is then sufficient to evaluate the single component of , which is 2 − 2 = 2 . Eq. (41) simply reads which in this case we can integrate by inspection to find S = ( − 2 /2)/ 2 up to a constant. For inference of this of course may be rewritten as S = ( − ) 2 /(2 2 ) + const, the exact Gaussian result that one might have anticipated from the form of the moments.
For a more complicated example, consider observing a number of vectors y of Gaussian-distributed components, with the vectors being independent of each other but allowing components within a single vector to be correlated with each other according to a covariance matrix C. The sampling distribution is then and we see that the components of the observed covariance matrix C, serve as sufficient statistics for learning about the components of C. The scheme recovers the posterior associated with (43) simply by working to linear order in the components ofĈ for S: after calculating their first and second moments, with some work Eq. (41) recovers the optimal result Hence we see how to use to compute an approximate likelihood, which we name momento, for the cross-spectra in our problem. One can in principle compute all of the intra-and interℓ cumulants between all the (TT, TE and) EE cross-spectra up to some required degree of approximation, assuming the underlying maps are gaussianly-distributed (around some offset noise template). In practice this is relatively easily manageable up to fourth order in the spectra for moderate ℓ max . Such moments are the natural generalisations of the following formulae for a single multipole of a single cross spectrumˆ1 2 on the full sky with isotropic noise (and no unsubtracted map mean noise template): (2 + 1) ˆ2 12 = 2 + ( + 11 )( (2 + 1) 2 ˆ3 12 = 2 3 + 6 ( + 11 )( (2 + 1) 3 ˆ4 12 = 6 4 + ( + 11 ) 2 ( + 22 ) 2 Taking the theory power components themselves as parameters of the theory, using such cumulants we can numerically integrate S , up along a path from a fiducial model to a model in question, for a selection of degrees of approximation (linear and quadratic, requiring from quadratic up to fourth order moments of the spectra). We find that even the linear approximation performs well and so typically use this in our work.
To summarise, momento uses QCS power spectra, which use a reasonable fiducial model with power spectrum fid ℓ to construct the appropriate matrices. Then, for each likelihood evaluation, momento: (i) takes as input a set of theory ℓ 's, (ii) computes the difference Δ ℓ ≡ ℓ − fid ℓ between the theory and the fiducial model (iii) uses Romberg integration to compute the change in S going from the fiducial fid ℓ 's to the theory ℓ 's along the line fid ℓ + Δ ℓ in power spectrum space, with being a parameter ranging from zero to one. We choose to use a step size in of 0.25.
(iv) This requires the computation of gradients of S with respect to at four new positions in power spectrum space for every new likelihood evaluation (at = 0.25, 0.5, 0.75, 1), as those computed at the fiducial model ( = 0) can be reused.
(v) The gradients of S with respect to are linear combinations of those of S with respect to the associated ℓ 's, and (vi) these S/ ℓ 's are computed via Eq. (41), which require both the QCS power spectra of the data and (vii) the multidimensional moments/cumulants of the QCS power spectra evaluated at the theory model corresponding to each , via the appropriate multidimensional generalisations of Eqs. In general, is a very flexible scheme to compute principled posteriors where likelihoods are challenging to compute either for computational efficiency or more fundamental reasons. Since the approach is physically motivated, we do not have the 'black box' behaviour seen in or other neural-network-based approaches. In contrast to the two other methods, which are dependent on the use of simulations to train their models, momento only needs simulations for the computation of suitable NCMs, which are then used in the computation of moments as required.

Density-estimation likelihood-free inference
The final method takes an alternative approach to perform a simulation-based likelihood and is called the 'likelihood-free' approach. In Sec. 4.1 we fit a functional form to the likelihood L ( ℓ |ˆℓ ) and obtained a total likelihood by assuming that each ℓ is independent. In this section we instead seek an invertible remapping of the measured cross-spectra,ˆℓ , to a set of variables ℓ , such that the resulting variables are statistically independent, Gaussian random variables. This mapping is obtained using a neural network (NN). With such a mapping the likelihood can be trivially evaluated as it is a multidimensional Gaussian combined with an appropriate Jacobian. The challenge with this approach is to obtain the mapping when the functional form of the distribution of theˆℓ is unknown 12 .
More precisely, we start with a set of variables, ℓ , that are unit-variance normal variables when conditioned on the parameters, i.e. ℓ | ∼ N (0, ), where denotes the conditional parameters (in our case ). We wish to find an invertible function such that ℓ = ( ℓ ). Given such a mapping we see that where the density for ℓ , P ( ℓ | ), is just a normal distribution.
Finding the mapping ( ℓ ), with a tractable Jacobian, seems a daunting task in general; the mapping from the -dimensional normal distribution to potentially multimodal distributions is likely non-trivial. To solve this challenge we use a second technique: the 12 We remind the reader that for the S B L approach an explicit functional form for the distribution of spectra was assumed.
above remapping can be expressed as a series of simpler mappings, i.e.
with the only modification that Eq. (50) is changed to a product of Jacobians. The intuition is that we decompose the complex mapping into a series of simple transformations that slowly deform the probability density into a distribution that approaches that of the complex mapping 13 . We consider a particularly simple series of mappings.
For the j ℎ mapping we perform the following transformation: i.e. that the th output of the mapping is obtained by subtracting and scaling the th input by a function of all the previous ( − 1) inputs. This has the nice property that the Jacobian in Eq. (50) for each transformation has the trivial form of the product of the functions ( 1... −1 , ) 14 . Thus, the desired distribution can be written as This expression of the problem has shifted the complexity from fitting a functional form for the likelihood to identifying a suitable series of mapping functions ( ℓ ; ).
To proceed we consider a family of functions ( ℓ , |w), parametrised by w, and optimise the parameters to find the appropriate mapping. Equivalently stated, we solve a variational inference problem: we have a parametrised form for the likelihood P (ˆℓ | ; w) and we wish to optimise the parameters so that we can approximate the true likelihood as accurately as possible.
To find the best fitting weights of the NN, we minimise the Kullback-Leibler divergence KL (P * |P), between the parametric distribution P (ˆℓ | ; w) and the true distribution P * (ˆℓ | ), which is defined as KL (P * |P) = − ∫ P * (ˆℓ | ) ln P (ˆℓ | ; w) The Kullback-Leibler divergence is a measure of the difference between probability distributions; it is a non-negative function that is zero only when the two distributions are identical. By minimising this function, we minimise the mismatch between our parametric conditional distribution and the true conditional distribution. As we do not have access to the true distribution, only samples, we perform a Monte Carlo approximation of the Kullback-Leibler divergence using which, as the number of samples tends to infinity, approaches Eq. (54) up to an additive constant. In principle, any family of functions can be used to perform the parametric fitting in this approach. However, if an overly restrictive 13 It has been shown that for some sufficiently flexible mappings arbitrary distributions can be modelled via such series of transformations (Huang et al. 2018;Jaini et al. 2019). 14 We used the leading notation and for these functions as this form can equally be thought of as stating that the distribution for the output ( ) follows a Gaussian distribution, conditioned on all the previous inputs. set of functions is chosen we will have a poor approximation of the true distribution. Thus, in this work we choose the set of functions to be representable by a NN. NN are universal approximators, thus given sufficient data they can represent any function meaning we have a sufficiently flexible class of functions.
We use the pydelfi implementation of this method to construct a polarisation-only and a joint temperature-polarisation likelihood, with the precise configuration shown in App. C. When dealing with a high-dimensional problem ( > ∼ 30) in likelihood-freeinference, score compression is required both to reduce computational cost and give stable results. The greater the degree of compression, the more sub-optimal the likelihood. As a consequence, the results of our pydelfi likelihood have larger uncertainties, as discussed in detail in App. C3 (for a more general discussion, see Alsing et al. (2018)).

LIKELIHOOD VALIDATION ON SIMULATIONS
In this section we present tests of the three likelihood methods discussed in Sec. 4 against simulations. The aim is to investigate whether there are biases or significant differences in their perfor-  Fig. 6 for the three pairs of the likelihood methods (C-SimLow, momento and pydelfi). The grey shaded area shows the typical posterior width for the optical depth ( ). The black, red and green dashed lines show linear fits to each set of points. Their correlation coefficients (slopes of the linear fit) are given in the legend. There is a high degree of correlation between all three methods, especially between C-SimLow and pydelfi.

mance.
We also test what happens if the input value of in the simulations is changed to be higher or lower than the fiducial value of 0.06 used to construct the covariance matrices required for the QCS estimator.
Since end-to-end simulations are used to model noise, we have taken care to use an independent subset of the simulations for the tests described in this section. Thus, the noise estimation steps used all but 100 end-to-end simulations for each SRoll1 and SRoll2, leaving the remaining simulations available for likelihood validation. We also generated 100 new Gaussian CMB realisations for each of three values of : = 0.05, 0.06 and 0.07, using the same CMB realisations for SRoll1 and SRoll2. We use the same masks for the tests as for the data analysis (illustrated in Fig. 1). Fig. 6 compares the performance of the three likelihood methods for polarisation-only ( ) inferences on for the SRoll2 simulations (the SRoll1 case is very similar). The figure shows posteriors for each of the 100 test simulations using the = 0.06 CMB realisations. All three likelihoods perform satisfactorily, without significant bias. The mean maximum likelihood values ML , mean of the posterior widths ( ) and the standard deviation of the maximum likelihood values ( ML ) found for each likelihood are given in Table 2 (which also lists values for the = 0.05 and 0.07 CMB realisations).
As well as investigating the average behaviour of the likelihoods, we have also compared them realisation by realisation. Fig.  7 shows scatter plots of the maximum likelihood values for each pair of likelihoods for SRoll2 with an input of 0.06. To guide the eye, the grey shaded area shows the range expected for a ±1 error of = 0.006. There is a high degree of correlation between all three likelihoods, especially between C-SimLow and pydelfi, as discussed in Sec. 4.3. This behaviour is expected since the pydelfi approach is effectively a generalisation of the simulation-based like-  Table 2. Summary of likelihood tests performed using Gaussian realisations of CMB signal maps and 100 end-to-end SRoll2 simulations for the simulationbased likelihood (C-SimLow), the likelihood approximation scheme (momento) and the density-estimation likelihood-free (pydelfi) method. Here we make inferences on using low-ℓ polarisation data only. For each likelihood the mean maximum likelihood values ML , mean of the posterior widths ( ) and the standard deviation of the maximum likelihood values ( ML ) are computed. The input for the CMB realisations is denoted by in . No evidence for any bias in the recovered ML 's, even when the CMB is drawn from a distribution that does match the fiducial model with = 0.06, is seen.
lihood C-SimLow (fitting a set of Gaussian's to the conditional distributions, instead of using a pre-defined functional form). The scatter between C-SimLow and pydelfi is about a sixth of a sigma, and between either C-SimLow or pydelfi and momento it is roughly half a sigma 15 . Applied to the same simulations, methodological differences in the likelihood implementation lead to differences in the maximum likelihood value of of less than a standard deviation. Table 2 also gives results for simulations in which the CMB realisations are generated from models with both lower (0.05) and higher (0.07) values of than the fiducial value = 0.06 used to compute the QCS estimates. No bias is seen for any of the likelihoods confirming that the methods are insensitive to the choice of fiducial cosmology.
We note that the posteriors on determined from momento are about 10% tighter than those determined from either C-SimLow or pydelfi. The distribution of peak maximum likelihood values of , shown in Fig. 6, is also tighter for momento. We have therefore chosen to use momento as our default low-ℓ likelihood in Sec. 6 when combining with the high-ℓ likelihood. Finally, all methods give average posterior widths that are slightly less than the scatter of their maximum likelihood values. The distribution of maximum likelihood values of should be closely related to the width of the posterior distribution but is not guaranteed to be the same. The agreement is, however, close enough to demonstrate that the widths of the posterior distributions are not seriously in error.

RESULTS
In this section, we use the three likelihoods described in Sec. 4 to derive constraints on from both the Planck 2018 legacy maps (SRoll1) and the SRoll2 maps. In Sec. 6.1 we present results for the cross-correlation of the 100 GHz and 143 GHz full-mission maps, since this choice of maps was used in PCP18 and Pagano et al. (2020) to derive the results quoted in Eqs. (3a) and (3b). Sec. 6.2 presents a more extensive investigation of the SRoll2 data set comparing all six combinations of 100 GHz and 143 GHz detector set maps to test whether the results are sensitive to different splits of the Planck data. To reduce the computational burden, in Secs. 6.1 and 6.2 we perform one-dimensional parameter scans in , allowing to change according to a fixed value of 10 9 −2 = 1.870. The other parameters of the base ΛCDM cosmology fixed to 0 = 67.04, Ω b ℎ 2 = 0.0221, Ω c ℎ 2 = 0.12, Ω ℎ 2 = 0.00064, * = 1.0411, s = 0.96 (PCP18). In Sec. 6.3, we relax the constraint on 10 9 −2 and perform a full Monte Carlo exploration of the 15 The quoted -shifts are calculated as a fraction of the scatter between the ML measurements, in other words of ( ML ).  Table 3. Summary of constraints for 100 × 143 full-mission cross-spectra obtained using a simulation-based likelihood (C-SimLow), a likelihood approximation scheme (momento) and a density-estimation likelihood-free (pydelfi) approach. For ( ) we measure only using the low multipole polarisation data and for ( ) we compute a joint likelihood for temperature and polarisation data. six cosmological parameter space using momento in conjunction with the high-ℓ CamSpec v12.5HM likelihood (EG19). Table 3 summarises the results from foreground-cleaned 100 × 143 full-mission maps. These results are plotted in Fig. 8 and compared with the results of Eqs. (3a) and (3b) (shown as the blue and red dashed lines respectively, together with 1 errors shown by the red and blue shaded areas). For all three likelihood approximations, our results reproduce the upward movement in between SRoll1 and SRoll2 (as noted in Delouis et al. (2019)). Furthermore, the results from all three likelihoods are consistent with each other. There are, however, some interesting features that are worth noting:

Constraints using 100 × 143 full-mission QCS
pydelfi results have larger error bars than the results alone, even though additional data is included in the likelihood (though the best-fit value of hardly changes). As outlined in Sec. 4.3 one of the drawbacks of likelihood-free inference is that higher dimensional problems ( > ∼ 30) require additional data compression. In our implementation we compressed the 84 power spectrum components for 2 ≤ ℓ ≤ 29 into three summary statistics, one each for , and , whereas for the -only implementation we were able to avoid compression entirely, using all 29 power spectrum multipoles. The larger pydelfi error is a consequence of lossy compression which actually degrades the block. In fact, we found that a score-compressed posterior from pydelfi gives a maximum likelihood value for that is lower by ∼ 0.25 compared to the results without compression. With compression the data does actually pull the posterior upwards by ∼ 0.3 , largely cancelling the effects of compression on . This is discussed further in App. C3.
(ii) For momento, adding and spectra to causes upward shifts in of approximately 0.002 (∼ 0.4 ) for both SRoll1 and SRoll2 values. The posterior width is reduced by a modest ∼ 5%. This behaviour is consistent with the parameter-shift criteria developed by Gratton & Challinor (2020).
(iii) The SRoll2 likelihoods consistently yield higher values of , and with slightly tighter errors, than the corresponding SRoll1 likelihoods.
(iv) The errors on from our application of C-SimLow to SRoll1 are about 20% smaller than those quoted in Eqs. (3a). There are two reasons for this: (a) the results of PCP18 used a sub-optimal noise model based on Planck FFP8 end-to-end simulations for the QCS computations; (b) we subtracted a smoothed noise template at the map level (see Eq. (30)), which reduces the size of the posterior widths by ∼ 5-10% for SRoll1, as explained in more detail in App. B.
Recently, a new set of Planck maps for the LFI and HFI frequency bands have been developed (Planck Collaboration 2020e, hereafter NPIPE). As with SRoll1 and SRoll2, a set of systematic templates are fitted as part of the map-making stage. However, amongst other differences, NPIPE retains the CMB Solar dipole in each map and uses foreground polarisation priors at 30, 217 and 353 GHz to break parameter degeneracies. The use of polarisation priors leads to a suppression of the polarisation signal at low multipoles, necessitating the calibration of power spectrum transfer functions from end-to-end numerical simulations. The transfer functions corrections are quite large for the multipoles ℓ = 2-7 that contain most of the information on . The analysis of the NPIPE 100 × 143 spectrum presented in Planck Collaboration (2020e) gives a value for that is lower by 1.2 − 1.6 compared to the results of Table 3. The results from NPIPE are therefore broadly in agreement with those from SRoll1 and SRoll2.

The optical depth from inter-and intra-frequency detector set combinations of SRoll2 maps
To assess the robustness of the results of the previous section, we have analysed the inter-and intra-frequency spectra computed from SRoll2 detector set maps. The SRoll2 cross-spectra used in this sub-section are shown in Fig. 5.
We stress that the likelihoods for each map pair have been computed/trained afresh using the appropriate detector set noise covariance matrices constructed from the relevant simulations. The results of the cross-checks, for each of the three likelihoods, are presented in Table 4. Each column lists the mean value for and the associated posterior width for the indicated spectrum combination.
Focusing on the results from momento, the posterior widths for 143ds1×143ds2 are about 5% smaller than for 100ds1×100ds2, which is expected because the 143 GHz maps are less noisy than the 100 GHz maps. (The reduction in errors is, however, smaller in the other two likelihoods). More significantly, at multipoles ℓ = 3−5 the 143ds1×143ds2 spectrum lies above the = 0.055 theoretical line. As a consequence, for all likelihoods, the 143ds1×143ds2 value of is higher than that for the 100 × 143 spectra by about 1.4 − 2 . The 100ds1×100ds2 also shows a preference for higher values of compared to the 100 × 143 spectra, but to a lesser extent. This is suggestive of correlated residual systematics in the 100 GHz detset and 143 GHz detset maps which partially cancel when 100 GHz detset maps are cross-correlated against 143 GHz detset maps (as discussed by Pagano et al. 2020). The effects are relatively small, but in agreement with the conclusions of Pagano et al. (2020), our results suggest that the 100 × 143 spectra are likely to provide the most reliable constraints on .
The last row in Table 4 shows the mean of values for all of the inter-and intra-frequency cross detector set spectra constraints (ignoring correlations). The 100ds1×100ds2 and 143ds1×143ds2 results pull the means to slightly higher values of compared to the 100 × 143 full-mission results, but only by about 0.5 . Thus, while there is some evidence of small systematic-related biases in the 100ds1×100ds2 and 143ds1×143ds2 values, the net effect of residual systematics on the 100 × 143 full-mission results are probably at the level of a standard deviation or less. This statement depends on the fidelity of the noise and systematics simulations.
Columns 3 and 4 of Table 4 illustrate the impact of polarised synchrotron cleaning on . (This test was done only for the C-SimLow likelihood). As expected, the effect on is most pronounced for 100ds1×100ds2, with synchrotron cleaning lowering by about 1 and bringing it into closer agreement with the 100×143 full-mission result. However, the effects of synchrotron cleaning on the 100 × 143 and 143ds1×143ds2 spectra are significantly smaller, leading to changes in of ∼ 0.3 . Synchrotron cleaning, while non-negligible, is not a critical factor in the constraints from the 100 × 143 spectra.

Full Monte Carlo Markov Chain Parameter Exploration
We explore the full ΛCDM cosmological parameter space by combining our full-mission 100 × 143 momento and likelihoods at low multipoles for both SRoll1 and SRoll2 with the high-ℓ CamSpec v12.5HM ( ) likelihood (which uses SRoll1 maps). The momento likelihood uses the multipole range 2 ≤ ℓ ≤ 29 and is utilised with the Planck 2018 low-ℓ likelihood over the same multipole range. The momento likelihood uses the multipole range 2 ≤ ℓ ≤ 10 to speed up the low multipole likelihood evaluations. The multipoles 11 − 29 in and have very little constraining power on 16 and so little information on is lost by truncating the momento likelihood at ℓ = 10. The 16 We tested this by constraining on simulations for the reduced multipole range (2 ≤ ℓ ≤ 10) compared to the full low multipole range (2 ≤ ℓ ≤ 29) and found that the posteriors were almost identical.  Table 4. Summary of posteriors derived from SRoll2 maps. Synchrotron and dust emission have been removed from the maps through template fitting, except for the column indicated which shows results obtained when only dust was subtracted. Hence the two C-SimLow columns illustrate the difference at the parameters level caused by the removal of synchrotron emission at the map level. The bottom row assumes that the six detector based likelihoods are highly correlated and shows the mean of these results. The pydelfi ( ) likelihood score compresses the power spectra before inferring the optical depth which results in wider posteriors than for the uncompressed pydelfi ( )-only likelihood.

Likelihoods
CamSpec  Table 5. Cosmological parameter constraints for ΛCDM cosmology obtained using momento with 100 × 143 full-mission spectra at low-ℓ and the CamSpec v12.5HM ( ) at high-ℓ with 68% confidence levels. We compare the effect on the parameter constraints by combining our polarisation-only or joint temperature-polarisation low-ℓ likelihood momento with the high-ℓ (ℓ ≥ 30) likelihood CamSpec. The redshift of reionization re is defined in the same way as in PCP18.

momento
likelihood is used with the Planck 2018 low-ℓ likelihood over the multipole range 11 ≤ ℓ ≤ 29, so that there are no multipole gaps in the likelihood. Table 5 lists the results of full MCMC exploration of the parameters of the base ΛCDM cosmology. These results are similar to those summarised in Table 3 for the one-dimensional scans. The SRoll2 results for are about 1 higher than those from SRoll1, and the momento likelihoods give values for that are about 0.3 higher than those using the momento likelihoods.
Our results using momento are higher by about 0.3 − 0.5 and (formally) have slightly smaller error bars. Fig. 9 illustrates the changes to the constraints caused by switching from SimLow to momento and then to momento . This figure shows contours in the ln 10 10 -plane for base ΛCDM. In each case we use the same high-ℓ CamSpec v12.5HM likelihood for 30 ≤ ℓ ≤ 2500 and so only the low-ℓ likelihoods change: grey contours for SimLow, red contours for SRoll1 momento and blue for SRoll1 momento . Interestingly, the momento likelihood disfavours values of < ∼ 0.04 that are already excluded by the Gunn-Peterson test (see Eq. (2)). In other words, the posteriors of the momento likelihood are consistent with what we know about the intergalactic medium.
The main results of this section are summarised in Fig. 10. The plot to the left shows the marginalised posterior distribution for for CamSpec combined with the SRoll1 likelihoods as described above. This shows the small shifts in when we use the SRoll1 momento likelihoods in place of the Planck 2018 low-ℓ likelihood. The right hand plot shows how the posteriors change if we use the SRoll2 momento likelihoods. The momento constraints shift to higher values of as a consequence of the changes to the HFI map-making. We take as our 'best' estimate of and redshift of reionization, re , the results from the combined CamSpec + SRoll2 momento likelihood: = 0.0627 +0.0050 −0.0058 , re = 8.51 ± 0.52, slightly higher than the result of Eq. (56b).

SUMMARY AND CONCLUSIONS
The determination of the optical depth to reionization from the CMB is extremely challenging, yet of great importance for our understanding of the intergalactic medium and the formation of the first stars and galaxies. Improvements in the HFI map-making algorithms described in PSRoll1 and PSRoll2 have led to maps which have low levels of residual systematics in polarisation resulting in low values of (Eqs. (56a) and (56b)). Producing high fidelity polarisation maps is only one part of the story however. To derive accurate constraints on requires an accurate likelihood. The construction of a likelihood is not straightforward at low multipoles for maps with complex noise properties and partial sky coverage. There is no analytic guide to help create such a likelihood, particularly if the likelihood is built around quadratic cross-spectra. In PSRoll1 and PSRoll2, an likelihood was constructed based on a relatively small number of end-to-end simulations, leading to the results of Eqs. (56a) and (56b).
In this paper we have developed and compared three likelihood techniques on the Planck SRoll1 and SRoll2 maps. The first is a variant of the S B L scheme described in PSRoll1 and PSRoll2, but using more accurate simulation-based noise covariance matrices to construct quadratic cross-spectra and to generate a large number of independent noise realisations. The second (momento) is based on the maximum entropy approach developed by Gratton (2017) and the third (pydelfi) is a density-estimation 'likelihood free' scheme that follows closely the implementation described by Alsing et al. (2019). The momento and pydelfi approaches can be generalised to construct low multipole likelihoods. (Though not explored in this paper, it is straightforward to adapt these schemes to develop likelihoods incorporating other low multipole spectra, e.g. , .) Our main conclusion is that all three likelihood methods are in good agreement and support the conclusions on reported in PCP18 and Pagano et al. (2020); we do, however, see small differences between the likelihoods as summarised in Tables 3 and 4. Using only the spectra at low multipoles, our results tend to give higher values of than those using S B L (Eqs. (3a)) and (3b)) by up to ∼ 0.8 . However, if we include the high multipole CamSpec likelihood, the results for using the SRoll2 momento likelihood is very close to that given in Eq. (56b) though with a smaller formal error.
We constructed low multipole likelihoods using momento and pydelfi. For momento, using a likelihood leads to smaller errors on than using alone, as expected. However, we had to apply data compression to produce a pydelfi likelihood that was numerically fast and robust enough for likelihood analysis. This resulted in a loss of information and to constraints that had slightly larger errors using pydelfi compared to using pydelfi , though with no evidence of any bias; see App. C3 for a detailed discussion.
We also made a thorough analysis of different detector set data splits at 100 and 143 GHz, as summarised in Table 4. For all likelihoods, the 100ds1×100ds2 and 143ds1×143ds2 spectra give values that are higher than those from the baseline 100 × 143 fullmission analysis by between 0.8 and 2 . There is therefore evidence that within a frequency band there remain correlated systematic effects (as is apparent visually from Fig. 3) that bias high by ∼ 0.01. The series of null tests described in PSRoll2, together with the absence of any statistically significant -mode signal at low multipoles, suggests that the 100 × 143 full-mission cross-spectra should provide unbiased estimates of . The changes between PSRoll1 and PSRoll2 suggest an upper bound of about 1 to biases in caused by residual systematics.
As noted above, the likelihood techniques explored here have wider applications and can be adapted to other problems involving low multipole polarisation maps, particularly if the maps have complex noise properties. An obvious example is the measurement of the tensor-to-scalar ratio , in addition to , from the forthcoming CMB satellite LiteBIRD (Sugai et al. 2020).
Initiative (WPI), MEXT, Japan and the Center for Computational Astrophysics of the Flatiron Institute, New York City. The Flatiron Institute is supported by the Simons Foundation.
This work was performed using the Cambridge Service for Data Driven Discovery (CSD3), part of which is operated by the University of Cambridge Research Computing on behalf of the STFC DiRAC HPC Facility (https://dirac.ac.uk). The DiRAC component of CSD3 was funded by BEIS capital funding via STFC capital grants ST/P002307/1 and ST/R002452/1 and STFC operations grant ST/R00689X/1. DiRAC is part of the National e-Infrastructure.

APPENDIX A: ANALYTIC SOLUTION TO THE NOISE INFERENCE PROBLEM
In order to minimise Eq. (28) analytically for , we first consider the variation in S caused by small changes in M. Omitting the /2 prefactor, we have: At the minimum we require S = 0 for arbitrary Ψ and thus need We now use the generalised Sherman-Morrison-Woodbury formula (and adding and subtracting −1 ) to rewrite Y M −1 as Substituting this and its transpose into Eq. (A5) yields Rearranging this for then gives the desired Eq. (29).

APPENDIX B: SMOOTHED TEMPLATE SUBTRACTION
This appendix discusses the effect of smoothed template subtraction. The ADCNL effectively leads to CMB-independent offsets in Planck HFI maps. The CMB signal can then suffer chance correlations with these offsets, leading to additional scatter in power spectra. This effect is seen in simulations and, as discussed in Sec. 3.2, can potentially be mitigated by computing a smoothed estimate of the offset for each map and then subtracting the appropriate estimate from each of the QCS input maps. Applying this prescription to SRoll1 spectra leads to a ∼ 10% reduction in the posterior width for from SRoll1 and to a ∼ 5% reduction from SRoll2. This is consistent with the hypothesis that SRoll2 better reduces largescale residuals than SRoll1 and so sees less of an improvement. Note that the procedure of removing smoothed templates from real data only leads one closer to the truth if the simulations from which the smoothed templates are computed are relatively accurate not only in the power spectrum domain but also in the map domain.  Figure B2. QCS estimates for , and 100 × 143 full-mission foreground-cleaned cross-spectra computed for the difference between SRoll2 and SRoll1 maps. The black points show the spectra computed before any smoothed templates are removed, while the red points show the spectra computed after the maps have their corresponding templates subtracted. The error bars for the cross-spectra are taken from the spectrum as the signal has been removed. The residual spectrum is not shown as it is consistent with zero. Note that the errors plotted on the spectrum are smaller than the points at low ℓ; there are significant changes (relative to the noise) between SRoll1 and SRoll2 in .

B1 Comparison between SRoll1 and SRoll2
Power spectra of the difference maps computed between SRoll1 and SRoll2 are shown in Fig. B2. The error bars are based on the error estimates obtained from the spectra since no signal should be left in the difference maps. Both maps have been foreground cleaned with the respective maps, as detailed in Sec. 3.1. The red points show the spectra computed from maps that have corresponding smoothed templates (denoted 1 for SRoll1 and 2 for SRoll2) subtracted. The black points show the difference at the map level without subtracting the templates. We note that the subtractions only have large effects for ℓ ≤ 4 (as one would expect given that this is the maximum multipole retained in the templates) and, except for the quadrupole, reduce the residuals. With or without the template subtractions, the residual cross-spectra show differences for the very low multipoles below ℓ ∼ 10 for , and . We do not show the spectrum since it is signal dominated and, as shown in Fig.  4, the spectra are almost identical for SRoll1 and SRoll2. This test confirms that the SRoll2 map-making algorithm indeed alters large-scale polarisation features relative to SRoll1.

APPENDIX C: TECHNICAL DETAILS FOR DENSITY-ESTIMATION LIKELIHOOD-FREE INFERENCE
In the present Appendix, we will elaborate on the technical details of the likelihood-free inference code pydelfi (Alsing et al. 2019), based on theoretical ground work by Leclercq (2018); Alsing et al. (2018Alsing et al. ( , 2019; Lueckmann et al. (2018); Papamakarios et al. (2018). These methods provide a means of performing Bayesian inference when it is possible to produce high accuracy simulations of the observables in question, even when it is not possible to write down an explicit analytic expression for the likelihood. Examples of such problems include measuring cosmological parameters from type Ia supernovae (Weyant et al. 2013), weak lensing peak counts (Lin & Kilbinger 2015;Jeffrey et al. 2020), analysing the galaxy-halo connection (Hahn et al. 2017), inferring photometric redshifts and size evolution of galaxies (Carassou et al. 2017), measuring cosmological redshift distributions (Kacprzak et al. 2018), estimating the ionising background from the Lyman-and Lyman-forests (Davies et al. 2018) and inferring the sum of the masses of Andromeda and the Milky Way (Lemos et al. 2021).
The key principle behind LF methods is the understanding that a simulated observable, d, and the input parameters, , are a sample from the joint probability distribution P (d, |M) conditioned on the theoretical model. For compactness we will hereafter suppress the conditional aspect of the distribution and make the assumption that our simulations accurately represent the true physical process 17 . With a sufficiently large number of samples from P (d, ) we can obtain a posterior on the parameters by selecting the samples that have d data = d sim . In many situations, it is essentially impossible to obtain d data ≡ d sim and thus |d data − d sim | < is used (for some suitable metric).
An immediate challenge to the above idea is that, for many situations including cases in cosmology, it can require an incredibly large number of simulations to obtain samples from the posterior, especially if one requires a small parameter. This is particularly true when the dimension of the observable is large. This challenge can be addressed with density estimation likelihood-free inference ( ). The conceptually simplest methods use a density estimator to obtain a parametric form for the joint distribution P (d, | ), where w are the parameters of the density estimator. These methods have been found to require many orders of magnitude fewer simulations. In this work we use a slight variation: we use a density estimator to model the conditional distribution P (d| , ), utilising the fact that, when conditioned on the parameters, d sim is a sample from P (d| ). This variation means that one can be agnostic about the properties of the chosen simulation points; see Alsing et al. (2019) for a more detailed discussion. The likelihood is then obtained by evaluating this conditional distribution at the observed data d 0 . Given the likelihood and a prior, we then use MCMC sampling to obtain posterior samples.

C1 Masked Autoregressive Flows
We require the density estimation to be flexible enough to allow us to accurately approximate the true conditional distribution and also 17 Note that a claim often made against LF methods is the explicit strong dependence on the accuracy of the simulations and that missing or inaccurate components in the simulations can bias inferences. Whilst this is the case, it is similarly true for other approaches that the validity of their result depends on the correctness of the chosen model. to be computationally tractable in order to both fit and use. Recent work by e.g. Papamakarios & Murray (2016); Uria et al. (2016) have shown that Masked Autoregressive Flows (MAFs) provide one such method.
MAFs rely on two steps to achieve these goals. Firstly they utilise the chain rule to express the multidimensional conditional distribution, P (d| ), as a series of one dimensional conditionals P (d| ) = =1 P ( |d 1: −1 , ) . (C1) A model utilising this decomposition is known as an autoregressive model. Next, a form is chosen for the one dimensional conditional distributions. We assume the conditionals are one dimensional Gaussian distributions whose means, , and standard deviations, , depend on d 1: −1 and . To achieve the desired flexibility we use neural networks, with parameters w, to parametrise these functions as (d 1: −1 , |w) and (d 1: −1 , |w).
As was shown in Germain et al. (2015) this setup can equivalently be formulated as developing a mapping from d to the variable = − (d 1: −1 , |w)/ (d 1: −1 , |w) where u are independent zero mean, unit variance Gaussian random variables. This means the density estimator has the simple form where the product over the standard deviations is the Jacobian from transforming from u to d. We use an efficient implementation of this setup called the Gaussian Masked Autoencoders for Density Estimation and hereafter refer to this method as MADE. There are two main limitations for using MADE: first, they depend sensitively on the order of factorisation and, second, the assumption of Gaussian conditionals may be overly restrictive. To mitigate these shortcomings we stack a series of MADEs to make a MAF. The output u of each MADE is the input of the next one. The conditional density estimator is thus given by P (d| ; w) = P ( |d 1: −1 , ) Thus we obtain a conditional distribution that is both analytically tractable (a simple product of Gaussians) and highly flexible. Finally we fit the weights of the neural network as detailed in Sec. 4.3 around Eq. (54). By minimising the negative loss function in Eq. (55) we train the neural density estimators (NDEs) with respect to the network weights. The problem of overfitting is mitigated by applying the standard machine learning procedures of early stopping, dropouts and the random selection of training and testing subsets.

C2 Architecture of pydelfi
For the density estimation, we select an ensemble of NDEs, all of them MAFs, with different numbers of MADEs, hidden layers and neurons per layer, as shown in Table C1. All the NDEs use a tanh activation function, and are trained using the stochastic gradient optimiser (Kingma & Ba 2014). To shorten training times of the NN, we trained the model on graphics processing units. To avoid overfitting, we use one tenth of the training set at each training   Table C1.) cycle to validate our posterior (i.e. perform 'early-stopping') and use a learning rate of 0.001.
The result from density estimation of the different NDEs is shown in Fig. C1. This is a powerful cross-check to determine whether or not our architecture was chosen appropriately, remembering that there is no general procedure for choosing NN architectures. Each NDE yields a consistent posterior, also shown in Table C1. All the NDEs are then stacked and weighted by the loss evaluated during training, following (Alsing et al. 2019), also illustrated in Fig. C1. This is motivated by the demonstration of Smyth & Wolpert (1998, 1999 that stacking multiple NDEs is typically more robust than using an individual one. (In fact, the high degree of consistency between the NDEs makes us suspect that using fewer of them would still be sufficient to constrain , saving training time.)

C3 Score and data compression
To compute a joint likelihood using low multipole temperature and polarisation QCS spectra ( , , ) an additional compression step is required to reduce the dimensionality of the problem. Therefore, the full -dimensional data set D ∈ R is first compressed to quadratic cross-spectra, i.e. summary statistics d ∈ R , with < . Then, we score compress the vector of power spectrum measurements d into a vector of components t ∈ R , with < < . In order to inform the choice of statistic for the score compression, an approximate form of the log-likelihood function L is assumed. The statistic t is then the gradient of the approximate log-likelihood, evaluated at some fiducial parameter values * , i.e. d ↦ → t = ∇ L * .
In the main body of the paper we compressed the , and cross-spectra separately for computational simplicity. This is suboptimal, as power spectrum elements between spectra are correlated. However this does not bias our cosmological constraints and pydelfi does capture residual correlations between the individual , and statistics. The approximate log-likelihood we use for the score compression step is the analytic result that can be derived for cross-spectra computed on the full sky with Gaussian isotropic noise, namely the variance-gamma distribution (Pearson et al. 1929): Here Γ( ) is the gamma function, ( ) is the modified Bessel function of the second kind and ℓ ≡ / √ is the correlation coefficient. Using a fiducial model with = 0.06, we differentiate Eq. (C5) to obtain = ∇ ln P (ˆℓ ) * , one each for , and . We then fit the parameters of the pydelfi likelihood using the compressed statistics of the simulations and the compressed data vector, yielding the pair { , }. For the case where we only consider , the compressed statistic is related to the maximum likelihood estimate for ; see Alsing et al. (2018) for a more detailed discussion. The limitations of the compression steps are discussed further in Alsing et al. (2018).

C4 Effect of score compression on the maximum likelihood value of and comparison between pydelfi and momento
This section considers in more detail the properties of the scorecompressed pydelfi likelihoods, both amongst themselves and then in comparison with momento. First, in Fig. C2 we illustrate the score-compressed posteriors obtained with 100 × 143 spectra from SRoll2 frequency maps using pydelfi. The different posteriors from the (dashed line), (dash-dotted line) and (dotted line) statistics are shown, along with that from the combined three-statistic likelihood (solid red line). These all used the full multipole range (2 ≤ ℓ ≤ 29) for the construction of the statistic(s). The posterior is most constraining, the posterior somewhat so, whereas the one hardly varies over the scan. In the three-statistic likelihood, the addition of information moves the mean value  Figure C3. Comparison of posteriors from 100×143 SRoll2 spectra using pydelfi with (dotted line) and without compression (dashed line) and using momento (solid line). For this test all likelihoods use the same multipole range of 2 ≤ ℓ ≤ 10. We note that avoiding score compression for pydelfi results in a shift upwards by ∼ 0.5 of the maximum likelihood value compared to that from the score compressed likelihood, leading to a similar result to momento. upwards by ∼ 0.25 from that from the -statistic likelihood. This upwards shift is also seen for momento in Table 3 but is not apparent there for pydelfi. This is because the table shows results from the non-score-compressed pydelfi likelihood and from the score-compressed three-statistic pydelfi likelihood. The upwards shift from adding partially cancels the downward movement caused in passing from a full likelihood to a scorecompressed one.
(C7b) Fig. C3 illustrates these posteriors and compares them to the momento posterior. This confirms the role of the score compression discussed above in affecting constraints on . This paper has been typeset from a T E X/L A T E X file prepared by the author.