21-cm Signal from the Epoch of Reionization: A Machine Learning upgrade to Foreground Removal with Gaussian Process Regression

In recent years, a Gaussian Process Regression (GPR) based framework has been developed for foreground mitigation from data collected by the LOw-Frequency ARray (LOFAR), to measure the 21-cm signal power spectrum from the Epoch of Reionization (EoR) and Cosmic Dawn. However, it has been noted that through this method there can be a significant amount of signal loss if the EoR signal covariance is misestimated. To obtain better covariance models, we propose to use a kernel trained on the {\tt GRIZZLY} simulations using a Variational Auto-Encoder (VAE) based algorithm. In this work, we explore the abilities of this Machine Learning based kernel (VAE kernel) used with GPR, by testing it on mock signals from a variety of simulations, exploring noise levels corresponding to $\approx$10 nights ($\approx$141 hours) and $\approx$100 nights ($\approx$1410 hours) of observations with LOFAR. Our work suggests the possibility of successful extraction of the 21-cm signal within 2$\sigma$ uncertainty in most cases using the VAE kernel, with better recovery of both shape and power than with previously used covariance models. We also explore the role of the excess noise component identified in past applications of GPR and additionally analyse the possibility of redshift dependence on the performance of the VAE kernel. The latter allows us to prepare for future LOFAR observations at a range of redshifts, as well as compare with results from other telescopes.


INTRODUCTION
The Epoch of Reionization (EoR) follows the period of Cosmic Dawn, when the first stars, galaxies, black holes and other astrophysical objects formed.These objects began to radiate photons that ionized the neutral gas across the Universe.Understanding this period, which spreads over redshifts  ≈ 5 − 15, is crucial to learn more about the nature, timing and mechanism of the formation and evolution of the aforementioned astrophysical objects, as well as their impact on the interstellar (ISM) and intergalactic (IGM) medium surrounding them (Ciardi & Ferrara 2005; Morales ★ E-mail: anshuman@mpa-garching.mpg.de& Wyithe 2010; Pritchard & Loeb 2012;Furlanetto 2016;Liu & Shaw 2020).From observations of the Gunn-Peterson troughs of high- quasars (Becker et al. 2001;Fan et al. 2006) and the optical depth for Thomson scattering of the Cosmic Microwave Background (CMB) radiation (Planck Collaboration et al. 2016), we can deduce that most reionization took place in the range 6 ≲  ≲ 10, with recent observations suggesting an end of reionization at  < 6 (see e.g.Becker et al. 2015 andBosman et al. 2022).
There exist multiple indirect probes to study this period.For example, the evolution of the observed Lyman- emitter luminosity function at  > 6 (Clément et al. 2012;Schenker et al. 2013), and Lyman- absorption profiles to distant quasars (Mortlock 2016; Greig et al. 2017; Davies et al. 2018).However, the most sensitive probe to study the EoR is through the fluctuations of the redshifted 21-cm line of neutral Hydrogen against the CMB (Hogan & Rees 1979;Madau et al. 1997;Shaver et al. 1999;Tozzi et al. 2000;Ciardi & Madau 2003;Zaroubi 2013).A statistical detection of the strength of these 21-cm brightness temperature fluctuations can allow us to constrain our models of the early Universe and the formation of the first stars and galaxies.For this, a number of interferometric low-frequency radio telescopes have been designed, such as LOFAR1 , HERA2 , MWA3 and PAPER4 , which over the years have been providing increasingly tighter upper limits.For example, The HERA Collaboration et al. (2022) reported Δ 2 ( = 0.34 ℎMpc −1 ) ≤ 457 mK 2 at =7.9 and Δ 2 ( = 0.36 ℎMpc −1 ) ≤ 3496 mK 2 at =10.4 from 94 nights of observation, and Mertens et al. (2020, hereafter M20) reported Δ 2 ( = 0.075 ℎMpc −1 ) < 5329 mK 2 from 141 hours (≈10 nights) of observation with LOFAR at =9.1.
One of the major challenges faced in the detection of the 21-cm signal is the fact that it is buried under foregrounds (synchrotron and free-free emissions from the Milky Way and other galaxies) that are several orders of magnitude stronger.To address this issue, The HERA Collaboration et al. (2022) uses the "foreground avoidance" technique (Kerrigan et al. 2018;Morales et al. 2019) by focusing on regions in Fourier space which are mostly foreground free, while the LOFAR EoR Key Science Project (KSP) team uses foreground modelling and removal, which allows the maximization of scales explored, as well as boosts the sensitivity up to an order of magnitude (Pober et al. 2014).
The most stringent constraints obtained with LOFAR data were presented in M20, where Gaussian Process Regression (GPR, as described in Mertens et al. 2018;Gehlot et al. 2019;Hothi et al. 2021) was used for hyperparameter optimization with different Matern class functions (Eq.6, see below) chosen as covariance kernels for modelling different components of the observed data and then recovering the fitted datacube.However, Kern & Liu (2021) pointed out some issues with this approach.Primarily, they found that given the choice of normalization and bias correction in the power-spectra estimation used in M20, misestimation of the covariance kernel for the EoR signal could lead to significant signal loss.This can have severe ramifications on the astrophysical interpretations of the estimated 21-cm signal power spectrum.Further, they show that alternative choices for normalization and weighting schemes could reduce the dependence on the choice of covariance priors, thus reducing its impact on the estimation of the 21-cm signal.However, here we focus on improving the covariance prior, while in future works we plan to explore other normalization and bias correction schemes to further upgrade the overall analysis pipeline.
To improve the covariance kernel, we refer to Mertens et al. (2023).They propose a Machine Learning (ML) based approach to GPR, where the covariance kernel for the 21-cm signal is obtained by implementing a machine learning based algorithm that learns from simulations.The results obtained can then be compared against runs of the same simulation code, to constrain the physical parameters used in it.As the covariance kernel is trained over a range of physical parameters, this would significantly reduce chances of misestimation, and thus it can be reliably used to derive astrophysical parameters necessary for the same simulation code to generate similar power spectra.
In this paper, we use GRIZZLY (Ghara et al. 2015(Ghara et al. , 2018(Ghara et al. , 2020) for generating the training, test and validation datasets.This code has been employed previously (see Ghara et al. 2020) to constrain astrophysical parameters based on the results obtained in M20.As this combines -body simulations with 1D radiative transfer, it is more physically precise than semi-numerical algorithms, while not being as computationally expensive as codes that use 3D radiative transfer (while still performing reasonably well, as shown in Ghara et al. 2018).Here, we test the performance of this ML based kernel versus covariance kernels used with GPR in previous work.In Section 2, we discuss a range of simulations used to generate mock 21-cm datasets, as well as introduce the ML trained 21-cm kernel.We also provide a short introduction to GPR.We report the results and comparisons between kernel performances in Section 3. Finally, we discuss the role of the excess noise component found in M20 and the overall performance of the ML based kernel in Section 4. In a companion paper we will apply the new pipeline to 10 nights of LOFAR data at =9.1, as was done in M20.

METHODOLOGY
In this section, we introduce the pipeline used to implement GPR to recover the 21-cm signal from mock datasets, comparing the performance of an ML based kernel versus kernels used in M20.

Simulations of the 21-sm signal
The 21-cm differential brightness temperature relative to the CMB for any patch of the IGM is given by (see Furlanetto et al. 2006): where  HI is the fraction of neutral hydrogen,  B is the fractional overdensity of baryons,  S is the spin temperature,  CMB is the temperature of the CMB photons at that redshift, Ω m is the total matter density, Ω B is the baryon density,  is the redshift and ℎ is the Hubble constant in units of 100 kms −1 Mpc −1 .In this equation, the parameters affecting the large-scale fluctuations of  b are  HI ,  S and  B .
We consider a variety of simulations to generate mock 21-cm differential brightness temperature maps as discussed below.In Section 2.1.1,we employ maps generated using GRIZZLY, where we focus on variations tied to fluctuations in  HI and  S , while assuming that the fluctuations due to  B are small and can thus be ignored.In Section 2.1.2,we do not make this assumption and employ maps generated using the reionisation simulation code CRASH.In Appendix A we also consider the additional case of using 21cmFAST (Mesinger & Furlanetto 2007;Greig & Mesinger 2015) to generate the 21-cm differential brightness temperature maps.

GRIZZLY simulations
GRIZZLY (Ghara et al. 2015(Ghara et al. , 2018(Ghara et al. , 2020) ) employs a 1D radiative transfer scheme in combination with cosmological density fields and halo catalogues obtained from an N-body simulation to produce brightness temperature maps of the 21-cm signal at different redshifts for a given source model.The algorithm has been shown to reproduce results similar to those obtained with 3D radiative transfer schemes with the same N-body simulation, while being at least 10 5 times faster (Ghara et al. 2018).Because of this, we can run a large number of GRIZZLY simulations without the process being too computationally expensive.Furthermore, it has a wide range of physical parameters that can be varied, thus allowing us to explore the role of different physical processes in generating different models of the 21-cm signal.The density fields, velocity fields and the halo lists used in this work are obtained from the same Nbody simulation (500 ℎ −1 cMpc box length, 6912 3 particles, with a mass resolution of 4.05 × 10 7 M ⊙ ) which was used in Ghara et al. (2020).In this study, we consider two major GRIZZLY models presented in Sections 3.1 and 3.2 of Ghara et al. (2020).Similar to their implementation, we use four physical parameters to model the sources: the ionization efficiency (), the minimum mass of the UV emitting halos ( min ), the minimum mass of the X-ray emitting halos ( min X ) and the X-ray heating efficiency (  X ).The emission rate of ionizing photons and X-rays per unit stellar mass from a halo are  × 2.85 × 10 45 s −1 M −1 ⊙ and  X × 10 42 s −1 M −1 ⊙ , respectively.Further, the X-ray spectral index  is fixed at 1.2, as done in Ghara et al. (2020).Lastly, they note that all other IGM properties can be derived from these parameters, and thus using just these to define the simulation is sufficient.The properties of the two models adopted are listed below: •  HI fluctuation dominated model: here, we assume a uniform Ly background strong enough to allow the spin temperature  S to be fully coupled to the gas temperature  K .Further, we adopt the following parameters:  = 7.0,  min =  min X = 10 9 M ⊙ and f X = 100, which makes the gas temperature (and in turn,  S ) significantly high compared to  CMB due to strong X-ray heating.This assumption of  S ≫  CMB ensures that  b becomes insensitive to the (1 −  CMB  S ) term from Equation 1.Thus, all variability of  b is tied to the fluctuation of the neutral hydrogen fraction  HI .
•  S fluctuation dominated model: in this case, while we continue to have the assumption of a strong, uniform Ly background to ensure coupling of  S and  K , we change our parameters to relax the condition of  S ≫  CMB .This is done by reducing the X-ray heating and ionization efficiency.Thus, we adopt the following parameters:  = 3.0,  min = 10 9 M ⊙ ,  min X = 10 10 M ⊙ and  X = 1.This allows for greater variability tied to  S , with regions of partial reionization and heating forming in the IGM.

CRASH simulations
As a reference, we also use the simulations of reionization described in Eide et al. (2018Eide et al. ( , 2020) ) and Ma et al. (2021).These are obtained by post-processing the large-scale, high-resolution hydrodynamical simulation Massive Black-II (Khandai et al. 2015; box length 100 ℎ −1 cMpc, 2×1792 3 gas and dark matter particles, corresponding to a resolution of 2.2 × 10 6 ℎ −1 M ⊙ and 1.1 × 10 7 ℎ −1 M ⊙ respectively) with the multi-frequency 3D radiative transfer code CRASH (Ciardi et al. 2001;Maselli et al. 2009;Graziani et al. 2013Graziani et al. , 2018;;Glatzle et al. 2019).Here, we make use of the "Stars" simulation run (which includes only stellar type sources) to generate the mock 21-cm signal data at =9.18.We refer the reader to the original papers for more detailed information on the simulations.

Gaussian process regression
Gaussian Process Regression (GPR, Rasmussen & Williams 2006;Aigrain & Foreman-Mackey 2023) can be used to model a noisy observation y =  (x) + , with  Gaussian noise having variance  2 noise .This is achieved by modelling the Gaussian Process as a joint probability distribution for y = {  } =1,...,  , as f=  (x), which is fully defined by its mean vector (m) and covariance matrix (K, also called covariance "kernel") as: for a set of points x (independent parameters).Here, the covariance matrix K gives the covariance between the function values at any two points and can be written as    = (  ,   , ) +     2  , where (  ,   , ) can be optimised by the choice of hyperparameters represented by , and    is the Kronecker-delta function.
When applying it to radio data to extract the 21-cm signal, we split this function into a foreground component,  fg , and the 21-cm signal,  21 , giving: where  is the observed data and  is the frequency.Next, following M20, we further split the foreground component into the intrinsic sky emission component (  sky ), which comes from the confusion-limited extragalactic sources and from the Milky Way, and the mode-mixing contaminants component  mix , which has contributions from the instrument chromaticity and calibration errors.Beyond the foreground, we also model the noise (represented by  in Equation 3) using estimates of the noise variance for ≈10 nights of observation from M20.In addition to this, M20 found a significant spectrally-correlated residual, and thus we inject this "excess noise" component (  ex ) into our model as well.This gives an updated version of Equation 3: For the sake of simplicity and utilising the additive property of matrices, we can reduce Equation 4 to  =  () + , and represent  () with its corresponding covariance kernel  (from Equation 2) as: M20 modelled each of these kernels using the best possible fit Matern-class functions (Eq.6; Stein 1999): Note that in the Matern-class function,  is the absolute difference between the frequencies of two sub-bands,   is the modified Bessel function of the second kind and Γ is the Gamma-function.M20 obtained the best possible fit Matern-class function by taking different values of the hyperparameter , maximising the marginal likelihood (also known as the evidence) and obtaining estimates for the coherence scale hyperparameter  and the variance  2 .Then, for each kernel, they chose the  that led to the highest evidence by calculating the analytical integral over f which is the log-marginal-likelihood (LML, see Section 2.3 in Mertens et al. 2018) and choosing the kernel that maximises its value.For calculating the hyperparameters (listed in the second column of Table 1), M20 used a gradient-descent based optimization algorithm for maximising the LML.

Machine learning trained 21-cm kernel
The limitations of GPR as pointed out by Kern & Liu (2021) mainly boil down to the choice of covariance kernel for the 21-cm signal.While the choice of hyperparameters allows a variety of functions to be accessed, the same function might not work equally well across the -space.Having a function obtained by employing ML trained on power spectra of simulations where the sources of ionization are modelled using parameters that sample a wide range of values, allows greater flexibility, and reduces chances of misestimation.Further, it allows for a direct comparison with physical quantities, as we can reliably derive the source parameters necessary to generate simulations that produce the power spectra estimated by the MLbased kernel.
To achieve this, Mertens et al. (2023) uses a Variational Auto-Encoder (VAE, Kingma & Welling 2013, 2019) algorithm.Simply put, an Auto-Encoder (AE, Goodfellow et al. 2016) is an unsupervised neural network which compresses data by reducing the number of independent parameters used to describe it into what is referred to as a "latent space" of hyperparameters.Thus, it is primarily used for data compression by filtering out independent parameters that are deemed to be unnecessary because they only slightly affect data recovery.This is a two steps process, where the first step of reducing the number of independent parameters into the latent space is called encoding, while the step of recovering the data given the latent space parameters is called decoding.Instead of taking an input of just a set of parameters  1 , ... ,   , a VAE (Pinheiro Cinelli et al. 2021) uses probability distributions of each parameter, thus allowing to interpolate in the latent space, and to generate a large range of new samples of reconstructed data (in our case 21-cm signal models), which are not limited by the data that the encoder was trained on.However, this also means that the VAE encoder has an inherently larger error than an AE encoder.
Thus, reconstructing the training data using the decoder with the latent space generated by the VAE encoder as input would not be an exact match to the original training data.However, the VAE is designed to optimize a trade-off between reconstruction accuracy and the fidelity of the latent space representation, by minimising the KL (Kullback-Leibler) divergence loss (Kullback & Leibler 1951), which is a measure of the divergence between the distribution of reconstructed data and the training data.So while the reconstructed data and training data might not be an exact match, if their overall distribution is similar (i.e., divergence is minimised), the training is considered successful.Thus, in the training of a VAE algorithm, the following sources of error exist: (i) Encoder: the error due to sampling from a distribution for each independent parameter to build a latent space.Sampling from a distribution is expected to be noisier than choosing point values.
(ii) Decoder: the error due to deriving the independent parameters, given some value of the latent space parameters.As this does not involve sampling from distributions, the contribution to the total error is expected to be smaller.Mertens et al. (2023) shows that GPR can be used to estimate the values of the latent space parameters, after which the decoder of the VAE kernel can be used to estimate the independent parameters, and in turn to obtain the recovered 21-cm signal.In this case we would then use only the decoder part of the VAE algorithm.However, as minimising KL divergence loss requires to train the encoder and decoder together, we proceed as follows.
We start by using two hyperparameters,  1 and  2 , and an associated variance, as the latent space parameters.We train the VAE algorithm on a dataset of ≈ 1500 simulations (the training set), and validate it against an independent dataset of ≈ 150 simulations (the validation set).We refer to this as the VAE kernel.The training and validation datasets are generated by running GRIZZLY simulations with a wide range of values for the parameters introduced in Section 2.1.1.We sample them randomly at the same redshift as the targeted 21-cm signal (𝑧 = 9.16, 8.30 and 10.11) and in the following ranges:  = [−3, 3],  min = [9, 12],  min X = [9, 12] and  X = [−3, 2] in the log space.We choose these ranges to be significantly broader than necessary (i.e., the performance of the VAE kernel does not show any appreciable difference even if they are reduced by multiple orders of magnitude), and also note that the performance of the kernel is not impacted if we fix one of the parameters to a specific value and vary the remaining three.This ensures that the sampling range does not induce any major bias.Next, to train the VAE we use 2000 iterations, as we find that the KL divergence loss and the reconstruction loss for both the training and validation datasets stabilise after ≈500 iterations.The reconstruction loss is defined as the total error made when using the encoder to obtain specific values of the latent space parameters, and then employing the decoder to retrieve the data from those parameters.We then evaluate the ratio between the output and input data as a function of the wave number , and finally calculate the median ratio, which is ≈1 across all -bins.However, as discussed earlier, the 68% confidence interval is significant, being ≈10% for  = [0.06,0.12] ℎMpc −1 , ≈35% for  = [0.12,0.43] ℎMpc −1 and ≈27% for  = [0.43,1.11] ℎMpc −1 at = 8.30, 9.16 and 10.11.
We note a similar reconstruction error for training sets of sizes ranging from 1000 to 5000 simulations (with the validation set scaling as ∼ 10% of the training set in size), while the errors become worse when using less than 1000 simulations.We also tried to used three hyperparameters, but saw no significant improvement in the recovery error.We thus choose to use two hyperparameters to avoid overfitting.
A high reconstruction error was to be expected, as it includes also the error due to the encoding process.While this is not required for our purpose, we do need to evaluate the exact contribution from the decoder.For this, we create a testing set of ≈ 150 simulations, and ensure that it also includes "extreme" models from GRIZZLY (which we define as cases where the power spectrum differs by at least an order of magnitude from the mean of the power spectra from the training and validation datasets), along with some injected stochastic noise.We then apply GPR for hyperparameter optimization (using MCMC, as discussed in Section 2.5) to estimate  1 ,  2 and the associated variance.This is then used with the decoder to obtain the recovered signal.The median ratio of the recovered and input data across all -bins is again ≈1.But now we obtain a 68% confidence interval of ≈0.5% for  = [0.06,0.12] ℎMpc −1 , ≈1% for  = [0.12,0.43] ℎMpc −1 and ≈3% for  = [0.43,1.11] ℎMpc −1 at = 8.28, 9.16 and 10.11, respectively.By comparing to the total error estimated above, we note that even where decoder error is high, i.e., for the highest -bins, it is still a minor contributor.Therefore, as the overall error is ≲ 5%, we accept the VAE kernel as a reliable ML-based kernel for the 21-cm signal, and proceed to use it with GPR for signal recovery.

Generating mock datasets
To build a mock dataset in the gridded  −  cube domain of radio observations, we derive a full dataset  by adding each term on the right hand side of Equation 4. For this, we adopt the values of ,  and  2 for  sky ,  mix and  ex given in the second column of Table 1 to generate their corresponding  sky ,  mix and  ex .These values are obtained from the results of M20, where they used  sky = +∞ for the intrinsic sky,  mix = 3/2 for the mode-mixing contaminants, and  ex = 5/2 for the excess noise to fix Matern class functions for each of these components, and then employed GPR to obtain the coherence-scale hyperparameter  and its associated variance by adopting  21 = 1/2 for the 21-cm signal.Note that while  sky and  mix are not independent quantities, their recovery with GPR treats them as such.In this paper, we also generate them as independent components to build our mock datasets (unlike in real data, where the mode-mixing component depends on the true sky signal).Thus, the overall quality of the recovered  sky and  mix is better from the mock datasets, than from real data.However, the effect on the accuracy of recovery is not expected to be severe, as the impact of not factoring in the inter-dependency is insignificant, as compared to the overall power of these components.That is, even without assuming their inter-dependence, GPR can recover them with reasonably high accuracy.
Next, we also assume the value  2 noise ∼ 74 × 10 3 mK 2 (from M20) to simulate the noise component  for ≈10 nights of observation with LOFAR scaled to the Field of View (FoV) corresponding to the GRIZZLY simulations, which is 3.03 • × 3.03 • .The variance for the intrinsic sky-emission, mode-mixing and excess noise components scales by the noise variance as indicated in Table 1.
We also consider the case of ≈100 nights of observation, to provide estimates for future observations of similar duration with LOFAR.In this case, the noise variance is expected to be reduced by a factor of 10, assuming all effects to scale uniformly from ≈10 to ≈100 nights of data.However, it should be noted that the data used in M20 was plagued with issues such as RFI flagging and bad ionospheric conditions, and it would be of little scientific value to assume the same conditions to persist for a 10 times longer observational duration.For this reason, we assume ideal conditions (for example, picking observation nights with good ionospheric conditions) and assume that the thermal noise is reduced by a factor of 20 instead, obtaining  2 noise ∼ 3.7 × 10 3 mK 2 .As we expect the excess noise to behave in a similar fashion, we also reduce its variance  2 ex by a factor of 20.M20 noted that the mode-mixing contaminants were not decreasing when integrating over more nights of data.However, the parts of this component due to effects of ionosphere and calibration errors are uncorrelated from night to night, and thus are expected to decrease with longer integration, leading to a reduction of the mode-mixing term.Additionally, this should also decrease because of the improved UV-coverage.Here, we thus consider an ideal scenario to understand how the performance of the VAE kernel compares with previously used kernels when also the mode-mixing term's variance  2 mix is reduced by a factor of 20.However, we scale down the intrinsic sky component only by a factor of 2. This is justified because (i) the confusion noise limit is set by the resolution of the LOFAR telescope, so that unresolved point sources cannot be subtracted even when integrating over more nights; (ii) modelling an increasing number of point sources is not a trivial task, and doing it accurately does not seem feasible in the short-term.Thus, we limit ourselves to an assumption of modest improvements.Because of this, we assume that the variance of  sky is scaled by a thermal noise variance of  2 noise ∼ 37 × 10 3 mK 2 , corresponding to a factor of 2 reduction from the noise variance in the ≈10 nights case.
Following the procedure outlined above, we generate two sets of foreground, noise and excess noise components, i.e. for ≈10 and ≈100 nights of observation.We then add the 21-cm signal (see Section 2.1) to obtain a complete mock dataset.It consists of the foreground component (intrinsic sky + mode-mixing contaminants; pink dashed), the excess noise (magenta dashed), the noise (yellow dashed) and the 21-cm signal at  = 9.16 (grey dashed).We also show the 1 - upper limit (dark-yellow dotted) achievable if the dataset is thermal noise dominated, i.e. any signal below this line has SNR < 1.The top (bottom) panel refers to a case with the noise corresponding to 10 (100) nights of observation.
The input power spectra for the overall mock dataset using GRIZZLY is shown in Figures 1 and 2 for the  HI and  S fluctuation dominated case, respectively.The ratio between the 21-cm signal and the 1- uncertainty of the noise power spectrum is a measure of the Signal to Noise Ratio (SNR).Thus, for -bins where the 21-cm signal's power is greater than the 1- error of the noise, there is a chance of detectability with 1- confidence.From Figure 1 we can thus conclude that for ≈10 nights of observation a detection of the  HI fluctuation dominated 21-cm signal is unlikely, as the SNR is ≪ 1 for all -bins.However, the chances of detectability improve for ≈100 nights of observation, as SNR ≈ 1.In Figure 2, we see that detecting the  S fluctuation dominated signal should be possible also for ≈10 nights of observations, as SNR is > 1 for all -bins, while for ≈100 nights it is ≳ 10, assuring a detection as long as the covariance kernel used to model the 21-cm signal is correctly estimated.
While we first focus on the 21-cm signal at  = 9.16 in order to make a direct comparisons with M20, we also consider  = 8.30 and 10.11 to prepare for future LOFAR results at these redshifts.

Recovery with MCMC
As discussed in Section 2.2, M20 used a gradient-descent method to maximise the LML.However, in this paper we instead use MCMC sampling (Foreman-Mackey et al. 2013) to estimate the hyperparameters by sampling their posterior distributions.The benefit of MCMC sampling is that it allows us to also have a measurement of the uncertainty on the hyperparameters, which can then be prop- agated.We build the posterior distributions adopting flat uniform priors with broad ranges as used in M20.For the coherence-scales, we provide a smaller range for the uniform prior () to improve convergence time, as done by M20.Thus, as listed in the third column of Table 1, we use a range of (10,100) for  sky and of (1,10) for  mix .To ensure that the converged value for  ex remains in the 1- confidence interval, we used a range of (0.1, 0.8) rather than of (0.2,0.8) as in M20.While we note that a Gamma prior for the variances of the different component leads to faster convergence, we still adopt flat priors in the logarithmic scale over several orders of magnitude (thus effectively an uninformed prior) to minimise the chances of bias.
In addition to this, we note that the coherence scale and the variance for the 21-cm signal are dependent on the baseline length.So, while to recover the intrinsic sky, mode mixing and excess noise components we continue to employ the same Matern class functions used to generate them (i.e. with the same  values of the input, see Table 1), when using Matern class functions to recover the 21-cm signal, we modify the hyperparameter  and the variance as: and where  0 and  2 0 are the coherence-scale parameter and variance used in M20 for baseline length  (where  min is the minimum baseline length), but now we introduce the additional parameters   and  2  to fully define the coherence-scale hyperparameter and the variance.Further,  2 norm is chosen such that the mean of the variance over all baselines is  2 0 .The two additional parameters (i.e.,   and  2  ) thus allow us to encode the dependence on the baseline length into the covariance kernel for the 21-cm signal.
To recover the 21-cm signal component, we use Matern class functions with two specific values of : the Exponential Matern class function with  = 1/2 (which, as shown in M20, maximises the evidence), and the Matern32 function with  = 3/2.We then compare their performance for recovery to the VAE kernel using GPR.For the hyperparameters  1 and  2 , we again take an uninformed flat prior in the linear space.

RESULTS
Here we discuss a qualitative comparison between the results which are shown in Figures 3 and 4 using the three aforementioned kernels (Exponential, Matern32 and VAE).We then analyse the estimated coherence-scale hyperparameter and variance values for each of the components of the mock datasets defined in Section 2.1.1 in Table 1.Further, we qualitatively compare the results obtained by using the full simulations of reionization (Section 2.1.2) rather than the mock datasets generated with GRIZZLY.Lastly, we explore the role of redshift on the performance of the VAE kernel, by testing cases at  = 8.30 and 10.11, and by comparing them against the results obtained for  = 9.16.

𝑥 HI fluctuation dominated model
In Figure 3, we compare the results from the three kernels (VAE in blue solid, Exponential in orange dashed-dotted and Matern32 in green dotted) in recovering the power spectrum of the 21-cm signal at  = 9.16 (in grey dashed).We also show the 2- uncertainty on the recovery from different kernels to compare their performance.We note that if the lower bound of the 2- uncertainty of the recovered signal is below the uncertainty on the thermal noise (as shown in Figure 1) the recovery qualifies as an upper limit, otherwise it is referred to as a detection.Based on this, we note that in this case, the recovery from all kernels is going to provide upper limits, as the thermal noise uncertainty is higher than the 21-cm signal for ≈10 nights, and comparable to it for ≈100 nights.
We note that for ≈10 nights of observations, while the VAE kernel has uncertainty bands wider than the Matern-class function based kernels, the input 21-cm signal is contained within its constrained region.Thus, while the recovered signal for the three kernels is comparable, the VAE kernel is robust enough to compensate for the overestimation and to contain the signal within the 2- limits of the error, although it is still an upper limit rather than a constrained detection.As discussed in Section 2.4, the reason for this non-detection when using Matern class functions based kernels and broad uncertainty bands for the VAE kernel is due to the low SNR, which is < 0.1 across all -bins.This, however, improves to an average SNR of ≈ 1 for ≈100 nights of observations, for which, as expected, we obtain tighter constraints and an improved prediction of the actual value.We see, though, that the recovered signal from Matern-class function based kernels is still over-predicted.In particular, for  > 0.5 ℎcMpc −1 the Exponential and Matern32 kernels are unable to contain the signal even in their 2 uncertainty bands.On the other hand, the VAE kernel contains the signal in its 2 uncertainty bands across all -bins, despite the recovered signal being about an order of magnitude higher than the input signal.The VAE kernel also does a much better job in recovering the overall shape of the power spectrum.By comparing to the estimated power at  = 0.075 ℎMpc −1 from M20 (cross), it is clear that in this case the VAE kernel is also capable of significantly improving the 21-cm upper limits estimate.However, we highlight that the VAE kernel applied to real data is still likely to provide upper limits higher than what has been shown here, because of the more complex noise, and thus the improvement provided by the kernel might be lesser.
In the 4 th and 5 th columns of Table 1 we show the MCMC estimates for the coherence-scale hyperparameter and the variance 16.The 2 uncertainty on the recovered signal for each kernel is shown as a shaded area in the corresponding colours.The top (bottom) panel refers to a case with the noise corresponding to 10 (100) nights of observation.We also plot the estimated upper limits of power at  = 0.075 ℎMpc −1 from Mertens et al. (2020, cross).Note that this value can be significantly higher than the signal due to more complex noise in real data.
obtained by applying GPR to the input power spectrum of the mock dataset (in indigo, Figure 1).covariance kernels for the intrinsic sky, mode mixing and excess noise components are the same as those used to generate them (i.e., the value of  is fixed), while we adopt the VAE kernel for the 21-cm signal.Note that the variances for all components are scaled down by the corresponding value of  2 noise .This is equal to 74 ×10 3 mK 2 for all components for ≈10 nights of observations (see M20), while for ≈100 nights this corresponds to scaling down by a factor of 2 (i.e.,  2 noise = 37 × 10 3 mK 2 ) for the intrinsic sky component, and by a factor of 20 (i.e.,  2 noise = 3.7 × 10 3 mK 2 ) for other components (as discussed in Section 2.4).From the MCMC estimates, we note that the measurement of the coherence-scale for the  sky and  mix improves from ≈10 to ≈100 nights of observation.However, the variance estimates do not show an improvement, and even slightly worsen for the excess noise component.The estimates of  1 and  2 and their associated variance  2 21 agree within error limits for both cases.

𝑇 S fluctuation dominated model
Here we perform a comparison between covariance kernels for the spin temperature fluctuation model, similarly to what done in the previous section.The results are shown in Figure 4.As seen in Figure 2 and discussed in Section 2.4, the SNR is larger than 1 also for ≈10 nights of observations, suggesting better chances of detectability and smaller uncertainty ranges.Indeed, all three covariance kernels contain the signal within the 2 uncertainty bands around their recovered signal, and the uncertainty range for the VAE kernel is ≈ 2 orders of magnitude smaller than the equivalent case for the  HI fluctuation dominated model.As discussed in the previous section, we compare the lower bounds of the recovered signal with the thermal noise uncertainty from Figure 2 to classify the recovery as a detection or an upper limit.
For ≈100 nights of observations, the SNR is ≫ 10 across all -bins, so that the recovered signal is expected to reproduce the input signal with a significantly narrower 2 uncertainty range, provided that the covariance kernel chosen is a reliable estimate of the covariance of the input 21-cm signal.Indeed, from Figure 4 we note that the VAE kernel fully recovers the signal with less than one order of magnitude uncertainty.However, the Exponential kernel contains the 21-cm signal only in the lowest -bins and shows significant bias in the estimated power at higher -bins.Its broad 2- uncertainty shows that the recovery just provides upper limits even in the low -bins.On the other hand, the Matern32 kernel performs better, and provides a successful detection, albeit with broader uncertainty ranges than the VAE kernel recovery.This suggests that the Exponential kernel is definitely not a good match for the covariance of the input 21-cm signal, and, as highlighted by Kern & Liu (2021), would lead to significant errors in the estimated physics, if used.The Matern32 kernel is better, however the VAE kernel improves upon it even further.This problem with the Exponential kernel appears in the ≈100 and not in the ≈10 nights of observation due to the similarity of power and shape of the excess-noise and 21-cm signal components.Thus, a covariance kernel which is not a reliable estimate of the covariance of the input 21-cm signal would not be able to distinguish between the two, and may either ignore both equally, or identify one over the other purely by chance.It can also be argued that the only reason for any "detection" at low -bins using the Exponential kernel could possibly just be the detection of the excess noise component, wrongly interpreted as the 21-cm signal one.
The recovered values for the coherence-scales and variances for the intrinsic sky, mode mixing and excess noise components, as well as those for the hyperparameters  1 and  2 and associated variance for the 21-cm signal are listed in the 6  ℎ and 7  ℎ columns of Table 1 along with their 68% confidence intervals.As expected, we find an improvement in recovery of the input values for ≈100 nights of observation in comparison to ≈10 nights, particularly for the coherence-scale hyperparameter.
While we note better estimates for  sky and  2 sky / 2 noise for ≈100 nights of observations in both 21-cm signal models, in the  HI fluctuation dominated model the input  2 sky / 2 noise is not included within the estimate error of the recovered values.A similar behaviour is observed in the recovery of the variance for  mix and  ex in the  HI fluctuation dominated model for ≈100 nights of observations.Lastly, we also note that the  1 and  2 hyperparameters and associated variance for the 21-cm signal in both models agree within the error estimates.While the estimated variance for ≈100 nights of observations is higher, it also has a broad 68% confidence interval.

CRASH simulations
The power spectra resulting from the recovery using the three kernels in the case of the CRASH simulation are shown in Figure 5.As now the input 21-cm signal has a power which lies in between the two GRIZZLY models, this translates into an intermediate SNR across -bins.Due to this, we are able to successfully contain the Table 1.We use the results of M20 as our input (2  column) for the hyperparameters (1  column) to generate a mock dataset composed of the  sky ,  mix and  ex components, in addition to the  HI ( S ) fluctuation dominated models from GRIZZLY to generate the 21-cm signal component at  = 9.16.For the recovery using GPR, we keep the value of  fixed for the  sky ,  mix and  ex components, and estimate the median and the 68% confidence intervals for the coherence-scales and variances from the MCMC based LML maximization (4  ℎ to 7  ℎ columns) using the priors as in M20 (3   column), i.e. we either use linear scale uniform priors , or logarithmic scale indicated as -.As the latter is over several orders of magnitude, it has not been listed and can be assumed to be an uninformed prior from −∞ to +∞.For ≈10 nights of observation we use  2 noise = 74 × 10 3 mK 2 (as obtained in M20).For ≈100 nights of observation we scale it down by a factor of 2 for the  sky component and by a factor of 20 for the rest of the components (see Section 2.4).The values for  2 noise in each case are listed in the first row for each component.Note that  2 noise has been scaled to the field of view of the GRIZZLY simulations, which is 3.03  input signal in the 2 ranges of the recovered signals in most -bins for ≈10 nights of observation with all kernels, but given the thermal noise uncertainty power, it is still classified as an upper limit.However, for ≈100 nights, we note that while the VAE kernel does an excellent job of recovering the signal with narrow 2 uncertainty bands, they still indicate that the recovery is an upper limit.On the other hand, the Matern class functions based kernels underestimate the signal and do not contain the input 21-cm signal within 2 uncertainty bands for some -bins, despite them being significantly larger.When comparing to the thermal noise uncertainty, we note that they only provide upper limits for the signal in most bins.This result is similar to that with the  S fluctuation dominated model with GRIZZLY, and thus is due to the same reasons discussed in Section 3.2.

Redshift dependence
To evaluate the performance of the VAE kernel at different redshifts, we use the simulations at  = 8.30, 9.16 and 10.11 of both GRIZZLY models, and compare the performance of the VAE kernel for noise levels equivalent to ≈100 nights of observations.It should be noted that we use the VAE kernels trained for the respective redshifts to avoid making the assumption of the kernels being redshift agnostic.
To analyse our recovery technique, we check how accurately it recovers the input signal, and how precise the results it reports are.For this purpose, we use the quantities defined below, with their values listed in Table 2: (i) Average bias, ⟨PS rec /PS in ⟩  : this is the average bias given as the ratio between the recovered (PS rec ) and input 21-cm signal (PS in ) power spectra, averaged across all -bins.An accurate recovery has a value close to 1, with higher (lower) values indicating an over(under)-estimation of the input 21-cm signal.We note that the deviation of average bias from 1 increases significantly with redshift for the  HI fluctuation dominated model.However, the trend is not so clear for the  S fluctuation dominated signal, as the power and slope of the excess noise component match closely those of the input 21-cm signal at  = 8.30 and  ≤ 0.2 hMpc −1 , which makes differentiating between them more difficult.However, we still get an average bias of 0.70, suggesting only a minor under-estimation, which is primarily observed at the lowest -bins (see Figure 6).
(ii) Average scaled uncertainty, ⟨Err rec /PS in ⟩ k : the 2 uncertainty (given by the shaded region in Figures 3, 4, 5, and referred to as Err rec hereafter) on the recovered power spectrum (PS rec ) gives the precision of recovery.However, except for cases of extremely poor recovery, the absolute value of Err rec generally depends on the absolute value of the corresponding PS rec .Thus, to compare different recoveries for a given PS in , we need to convert it into a unitless quantity.We first tried to do this by calculating the ratio Err rec /PS rec as a measure of the precision of the recovery.However, this ratio contains no information on the accuracy of PS rec .Indeed, one could have a precise recovery, i.e. a low Err rec /PS rec , but an inaccurate PS rec , i.e. a bias which deviates significantly from 1 (see point above).This ratio, then, is not useful, as it does not quantify the overall quality of recovery.To overcome this issue, we use the ratio Err rec /PS in instead.As the magnitude of Err rec depends on that of PS rec , it also carries the information of the bias in recovery.Further, as we divide by PS in , the scaled uncertainty becomes independent of the specific PS in being recovered.This allows us to use it to compare between cases with different PS in , such as at different redshifts (see top panel of Figure 6 where the power spectrum varies due to the time evolution of ionising bubbles) and different physical models (as discussed in Sections 2.1).We use this generalised comparison, and call it the scaled uncertainty for each -bin.Averaging this across all -bins provides a handy quantity to compare the precision of recovery for input 21-cm signals with different physical properties.For example, in Figure 6 we observe that in the  HI fluctuation dominated model, while small-scale variability due to partial reionization is restricted, it still has variability tied to the large-scale distribution of neutral Hydrogen, which increases for lower redshifts.This boosts the power at large scales, corresponding to the low -bins.The same can also be seen in the  S fluctuation dominated model, although its overall power is boosted, as it allows small-scale variability in  b as well.Using the scaled uncertainty, we can compare the precision of recovery across redshift for both cases.The difference in ⟨Err rec /PS in ⟩ k is quite significant for the  HI fluctuation dominated model, going from ≈ 15 at  = 8.30 to ≈ 80 at  = 10.11.
(iii) z-score, ⟨z-score⟩  : The z-score (Kirch 2008) is a popular quantity to evaluate quality of recovery.In our case, it is defined as or PS rec −PS in Err rec /2 at each -bin, and it measures how much the recovered signal deviates from the input 21-cm signal, in units of standard deviation of the recovered signal.The z-score is thus a more explicit method to combine into a single quantity the information provided by the bias along with that of the uncertainty.The only possible issue is that (= Err rec /2) depends on PS rec and thus their ratio would mask the accuracy of recovery as discussed in the point above (see below for an example case).Further, we note that we cannot just report an average of z-scores across all -bins, as the distribution of z-scores is not necessarily Gaussian.Indeed, we find that while it is approximately Gaussian for the  HI fluctuation dominated model, this is not the case for the  S fluctuation dominated model.Thus, we report the minimum and maximum z-scores (z-score min and z-score max ) along with the average (⟨z-score⟩  ).When ⟨z-score⟩  >(<)0, its exact value quantifies the extent of over(under)-prediction.We note that in the  HI fluctuation dominated model, the average z-score worsens with increasing redshift, consistently with the behaviour of the average bias and average scaled uncertainty.This trend is not detected for the  S fluctuation dominated signal, due to the same reasons discussed above for the average bias in (i).We also note that at  = 10.11,⟨z-score⟩  ≈ −2 and +1 in the  S and  HI fluctuation dominated model respectively, naively suggesting a better recovery for the latter.This, though, is not correct, but simply a consequence of the very broad error bars and the inverse proportionality of ⟨z-score⟩  with the error.This reasoning exposes the limitation of the z-score.Indeed, by looking at ⟨PS rec /PS in ⟩  and ⟨Err rec /PS in ⟩ k for recovery of the  HI fluctuation dominated signal (see Table 2), we note that the deviation from zero bias (obtained when ⟨PS rec /PS in ⟩  = 1) and zero uncertainty (when ⟨Err rec /PS in ⟩ k = 0) is significantly higher than for the  S fluctuation dominated model.In fact, these numbers suggest a better quality of recovery in the latter case, which is understandable as the SNR in this model is higher.Thus, while we report the z-score due to its popularity, we recommend using the average bias and scaled uncertainty for evaluating the quality of recovery.
The trends in various quantities discussed above are linked to the physical nature of the 21-cm power spectrum and its redshift evolution.The drop in the SNR with increasing redshift (due to a decrease in signal power as shown in Figure 6 and explained in (ii)), leads to a worsening of the average bias and scaled uncertainty, especially for the  HI fluctuation dominated model.As already mentioned, when the excess noise and the input 21-cm signal have similar power and slope (as at  = 8.30 for the  S fluctuation dominated case), we observe limitations in the capability of differentiating among the two, but the effects are minor and the trends of average bias, scaled uncertainty and z-score for the  HI fluctuation dominated model are also observed in the  S fluctuation dominated model when going from  = 9.16 to  = 10.11.Thus, we find that the VAE kernel does not add significant biases, with its recovery and associated uncertainty largely scaling with the physical properties of the 21-cm signal.

Role of the excess noise component
In M20 it was noted that the excess noise component was poorly constrained, and thus the combined excess noise and 21-cm signal components were jointly recovered, as separating them was not statistically justifiable.Thus, it is important to understand how well constrained the excess noise component has to be, in order to separate the 21-cm signal from it.
To explore this, we looked at the  HI fluctuation dominated model of the 21-cm signal at  = 9.16.We note that in the ≈100 nights case for the chosen excess noise component, the power spectrum recovered with the VAE kernel is slightly overestimated, with the 2 bands on both sides of it spanning ≈2 orders of magnitude.We then generated a range of input excess noise components (by varying either the coherence-scale hyperparameter or the variance), and analysed the effects on the recovery of the 21-cm signal power spectrum using the VAE kernel by looking at the average bias, scaled uncertainty and z-score, as defined in Section 3.4.The results are reported in Table 3, where f var is the factor by which the coherencescale hyperparameter and the variance from the results of M20 were scaled.We see that, overall, the average bias and scaled uncertainty are reduced when increasing the coherence-scale hyperparameter or decreasing the variance, while no substantial difference is observed in the average z-score.This is possibly because the bias and scaled uncertainty decrease at the same rate, and thus their effects roughly cancel out.

Overall performance of the VAE kernel
Usually it is expected that the recovery of the 21-cm signal from an overall dataset with foreground components, noise and signal is not possible if SNR<1.In Section 3.1 we have indeed shown that Matern class functions based kernels are unable to contain the input 21-cm signal within their 2 uncertainty bands when SNR<1.However, the VAE kernel is not only able to do so, but also to recover the overall shape of the power spectrum, as seen in the top panel of Figure 3. Further, as highlighted by Kern & Liu (2021), misestimation of the covariance kernel can significantly hamper signal detection given the currently used normalization and bias correction schemes.This means that as the Matern class functions model the 21-cm signal only approximately, their results can be significantly biased for more complex models of the 21-cm signal, as given e.g. by the  S fluctuation dominated model of GRIZZLY and the 21-cm signal model from the CRASH simulations.Indeed, in these cases the Matern class functions based kernels fail to recover the signal also for noise levels equivalent to ≈100 nights of observation, even when ⟨SNR⟩ k ≈ 10.The VAE kernel does not suffer from such a limitation and performs well also when used with an input signal from the CRASH simulations.This shows that the VAE covariance kernel is a more robust estimate of the covariance of the 21-cm signal, and can successfully report a detection within 2 uncertainty regardless of the exact physical properties of the observed 21-cm differential brightness temperature power spectrum.Lastly, we note that it performs well across all redshifts analysed here.This reconfirms the robustness of the VAE kernel in constraining the 21-cm signal, with an increase/decrease in uncertainty tied to the ⟨SNR⟩ k of the signal itself.
Overall, the 2 uncertainty bands given by the VAE kernel contain the signal in all cases discussed here.We consider the recovery limit of the VAE kernel in terms of SNR averaged over -bins to be that from ≈10 nights of observation in the  HI fluctuation dominated model, i.e. ⟨SNR⟩ k = 5 × 10 −2 .For values lower than this, we do not expect the VAE kernel based recovery to contain the input signal within its 2 uncertainty bands across  = [0.05,1.00] ℎMpc −1 .We explore one such case in Appendix A, and indeed find that the recovery does not contain the input signal within its 2 uncertainty bands, but instead provides upper limits.Yet it still outperforms the Matern class functions based kernels in recovering the shape of the power spectrum, spread of uncertainty on recovery, as well as the reported upper limits.

Limitations
In this work, we reduce the possibility of biases in the EoR covariance kernel by incorporating a more physically motivated covariance.We showcase the robustness of the generated VAE covariance kernel by testing it against not just mock 21-cm signals obtained with GRIZZLY, but with signals generated using very different frameworks as shown in Section 3.3 and in Appendix A. This has also been demonstrated in Mertens et al. (2023).
However, biases are still possible, especially if the true signal, and thus its covariance, is fundamentally different from what we obtain with our simulation codes.One way to further minimise this bias is to use mock data obtained from different codes to train the covariance kernel, which we plan to do in further upgrades of our pipeline.In the future, we will also investigate methods to reduce the dependence on the prior by using different normalization and bias correction schemes as suggested by Kern & Liu (2021).

SUMMARY
The LOFAR Epoch of Reionization (EoR) KSP team strives for a successful detection of the 21-cm signal from the EoR at  ≈ 7 − 11.Past theoretical models indicated that at least 1000 hours of observation would be necessary to lead to a successful detection (Mertens et al. 2018), while Mertens et al. (2020) provided upper limits using 141 hours (≈10 nights) of observation.In this respect, an optimal choice of the covariance kernel for the 21-cm signal component is crucial.Indeed, as shown in Kern & Liu (2021), given the currently used normalization and bias correction scheme, a mismatch between the adopted and the actual covariance kernel of the 21-cm signal can induce a significant signal loss, which can in turn lead to incorrect astrophysical interpretations from any "successful" detection.
To improve the choice of the 21-cm signal covariance kernel, Mertens et al. (2023) introduce a Machine Learning method which employs a Variational Auto-Encoder (VAE) based algorithm.As the training done using VAE is not limited by the form of the specific function (as e.g. in the case of Matern class functions), nor by the kernels of the training datasets (as in the case of a simple Auto-Encoder), it allows to reproduce the covariance kernel of the 21-cm signal with a greater flexibility.This is showcased in terms of the robustness of the VAE based kernel's performance in comparison to previously used kernels based on Matern class functions.
We show that the result on using the VAE kernel is able to contain the input 21-cm signal within its 2 uncertainty band in all cases explored where ⟨SNR⟩ k ≳ 5 × 10 −2 .It is also usually better than the results from Matern class functions based covariance kernels in recovering the overall shape of the power spectrum of the signal.A key result in this paper is that Matern class functions based kernels are unable to recover the 21-cm signal for the  S fluctuation dominated model even for ≈100 nights of observation, for which ⟨SNR⟩ k ≈ 10, while a recovery with the VAE kernel is successful.A similar result is obtained also with a 21-cm signal generated using the CRASH simulations, thus clearly indicating that the Matern class functions based kernels do not correctly match the covariance of more complex models of the signal.Thus, this analysis suggests that using the VAE kernel can mitigate to a significant extent the issues highlighted by Kern & Liu (2021) given no change to the normalization and bias correction schemes.
Further, we show that the behaviour of the VAE kernel is consistent across all redshifts of interest, with changes in its performance strongly correlating with the neutral hydrogen distribution, which changes the strength of the resultant power spectrum, and thus the ⟨SNR⟩ k .This suggests that the VAE based kernel can be used for any choice of redshift without additional correction factors, making the algorithm developed here directly applicable to LOFAR data at  ≈ 8 to 10, whose results can then be compared with results from telescopes like HERA.
We also explore the effects that the properties of the excess noise component identified in M20 have on the recovery of the 21cm signal.As expected, we find that having a higher coherence scale or a lower variance for the components leads to better recovery.
In companion papers we will apply the VAE kernel to the ≈10 nights of LOFAR data used in M20, and explore the range of theoretical models which are consistent with the upper limits provided by the VAE kernel, as done in Ghara et al. (2020).Applying the VAE kernel to observations much longer than ≈10 nights requires a significant improvement in the modelling of the intrinsic sky component, which would eventually be limited by the confusion noise due to the angular resolution of LOFAR.Further improvements, such as noise mitigation, can be implemented by choosing data from nights with better ionospheric conditions and lesser contribution from RFI flagging.All these aspects are currently being explored by the LOFAR EoR KSP team.discussions.FM acknowledges support of the PSL Fellowship.RG acknowledges support by the Israel Science Foundation (grant no.255/18).LVEK and SM acknowledge the financial support from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant agreement No. 884760, "CoDEX").The post-doctoral contract of IH was funded by Sorbonne Université in the framework of the Initiative Physique des Infinis (IDEX SUPER).QM is supported by the National Natural Science Foundation of China (Grant No. 12263002).GM acknowledges support by the Swedish Research Council grant 2020-04691.Nordita is supported in part by NordForsk.Lastly, we would like to thank the referee for their helpful comments that helped improve this paper significantly.

Figure 1 .
Figure 1.Power spectrum of the mock dataset (purple solid line) generated using the  HI fluctuation dominated model from GRIZZLY.It consists of the foreground component (intrinsic sky + mode-mixing contaminants; pink dashed), the excess noise (magenta dashed), the noise (yellow dashed) and the 21-cm signal at  = 9.16 (grey dashed).We also show the 1 - upper limit (dark-yellow dotted) achievable if the dataset is thermal noise dominated, i.e. any signal below this line has SNR < 1.The top (bottom) panel refers to a case with the noise corresponding to 10 (100) nights of observation.

Figure 2 .
Figure 2. As Fig. 1 for the  S fluctuation dominated model.

Figure 3 .
Figure 3. Power spectrum recovered using the Exponential (orange dasheddotted line) and Matern32 (green dotted) Matern class functions based covariance kernels, and the VAE-based kernel (blue solid), together with the  HI fluctuation dominated model 21-cm signal (grey dashed) and noise (yellow dashed) at  = 9.16.The 2 uncertainty on the recovered signal for each kernel is shown as a shaded area in the corresponding colours.The top (bottom) panel refers to a case with the noise corresponding to 10 (100) nights of observation.We also plot the estimated upper limits of power at  = 0.075 ℎMpc −1 fromMertens et al. (2020, cross).Note that this value can be significantly higher than the signal due to more complex noise in real data.

Figure 4 .
Figure 4.As Figure 3 for the   fluctuation dominated model.

Figure 5 .
Figure 5.As Figure 3 for the CRASH simulation of reionization.

Table 2 .
Average bias, ⟨PS rec /PS in ⟩  , average scaled uncertainty, ⟨Err rec /PS in ⟩ k , and average z-score, ⟨z − score⟩  , at various redshifts.These values are obtained when the signal recovery is done employing the VAE kernel with the  HI (left) and the  S (right) fluctuation dominated model and ≈100 nights of observations.Top panel: input 21-cm signal for the  HI (dashed lines) and the  S (solid) fluctuation dominated model are shown for  = 8.30 (blue), 9.16 (orange) and 10.11 (green).Middle panel: recovered 21-cm signal and its associated uncertainty divided by the input 21-cm signal to give the bias and scaled uncertainty (the average values over -bins for these quantities are listed in Table2) for the  HI fluctuation dominated model.For comparison, the line of zero bias (i.e., recovered signal perfectly matching the input signal) is also shown.Bottom panel: same as the middle panel, but for the  S fluctuation dominated model.

Table 3 .
Average bias, ⟨PS rec /PS in ⟩  , average scaled uncertainty, ⟨Err rec /PS in ⟩ k , and average z-score, ⟨z − score⟩  , for the  HI fluctuation dominated model with ≈100 nights of observations at  = 9.16.These values are obtained when the signal recovery is done employing the VAE kernel and multiplying the coherence-scale hyperparameter,  ex , or the variance,  2 ex , listed in the second column of Table 1 (and as used in M20) by a factor  var .