Detecting the HI Power Spectrum in the Post-Reionization Universe with SKA-Low

We present a survey strategy to detect the neutral hydrogen (HI) power spectrum at $5<z<6$ using the SKA-Low radio telescope in presence of foregrounds and instrumental effects. We simulate observations of the inherently weak HI signal post-reionization with varying levels of noise and contamination with foreground amplitudes equivalent to residuals after sky model subtraction. We find that blind signal separation methods on imaged data are required in order to recover the HI signal at large cosmological scales. Comparing different methods of foreground cleaning, we find that Gaussian Process Regression (GPR) performs better than Principle Component Analysis (PCA), with the key difference being that GPR uses smooth kernels for the total data covariance. The integration time of one field needs to be larger than $\sim 250$ h to provide large enough signal-to-noise ratio (SNR) to accurately model the data covariance for foreground cleaning. Images within the primary beam field-of-view give measurements of the HI power spectrum at scales $k\sim 0.02\,{\rm Mpc^{-1}}-0.3\,{\rm Mpc^{-1} }$ with SNR $\sim 2-5$ in $\Delta[{\rm log}( k/{\rm Mpc^{-1}})] = 0.25$ bins assuming an integration time of $600$ h. Systematic effects, which introduce small-scale fluctuations across frequency channels, need to be $\lesssim 5\times 10^{-5}$ to enable unbiased measurements outside the foreground wedge. Our results provide an important validation towards using the SKA-Low array for measuring the HI power spectrum in the post-reionization Universe.


INTRODUCTION
The standard model of cosmology, the Λ cold dark matter (ΛCDM) model, helps us describe and understand the observed Universe. In particular, measurements of the cosmic microwave background (CMB, e.g. Planck Collaboration et al. 2020a) and the large scale structure (LSS, e.g. Alam et al. 2021) can be well fitted by the ΛCDM model, producing precise, per-cent level constraints on the model parameters. However, as we reach further into the realm of precision cosmology, potential inconsistency between different probes arises in the form of cosmological tensions. Namely, measurements of the Hubble parameter in the local Universe using tip of the red-giant branch (TRGB) and Type Ia Supernovae (e.g. Riess et al. 2022) have significant discrepancies ∼ 5 with the measurements made using the CMB (e.g. Planck Collaboration et al. 2020b). There also exists a tension of ∼ 2.7 between the measurements of the amplitude of the dark matter clustering 8 from the CMB and from the LSS (e.g. Amon et al. 2022).
The disagreements between different cosmological observations highlight the need for understanding the evolutionary history of the Universe. The CMB captures the cosmic structure at the last scattering surface ∼ 1100 (Dodelson & Schmidt 2020) while the local measurements are made at 2.0, missing a large part of the observable Universe in between. One promising approach to fill the ★ E-mail: zhaoting.chen@manchester.ac.uk gap is neutral hydrogen (H ) intensity mapping (e.g. Battye et al. 2004;Chang et al. 2008;Mao et al. 2008;Wyithe & Loeb 2009;Battye et al. 2013;Kovetz et al. 2017). It uses the emission line of the H atoms, at the rest wavelength of ∼ 21 cm, as a tracer of the underlying dark matter distribution. Neutral hydrogen is the most abundant element in the Universe after recombination as predicted by the Big Bang nucleosynthesis (Alpher et al. 1948;Dodelson & Schmidt 2020). The formation of dark matter structures, i.e. dark matter halos, attracts baryonic matter to fall into the halos and produces luminous stars and galaxies during the cosmic dawn (Schaerer 2002). The ultra-violet radiation produced by these objects ionized the initially neutral inter-galactic medium (IGM), a process known as the cosmic reionization (Furlanetto et al. 2006). The 21 cm emission is dominated by the H inside the IGM during the cosmic reionization, after which the majority of the remaining H resides in the dark matter halos (Rahmati et al. 2013). Therefore, the H signal traces different cosmic structures during different epochs and can be used to probe cosmology across a wide range of redshifts.
The spectroscopic nature of the 21 cm line allows the measurement of the matter clustering across the history of structure formation from the cosmic Dark Ages, to the Epoch of Reionization (EoR), and all the way to the low-redshift Universe. However, the H signal is inherently weak, and resolving H sources requires deep integration time even for observing the H galaxies in the local Universe (e.g. Haynes et al. 2018). Without the need to resolve individual sources of the H emission, intensity mapping is a technique that maps the 21 cm emission across a large area of the sky with relatively coarse angular resolution, allowing efficient surveys of large cosmological volumes suitable for testing the ΛCDM model. Ongoing experiments targeting different redshifts include MeerKAT (Santos et al. 2016), Canadian Hydrogen Intensity Mapping Experiment (CHIME; CHIME Collaboration et al. 2022), Tianlai (Xu et al. 2015), Hydrogen Epoch of Reionization Array (HERA; DeBoer et al. 2017), Low-Frequency Array (LOFAR; Patil et al. 2017), Murchison Widefield Array (MWA; Tingay et al. 2013) and more, covering ∼ 0.0 − 10.0. In the future, the Square Kilometre Array Observatory (SKAO) will further enable detections of the neutral hydrogen clustering, with SKA-Low observing at 50MHz to 350MHz, covering the redshift range from the cosmic Dark Ages ∼ 27 down to the post-EoR Universe ∼ 3.0 (Koopmans et al. 2015), and SKA-Mid observing at 350MHz to 15.4GHz covering 3 (Square Kilometre Array Cosmology Science Working Group et al. 2020).
The biggest challenge of H intensity mapping is measuring the signal against the foregrounds which are several orders of magnitude brighter than the H . In order to measure the H signal, extreme accuracy in the instrument calibration is necessary to model the foregrounds (Barry et al. 2016). The desired calibration accuracy calls for a thorough understanding of the sky (e.g. Trott & Wayth 2017;Murray & Trott 2018), the beam (e.g. Thyagarajan et al. 2015;Ewall-Wice et al. 2016b), and the systematics (e.g. Trott et al. 2018). Techniques of foreground mitigation can then be utilised to extract the H signal. The spectral smoothness of the foregrounds contrasts with the H which is discretely structured in frequency since, for the HI, different frequencies correspond to different redshifts, and therefore different line-of-sight distances. Fourier transformation along the frequency direction to the delay time space for individual baselines, a technique called the 'delay transform', can thus be used to isolate modes of the power spectrum where the H dominates (Morales & Hewitt 2004;Parsons et al. 2012a,b). The region of the wavenumber -space where H signal can be measured is called the 'observation window' whereas the region dominated by the foregrounds is the 'foreground wedge' (Datta et al. 2010;Morales et al. 2012;Liu et al. 2014). Measuring the H power spectrum in the observation window is usually referred to as 'foreground avoidance', which is one approach among ongoing efforts of measuring the EoR signal. Alternatively and/or additionally, Blind signal separation (BSS) techniques can also be applied on the foregrounds, or the residuals of them after sky model subtraction. These techniques work mostly on the frequency-frequency covariance of the data, such as fast independent component analysis (fastICA, Chapman et al. 2012;Wolz et al. 2014), generalized morphological component analysis (GMCA, Chapman et al. 2013), correlated component analysis (CCA; Bonaldi & Brown 2015), gaussian process regression (GPR; Mertens et al. 2018) and more (see Chapman & Jelić 2019 for a review). For H observations targeting the post-reionization Universe, foreground removal using BSS methods is typically used to recover the H signal, with transfer function corrections of signal loss (e.g. Switzer et al. 2015;Cunnington et al. 2023b).
Using the methods mentioned above, progress has been made at different redshifts towards the detection of the H power spectrum. For single dish experiments targeting the low-redshift Universe, crosscorrelation detections of the H signal with optical galaxies have been made by the Green Bank Telescope (Masui et al. 2013;Switzer et al. 2013;Wolz et al. 2022), the Parkes telescope (Anderson et al. 2018) and the MeerKAT telescope (Cunnington et al. 2023a). A similar cross-correlation measurement has also been made by the CHIME telescope using stacking (CHIME Collaboration et al. 2023). The first auto-correlation detection has been made by using the MeerKAT telescope as a radio interferometer (Paul et al. 2023). For experiments targeting EoR, upper limits on the H power spectrum have been found by the MWA (Ewall-Wice et al. 2016a;Trott et al. 2020) and HERA (The HERA Collaboration et al. 2022) using the delay transform and foreground avoidance, and by LOFAR using map making with GMCA and GPR foreground removal (Patil et al. 2017;Mertens et al. 2020).
In light of the recent progress, in this paper we explore the possibility of measuring the H power spectrum at 5 < < 6 using SKA-Low. While this redshift range is within the frequency coverage of the instrument, it has been largely neglected since it is not in the interests of the primary goal of H science for SKA-Low, which mainly focuses on the EoR (Koopmans et al. 2015). Despite probing different physics, observations of the post-reionization Universe can benefit significantly from the wide frequency range of the SKA-Low telescope, as the deep observations of the EoR fields will provide accurate modelling of the radio continuum and the instrument. Furthermore, it has been suggested that the Universe may still be partially ionized at ∼ 5.5 (Bosman et al. 2022), in contrast with conventional constraints on the end of reionization to be at ∼ 6 (e.g. Fan et al. 2006). Using the H power spectrum at 5 < < 6 provides a unique method of constraining the end of reionization. However, measuring the H clustering at these redshifts has its own challenges. The H signal at the quasi-linear scales probed at 5 < < 6 will be lower than the signal at the EoR. Meanwhile, the low-frequency band contains more foreground contamination than the L-band typically used for intensity mapping at lower redshifts. It is important to quantify the signal and foreground level at these frequencies as well as the instrument effects, to verify if these redshifts can be used for cosmology.
In this paper, we present an end-to-end pipeline including simulations of the sky signals and the interferometric observations, the foreground mitigation, and the power spectrum estimation to provide a proof-of-concept study for measuring the H power spectrum at 5 < < 6 using SKA-Low. Using the simulation pipeline with different settings, we explore different levels of foreground residual and noise level to find the requirements on integration time and foreground modelling needed. Methods for residual foreground removal are investigated focusing on the comparison between Principle Component Analysis (PCA) and GPR, with quantitative investigations into the differences in the performance of these two methods. We present our forecasts for future SKA-Low surveys on the power spectrum measurements. Impacts of systematics are also briefly discussed to provide an estimation of the requirements on levels of the systematics.
The paper is organised as follows: The simulation of the sky signal is described in Section 2. Simulations of the interferometric observations to get the images and subsequent power spectrum estimation from the images are discussed in Section 3. The presence and the structure of the foreground wedge, with foreground mitigation methods applied, are quantified in Section 4. The robustness of the foreground mitigation methods is tested in the presence of thermal noise and systematic effects in Section 5. We present the concluding remarks in Section 6. Throughout this paper, we assume the ΛCDM cosmology from Planck Collaboration et al. (2020b).

SIMULATIONS OF THE RADIO SKY
In this section, we outline the simulations of the sky signal which consist of the H signal and the foregrounds at 5 < < 6, corresponding to ∼ 200 − 240MHz. The SKA-Low instrument is de- Figure 1. The input sky simulations of different foreground components at 220MHz as described in Section 2. The simulation of the synchrotron radiation is shown in the top panel. The simulation of the free-free emission is shown in the middle panel. The simulation of the extragalactic radio sources are shown in the bottom panel. The pixel size of the figure is (21 arcsecond) 2 and the total size of the signal simulation is (10.5 deg) 2 . Note that the extragalactic signal shown in the bottom panel is simply for illustration with the sources plotted as point sources. Values larger than 100K are masked for better presentation. When simulating the observations, the radio sources are directly put in as a source catalogue instead of a map, as described in Section 2.2. signed to have a maximum channel resolution of 5.4kHz (Braun et al. 2019). Since we are only interested in the H intensity mapping which uses large voxels to map the distribution of the H emission, we reduce the simulated data volume by assuming the redshift bin is covered by 66 frequency channels with a channel bandwidth of 510kHz. The coarser frequency resolution of 510kHz corresponds to ∼ 0.4 Mpc −1 . While increasing the frequency resolution gives access to higher where the foregrounds are weaker, the small scales beyond BAO wiggles are difficult to model for cosmological inferences. We leave simulations with the full frequency resolution for future work.
The primary beam field-of-view (FoV) for SKA-Low at these frequencies is ∼ 3 degrees (Braun et al. 2019). We simulate (10.5deg) 2 sky areas around the pointing centre for all the components of the sky signal. While the sky area only extends to the -20dB (1%) sidelobes of the primary beam, we find that there is no sharp features in the cylindrical power spectrum from simulated foreground residuals (see Appendix B). As discussed later in Section 3, we perform the power spectrum estimation using only the centre (1.5 deg) 2 and therefore the (10.5deg) 2 sky area is sufficient. The pointing centre is at the EoR0 field (Lynch et al. 2021) at RA=0h, Dec=-27deg. The methods for generating the components are described as follows.

Diffuse Galactic radiation
The diffuse Galactic radiation at these scales is dominated by the synchrotron radiation. We use the all-sky 'Haslam map' of synchrotron radiation at 408MHz (Haslam et al. 1981(Haslam et al. , 1982 with the updated version described in Remazeilles et al. (2015). The map is then extrapolated to the frequencies of interest using the Global Sky Model (Zheng et al. 2017) at 1.4GHz and 2.3GHz to calculate the spectral indices of the map pixels. The curvature of the spectral indices (see e.g. Irfan et al. 2022) is neglected for simplicity.
The pixel size of the input Haslam map is (1.72 arcmin) 2 , corresponding to HEALPIX (Górski et al. 2005;Zonca et al. 2019) NSIDE=2048. An image of (10.5 deg) 2 around the pointing centre is created with a pixel size of 21 arcsec. The image is then Gaussian smoothed with a resolution of 1.75 arcmin. The input synchrotron radiation at the central frequency of our simulation 220MHz is shown in Fig. 1.
Free-free emission from the Galactic electrons also contributes to the diffuse Galactic radiation. Following Lian et al. (2020), we use 21 1 to simulate the Galactic free-free emission. It is based on the H intensity map in Finkbeiner (2003). The free-free emission in the frequency range of our interest is several orders of magnitude smaller than the synchrotron as shown in Fig. 1.
As discussed later in Section 3.1, we make image cubes of the observations to perform residual foreground removal and power spectrum estimation. In interferometric observations, during the calibration and imaging process, the diffuse emission is largely subtracted and no visible structure is left in the image cube (see e.g. Rajohnson et al. 2022). Therefore, in our work, we assume the majority of diffuse emission has been removed and model the diffuse foreground residual amplitude as 0.1% of the original emission of our simulation. Although this approach will require accurate modelling of the sky signal, it is fully within the power of SKA-Low. Note that while we are only simulating 66 frequency channels from 200 to 240 MHz, a much wider frequency range, from 50MHz to 350 MHz, will be utilised in future SKA-Low observations to provide accurately modelling of the continuum emission. As we show later in Section 3.1, the output foreground image cube fluctuates on the scale of ∼ 2mJy per point spread function (PSF), corresponding to the overall fluctuation of roughly 80mJy, consistent with the flux density level of residual image cubes from existing EoR observations (see e.g. fig. 2 of Mertens et al. 2020). Thus, the assumption for the level of foreground residual is representative for SKA-Low. It is beyond the interest of this preliminary work to simulate the entire frequency range and produce the sky model for visibility subtraction.
Note that there are other sources of foregrounds that are of Galactic origins, such as supernovae remnants . Since the dominant component of the foregrounds is the synchrotron, we expect that the Galactic foreground simulated in our work is enough to capture the amplitude and the structure of the diffuse emission and leave other components of the Galactic foregrounds for future study.

Extragalactic radio sources
Apart from the Galactic diffuse emission, extragalactic radio sources also contribute to the overall foreground emission. While the Galactic foregrounds are mostly diffuse, the extragalactic foregrounds are typically individual sources of finite size. Understanding the properties of the radio galaxies is a major scientific goal for radio surveys. For example, both continuum and H science results have been produced using the same fields of the MIGHTEE survey (Heywood et al. 2022;Sinigaglia et al. 2022); Observations of EoR0 field from the MWA are used to produce both the upper limits on the reionization power spectrum and the source catalogue (Beardsley et al. 2016;Trott et al. 2020;Lynch et al. 2021).
For future observations using SKA-Low, we expect a good understanding of the radio sources in the fields which will be iteratively improved as the observations themselves will further help build more complete catalogues. Here we use the source catalogue from the LO-FAR Two-meter Sky Survey observations of the ELAIS-N1 (EN1) field (Sabater et al. 2021) and rotate the centre of the field to our pointing centre as shown in Fig. 1. The EN1 catalogue covers slightly less than the (10.5 deg) 2 sky area used for simulating the diffuse foregrounds. As discussed later in Section 3, we only image the central (1.5 deg) 2 fields so the smaller input sky area for the radio sources has negligible impacts on the intensity of the foreground emission in our image cubes. In real observations, the bright sources in the beam sidelobe pose challenges to the data calibration which we do not consider in this work. These issues can be mitigated by techniques such as secondary and direction-dependent calibrations (see e.g. Patil et al. 2017;Mertens et al. 2018;Heywood et al. 2022).
In the source catalogue, we impose a flux density cut of 10mJy assuming all sources above this flux density can be perfectly peeled. The 10mJy limit is fairly conservative and can be set lower given the high sensitivity of SKA-Low. For example, using 12 nights of LOFAR-EoR data observing the North Celestial Pole (NCP), Mertens et al. (2020) produced source-subtracted images with fluctuations at 50mJy level. The source model of the NCP field has also been built iteratively over the years down to sources with flux density down to ∼ 3mJy (Yatawatta et al. 2013). The depth of the sky model for the EoR0 field simulated in this work can also be expected to reach mJy level. Furthermore, we expect the sources below this flux density to be modelled with 90% accuracy. This is again a conservative estimate, as relatively short observations of only 13 h used in Patil et al. (2017) reports ∼ 5% error in recovering the flux density of a known bright source. As we discuss later, we focus on deep observations with ≥ 300 h of observation and therefore it is expected that the flux of the sources around 1mJy can be accurately modelled with below 10% errors. We assume no position errors for the sky modelling.

The H signal
H resides mostly inside the dark matter halos after the EoR at 6. The collapse of the cold gas leads to star formation, creating strong correlations between the star formation rate and the molecular (H 2 ) gas content of the galaxies (Leroy et al. 2008). Therefore, the clustering of H can be related to the star forming properties of the galaxies and can be used to constrain the galaxy astrophysics (e.g. Wolz et al. 2016;Chen et al. 2021). At higher redshifts beyond cosmic noon > 2, the fraction of H within galaxies start to drop (Villaescusa-Navarro et al. 2018) and the distribution of the H tilts more towards the massive halos (Spinelli et al. 2020). Due to the lack of direct observations on these H emission sources at higher redshifts, the properties of the H within halos are not well understood, which can be dramatically improved by future H intensity mapping experiments.
The large sky area of (10.5 deg) 2 , and the 5 < < 6 redshift bin, result in a light cone of ∼ 1500Mpc in the transverse direction and ∼ 500Mpc in the line-of-sight (los) direction. For our purposes of exploring the detectability of the signals, instead of using a full hydrodynamical simulation, we use semi-analytical simulations based on dark matter simulations and H Halo Occupation Distribution (HOD, Cooray & Sheth 2002). It allows us to efficiently simulate the large volume required. The detailed steps of our H simulation are as follows: • Assuming the Planck18 cosmology (Planck Collaboration et al. 2020b), we use 2 (Monaco et al. 2002(Monaco et al. , 2013 to simulate nine boxes of dark matter distributions, each with a volume of (620 Mpc) 3 . The total volume of 9 × (620 Mpc) 3 is to ensure that the lightcone falls well within the simulated volume, avoiding edge effects. The total volume is divided into 9 sub-boxes to avoid computational difficulties.
• Each sub-box has 1850 grid points per side, resulting in a mass resolution of ∼ 3.25 × 10 9 M /h. Note that this mass resolution is likely not enough to resolve all the H -rich halos (see e.g. Villaescusa-Navarro et al. 2018). However, it is enough to capture the bias of the H clustering which is sufficient for our purposes.
• Each sub-box is simulated across the 5 < < 6 redshift bin with a snapshot taken at each observing frequency channel, equalling a total of 66 snapshots (see Section 3 for specifications of the observations). The halo positions relative to the centre of the box in comoving space, the velocities, and the mass of the haloes are taken.
• The 9 sub-boxes are then put together onto 3x3 grids with the centres of the boxes re-positioned. We take the observer to be at (0,0,0) and the centre of the 5th box is at (0,0,X cen ) where X cen is the comoving distance at the centre of the 5 < < 6 redshift bin. The halos are re-positioned accordingly.
• The peculiar velocities of the halos are calculated given the 3D halo velocities and the position vectors. The halo positions are modified to redshift space according to the Kaiser effect (Kaiser 1987).
• Each halo is assigned an H mass according to the H HOD of the IllustrisTNG simulation in Villaescusa-Navarro et al. (2018). The H HOD follows h the halo mass. We adopt the parameter values at = 5, with 0 = 1.9×10 9 ℎ −1 , min = 2.0×10 10 ℎ −1 and = 0.74. All HI masses are put into the halo centres, since we are only interested in large scales < 0.5 Mpc −1 and hence the halos are unresolved. The H masses are then multiplied by a constant factor so that at each redshift the HI mass density, Ω H , equals to 10 −3 . This is consistent with the observation of Crighton et al. (2015) and ensures that the clustering amplitude is realistic.
• The distances between the halos and the observer are calculated. For snapshot corresponding to frequency channel , the line-of-sight comoving distance range [X i min , X i max ] is calculated according to the channel bandwidth and central frequency. Only halos in the distance range are selected.
• A rotational matrix along axis to rotate the -plane is applied to the halos so that the centre of the simulation corresponds to the pointing centre RA=0h and Dec=-27deg. The halo positions are converted to angular coordinates. The H mass is converted to the flux density assuming that the flux is distributed as a step function across the frequency channel. This is a reasonable assumption given that the velocity resolution of SKA-Low at 5 < < 6 is not high enough to resolve the emission profiles of the H sources.
To validate our H simulation, we compute the H power spectra for the 9 sub-boxes, and compare to the central (1.5 deg) 2 area of the light cone which we will use for imaging later. The resulting average H power spectra for the boxes and for the central input image is shown in Fig. 2.
We emphasize that the variance of the H signal, shown as the shaded area in Fig. 2, is underestimated. This is due to the fact that we assume a deterministic relation between the H and halo mass, ignoring the scatter of the relation (see e.g. Fig.4 of Villaescusa-Navarro et al. 2018). The scatter comes from the assembly bias of halos, which can be introduced by the inhomogenous reionization history (e.g. Long et al. 2022). In our case of investigating the detectability in thermal noise dominated case, this effect is negligible and we leave more realistic simulations for future work.
Note that, the (1.5 deg) 2 image size corresponds to a maximum length scale equivelant to ∼ 0.03 Mpc −1 . Scales larger than this can not be probed by the image, as one can see from the top panel of Fig. 2. At smaller scales, > 1 Mpc −1 , the H power spectrum hits the shot-noise plateau. This is not accurate and the actual shot noise should be much lower. In our simulation, the H is directly put as point sources in the halo centres, so that the number density of H sources is underestimated (see Spinelli et al. 2020 for a discussion of this). The actual shot noise should be much lower and requires more in-depth modeling of the H halo model (Wolz et al. 2019;Chen et al. 2021). As we will discuss in Section 3, the minimum scale probed in our simulation is ∼ 0.3 Mpc −1 and therefore we are not affected by this insufficient modelling. The cylindrical power spectrum shown in the bottom panel of Fig. 2 indicates that the H power spectrum from our simulation gives the correct isotropic features, and therefore can be reliably used to study the detectability of the H power spectrum in the presence of the foreground wedge.

SIMULATIONS OF OBSERVATIONS
In this section, we describe the simulation of the SKA-Low interferometer to observe the input sky signal discussed in Section 2, the imaging routine to produce the image cube within the primary beam FoV, and the power spectrum estimation.

From sky signal to image product
The SKA-Low array will consist of 131,072 log-period dipole antennas within 512 stations covering the southern sky from 50 to 350 MHz. Since the specific station layout and specifications are not finalised, we use the v3 station layout (de Lera Acedo et al. 2020) assuming a frequency channel bandwidth of 510kHz. We only take the central area with 296 stations with a maximum baseline length of 3.15km. The longest baselines are not of cosmological interest and are thus neglected to reduce data volume. The frequency range we simulate is from 202.56MHz to 235.76MHz, covering redshift 5 to 6 with 66 frequency channels. The station layout is shown at the left panel of Fig 3. The visibility data are simulated to represent one night of observation at the EoR0 field. We assume a total integration of 12 h with a time-resolution of 180 s in one tracking. The resulting u-v coverage of the baselines is shown in the right panel of Fig 3. The u-v coverage shown is dense within |u| < 1000m (which corresponds to the physical scale 0.5 Mpc −1 ). Choosing the u-v grid length to correspond to our image size, we find no loss of u-v grid sampling, justifying the usage of a relatively coarse time resolution.
Following the observational specifications discussed above, we use the OSKAR 3 package (Mort et al. 2010) to generate the visibility data. OSKAR takes in the telescope specifications, sky model and observation strategy to simulate the primary beam, the u-v coverage and the visibility data. It can also be used to generate dirty images, which we use to produce the image cube. The sky area for the imaging output is determined by the primary beam size. In the calculation of the power spectrum, the primary beam attenuation is squared since  the power spectrum is the Fourier density field squared (see e.g. Parsons et al. 2014). To image within the primary beam field-of-view, we take the limit where the power-square beam attenuation reaches ∼ 0.5. The primary beam is largely Gaussian near the pointing centre as shown in Fig. 4, resulting in power-square beam having half the full-width-half-maximum (FWHM) comparing to the actual beam. The image size is accordingly set to be (1.5 deg) 2 and we choose the pixel size to be (0.45 arcmin) 2 with 200 × 200 grids. We apply the W-projection algorithm (Cornwell et al. 2008) with natural weighting to the baselines to produce the image cube. The power-square beam and the synthesized beam (PSF) are shown in Gaussian random noise are added to the visibility data to simulate the thermal noise. The amplitude of the thermal noise is determined by the radiometer equation (Wilson et al. 2013 where k B is the Boltzmann constant, sys is the system temperature, e is the effective collecting area, is the frequency channel bandwidth, is the time resolution. We follow Braun et al. (2019) and set the natural sensitivity e / sys = 1.235 m 2 K −1 to generate random complex Gaussian on every baseline. The images at the central frequency for the foregrounds, the H , and the thermal noise are shown in Fig. 5. All images are dirty images with no cleaning routine applied. Throughout this paper, we use 'Jy/PSF' and 'kelvin/PSF' units for the images before deconvolution with the PSF. The 'PSF' refers to the integrated PSF area in steradian ∫ d d PSF( , ). 'Jy/PSF' is more commonly referred to as 'Jy/beam'. We use 'Jy/PSF' to avoid confusion with the primary beam.
In Section 5.1 when we discuss residual foreground removal, the thermal noise is rescaled by a factor of √︁ sim / int , where sim = 12 h is the observation time for the simulated one tracking and int is the total integration time set to 360, 480 and 600 h for different scenarios. The rescaling mimics coherent averaging of the visibility data over multiple nights. The thermal noise power spectrum is ∼ 4 orders of magnitude larger than the H power spectrum as we show in Fig. A1 in Appendix A.

Simulating systematics
Real observations will contain a wealth of systematics, including the radio frequency interference (RFI), gain instabilities, calibration errors and more. While it is beyond the scope of this paper to properly take into account all of the systematics, we aim to simulate the effect of systematics that can lead to spectral instability in a simplistic way. The systematics are simulated using where i true is the visibility data of the i th baseline without the systematics and the thermal noise. TN is the thermal noise visibility.
follows a Gaussian distribution with zero mean and only depends on the frequency channel. We simulate with different standard deviations from 10 −5 up to 10 −4 . The systematic errors are multiplied to the full visibility data before the assumed sky model subtraction. This effect is a crude approximation for bandpass calibration error averaged across all timesteps, creating fluctuations on small frequency scales which will leak foreground power into the observation window and bias the foreground removal techniques as we discuss in Section 5. Note that, the calibration errors are complex and have smooth structures in frequency for H observations (see e.g. figs 2 and 3 of Byrne et al. 2019). In our case, we focus on the blind removal of residual foreground after calibration and choose Gaussian errors so that the foreground scatter is present across the delay space (see Appendix B).
It is worth pointing out that the 200-240 MHz frequency range hosts several prominent sources of RFI. Around 220MHz there are the RF11 and RF12 bands of digital TV (see e.g. fig. 2 of Offringa et al. 2015), which can be identified through flagging algorithms (e.g. Offringa et al. 2010;Wilensky et al. 2019). The larger end of the frequency range ∼ 240MHz sits right next to military satellite band (242-272 MHz) which may cause complete data loss of the entire frequency range (see fig. 4 of Sokolowski et al. 2015). The presence of this RFI forbids us to go below redshifts < 5. Overall we expect that the 200-240MHz frequency range can be observed without substantial loss of data.

H power spectrum from the imaging route
The image cube can be used to estimate the H power spectrum. We compute the HI power spectrum from the imaged data instead of measuring the delay power spectrum directly from the visibilities for two reasons. First, the cosmological quantities such as the Hubble parameter and the comoving distance have significant evolution across the large redshift bin Δ = 1, making the delay power spectrum estimation very difficult especially with regards to deconvolving wprojection kernel and primary beam attenuation. Second, if we can verify the detectability of one field in image space, we can probe larger cosmological scales through image mosaicing of overlapping fields.
To calculate the H power spectrum, we first transform the flux density ( , , ) in the image cube into Fourier space brightness temperaturẽ is the physical co- ordinate corresponding to the sky coordinate ( , ) and observing frequency . V is the comoving volume of the image cube.
( ) is the comoving distance at redshift . c is the centre of the redshift bin and = 21 / − 1 is the redshift corresponding to the frequency where 21 is the rest frequency of the 21-cm line. ( ) is the primary beam attenuation. is the observing wavelength. The transverse coordinates for each voxel are assigned assuming an effective comoving distance, which is important to ensure that the operators for residual foreground removal and Fourier transformation are commutable as we discuss in Appendix A.˜( ⊥ , ) is in the units of kelvin/PSF. The H power spectrum in 3D -space is where PSF( ⊥ , c ) is the 2-D Fourier transform of the PSF at the central frequency c In the calculations above, several approximations have been made. The frequency evolution of the PSF is assumed to be negligible over the frequency bandwidth of the simulated observation. The physical coordinates of the voxels are assigned assuming an effective comoving distance. The flat-sky approximation is also used. While these assumptions may not be accurate enough for precision cosmology, as we show in Fig. 6, it can be seen that the output H power spectrum is within the 1-region of the input. It is sufficiently accurate for studying the detectability of the signal. The scales probed are from ∼ 0.03 Mpc −1 , limited by the size of the image, to ∼ 0.3 Mpc −1 , limited by the image resolution due to the PSF. In the power spectrum results shown hereafter, a Blackman-Harris frequency taper is also applied to minimise potential leakage of foregrounds and systematics, with the details discussed in Appendix A.

QUANTIFYING THE FOREGROUND WEDGE
In this Section, we use the image cube from H and foreground visibility data without the thermal noise to explore the limits of reducing foreground contamination. Without the thermal noise and any systematics, the H and foreground-only case showcases the best possible scenario for residual foreground removal. It helps us understand the requirements for sky modelling to enable detection and locate the observation window in the ⊥ − plane. We particularly focus on scales of cosmological interest < 0.2 Mpc −1 , especially the largest scale that can be probed using our image cube ∼ 0.03 Mpc −1 . If these scales can be probed with little foreground contamination, future surveys using wide-field imaging and mosaicing can further extend the scales larger than the first Baryon Acoustic Oscillation (BAO, Eisenstein & Hu 1998) peak at ∼ 0.04 Mpc −1 to the linear scales for cosmological analysis.

Observation window using only foreground avoidance
We first use the H -only image cube and foreground-only image cube to estimate the power spectra for the H and the foregrounds to compare them in cylindrical -space. The cylindrical power spectra for the H and the foregrounds are shown in Fig. 7. Comparing the ratio between the H power spectrum and the foreground power spectrum as shown in the top-left panel of Fig. 8, the foreground power spectrum is larger than the H power spectrum at 0.12 Mpc −1 , leaving no observation window at linear and BAO scales. In Fig. 8, the region where foreground power dominates does not have a clear wedge structure. This is due to the fact that the bright sources in the primary beam side-lobes, which contributes mostly to the wedge structure at high ⊥ , are assumed to be already removed in our simulation. Without the strong foreground emissions coming from large angular extent (high delay time), the wedge structure at high ⊥ no longer exists. The lack of wedge feature can also be seen from observations (e.g. LOFAR observaions shown in Mertens et al. 2020;Hothi et al. 2021). As we demonstrate in Section 4.2, the wedge structure reappears after foreground cleaning is applied to the data. This is due to the fact that removing residual foregrounds reduces the foreground power near the pointing centre, making the foreground emission at larger angular distance comparatively brighter.
If we relax the 10% modelling residual as described in Section 2.2 to an extreme 0.1%, 0.05 Mpc −1 scales are still lost as shown in the top right panel of Fig. 8, which invalidates the usage of the observations for cosmology. The result suggests that even with extreme level of calibration and sky modelling accuracy, it is unlikely that foreground avoidance can be used to measure the H power spectrum at cosmological scales at 5 < < 6 due to the = 0.24 ⊥ calculated according to Eq. (11). Bottom right panel: The ratio between the H power spectrum and the residual power spectrum after GPR cleaning. In all the panels shown, values below 1 are set to 1 for better presentation. The darkest end of the color scale corresponding to the region where the foreground power is larger than the H . weakness of the H signal at these redshifts. We can use foreground removal methods to mitigate the contamination at large scales as we show in the following sections.

Residual foreground removal
In order to suppress foreground contamination down to the wedge and create an observation window at large scales, we explore methods of blind source subtraction to remove the residual foregrounds. We focus on two methods commonly used, namely the Principle Component Analysis (PCA, e.g. Spinelli et al. 2022) and Gaussian Process Regression (GPR, e.g. Mertens et al. 2018;Soares et al. 2022). Following Chen et al. (2023), with the observation window enlarged due to the foreground cleaning we can choose a criteria for the power spectrum estimation in 1D -space where is a constant to be set. The value for can be found by iteratively testing with larger values to the point where the 1D power spectrum results converge.
We write out the general formalism for frequency-frequency covariance based foreground removal methodŝ where X is the mean-centred image cube which has dimensions of (N , N p ) with N the number of frequency channels and N p the number of pixels in one frequency channel.Ĉ fg is an estimation of the covariance matrix for the foregrounds andĈ −1 is the inverse of the estimation of the total data covariance. For different methods such as PCA and GPR, different choices ofĈ fg andĈ are used, producing different reconstructed foregrounds which we discuss in detail in Section 5.1.

Foreground removal using PCA
The PCA method separates the foregrounds by using the eigenvalue decomposition of the frequency-frequency data covariance matrix (e.g. Cunnington et al. 2021) The eigenvalues and eigenvectors of the covariance matrix are then calculated. An estimation of the foregrounds can be extracted from the data matrix usinĝ Here, v i is the eigenvector corresponding to the i th largest eigenvalue and a total of fg modes are removed. To link it to Eq. (7), we can rewrite Eq. (9) aŝ where it is straightforward to see that, in the case of PCA,Ĉ PCA =Ĉ d andĈ PCA fg = AA TĈ d . In our case, the eigenvalues of the data covariance reach a plateau after the third eigenvalue, suggesting that N fg = 3 is a good choice for cleaning the foregrounds and avoiding overcleaning the signal. The ratio between the H power spectrum and the residual power spectrum after cleaning is shown in the bottom left panel of Fig. 8. Throughout the paper, the residual power spectrum is defined as the power spectrum of the residual foreground image X res = X fg −X fg , where X fg is the image of the input foregrounds andX fg is the removed foreground by either PCA or GPR.
Comparing Figs 7 and Fig. 8, the cleaning efficiently enlarges the observation window at small ⊥ . If the foreground contamination is optimally mitigated, the foreground wedge can be located using the 'horizon limit' (Liu et al. 2014 where ( ) is the Hubble parameter, is the speed of light and 0 is the angular extent of the instrument beam. As a crude approximation we choose 0 = 2 √︁ Ω beam / where Ω beam is the integrated primary beam which gives h = 0.24. From the bottom-left panel of Fig. 8, we can see that the foreground wedge is close to the horizon limit which is marked by the red dotted line, showing that the foreground cleaning is efficient. Iteratively increasing the threshold we find that the 1D power spectrum converges at = 0.3, which we use from now on in this paper.

Foreground removal using GPR
GPR constructs the foreground component by fitting parameterized kernels to the data covariance. Suppose we have the H kernel K H , the foreground kernel K fg and the thermal noise kernels K n fitted, then the estimated foreground can be written as (e.g. Mertens et al. 2018) It is straightforward to see that, in the case of GPR,Ĉ GPR = K fg + K n + K H andĈ GPR fg = K fg . The H and the foreground covariance matrices are shown in Fig.  9 for reference. The H covariance is highly diagonal, due to the discrete and uncorrelated nature of the H along the line-of-sight. On the other hand, the foreground covariance is smooth and shows a clear spectral feature along the frequency direction, corresponding to the negative spectral indices of the radio sources. Due to the spectral evolution of the foreground covariance, the conventional choice of a Matérn kernel (?) does not describe the foreground covariance well. Instead, we use Markov chain Monte Carlo (MCMC) to fit the kernels using the following steps: • In each step, a random value n is sampled and a diagonal kernel K n = 2 n K ij is calculated where K is the Kronecker delta. In this section, K n is the H kernel. Following Soares et al. (2022), in Section 5 when thermal noise is included, K n is the sum of the H and the thermal noise covariance matrices.
• The total data covariance matrix is then subtracted by the diagonal kernel K n . A third-order polynomial fitting is then performed on every row of the subtracted result, creating a fitted kernel K fit . The kernel is then symmetrised to get the foreground kernel K fg = (K fit + K T fit )/2. • The parameters for the kernels are then fitted by maximizing the log-marginal likelihood log = −(X T K −1 X + log|K| + log2 )/2, where is the number of data points sampled and K is the sum of the kernels K = K fg + K n .
• The MCMC fitting is then performed with 20 random walkers with 2000 iterations to make sure the chains converge. The initial guess of n is taken to be the square root of the trace of the data covariance. The final kernels are the 50% percentile of the K n and the K fg samples in the chains excluding the first 100 steps.
Note that after GPR cleaning, a bias correction can be applied as shown in Mertens et al. (2018). We follow the quadratic estimator formalism of Kern & Liu (2021) and show in Appendix A that the bias correction term in our case is negligible. The resulting foreground residual power spectrum compared to the H power spectrum is shown in the bottom right panel of Fig. 8. Comparing the foreground wedge in the GPR case with the horizon limit and with the PCA case, we can see that in the absence of thermal noise, GPR is slightly more efficient in cleaning the foregrounds and both methods do well enough to enable the detection of the H at large scales < 0.1 Mpc −1 . At the largest spatial scales of the image, there is negative residual power from overcleaning. The differences between these two methods are discussed later in Section 5.1.
The success of the methods in cleaning the foregrounds indicates that we can measure the H power spectrum from the SKA-Low observation at 5 < < 6, as we show using the 1D power spectrum in Fig. 10. As mentioned, both methods can enable the measurements of the H power spectrum from ∼ 0.05 Mpc −1 up to ∼ 0.3 Mpc −1 .

FORECASTS FOR SKA-LOW
In this section, we further explore the detectability of the H power spectrum for SKA-Low observations by including different levels of thermal noise in the simulation. In particular, to enable the measurement of the H power spectrum, the robustness of the foreground removal methods in the presence of the thermal noise must be tested. Furthermore, we simulate systematics by generating stochastic errors along the frequency direction to test the limits of level of systematics allowed.

Robust foreground cleaning with low SNR
In Section 4.2, we show that the foreground removal methods can suppress the foreground wedge to the horizon limit. However, this result is based on the fact that the empirical data covariance is 'clean', i.e. the covariance is purely a combination of the H and the foregrounds. Therefore, the distinctive features of the H can be extracted from the signal using PCA and GPR. In reality, the data covariance is likely to contain a high level of thermal noise as well as systematics, making it difficult to construct the covariance of the foregrounds. We test PCA and GPR in the presence of different levels of thermal noise. As described in Eq. (1) in Section 3.1, we simulate the thermal noise for the 12h tracking and rescale it to match 360, 480 and 600 h of integration time. We first show the results for the 360 h case and compare the effects of foreground removal methods. The PCA and GPR routines are kept the same as in Section 4.2 with the observation window = 0.3. The ratio between the underlying H power spectrum and the foreground residual after removal in cylindrical -space is shown in Fig.  11. In contrast with the results shown in Fig. 8, the amplitude of the residual power increases significantly. For the PCA case, the observation window is heavily contaminated by the foregrounds while the contamination is less severe in GPR. Note that, this difference is not visible in the residual image cube as we show in Fig 12. The amplitude of the fluctuation of the residual is roughly the same with no indications of the different levels of foreground contamination.
The difference between PCA and GPR can be seen using the formalism in Section 4.2. Comparing Eq. (10) and Eq. (12), we can see that GPR uses the fitting result to obtain smooth kernels of the H and the foregrounds for cleaning. On the other hand, PCA directly operates on the total data covariance, which contains a fluctuation around zero in the non-diagonal elements because of the thermal noise. The fluctuation of the thermal noise leads to small scale oscillations in the residual covariance. For comparison, we calculate the covariance of the 'estimated' H , i.e. the total image subtracted by the removed foreground and the noise component Comparing the covariance ofX PCA H andX GPR H as shown in Fig.  13, we can see that while the amplitude of the covariance is roughly the same and close to the true H shown in the left panel of Fig. 9, the PCA case has large fluctuations across the frequency channel, leading to the stripe-like features in the frequency-frequency covariance matrix. While this fluctuation is also present in GPR, its amplitude is much smaller and the dominating component is still the diagonal H covariance. For PCA, however, this fluctuation introduces a smallscale fluctuation that spills foreground power into the observation window, resulting in severe signal loss at all scales including scales  where the foreground power is originally already lower than the H as shown in the upper left panel of Fig. 8. To further illustrate the small-scale contamination, we calculate the covariance matrices for the residual foregroundX res for PCA and GPR and compare them with the H covariance as shown in Fig. 14. Both methods have clear foreground residual structure over large frequency scales. However, the PCA residual has a much larger small-scale fluctuation with the amplitude larger than the diagonal H . The small-scale fluctuation results in severe contamination in high modes inside the observation window as shown in Fig. 11.
When comparing PCA and GPR, we assume full knowledge of the true H , thermal noise and foregrounds in our simulation to perform quality checks on the foreground removal methods. It is important to note that the foreground removal and power spectrum estimation routines do not rely on knowing the underlying components. The foreground removal is performed blindly and the H power spectrum is estimated by subtracting a thermal noise covariance as discussed in Appendix A. We choose logrithmically distributed k-bins from 0.01 to 1 Mpc −1 with Δ[log( /Mpc −1 )] = 0.25 and show the resulting 1D power spectrum for 360 h of integration time for both PCA and GPR in Fig. 15. Throughout this paper, the error bars on the 1D power spectrum are calculated by calculating the sampling variance of the  3D powers that fall into the 1D -bins. The resulting measurement errors on the power spectrum are where std[ ( ∈ )] denotes the standard deviation among the 3D powers that belongs in the th -bin and modes denotes the number of -points in the th -bin. As shown in Fig. 15, the foreground contamination leads to overestimation for the GPR case from ∼ 0.03 to 0.3 Mpc −1 . The severe contamination of foregrounds results in signal loss on most scales for the PCA case. In conclusion, we find that in the presence of high thermal noise, PCA induces foreground contamination into the observation window due to the small-scale fluctuation in the data covariance matrix. On the other hand, GPR does not introduce sizable foreground leakage into the observation window and mitigates the foregrounds to enable the measurements of the H power spectrum at large scales.
Using GPR, we present our forecasts for the H power spectrum measurement for SKA-Low observations of the EoR0 field assuming 360, 480 and 600 h of integration time in Fig. 16. The power spectrum results converge to the input H as the noise level decreases. For 360 h of integration time, all bandpower measurements are within the 1error of the input H with the bandpower at ∼ 0.05 Mpc −1 slightly overestimated due to foreground contamination. While not shown, we also tested that decreasing the integration time to 250 h results in the bias exceeding the 1-error. We conclude that the integration time of one field needs to be greater than 250 h to enable unbiased measurements of the H power spectrum. In the case of 480 h, the H power spectrum can be measured with ∼ 3 signal-to-noise ratio from ∼ 0.03 to 0.3 Mpc −1 . Further increasing the integration time to 600 h, we find that the bias further decreases and the error bar scales as √ int , suggesting that the thermal noise is the dominant source of the measurement errors.

Impact of systematics on foreground removal
Interferometric observations contain various systematics, such as RFI, gain fluctuations, calibration errors, etc. These systematics impact the H power spectrum measurement in various ways. For example, the data loss coming from RFI requires inpainting or novel Fourier transform methods which leaves residuals in the power spectrum (Trott et al. 2016;Pagano et al. 2023). Imperfect calibrations leaks the foreground power into the observation window (Barry et al. Figure 17. The H power spectrum measured from the residual foreground removed image cube as described in Section 5.2 for different levels of systematics (10 −5 , 5 × 10 −5 , and 10 −4 ). The error bars on the horizontal axis denote the width of the -bins and the error bars on the vertical axis denote the errors of the bandpower estimation. The shaded region denotes the input H power spectrum ('Input HI'). The centres of the k-bins for the 10 −5 and 10 −4 systematic effect cases are misplaced by 5% in -direction for better presentation. 2016). Gain and phase errors contribute to the foreground contamination in the H power spectrum (Mazumder et al. 2022). As a proof of concept, we are aiming to give a qualitative assessment of the impact of the systematics. Following Eq. (2), we set std( ) to 10 −5 , 5 × 10 −5 , and 10 −4 to check the resulting power spectrum estimation. All foreground removal and power spectrum estimation steps are kept the same as Section 5.1 and we choose the integration time to be 600 h to isolate the impact of the systematics. Previous literature suggest that < 10 −5 level of systematic error is needed for the measurement (Barry et al. 2016;Mazumder et al. 2022). Note that, however, as we simulate the systematics as a random error on the true signal, it acts as a small frequency-scale fluctuation on the residual foregrounds which can be partially mitigated. Therefore, using methods such as GPR, we can still recover the observation window with the level of systematics higher than 10 −5 .
In Fig. 17, we show the H power spectrum measured from the > 0.3 ⊥ window with the signal perturbed by the 10 −5 , 5 × 10 −5 , and 10 −4 systematic error. For a very small level of 10 −5 systematic effects, the GPR foreground removal method is unaffected by the systematics and removes the residual foreground sufficiently. As we increase the level of systematics to 5 × 10 −5 , the foreground starts to leak into the observation window especially at small scales, leading to overestimation of the H power spectrum. Increasing the systematics to 10 −4 the contamination becomes severe and leads to biased estimation of the H power spectrum at all scales.
Similar to Section 5.1, we can use the covariance matrices to show that the systematic effects break the frequency smoothness of the foreground, leading to biased foreground removal results. The covariance matrices of the 'estimated' H in presence of different levels of systematics are shown in Fig. 18. When no significant systematic effects are included as shown in the top panels, the reconstructed H covariance matrix is largely diagonal, suggesting that no sizeable foreground leakage is present. However, as we increase the level of systematics to 5 × 10 −5 , the small scale stripes similar to the ones discussed in Section 5.1 appear. For the 5 × 10 −5 case, we can see that the diagonal component is still dominant and indeed as shown in Fig. 17 the H power spectrum is still accurately measured. Increasing the level of systematics to 10 −4 , we can see that the covariance is completely dominated by the contamination from the systematics, leaving no observation window for the H at all. In conclusion, the level of residual systematics needs to be contained at < 10 −4 and ideally 5 × 10 −5 for accurate measurement of the H power spectrum.

CONCLUSIONS
In this paper, we present the first proof of concept for measuring the H power spectrum at 5 < < 6 using SKA-Low. We have presented an end-to-end simulation and data analysis pipeline, generating the sky signal, the interferometric observation, performing the imaging and the power spectrum estimation. We use the pipeline to generate realistic simulations consistent with deep observations of the EoR0 field using SKA-Low and test foreground mitigation methods to present our forecasts for future SKA-Low observations.
We start by simulating the input sky signal including the H and the foregrounds. Galactic foregrounds are generated based on templates from observed maps of the radio sky and extrapolated to the frequency range of our interests. We use a realistic radio source catalogue to simulate the extragalactic radio sources. The H clustering signal is generated by using large-volume, realistic dark matter halo simulations with an H HOD inpainting. Generating the sky sig-nal with different levels of foreground residuals compared with the underlying H signal, we find that: • Assuming a realistic amplitude for foreground residuals after sky model subtraction to be at ∼ 80mJy in the image cube, the foregrounds reside mainly at low 0.1 Mpc −1 , leaving an observation window at high for estimation of the H power spectrum. Residual foregrounds need to be subtracted using blind source separation methods to enable the measurement of the H power spectrum at large cosmological scales < 0.1 Mpc −1 .
• Testing PCA and GPR to remove the residual foregrounds, we find that if bright sources with flux density > 10 mJy are subtracted with the rest of the sources being modelled to 90% accuracy, removing the residual foreground can enable detections of the H power spectrum at large scales. The foreground wedge is consistent with the instrinsic foreground power coupled with the instrument chromaticity, with the wedge corresponding to the primary beam size.
• Assuming no contribution from thermal noise and systematic effects, the empirical data covariance matrix calculated from the image cube reflects the true underlying covariance of the sky signal. Therefore, PCA and GPR can both sufficiently remove the foregrounds with trivial differences between these two methods.
• From the image cube with (1.5 deg) 2 sky area within the primary beam FoV, we can measure the H power spectrum from ∼ 0.02 Mpc −1 to ∼ 0.3 Mpc −1 .
The results suggest that measuring the H power spectrum at 5 < < 6 for cosmological analysis using SKA-Low is viable and will open up a new window for cosmology in the near future. Using widefield imaging and/or mosaicing, we can probe linear cosmological scales ∼ 0.01 Mpc −1 to quasi-linear scales ∼ 0.3 Mpc −1 . The wide range of clustering scales probed can be used to constrain cosmology (Pourtsidou 2023).
The detection of the H signal at large cosmological scales depends heavily on the robustness of foreground mitigation strategies. Simulating different level of depths for the observation, we find that: • In general, future observations using SKA-Low contain a high level of thermal noise fluctuations. The effects of the thermal noise on the data covariance are visible even for deep observations > 250 h.
• The thermal noise fluctuations in the empirical data covariance matrix make residual foreground removal more difficult. Thermal noise creates numerical features on the foreground-removed image cube on small frequency scales, breaking the spectral smoothness of the data covariance.
• As a result of the spectral fluctuations, foreground removal methods induce numerical artefacts on small frequency scales. The numerical artefacts leak power into the observation window which leads to significant bias on the H power spectrum estimation. Even scales > 0.1 Mpc −1 which can be probed with just foreground avoidance can be contaminated.
• Comparing PCA and GPR, we find that GPR performs much better in the presence of thermal noise. The key factor is that GPR uses smooth kernels to model the signal and apply the fitted kernels instead of the actual data covariance matrix for the foreground removal. For observation with integration time > 250 h, GPR can sufficiently remove the foregrounds and allow unbiased estimation of the H power spectrum for > 0.3 ⊥ regions. • For the integration time of 600 h, SKA-Low will be able to measure the H power spectrum in the 5 < < 6 bin from 0.03 to 0.3 Mpc −1 with a signal-to-noise ratio of ∼ 5 across the scales.
In conclusion, the viability of detecting the cosmological H power spectrum at 5 < < 6 using SKA-Low depends on deep observations to preserve the spectral smoothness of the data covariance to facilitate sufficient foreground removal. It will allow accurate measurement of the H power spectrum, on the premise that deep fields with effective integration time 300 h are observed. Our results not only solidify the science case of measuring post-reionization cosmology with SKA-Low, but also provides insights into survey design for maximising the scientific output of the instrument.
Finally, we provide a qualitative study into the systematic effects by introducing spectral fluctuations that can originate from bandpass instabilities and calibration errors. Testing the data analysis pipeline for different levels of systematics we find that: • Systematic effects such as bandpass instabilities will introduce fluctuations in the small frequency interval, breaking the spectral smoothness of the foregrounds. It leads to spillover of the foreground power into the observation window outside the foreground wedge.
• In the image cube averaged across all timesteps, the effective systematic errors acorss the frequency channels need to be small to suppress the contamination. If the level of the systematics is above 10 −4 , the power spectrum measurement will be biased across all scales of interests.
• For systematic errors 5 × 10 −5 , we find that using GPR to perform foreground removal gives unbiased estimation of the H power spectrum.
The requirements on containing the systematic errors below one per cent level again highlight the need for deep observations with good understanding of the sky model and the instrument. With the unprecedented power of the SKA-Low array, we expect that future surveys will be sufficiently systematic-mitigated to enable the detection of the H power spectrum for the high redshift, post-reionization Universe.
Our work strongly favours using the future SKA-Low data for H science after cosmic reionization. We have demonstrated that the H power spectrum can be measured with statistical significance using observational depth that can easily be reached using SKA-Low. Furthermore, we have showcased residual foreground removal using GPR that suppresses the foreground wedge to probe cosmological scales, which is robust in the presence of a reasonable level of systematic effects. The tools presented in this paper can be further used for more realistic simulations of SKA-Low observations to develop the data analysis pipeline towards future detections.
for the th frequency channel The estimation matrix E can be written as where M is the renormalisation matrix, T is the frequency taper, is the selection matrix with all elements being zero except the th diagonal element and F is the 1D discrete Fourier transform kernel along the frequency direction. C , = F † F is the Fourier operator. R is the foreground removal operation. For PCA as described in Eq. (9), the removal matrix is R = I − AA T where I is the identity matrix. For GPR as described in Eq. (12), the removal matrix is The renormalisation matrix can be calculated by taking the expectation value of Eq. (A1) Following Kern & Liu (2021), we can form the quantity and choose M = H −1/2 (Tegmark et al. 2002) to renormalise the estimator.
In order to remove the bias in the power spectrum estimation from the foregrounds and the thermal noise, from Eq. (A4) we can choosê to remove the bias. In reality though, we do not know the underlying thermal noise and the foregrounds. In order to remove the noise bias, we calculate N by simulating 1000 realisations of the thermal noise using the same N . Here, N is assumed to be a known quantity, which is the case for our simulation. In real observations, a good estimation of N can be obtained by calculating the fluctuations of the Stokes V visibility data on long baselines (e.g. Paul et al. 2021).
For each realisation, we pass the visibility data to the same imaging pipeline to generate the image cubes. For each pixel in the image cube, we then calculate the average frequency-frequency correlation across all realisations to obtain an estimation of N. We bin the resulting noise bias tr[NE d ] into cylindrical -space and show the thermal noise power spectrum in Fig. A1 for the case with 360 h of integration time. The vertical stripes follow the baseline densities on different scales. The covariance of the foregrounds can be extracted from the GPR fitted kernel K fg . Note that, since we work on the frequencyfrequency covariance, K fg E d is the same for each pixel, and therefore there is no ⊥ dependence of the bias term as shown in the right panel of Fig. A1. We note that this is not a result of GPR but the result of our simplified quadratic estimator formalism. Nevertheless, it gives us a good estimation of the order of magnitude of the GPR bias correction. As one can see, the correction is at least 2 orders of magnitude smaller than the H signal shown in Fig. 15, and therefore this bias correction is negligible in our case. Finally, we comment on the fact that in the power spectrum estimation, the 2-D Fourier transform shown in Eq. (A2) is applied to the data before the GPR removal R, while the GPR fitting for the kernels are done on the original data vector before the transform. These two operations are commutable, as the GPR removal only operates along the frequency direction, independent of the 2-D Fourier transform on the transverse plane. We verified that there is no visible difference in the resulting power spectrum if these two operations are swapped. Performing the 2-D Fourier transform first allows us to only construct the estimator one pixel at a time, providing massive speed-up.

APPENDIX B: CAVEATS OF THE SIMULATIONS
We discuss the limitations of our simulation settings. Specifically, we quantify the effects of limited (10.5 deg) 2 sky area for the input signal, coupled with the instrument beam which gets cut off at 1% at the 10.5 deg angular extent. Furthermore, we discuss the Gaussian calibration errors simulated in terms of its structure in frequency.
The primary beam of the instrument is shown in Figure B1. Around the centre (10.5 deg) 2 region, the beam only goes down to 10 −2 , introducing sharp features in the simulation. We first note that, as discussed in Section 4, the image power spectrum does not show a clear wedge structure due to the small image size. To investigate the chromatic structure of the data, we instead calculate the delay power spectrum directly from visibility and present it in Figure B2. As shown in the top panel, the full foreground delay power spectrum shows a clear wedge structure. Above the wedge, the effect of sky signal getting cut off at 10.5 deg can be seen as the diamond-shape structures. Assuming the bright sources are removed as described in Section 2, we calculate the delay power spectrum of the residual foregrounds shown in the centre panel of Figure B2. The chromatic features disappear as there is no bright emission coming from the beam sidelobes. Finally, we also present the delay power spectrum of the H signal in Figure B2 to show that the sky cut-off does not affect the H simulation.
The calibration errors for SKA-Low observations are likely smooth in frequency , which are not the Gaussian random fluctuations we use. Using the delay power spectra, we then investigate the assumption of the Gaussian gain errors described in Section 3. For comparison, we simulate another type of error that follows the sine function with a period of 15 frequency channels. The errors are then rescaled so that the standard deviation across the channels is  10 −4 . The sine errors are then compared against the Gaussian errors as shown in Figure B3.
For the sine error case shown in the right panel, the foreground wedge gets lifted into higher delay. Comparing to the Gaussian error case in the central panel, the leakage still concentrates around relatively low delay. This means that the foreground contamination can be easier to remove for GPR, as its structure has large frequency intervals. In the Gaussian case, however, the scatter of the foreground power into higher delay is visible across all scales. The foreground contamination is at the smallest frequency interval, which is difficult to distinguish from the H signal. Therefore, we conclude that the conclusions reached in Section 5.2 are robust, as the foreground contamination is not an optimistic case.
We emphasize that, the smooth frequency structures of the gain errors pose other challenges in sky modelling and continuum subtraction, which are beyond the scope of this paper and left for future work. This paper has been typeset from a T E X/L A T E X file prepared by the author.