Deep learning approach for identification of Hii regions during reionization in 21-cm observations – II. foreground contamination

The upcoming Square Kilometre Array Observatory (SKAO) will produce images of neutral hydrogen distribution during the epoch of reionization by observing the corresponding 21-cm signal. However, the 21-cm signal will be subject to instrumental limitations such as noise and galactic foreground contamination which pose a challenge for accurate detection. In this study, we present the SegU-Net v2 framework, an enhanced version of our convolutional neural network, built to identify neutral and ionized regions in the 21-cm signal contaminated with foreground emission. We trained our neural network on 21-cm image data processed by a foreground removal method based on Principal Component Analysis achieving an average classification accuracy of 71 per cent between redshift 𝑧 = 7 to 11. We tested SegU-Net v2 against various foreground removal methods, including Gaussian Process Regression, Polynomial Fitting, and Foreground-Wedge Removal. Results show comparable performance, highlighting SegU-Net v2 ’s independence on these pre-processing methods. Statistical analysis shows that a perfect classification score with 𝐴𝑈𝐶 = 95% is possible for 8 < 𝑧 < 10. While the network prediction lacks the ability to correctly identify ionized regions at higher redshift and differentiate well the few remaining neutral regions at lower redshift due to low contrast between 21-cm signal, noise and foreground residual in images. Moreover, as the photon sources driving reionization are expected to be located inside ionised regions, we show that SegU-Net v2 can be used to correctly identify and measure the volume of isolated bubbles with 𝑉 ion > ( 10 cMpc ) 3 at 𝑧 > 9, for follow-up studies with infrared/optical telescopes to detect these sources.


INTRODUCTION
Radiation emitted by the first luminous sources drastically influenced the chemical composition and thermal history of the intergalactic medium (IGM), transitioning the Universe from an initial cold and neutral state to a final hot and ionized state (e.g.Furlanetto et al. 2006;Ferrara & Pandolfi 2014;Choudhury 2022).These sources most likely formed at locations where dark matter structures collapsed into gravitational bound structures during redshift 𝑧 ≳ 10 (Abel et al. 2001;Bromm et al. 2009;Pawlik et al. 2011).The newly launched James Webb Space Telescope (JWST) 1 is already providing preliminary results by detecting possible ionizing source candidates at these high redshifts (Castellano et al. 2022;Naidu et al. 2022;Bakx et al. 2022), which will help us understand the conditions for early galaxy formation (e.g.Boylan-Kolchin 2022; Hütsi et al. 2023;Dayal & Giri 2023).
Another way to probe the appearance of these first luminous sources is to observe the evolution of neutral hydrogen (Hi) in the IGM.The ground state spin-flip transition of neutral hydrogen produces a signal with a wavelength of 21 cm in the rest frame, known as the 21-cm signal.The presence of this signal is directly correlated with the number density of neutral hydrogen present in the early Universe, and with the Universe expansion, the 21-cm signal wavelength redshifts into the radio frequency.As the first stars and galaxies formed and began emitting ultraviolet radiation, they started to ionize neutral gas in their surrounding.These primordial sources produce enough photons to escape their hosting environment and propagate into the IGM.As the hydrogen in the IGM becomes ionized, the intensity of the 21-cm signal decreases.Therefore, by observing the 21-cm signal from the early Universe with radio telescopes, we can study the reionization process and learn about the properties of the first luminous sources (e.g.Madau et al. 1997;Furlanetto et al. 2006).Several radio experiments, such as the Low-frequency Array 2 (LO-FAR; e.g.Mertens et al. 2020;Ghara et al. 2020), Murchison Widefield Array3 (MWA; e.g.Trott et al. 2020;Ghara et al. 2021) and Hydrogen Epoch of Reionization Array4 (HERA; e.g.The HERA Collaboration et al. 2022b,a), have been trying to detect this signal during the epoch of reionization (EoR).
Currently, the low-frequency band component of the Square Kilometre Array5 (SKA-Low; e.g.Mellema et al. 2013), which will observe the sky at a frequency range between 50 and 350 MHz, is under construction.SKA-Low will have a field of view covering ∼(10 deg) 2 on the sky (Koopmans et al. 2015).This radio interferometer will be sensitive enough to capture the evolution of the IGM during EoR with images of the 21-cm signal from redshift  = 30 to 5.This sequence of 21-cm maps observed at different frequencies will be stuck together to constitute a three-dimensional set of data, known as the multi-frequency tomographic dataset (e.g.Mellema et al. 2015;Wyithe et al. 2015;Giri et al. 2018a).The 21-cm signal image data produced by the SKA-Low will contain imprints of the ionised regions (or bubbles) caused by the luminous sources (Giri et al. 2018a,b) and neutral regions (or islands) tracing the cosmic voids (Giri et al. 2019).By detecting these bubbles, we can learn about the locations of the first luminous sources (Zackrisson et al. 2020).We can also understand the nature and distribution of the photon sources driving the reionization process by studying the evolution of their sizes and morphology (e.g.Giri et al. 2018aGiri et al. , 2019;;Giri & Mellema 2021;Kapahtia et al. 2019Kapahtia et al. , 2021;;Gazagnes et al. 2021;Elbers & van de Weygaert 2022).However, detecting these ionised bubbles in radio telescope observations is not trivial due to several limitations of the telescope, such as the limited resolution and instrument noise.
To detect these bubbles, previous authors have developed methods using visibilities data smoothed with appropriated filters to represent the sizes and shapes of the bubbles, then a likelihood for Bayesian approach estimates the parameters of the ionized regions filtered (e.g.Datta et al. 2007;Ghara & Choudhury 2020).Other authors employ the image data of radio telescopes.This approach can be intensity-based, where the method filters the image based on a threshold value or region-based, by agglomerate clustering correlated pixels into groups with common traits within the image (e.g.Achanta et al. 2012;Mehra & Neeru 2016;Giri et al. 2018b).This task is a well-known assignment in Artificial intelligence (AI) called segmentation.Therefore, another approach would be to consider a deep learning application.Recent work by Gagnon-Hartman et al. (2021) demonstrated that a combination of foreground avoidance and machine learning techniques enable 21-cm segmentation and bubble detection for experiments that are not necessarily optimized for imaging.Moreover, recently, we presented our first work (see Bianco et al. 2021, hereafter Paper I), where we introduced a deep learning approach to identify the distribution of Hi regions in SKA 21-cm tomographic image using a U-shaped convolutional neural network (U-Net) (Ronneberger et al. 2015).We named our framework SegU-Net and we assessed how this network could process 21-cm images during the EoR contaminated by systematic noise simulated for SKA-Low and segment the images into ionized and neutral regions with an average of 87% accuracy for redshift between 7 and 9.Moreover, we assessed that our network outperforms the Super-Pixel method (Giri et al. 2018a), considered the state-of-the-art algorithm for EoR segmentation, with, on average, 10 to 20% more accuracy.We also demonstrated that SegU-Net could be used to recover the bubble size distributions with a relative difference within the 5% and other summary statistics with the same level of accuracy.Moreover, we provided our method with a per-pixel uncertainty map that provides a confidence interval for its prediction and the derived statistics.We have tested the response of our framework to different noise levels based on a shorter (250 h) and more extended (1500 h) observing time, corresponding to an under-and overestimation of the noise level, respectively.We demonstrated that SegU-Net tolerates noise up to √ 2 times larger than the one employed in the training process, obtaining the same level of accuracy.By studying the uncertainty map and the response to the noise level, we realised that machine learning models are sensitive to the dynamic range and the intrinsic resolution of the simulated images.
While our previous work demonstrated excellent performance in detecting Hi regions from EoR images, it should be considered a proof-of-concept as we consider EoR images with only telescope systematic noise, and we did not include any foreground contamination.The biggest challenge for the SKA-Low observation, just like other radio telescopes, is to separate the 21-cm signal from the undesired extra-galactic and galactic foreground contamination, which outshine the cosmological signal by several orders of magnitude (Jelić et al. 2008;Bowman et al. 2009).The key goal of this work is to develop tools which remove these foregrounds while recovering the regions of Hi during EoR from the 21-cm signal image data.
In this work, we will further develop our deep learning-based method to determine the ionised bubbles in image data with the presence of realistic galactic and extra-galactic foregrounds expected from the SKA-Low.Therefore, here we present SegU-Net v2, which extends the previous work by including foreground emissions of galactic origin and a complete study of its dependency on the foreground mitigation pre-processing step that partially subtracts the foreground signal, thus reducing the dynamic range in the 21-cm images before starting the network training.In the last three decades, several foreground removal methods with different approaches have been developed.Some of the early attempts take advantage of the spectral smoothness of the galactic and extra-galactic contaminants to fit along the line of sight and remove the foreground in either real or  space (e.g.: Morales et al. 2006a,b;Wang et al. 2006;Gleser et al. 2008;Liu et al. 2009b;Wang et al. 2013).However, more recent approaches suggest a non-parametric subtraction (e.g.Harker et al. 2009;Gu et al. 2013;Chapman et al. 2012Chapman et al. , 2013;;Bonaldi & Brown 2015;Mertens et al. 2018) as the frequency smoothness of the foreground spectrum can be corrupted by beam effect and incomplete  coverage (Liu et al. 2009a).Therefore, we perform a complete study of different available approaches for foreground subtraction in the case of the SKA-Low 21-cm tomographic dataset applied to SegU-Net v2.We analyse the effect of the subtraction process on the predicted binary maps so that we can establish if a particular foreground removal method provides a concrete advantage for our task.
This paper is organised as follows.In § 2, we present how we generate the simulated data sets used for this work, including details of our foreground model in § 2.3 and a description of the mock 21-cm observation in § 2.4.In § 4, we describe the design and the training of our neural network.In § 5, we discuss its application to our simulated SKA-Low data sets contaminated by the foreground signal, and we analyse summary statistics such as the mean ionization fraction, power spectra and topological quantities.In § 5.2 we test our framework on a different foreground removal method.We discuss and summarize our conclusions in § 6.Throughout this work, we assume a flat ΛCDM cosmology with the following parameters: Ω Λ = 0.73, Ω  = 0.27, Ω  = 0.046,  0 = 70 km s −1 Mpc −1 ,  8 = 0.82,   = 0.96.These values are based on the WMAP 5 years observation (Komatsu et al. 2009) and consistent with Planck 2018 (Planck Collaboration et al. 2020) results.

21-CM SIGNAL
This section illustrates the process we follow to create 21-cm mock observations of the EoR.Development of the network requires mock 21-cm observations of the EoR for network training, validation and testing, which will be described in § 4.

Simulating the Cosmological 21-cm Signal during EoR
The intensity of the redshifted 21-cm signal emerging from a neutral cloud of hydrogen can be observed by a radio interferometric telescope as the difference against the CMB temperature  CMB , i.e.   ≡   −  CMB .For a given sky angular position n n n and redshift , we can define it to be (e.g.Zaroubi 2012;Mellema et al. 2013) where  HI is the neutral hydrogen fraction,  b is the baryonic overdensity, and  S is the spin temperature.We assume that the IGM is heated well above the CMB temperature ( S ≫  CMB ) at  ≲ 12, which is consistent with theoretical predictions (e.g.Pritchard &  Furlanetto 2007; Ross et al. 2017, 2019, 2021) 6 .In this context, Equation 1 is always positive and can be approximated as   ∝ (1 +   )  HI , while the presence of ionized regions is characterized by a lack of signal,   = 0 mK.The radio interferometer cannot observe the absolute   .Therefore, the ionised regions cannot be identified by finding pixels with zero signal in the 21-cm image data.To model the large-scale cosmological 21-cm signal expected during reionisation, we employ the Python wrapper of the 21cmFAST semi-numerical code (Mesinger et al. 2011;Murray et al. 2020).The code models the dark matter density evolution and gravitational collapse using the second-order Lagrangian perturbation theory (2LPT).From the generated large-scale density field, a region is considered collapsed when it exceeds a defined mass threshold, which can be related to a minimum virial temperature  min vir .The excursion set formalism is then employed to calculate ionised regions (Furlanetto et al. 2004).The code outputs a coeval cube at different redshifts that are then used for constructing 21-cm lightcones.We refer the readers to e.g.Giri et al. (2018a) for more general details on the construction of lightcone from coeval cube simulations.In this work, we simulate the signal in coeval cubes for a total of ∼ 20 snapshot for redshift  = [7, 11] with a mesh grid of 128 3 that is 256 Mpc along each direction.

Systematic Noise
We model the SKA-Low antenna receiver noise by a random Gaussian distribution with mean value zero and variance (Ghara et al. 2017;Giri et al. 2018b) Here  int is the integration time,  daily is the window of observation per day,  sys is the system temperature,  eff is the effective collecting area, Δ is the bandwidth,  uv is the number of measurements that are detected in each cell of the uv-coverage grid.We assume an observation length of  obs = 1000 h.We list the SKA-Low telescope parameters in Table 1.The uv-coverage grid is simulated assuming the current plan for antennae distribution of SKA-Low7 .In the topright panel of Figure 1, we show an example slice of the 21-cm signal and a noise realisation at  = 8.24.As the map is degraded to a resolution corresponding to a maximum baseline of  = 2 km, we can see the large-scale distribution of the neutral and ionised regions.

Foreground Contamination
Between 250 and 30 MHz, the dominant emission comes from the Galactic synchrotron radiation.This emission alone is expected to contribute to the majority of the total foreground contamination of the comic 21-cm signal (Di Matteo et al. 2002Matteo et al. , 2004;;Santos et al. 2005;Datta et al. 2007;Jelić et al. 2008;Kerrigan et al. 2018).Other contributors can include emissions from unresolved extra-galactic point sources, Galactic free-free emissions, supernova remnants and extragalactic radio clusters, which, for simplicity, have been neglected in this study.We based our Galactic synchrotron emission model on the Choudhuri et al. (2014) study.We express the foreground radiation with a Gaussian random field with an angular power spectrum as Here, the parameter for the Galactic synchrotron emission is the power spectra amplitude  150 = 512 mK 2 at the reference frequency  ★ = 150 MHz, the angular scaling  = 2.34, the spectra index  syn = 2.8 and the running spectral index Δ  = 0.1.These quantities are taken from Platania et al. (1998), andWang et al. (2006).We then generate the foreground temperature fluctuations map following the relation the Galactic synchrotron emission added to the 21-cm signal with the systematic.We can notice how the dynamic range is a few orders of magnitude larger and completely outshines the 21-cm signal.For all the differential brightness images, the units are in mK.
Ω SKA is the total SKA-Low solid angle and  = /2.The two quantities   and   are independent random Gaussian variables with mean zero and variance of one, N ∼ (0, 1).By performing twodimensional inverse fast-Fourier transform of Equation 5, we get the spatial distribution of the foreground contamination  frg  ( n n n, ).With each lightcone simulation, we fix the random variables seed for the lowest redshift,  = 7, and compute Equation 4 for the corresponding frequency of the image.

Mock 21-cm Observation
From the simulated coeval cubes described in §2.1, we create 3D lightcones with differential brightness  sim  ( n n n, ) ≡  sim  (, , ) at ,  coordinates for a total box size of 256 cMpc and spatial resolution of Δ = 2 cMpc, both in comoving units, corresponding to an angular mesh-size of 128 2 .This scale corresponds to an angular resolution of Δ = 0.77 arcmin at redshift  = 7.The redshift coordinate is divided into 552 bins at equal comoving distance Δ from  = 11 to 7, corresponding of frequencies from  obs = 118 MHz to 178 MHz and a frequency resolution of approximately Δ ≃ 0.11 MHz.
We select one tomographic simulation from the prediction dataset as our fiducal simulation.In Figure 1, left column, we show a slice of this fiducial lightcone at redshift  = 8.24, corresponding to  obs = 152.90MHz.At this stage, the simulated lightcone is 50% ionised.The top panel show the neutral fraction  HI , with blue and red regions being the neutral and ionised regions, respectively.At the same time, the green colour indicated regions of transitions with  ≃ 0.5.The differential brightness is calculated with Equation 1 with the approximation discussed in §2.1.
From radio interferometry telescope, we can obtain images by gridding the uv-plane and inverse Fourier transform the gridded visibility (Smirnov 2011;Offringa et al. 2014).Image weighting can be applied to the visibilities before the gridding, and in the case of large-scales 21-cm EoR experiment with SKA-Low, the so-called natural weighting is preferable as the more redundant, short baselines ensures the highest signal-to-noise ratio in the image at the expense of a limited image resolution and large side lobes effect (Briggs 1995).In our case, we do not simulate the 21-cm signal from the visibility space but instead work on images already in the real space.Therefore, to mimic the effect of the limited resolution due to the visibility weighting, in the angular direction, we apply a Gaussian kernel,  ( n n n, ), with Full-Width at Half Maximum (FWHM) of 21 cm(1 + )/, where  = 2 km that corresponds to the maximum baseline of SKA-Low.According to the planned SKA-Low design 8 , it will be densely filled within this 2 km providing enough sensitivity to construct images.The bottom panel in Figure 1 shows the differential brightness after smoothing the field with  ( n n n, ).For reference, this interferometric smoothing corresponds to an angular resolution of ∼ 2.9 arcmins at  ≈ 7 and ∼ 4.3 arcmins at  ≈ 11.In the frequency direction, we apply a top-hat bandwidth filter with the same width as the FWHM in the angular direction.We implement the method explained in §2.2 and the parameters listed in Table 1 to simulate the effect of the systematic noise,  noise  ( n n n, ).We create a random field with the same mesh size as the lightcone and add the simulated differential brightness.We then apply the same interferometric smoothing mentioned above, and the result is shown in Figure 1, top right panel.As a reference for the reader, this was the network input in our previous work (Paper I).
In this paper, we want to extend our previous effort as we want to recover the neutral binary map in the presence of contamination due to the synchrotron Galactic foreground,  frg  ( n n n, ).The result of the model described in §2.3 is shown in Figure 1, bottom right panel.As we can see, the dynamic range of the observed changes drastically.Our previous work showed that our method is sensitive to the SNR level between the noise and the 21-cm signal.Therefore, we need to introduce an additional pre-processing step in our framework to mitigate foreground contamination and decrease the dynamic range of the contaminated images before providing them for network training.We will discuss this method in more detail in §3.
We can describe our mock observation pipeline by combining the components and operations described here above as (e.g.Liu & Shaw 2020) For each realization of the lightcone  obs ( n n n, ), illustrated with Figure 1, we calculate the mean along the frequency channels, where   and   are the dimension in the angular-direction of the 128 2 mesh.We subtract this quantity from  obs to account for the effect of the null baseline in interferometry telescopes.For this reason, the colour bar in the figure shows a negative value.W convolve the subtracted term with the Gaussian kernel  mentioned above (8) This result constitutes a realistic mock observation of the SKA-Low interferometric telescope, including systematic noise, Galactic foreground contamination, and telescope limited resolution effect.We employ this pipeline to create the training, validation and random testing set.In §3, we explain how we pre-process this type of data before inputting it into our neural network.
Finally, we create an additional field that serves as the target of the network training.We apply the interferometric smoothing explained 8 The construction document can be found at https://www.skao.int/en/resources/402/key-documents.above to the simulated neutral fraction field  HI (top left panel Figure 1).We then choose a threshold of  th = 0.5 to discern the ionised and neutral regions.The result is a binary lightcone,   HI ( n n n, ), where neutral and ionized regions are classified by 1 and 0, respectively.For a visual comparison, we over-plot the contour of this binary field as a black line in Figure 1 top right panel.

FOREGROUND MITIGATION
As we outlined in §2.4,foreground contamination poses a huge problem in detecting the 21-cm signal, as this signal is several orders of magnitude fainter in comparison.In Figure 2, we illustrate the effect of the foreground contamination on the 2D cylindrical power spectrum for a lightcone sub-volume centred at redshift   = 8.24 and frequency width of Δ ± 10 MHz.This quantity of the 21-cm signal (top panel) is compared with the same signal contaminated by the Galactic foreground signal (bottom panel).We observe that the contamination is visible at  ∥ ⩽ 10 −1 Mpc/h with signal intensity of ⩾ 10 9 mK 2 .The black dashed line in the figure indicates the foreground wedge.We will discuss this line later in §3.2.To reduce the dynamic range of the foreground contaminated images to a level that is manageable for the neural network, we include a pre-processing step on the observed data,   obs ( n n n, ).Hereafter, we refer to the re- sulting images of this pre-process as residual lightcone or images,  res ( n n n, ).
In foreground mitigation, we can consider two methods: foreground subtraction or avoidance (Chapman & Jelić 2019).Here, we consider three of the former cases, namely PCA, GPR and Polynomial fitting, and one of the latter techniques, Wedge removal.In this section, we briefly describe four different pre-processing methods that we test and we provide the residual image in Figure 3 for each method.The top panels show the residual image of the example illustrated in Figure 1, while black contours indicate the ground truth.The bottom panel shows the 2D cylindrical power spectrum for the fiducial lightcone sub-volume centred at   = 8.24 and frequency depth of ±10 MHz.

Principal Component Analysis
Principal Component Analysis (PCA) is a commonly used method to remove foregrounds in 21-cm experiments (e.g.Alonso et al. 2015;Cunnington et al. 2023;Chen et al. 2023a).The method exploits the fact that foregrounds have large amplitude and smooth frequency coherence.PCA simultaneously identifies the largest foreground components and an optimal set of basis functions that describe the frequency structure of the foregrounds.As the foregrounds are highly correlated in frequency, the frequency-frequency co-variance matrix of the foregrounds will have a particular eigensystem where most of the information can be sufficiently described by a small set of very large eigenvalues, the other ones being negligibly small.Thus, we can attempt to subtract the foregrounds by eliminating the components corresponding to the eigenvectors of the frequency co-variance matrix with the largest associated eigenvalues.In practice, we remove 4 components, which captured most of the variance of the foreground modes.PCA is a relatively fast and computationally efficient method that requires no prior assumptions about the foregrounds or the 21cm signal.However, PCA is not well-suited to handle non-linear relationships between the foregrounds and the 21-cm signal, and it can struggle to remove residual foregrounds not well-described by the largest components.
In Figure 3, left column, we show the residual image at   = 8.24, on top.After removing the first four components with PCA decomposition on the 20 MHz sub-volume of the fiducial lightcone, we obtain this image.On the bottom panel, we show the corresponding 2D power spectra.

Wedge Remove
We consider another pre-process that focuses on discarding the Fourier modes dominated by foreground contamination.This method assumes that the contaminated modes are contained in specific regions in the  ⊥ −  ∥ space, named the foreground wedge.These contaminated -modes can be defined by (e.g.Liu et al. 2014;Murray & Trott 2018) where  () is the Hubble parameter and k ⊥ is the Fourier component perpendicular to the line of sight. is the angular size of the field of view, which can be interpreted as the horizon limit angle. is the bias that accounts for the presence of an intrinsic foreground limit at low  ∥ -values.Pessimistic and arguably more realistic assumptions consider the horizon limit to  = 90 • justified by antenna side-lobes effect (Pober et al. 2014;Dillon et al. 2014).In our case, we select  = 2.25 • , corresponding to the field of view (FoV), at redshift  = 7 and comoving size of 256 cMpc, of our dataset.We then select  = 8 × 10 −2 h Mpc −1 based on the 2D cylindrical power spectrum shown in the right panel of Figure 2. The dashed black line indicates Equation 9 for the  and  mentioned before.
In this work, we employ a simplified version of the code developed by Prelogović et al. (2021).Here we give a brief description, referring the reader to the original paper for more details.First, we perform a 2D Fourier transform in the angular direction of a lightcone sub-SegU-Net v2 7 volume, Equation 8, centred at redshift   and with a given frequency depth, ±Δ.Subsequently, an iterating procedure along the line-ofsight axis calculates Equation 9 and sets the -modes that obey the condition to zero.A Blackmann-Harris taper function of the same angular and redshift size is multiplied by the lightcone to avoid artificial ringing in the Fourier space.However, this taper has the limitation that at low  ∥ , it reduces the Fourier-space side lobes, while the opposite effect occurs at high  ∥ .Finally, we do an inverse Fourier transform to regain the real-space lightcone sub-volume.
An example of data with the foreground contamination removed by this algorithm can be seen in the second column of Figure 3. From the residual image (top panel), we see a large portion of the foreground residual is still present.The bottom panel shows the 2D cylindrical power.The dark blue colour indicates the  ⊥ −  ∥ modes where the wedge removes method is applied.

Gaussian Process Regression
The Gaussian regression processes (GPR) method was developed in Mertens et al. (2018) to separate foregrounds from 21-cm signal by modelling the two components as a stochastic process and separating them using a Bayesian approach.The method involves constructing a prior statistical model of the foregrounds and the 21-cm signal and then using the model to estimate the posterior distribution of the 21-cm signal given the observed data.This is done by assuming that the foregrounds and 21-cm signals are realizations of Gaussian processes, fully defined by their covariance.The selection of the prior covariance model in GPR is made under a Bayesian framework by maximizing the marginal likelihood.The Matérn class of covariance functions is commonly used as prior covariance for the different data components.Following Mertens et al. (2018), a Radial Basis Function (RBF) kernel is used as the prior covariance model for the foreground component, while an Exponential kernel is used for the 21-cm signal.This method can effectively remove foreground contamination from the 21-cm signal and has the advantage of being able to incorporate prior knowledge about the signal and foregrounds.However, it requires accurate modelling of the foregrounds and assumptions about the statistical properties of the signal and foregrounds.
In Figure 3, third column, we show the result obtained by the GPR presented here above.Similar to PCA, see §3.1, GPR removes a good portion of the foreground contamination providing a better contrast between the 21-cm emitting regions and the ionized one.For instance, the regions around (, ) = (225, 100)Mpc and (, ) = (150, 200)Mpc.From the 2D power spectra at  ∥ > 3 × 10 −2 we see more signal when compared to PCA pre-process data.

Polynomial fitting
We can also use Polynomial fitting to remove foreground contamination from the 21-cm signal (Wang et al. 2006;Alonso et al. 2015).The method involves modelling the foregrounds as a smooth polynomial function in log space and fitting this function to the observed data,   obs .
Here,  0 is the 21-cm frequency and  fg indicates the polynomial degree.In our study, we consider a fourth-degree polynomial.The resulting fit is then subtracted from the data to remove the foreground contamination  res =   obs −  ( n n n, ).
This approach has the advantage of being simple and computationally efficient but may not be as effective at removing foregrounds as other, more sophisticated methods.One limitation of the polynomial fitting is that it assumes the foregrounds can be well-described by a smooth polynomial, which may not always be the case (e.g.Thyagarajan et al. 2015).Additionally, if the polynomial fit is not in high enough order, it may leave some foregrounds in the data, while an overly high-order polynomial may also remove the signal.The polynomial fitting has been combined with other foreground removal methods in some studies to improve the overall performance of the foreground removal process.
In Figure 3, fourth column, we show the result obtained by the Polynomial fitting.In both cases, from the residual image and the 2D power spectra, visually, we see similar results to GPR, see §3.3, with a more considerable difference between the positive (neutral) and negative (ionized) regions in the residual image, although presenting the same level of residual foreground located at (, ) ∼ (80, 125) Mpc as in the other methods.

U-NET FOR 21-CM IMAGE SEGMENTATION
The network architecture of SegU-Net v2 is the same as in Paper I. The only implementation consists of a simplistic hyper-parameter optimization analysis on seven network hyper-parameters.In §A, we give a brief overview of the hyper-parameter space exploration method we employed and in Table A1, we list the six best-performing setups we found.Moreover, in §B, we present a first attempt to open the black box and performed a Gradient-weighted Class Activation Mapping (Grad-CAM) (Selvaraju et al. 2019) importance analysis to highlight the features in the input image that the network employs to identify and predict the neutral regions from residual images.In Figure B1, we give a visual representation of the Grad-CAM importance analysis we performed.

Network Architecture
Here, we give a brief description of our network architecture.We refer the reader to our previous work for more details.SegU-Net is a U-shaped deep convolutional neural network composed of a contracting (encoder) and an expanding path (decoder).The former has two convolutional blocks, followed by the 2D averaging pooling operation of size 2 2 and a dropout layer with a 5 per cent rate, Encoder-Level=2*ConvBlock+AvrgPool+Drop.A convolutional block consists of a 2D convolutional layer with kernel size 7 2 , followed by batch normalization and Rectified Linear Unit (ReLU) activation function, ConvBlock=Conv2D+BN+ReLU.The latter path consists of transposed 2D convolution followed by the concatenation with the corresponding output of the convolutional encoder block, dropout layer and two convolutional blocks, Decoder-Level=TConv2D+CC+Drop+2*ConvBlock.This structure is repeated four times for both the encoder and decoder.At each level, the pooling operation halves the angular dimension of the input and doubles the number of channels.The network takes as input a redshift slice from the residual lightcone,  res , and outputs the corresponding 2D binary image,   HI .

Dataset
We generated a large set of realisations of the SKA multi-frequency tomographic dataset by changing the initial conditions and the following three astrophysical parameters.We sample the high-redshift galaxy efficiency  and the mean-free path of ionising photons  mfp with a normal distribution with mean and variance N (82, 18) and N (17.5 Mpc, 4.5 Mpc), respectively.At the same time, the minimum virial temperature for star-forming halos  min vir is sampled in logarithmic space with distribution N (4.7, 0.2).We chose this sampling of parameters because we want the global volume-averaged neutral fraction  HI of all data to be at least greater than 90% at redshift  = 11 and less than 10% at redshift 7.Moreover, with this parameter sampling, we can postulate the spin-saturation assumption,   ≫  CMB , which assures that the differential brightness is strictly positive and that neutral hydrogen is correlated with a positive signal in each image.
In this work, we updated the dataset from Paper I for a total of 10,000 samples for the network training and 1,500 for validation.Once the network is trained, we will test its accuracy and generalisation ability on an additional 300 mock observations during the prediction step.We will refer to this dataset as the random testing set.The training dataset is employed during the forward-and backpropagation (Rumelhart & Zipser 1985), while the validation dataset is used to validate the accuracy of network results during training.We want to clarify that we trained SegU-Net v2 on  res data preprocessed only with the PCA eigen-decomposition on the full redshift range,  = 7 to 11, which is explained in §3.1.The testing dataset is an independent set of simulations on which we will validate the final results of the trained network.

Metrics
We consider a true positive detection ( ) to be the number of pixels correctly identified as neutral, while a true negative ( ) is the opposite.False positives () and false negatives ( ) represent the number of pixels wrongly classified as neutral or ionised.Therefore, we can define the Matthews correlation coefficient (MCC) for quantifying the accuracy of our network predictions as Here, this metric indicates how well a model is able to predict the target variable correctly.
This second metric refers to the level of consistency or repeatability of a predicted value.While accuracy and precision are important metrics in evaluating the performance of a network, they may not be sufficient in certain scenarios.For instance, in our binary classification problem, there can be scenarios when neutral regions can be much rarer than ionised regions and vice versa.In this case, accuracy can be misleading as the model may achieve high accuracy by simply predicting the majority class for all instances.Precision and recall are more informative metrics in such cases as they consider the class imbalance.
However, here, we include the third additional metric, the Intersection over Union (IoU), that quantifies how well the predicted neutral region of interest overlaps with the true one.We will use these metrics later in §5.2.
In our case, we are targeting binary maps that indicate the location in the sky at a given redshift as either neutral or ionized.Therefore, an easy way for the reader to interpret the results is in the number of pixels guessed correctly or wrongly.For this reason, we introduce the false positive rate (FPR), also referred to as non-specificity, and the true positive rate (TPR), also known as sensitivity.
The former quantity gives the percentage of neutral pixels (positive case in our context) correctly identified as neutral.A value of   = 1 will indicate that the network identified all the neutral pixels correctly.Otherwise, 1 −   indicates the percentage of pixels falsely classified as ionized.Similarly, the  gives the percentage of pixels falsely detected as neutral.

Per-Pixel Error Estimation
The error calculation uses the same method as in Paper I. In the prediction step, we employ temporal time augmentation (TTA) operations (Perez & Wang 2017;Wang et al. 2020) on the network input data to create several copies of the same realisation but that we modify by rotating and vertical/horizontal flip operation.In this work, we fix the axis of symmetry and rotation to the frequency direction.Thus, the number of manipulations was reduced to a sample of 16 copies.This number corresponds to the maximum independent operations we can apply to an image.SegU-Net v2 then gives a prediction for each modified copy that is then rotated or flipped back to obtain a different prediction of the same input image.We calculated the standard deviation,    , on the 16 copies and obtained a per-pixel uncertainty map as shown in Figure 4, bottom panel.The method is simple but efficient, showing how difficult it was for the network to give the predicted binary field for each pixel in the image.

RESULTS
This section discusses the result obtained with SegU-Net v2 acting on data pre-processed with the PCA foreground removal method as explained in §3.1.Here, we evaluate the result on the predicted binary maps and the network performance on the different methods (illustrated in §3) in §5.1 and §5.2 respectively.Finally, in §5.3, we demonstrate a possible astrophysical application of SegU-Net v2.

Identifying Hii Regions with SegU-Net v2
In Figure 4, we visually evaluate one realisation of the network predicted neutral (red) and ionized (blue) regions.We refer to this simulated lightcone as the fiducial simulation.In the right column, we show a slice at redshift  = 8.24 ( obs = 152.90MHz), corresponding when the global volume average neutral fraction is  HI = 0.5.From top to bottom, we show the residual image after the PCA preprocessing employed as the input of the neural network, the binary map predicted with SegU-Net v2 from the PCA pre-processed data and the derived per-pixel uncertainty, respectively.In the left column, we show the redshift evolution of the same fields along one given direction of the corresponding fields.
First, when we compare the bottom right panel in Figure 1 with  the top right panel in Figure 4, we can notice that the pre-processing step drastically reduces the signal from   ∼ ±10 5 mK to just an observed differential brightness of few tens   ∼ ±40 mK.Nevertheless, some of the foreground contamination is still visible.For instance, in Figure 4 top left panel, we can see that across a few frequency bands at  ≈ 10.8 presents an anomalous feature.Moreover, we can see that foreground residual is still present between 7 ⩽  ⩽ 8.2.This signal excess is self-evident in the per-pixel uncertainty for the same redshift range.Some frequency bands are saturated with considerable uncertainty  std ∼ 0.3.This is because the foreground component is correlated along the frequency direction and is primarily diffused over large angular scales.The foreground residuals thus observe extended features along the  direction over multiple adjacent frequency channels.From the redshift evolution of the predicted binary field (left middle panel), we notice that the network can either falsely detect bubbles when most of the lightcone is still highly neutral,  ⩾ 9.5, or completely miss ionised bubbles that are entirely surrounded by neutral hydrogen.In both cases, the mislabelling is limited to bubbles with sizes close to or smaller than the interferometric smoothing scale, Δ ∼ 9 Mpc, as the network confuses structures with small-scale noise fluctuations.Thus posing a hard limit on the possibility of measuring and detecting the smallest HII bubble close to the instrument resolution.We discuss this further in §5.3.This limitation is visible from the recovered binary field at redshift  = 8.24 (middle right panel).Here, the detection of the bubbles at 180 Mpc ⩽ x ⩽ 210 Mpc is entirely missed.We observe the same outcome for the island of neutral hydrogen at coordinates (, ) ≈ (75, 75) Mpc.These erroneous findings are associated with a moderate to high uncertainty  str ⩾ 0.2.As we mentioned above, the per-pixel uncertainty shows that at the early stage of reionization,  > 9, most of the uncertainty is either situated around small HII volumes,  ⩽ (10 3 Mpc) 3 , or at the border between neutral and ionised regions.On the other hand, at the late stages,  < 8.2, high uncertainty is mostly located in the vast, interconnected ionised IGM.
In Figure 5, we show three statistical analyses for the entire random testing set.In the left panel, we show the correlation plot between the true global averaged neutral fraction xHI,true against the predicted xHI,pred .The dashed green line indicates the 95 per cent data contour, corresponding to a 2 difference from the ground truth.The 2 contour clearly shows a deviation on the left-hand side of the black dashed line (perfect correlation), indicating that the predicted images tend to be considered more neutral than they should be.This trend is more visible at lower redshift  < 8.5 ( xHI,true < 0.4) as more points reside outside of the 95% percentile.This behaviour can be motivated by the presence of residuals from the foreground that the PCA process could not remove.As we mention in §3.1, we consider the first four components to contain most foreground information.These components are most representative at higher frequency as the foreground amplitude increases inversely proportional to redshift, Equation 4. Therefore, for tomographic data with a wide redshift range, the decomposition can under-represent foreground contamination at lower redshift, resulting in more residuals when we reconstruct the image from the remaining components at the corresponding redshift slices.This effect is visible in the uncertainty map in Figure 4.
In Figure 5, middle panel, we show the correlation coefficient against the same quantity as before,  HI,true .Each point corresponds to an image at a redshift indicated by the colour bar.We add the 68 per cent data contour (solid line) on this panel, corresponding to a 1 difference from the ground truth.We first noticed that we obtain a global accuracy that is approximately 15% lower, r  = 0.71, compared to our previous work in Paper I. This lower score with the same network structure and architecture is justified because any signal extrapolation in foreground contamination is extremely arduous compared to forecasting in the presence of just telescope systematic noise.Moreover, as we stated before, we notice that at lower redshift  < 8.5 ( xHI,true < 0.4), a sizable portion of the redshift slices have a difference larger than 2.This behaviour is also evident from the increase of the uncertainty map in Figure 4 for images at  < 8.5.
Lastly, in Figure 5, right panel, we show the correlation between the true positive rate ( ), also known as sensitivity, and the false positive rate (), also known as non-specificity, on the random testing set.In our case, these quantities indicate the percentage of pixels correctly labelled as neutral and the fraction of pixels mislabelled as ionized, respectively.This plot is known as the Receiver Operating Characteristic (ROC) curve, and it is a standard analysis in classification problems as it gives an intuitive overall performance of the method.The results from our network show that most of the realizations with redshift range  ∈ [7.5, 10] are located in the top-left corner, representing the ideal performance or perfect classification.This indicates that most binary maps have high sensitivity and specificity, i.e., neutral and ionised regions are correctly identified.Data points close to the diagonal line indicate that the method performance is not much better than a random classifier.In our case, this is true for the values at the extreme of the redshift range.The data points on the top-right corner have high sensitivity but low specificity, meaning that the network labels correctly neutral regions, from Equation 15, left metric,   ≪  , while misclassifying most of the ionized pixels as neutral,   ≪ .This is the case for images with  > 10; however, at this redshift, the images are mostly neutral; thus, the incorrect detection is limited to a few pixels of the image.The data point in the bottom-left represents the opposite situation where the network has high specificity but low sensitivity.This scenario indicates that the model is not able to differentiate well between neutral and ionized instances, from Equation 15, right function,   ≪   and  ≪  .We see the opposite trend as in the previous case, where images with  ∼ 7 occupy this instance.Another important quantity derivable from the ROC curve is the area under the curve ().This quantity gives an overall evaluation of the classification method.In Figure 5, right panel, we overplot four curves that represent different  scores.In our case, we can see that the network performs well as the random testing set points are mostly located above the 85% line and are well centred around the  = 95%.

Sensitivity to the Choice of Pre-processing Method
We trained SegU-Net v2 on the signal that is pre-processed using the PCA method.Therefore, it is vital to investigate how sensitive the trained model is to the pre-processing method used to mitigate foreground.Here, we test SegU-Net v2 on the foreground mitigation processes we presented in §3.We cannot use the entire lightcone as the GPR module currently available has been validated only for a bandwidth of 20 MHz.From the entire lightcone, we use three subvolume centred at redshift   = 7.68, 8.24 and 8.97 with frequency size of 20 MHz, corresponding to 172, 181 and 186 redshifts bins from  ∈ [7.19, 8.24], [7.68, 8.88] and [8.31, 9.72], respectively.The volume average neutral fraction of these sub-volumes is  HI ≃ 0.25, 0.50 and 0.75, corresponding to the late, middle and early stages of reionization, respectively.
We then apply four different foreground mitigation pre-processing steps to each sub-volume: PCA, Wedge Remove, GPR and Polynomial fitting.From the residual volumes, we predict the neutral/ionised regions from the trained SegU-Net v2, with PCA, pre-processing step as presented in §5.1.By applying different foreground mitiga-  tion processes, we can quantify the robustness and adaptability of our trained network.

Visual Evaluation
We visually compare the middle stage of reionization sub-volume for the four cases in Figure 6.From the left to right column, we have PCA, Wedge Remove, GPR and Polynomial fitting, respectively.The top panels visually compare an image at the sub-volume central red- shift   = 8.24 for the different pre-processes.In the bottom panels, we show the corresponding uncertainty map from the SegU-Net v2.
We notice that for the case of the fiducial simulation, the Polynomial fitting and GPR pre-processing obtain similar results with correlation   (  ) = 0.81 and  (  )  = 0.84, respectively.The former case appears to overestimate the extent of the neutral regions (see at position (, ) ≃ (75, 125) Mpc) as well as falsely detecting the presence of isolated neutral island in the vast ionised region, for instance, see around (, ) ∼ (75, 100) Mpc.The PCA obtains approximately 10% less accuracy,   (  ) = 0.70, its limitation comes forth when predicting the vast ionised region (see at position 50 Mpc ⩽  ⩽ 125 Mpc and 75 Mpc ⩽  ⩽ 125 Mpc) as the network is over-predicting the presence of an interconnected neutral hydrogen region.Wedge Remove method has the lowest performance, with   (  ) = 0.62.In this example, the pre-process forecasts an excess of neutral hydrogen outside the ground truth.On the other hand, this method underestimates its presence within the extensive neutral cloud.In Table 2 third column, we show the resulting   (  ) for each pre-process.
Among the methods presented, the Wedge Remove method appears to be the least efficient for SegU-Net v2.The uncertainty map in Figure 6 shows that the Wedge Remove method has high incertitude in the vast interconnected H ii regions, for  ∈ [0, 125] Mpc and  ∈ [0, 150] Mpc, as well as between nearby H i regions, for instance at (, ) ≃ (120, 160) Mpc.The presence of a higher fore-ground residual compared to the other methods (visible in the same region in Figure 3) indicates that lower performance is attributed to a harsh and perhaps undisclosed subtraction that does not aim at portraying the foreground contamination but rather removes its contribution.Overall, the GPR method, followed by PCA decomposition, appears to give an advantage compared to the other pre-processing.At the same time, all the cases fail to detect ionised or neutral regions of sizes close to the interferometric smoothing scale, Δ ≃ 9 Mpc.

Redshift Evolution
In Figure 7, we show the redshift evolution of the Matthew correlation coefficient   for the four different methods.On each panel, we show the results from the early (  = 8.97, in red), middle (  = 8.24, in green) and late (  = 7.68, in blue) stage of reionization sub-volumes with the corresponding error bar represented by the shadow area.The horizontal dashed line denotes the redshift averaged correlation coefficient,   .In Table 2 fourth column, we show the resulting   for each sub-volume and sub-volume.Based on this quantity, we notice that the ranking goes by the GPR method with   = 0.71 at   = 7.68, 0.67 at   = 8.24 and 0.63 at   = 8.97, followed by the PCA with   = 0.68, 0.67 and 0.62, respectively.Polynomial fitting follows with   = 0.65, 0.62 and 0.60, while Wedge Remove follows with   = 0.18, 0.19 and 0.15, respectively.An important remark: in SegU-Net v2 13 this comparison, we limit the PCA decomposition to the sub-volumes redshift bins (172, 181 and 186), and it is performing slightly worse when compared to the same results in the previous section on the 552 redshift bins.Therefore, we attribute the performance decrease to the reduced number of redshift bins that directly lower the number of orthogonal components with which the data are represented.For the case of PCA in Figure 7, we plot on the same panel the performance of the PCA decomposition on the 552 redshifts (dark blue line).Here, we can notice how the redshift averaged correlation coefficient is substantially higher,   = 0.82 at   = 7.68, 0.80 at   = 8.24 and 0.76 at   = 8.97, hence indicating that the PCA pre-process is preferred if we have at our disposal a tomographic dataset with an extended redshift range.The sharp increase at  ≃ 8.76, the sudden increase at  ⩾ 9 and the constant broadening for  ⩽ 8.1 of the uncertainty error in Figure 7 indicates that the PCA, GPR and Polynomial fitting are sensible to the evolution and distinctiveness of the same structures in the data.
Moreover, all processes, except for PCA, show a slight decrease in accuracy close to the redshift extremities values of the sub-volume.The Wedge removal efficiently helps recover the binary maps only for the selected sub-volume central part, close to the central redshift.While the accuracy decreases rapidly toward the edges as the foreground removal becomes inefficient, in our simplified version of the wedge removal code, we do not include the sliding trough process.See §3.2.Therefore, a comparison between the Wedge removal and the other pre-processing should be strictly limited to the sub-volume central part.

Recovered Neutral Island Size Distribution
In Figure 8, we compare the neutral island size distribution (ISD) derived from the Hi binary field predicted with the different preprocessing methods presented in §3.We employ the Mean-Free Path (MFP; Mesinger & Furlanetto 2007) method to derive the probability density distribution (/) of the neutral region sizes or radius .This size distribution measures the topological evolution of the reionization process (Friedrich et al. 2011;Giri et al. 2018a).See Giri et al. (2019) for a detailed study of ISDs during reionization.
In Figure 8, each panel shows the predicted ISD (solid line) for three sub-volumes centred at redshift   = 7.68 (blue), 8.24 (green) and 8.97 (red) against the ground truth ISD (dashed line).In the bottom part of each panel, we show the difference with the ground truth.Similarly to before, in the case of PCA, the estimated distribution with PCA decomposition on the full redshift range, from 7 to 11, is shown with a darker colour.We show the uncertainty error on the predicted ISD with a shadow area of the same colour.The GPR method and the polynomial fitting from neutral island distribution analysis appear to be the best fit.Differences are visible only at a large scale,  ⩾ 100 Mpc, with a factor ∼ 3 larger for the early and middle reionisation sub-volume stage.The only noticeable difference for the early stage sub-volume is for the extremely large sizes,  ≈ 300 Mpc.The results from the training pre-processing (darker colour) predict an ISD consistently shifted toward a larger scale for the case of   = 7.68 and 8.24.Deviations from the ground truth start to be visible for scale  ⩾ 40 Mpc and  ⩾ 80 Mpc with differences from up to a factor of ∼ 2 and a maximum of 5 at  ≈ 200 Mpc.On the other hand, for the case of the sub-volume centred at   = 8.97, the predicted ISD shows no virtual difference.These results confirm what we concluded in §5.1, with the analysis from Figure 5 (left panel).The PCA performed on the sub-volume redshift range shows the same factorial difference but with an opposite behaviour.Differences are more prominent for the late stage of reionisation sub-volume and get gradually better at the early stage.In this analysis, the Wedge method fails to depict the Hi distribution for all the sub-volumes.For small neutral regions,  ⩽ 20 Mpc, the predicted distribution is a factor 2 larger, while for larger sizes, the distribution can be severely underestimated, with / two orders of magnitude smaller than the ground truth distribution.This performance is an indication that with the Wedge pre-processing, SegU-Net v2 is struggling to connect large neutral regions due to the missing 21-cm signal lying in the foreground wedge region that has been removed along with the foreground.
From the probability density distribution /, we can estimate the mean radius of the neutral islands at a given redshift, defined as In our case, we set the lower limit to the intrinsic resolution of our simulation  min = 2 cMpc.In Table 2, rightmost column, we list this quantity derived from the predicted binary field with the different pre-process.The ground truth average radius is   = 19.89cMpc for the sub-volume centered at   = 7.68,   = 29.54cMpc for   = 8.24 and   = 49.09cMpc for   = 8.97.Based on this quantity, we notice that the GPR method and Polynomial fitting produce a better prediction for the late and middle EoR sub-volumes, with a difference to the ground truth below the cMpc, while for the early stage scenario, they tend to underestimate of a few cMpc.In the case of both PCA decompositions, the predicted quantity differs by a few cMpc in excess and deficit, respectively.This trend is also visible from the predicted ISD, as PCA shows a systematic underestimation, while the same decomposition on the entire redshift range shows an overestimation for the same scale,  ⩾ 30 cMpc.Considering the uncertainty, the wedge method seems to work reasonably well only for the late stage of reionization.However, for this scenario, the predicted ISD does not match.At late stages, the Wedge Removal prediction of   can not be trusted, as this quantity differs substantially.Zackrisson et al. (2020) illustrated the possibility of employing SKA-Low tomographic data as a foreshadowing method to identify the region of interest for future and ongoing experiments that aim to observe galaxy formation in the early Universe, such as the JWST, Euclid and Nancy Grace Roman Space Telescope (e.g.Beardsley et al. 2015;Geil et al. 2017).This work demonstrated that there is a simple relation between the volume of isolated HiI bubbles, V ion , and the grand total of ionising photons, N , tot , produced by the primordial sources within the same ionised region.Although we are overlooking relevant instrumental effects (e.g.incomplete uvcoverage, absence of gain error, beam effect and more), we assume that our framework, described in §2.4,produces realistic enough mock observation to demonstrate the challenge of identifying and measuring the sizes of such bubbles and its derived relation.

Relation between ionised volume and total ionising photons
For this analysis, we require the mass and the position of the sources within the ionised bubbles.Therefore, we decided to use a simulation run with the C 2 Ray radiative transfer code (Mellema et al. 2006).In Paper I, we demonstrated how SegU-Net works reasonably well on simulations other than those employed for the training and validation.Moreover, recent works demonstrated the limitations of U-Net when cross-validating different cosmological models (Chen et al. 2023b).Here, we employ the obtained ionised hydrogen and density coeval cubes to calculate the 21-cm differential brightness with Equation 1 and follow the mock observation procedure explained in subsection 2.4.We consider the third axis the frequency direction to create the corresponding network input and target.We use one realisation of the simulated coeval cube at redshift z = 8.89 with box and mesh size of 348 cMpc and 250, respectively.We interpolate the 250 mesh grid into a 166 grid per side to a corresponding intrinsic resolution similar to our Δx = 2.09 Mpc dataset.One of the inputs of the C 2 Ray code is the cumulative halo mass smoothed into the mesh grid.In this way, we can associate an ionised bubble to the sources within the same region by converting the total halo distribution mass  h,tot to the total ionising photon produced N ,tot = f  Ω m /Ω b M h,tot .We refer the reader to Iliev et al. (2006Iliev et al. ( , 2012) ) and Dixon et al. (2016) for further reading on the halo source model.
Though SegU-Net v2 is not trained on simulations produced with C 2 Ray, we still find that the ionised regions are accurately identified.This analysis shows that the trained model is quite general9 and, therefore, capable of finding physical features in real observa-tions.In Figure 9, we show the relation between V ion and N , tot derived from the simulation data (blue crosses) and the predicted binary maps (orange points).We notice that SegU-Net v2 is failing to correctly quantify the number of ionising photons for volumes V ion ≲ (10 cMpc) 3 , vertical black dash line.This limitation corresponds to the 2 km interferometric smoothing scale we apply in our mock observation pipeline.At  = 8.89, the Gaussian kernel has an angular scale of Δ ≈ 3.57 arcmin, corresponding to a comoving size of 9.9 cMpc.This limitation is also consistent with the results in Figure 5, where the correlation between prediction and ground truth slowly decreases,   ⩽ 80%, for higher redshift, z ⩾ 9.

DISCUSSION & CONCLUSIONS
With this work, we improved our previous effort in Paper I and updated our deep learning framework, SegU-Net v2, for the identification of neutral and ionized regions in realistic 21-cm mock observation expected from SKA-Low.One of the advantages of our network is the possibility to provide per-pixel uncertainty maps on its predictions.In §2.4,we introduced our extended mock observation pipeline by including synchrotron Galactic foreground contamination, presented in §2.3.Additionally, we performed machine learning hyper-parameter optimisation.We show the best-performing hyperparameters setup we analysed in §section A.
In this work, we combine our network with a foreground mitigation method that pre-processes the input data and reduces, in part, the foreground contribution.We trained SegU-Net v2 on 10, 000 lightcones with 552 redshift slices from  = 7 to 11 pre-processed with PCA on 4 components for the full redshift range.We chose this pre-processing method as it is the most commonly used method for foreground contamination and provides fast and efficient mitigation.In §5.1, the analysis on a random sample dataset, composed of 300 lightcone with the same redshift extent and bins, shows that the updated version of our network works well, with an average correlation of 71%, on 21-cm images contaminated and pre-processed by a foreground contamination method.This level of accuracy is almost ∼ 20% less than our previous results and is attributed to the added complexity due to the presence of the Galactic foreground.We show that SegU-Net v2 recovered binary fields that tend to be considered more neutral at  ⩽ 8.5.We attribute this to the under-subtraction of the PCA pre-processing method employed during training.This trend is confirmed by the increase of the uncertainty map for the same redshift extent that saturates entire frequency channels (see the bottom panel in Figure 4).
In §5.2, we compared the binary maps predicted with SegU-Net v2 on different pre-processing foreground mitigation and one avoidance method.We consider three sub-volume of the fiducial simulation with frequency width Δ = ±10 MHz centred at redshift   = 7.68, 8.24 and 8.97, representing a late, middle and early stage of reionisation.In this work, we consider PCA decomposition ( §3.1), Wedge removal ( §3.2), Gaussian Process Regression ( §3.3) and Polynomial fitting ( §3.4).We demonstrated that SegU-Net v2 is able to recover Hi regions with varying accuracy for all the pre-processing methods we tested.In our case, the network is able to generalize enough and work with the same level of accuracy as the training case on preprocessing methods that were not employed during its training (see summary statistics in Table 2).Moreover, in §5.2.3, we study the island size distribution (ISD) of the predicted binary maps.GPR and Polynomial fitting work better in recovering the ISDs, as well as the average distribution size   of neutral regions, than the two cases of the PCA pre-processing (applied on the full redshift range and the sub-volume redshift range).
Therefore, we can conclude that SegU-Net v2 is the preprocessing method agnostic, providing accurate predictions independent of the pre-processing method, as long as the foreground mitigation provides reasonable residual images of the original 21-cm signal.Another conclusion is that PCA decomposition on lightcone data with a wide redshift range, e.g.frequency depth of the order of 60 MHz or larger, is to be preferred.In the case of smaller available sub-volumes, with frequency depth between 20 MHz and 30 MHz, other methods such as GPR or Polynomial fitting are to be preferred as they provide better prediction when compared to PCA on the same redshift range.
Finally, we provided a concrete use case of SegU-Net v2 in the context of 21-cm SKA-Low tomographic observation.Previous work demonstrated that a linear relation could be derived between the size of the ionised volume and the grand total number of ionising photons produced by the hosted source.In §5.3, we demonstrated that our network could recover with precision the linear relation for ionised volumes that are resolved.Here, we stipulate the limited resolution of the SKA-Low layout by the interferometric smoothing scale for the maximum baseline of  = 2 km, which corresponds to an angular scale of approximately 3.57 arcmin at redshift  = 8.89, corresponding to an early stage of reionisation scenario, xHI = 0.75.
The current version of SegU-Net v2 is trained using seminumerical simulations, known for their non-conservation of photons (e.g., Hutter 2018;Choudhury & Paranjape 2018).This discrepancy arises when the number of photons emitted by the sources does not match the number of IGM ionisations.However, it is important to highlight that SegU-Net v2 does not exhibit sensitivity to the model linking the sources and sinks in the simulations.Instead, it learns the ionization patterns present in the 21-cm signal distribution.Consequently, the framework successfully predicts the accurate volume of ionized regions in simulations generated by C 2 Ray, a numerical simulation code that conserves photons ( §5.3).In future work, we plan to retrain the network on models from photon-conserving frameworks, such as Beorn (Schaeffer et al. 2023) and pyC 2 Ray (Hirling et al. 2023).
In this paper, SegU-Net was trained on one NVIDIA® Tesla® P100 with 16GB for a total computational cost of approximately 12 GPU hours.When comparing the pre-processing method, we also consider the computational time required to compute the foreground mitigation/avoidance method.In our setup, one lightcone sub-volume of frequency depth 20 MHz with 200 redshift bins takes about 7 s CPU time to compute with PCA and 2 s with Polynomial fitting.Wedge remover provides faster pre-processing with 230 ms but inefficient foreground mitigation.On the other hand, GPR provides slow but reliable mitigation with a computing time of ∼ 1.2 CPU hours.
The Grad-CAM importance score analysis conducted in §B shows that the network decoder convolutional layer starts by identifying and grouping the region with the strongest positive emission.In the bottleneck of the U-Net model, the low-dimensional latent space then uses the encoded information to identify the threshold that defines the boundary of the neutral regions.The decoder layers use the compressed information and the U-Net skip connection with the encoder layer to define the location of the borders.Finally, the last convolutional layer further refines the decoder output.However, the analysis showed that the network struggled to correctly identify the residual foreground when this signal is similar to the 21-cm intensity.This explains why the final predictions include a positive detection of 21-cm signal regions and a false negative due to the noise or foreground residuals.
In our case, SegU-Net is a deterministic deep learning model.Recently, a series of works have imported probabilistic models in radio astronomy and astrophysics (Friedman & Hassan 2022;Wang et al. 2023;Sortino et al. 2023, e.g.).This approach inherently handles noise and variability in the data compared to the deterministic case.At the same time, they can learn the underlying probability distribution of the data, which can help for a better interpretation.On the other hand, deterministic models like U-Net often have the advantage of being computationally efficient and easier to train.In future work with SegU-Net, we consider converting the model to be probabilistic.
Our analysis shows that using image data from SKA-Low, SegU-Net v2 accurately determines the ionization fraction at different stages of reionization.Additionally, we have identified how the ionized regions detected by SegU-Net v2 can be used as markers for locating the galaxies responsible for driving the reionization process.These findings demonstrate the potential of our framework for synergy studies with other telescopes, such as the JWST, Euclid and Nancy Grace Roman Space Telescope.to minimise the loss.We employed the in bold text for the results presented in this paper.Although we monitored the validation loss to select the best setup, we noticed that the first and second models gave the worst prediction for images at the edges of the redshift range  ∼ 7 and 11.The fifth model provided the most balanced result, with an overall   ≈ 0.7 score, as shown in Figure 5 central panel.Moreover, in contrast to the findings by Li et al. (2018), we observe that setting the dropout rate to zero enhances accuracy only for the third-ranked setup.Meanwhile, other configurations exhibit improved performance when both batch normalization and dropout are included.

APPENDIX B: INSIDE THE BLACK BOX
The trained model we presented in this paper is able to recover the ionized field from noisy images with residual foreground contamination.This is an indication that the network learns to identify the regions of interest from important hidden features that maximize the recovery.However, the machine learning model's complexity, high dimensionality and non-linearity make them difficult to interpret and regulate, so these applications are often referred to as a black box.
Here, we present a first attempt to open and look inside SegU-Net black box.A standard tool to visualize and understand the decisions made by a general convolutional neural network is the Gradientweighted Class Activation Mapping (Grad-CAM) technique (Selvaraju et al. 2019).This method applied in segmentation and object classification highlights the features of an input image employed in predicting a particular class.Grad-CAM achieves this by computing the gradients of the predicted class   ≡   HI score with respect to the feature maps   of all the  > 0 convolutional layers up to the layer in the network under study.A weighted combination of these feature maps gives the importance score  The right column visualises the hidden layer output, with the number of sub-panels corresponding to the number of channels.From top to bottom, we have the output of the convolution block at the second level of the encoder after two convolutional layers and a pooling operation.The hidden state has angular and channel dimensions (64,64,32).We can see that in the encoder, the network focuses on the regions with the highest intensity, which, thanks to the preprocess presented in §3, are mostly located within the neutral region.The different channels in the hidden layer show a similar conclusion, with the convolutional operation capturing the large-scale region that produces 21-cm signals.In the second row, the bottom of the U-Net, known as the low-dimensional latent space, with dimension (16, 16, 128), gives a compressed representation of the input image, and it appears to focus on location in the image with the highest and lowest values.Our interpretation is that the network focuses on these extreme values to quantify the "threshold" value that sets the boundary between neutral and ionized regions.This interpretation is also supported by the 128 hidden layer plots in the right panel, as the compressed data shows different constant values across the channels.In the last row, we show the importance score from the final convolution before the binarization of the output.Here, it appears that the network uses the threshold value defined in the bottom layer of the U-Net to locate the delimitation that defines the neutral regions, as the shadow is located along the contour of the ground truth (black solid line).We notice that some locations in the image with substantial foreground residuals are wrongly included.The hidden layer plot shows that the network struggles to remove the foreground residual completely.

Figure 1 .
Figure 1.An example of a slice through the sky-plane used during the network training.Top Left: the neutral hydrogen fraction at simulation resolution when the reionisation process is halfway complete.Bottom Left: the simulated 21-cm signal after the interferometric smoothing with a maximum baseline of  = 2 km and matching frequency resolution.We then subtract the frequency mean signal to mimic the effect of the lack of a zero baseline.Top Right: systematic noise added to the 21-cm signal for an observing time of 1000 hours.A solid black line indicates the neutral field after the same interferometric smoothing scale.Bottom right: the Galactic synchrotron emission added to the 21-cm signal with the systematic.We can notice how the dynamic range is a few orders of magnitude larger and completely outshines the 21-cm signal.For all the differential brightness images, the units are in mK.

Figure 2 .
Figure 2. Cylindrical power spectra for a lightcone sub-volume centered at redshift   = 8.24 and frequency depth of ±10 MHz.Top Panel: 2D Power spectra from the simulated 21-cm signal only.Bottom Panel: Same quantity but with the galactic foreground contribution.The black dashed line indicates the wedge slope with  = 2.25 • and  = 8 × 10 −2 h Mpc −1 .

Figure 3 .
Figure 3.Comparison between different foreground mitigation methods.From left to right, we have PCA, wedge removal, GPR and polynomial fitting.First row, a visual example at redshift  = 8.24 of the residual image after the corresponding method.Second row, the cylindrical power spectrum for a lightcone sub-volume centred at   = 8.24 and frequency depth ±10 MHz.

)
This metric can have values between −1 ⩽   ⩽ 1, quantifying the quality of binary field (two-class) classifications.A negative value indicates anti-correlation, zero represents a completely random classification, and positive values indicate a positive correlation.For a direct comparison with previous studies on segmentation of 21-cm image data (e.g.Gagnon-Hartman et al. 2021), we define three additional statistical metrics as follows Accuracy =   +     +  +   +   .

Figure 4 .
Figure 4. Visualisation of the different fields for our fiducial lightcone.Top Left: for a given position on the x-direction, the redshift evolution of the residual lightcone after the PCA pre-processing step.Top Right: residual image at redshift  = 8.24 ( HI = 0.5).Same image as in Figure 1.Middle Left: redshift evolution of the predicted neutral (red) and ionised (blue) lightcones.Middle Right: predicted map at the corresponding redshift.Bottom Left: the corresponding per-pixel error lightcone, orange colour indicates the intensity of the uncertainty.Bottom Right: the corresponding per-pixel error map.For all panels, we over-plot contours that represent the ground truth.

Figure 5 .
Figure 5. Statistical analysis of the predicted binary maps for the testing dataset.Each point indicates an image at a given redshift in the colour bar.Left Panel: correlation plot between the ground truth volume average neutral fraction,  HI, true , against the predicted,  HI, pred .Right Panel: Matthew correlation coefficient   against global volume-averaged neutral fraction.The dashed blue line indicates the redshift averaged   .Here, solid green lines indicate the 68 per cent (1) and dashed green lines the 95 per cent (2) data contour.Right Panel: Receiver Operating Characteristic curve for the same dataset.The dashed line of different blue shades indicates the percentage of reliability of the prediction.

Figure 6 .
Figure 6.Comparison of the recovered binary field from different foreground mitigation pre-processes.We have PCA, wedge removal, GPR, and polynomial fitting from left to right.Top panels: a visual example of the recovered binary map at redshift  = 8.24 after the mentioned pre-processing step.The red/blue indicates the predicted neutral/ionized regions, while the green contour indicates the ground truth.Bottom panels: the corresponding per-pixel uncertainty map derived by SegU-Net v2.The orange indicates the intensity of the uncertainty, defined as a general standard deviation.The title includes the resulting   at this redshift.

Figure 7 .
Figure 7. Redshift evolution of the r  correlation coefficient for the different tested pre-processing step.Each panel shows the result on three lightcone sub-volumes centred at   = 7.68 (blue), 8.24 (green) and 8.97 (red) with a ±10 MHz frequency depth.These redshifts correspond to the late, middle and early stages of reionization, respectively.Solid lines indicate the   coefficient for the predicted binary maps.Shadow areas indicate the error due to the uncertainty map.Horizontal dashed lines indicate the redshift averaged   coefficient.For the case of PCA, we plot the decomposition executed on the full redshift range (dark blue) as a reference.

Figure 8 .
Figure 8. Island size distribution for the different pre-processing steps.Each panel shows the predicted size distribution   / (top section) and the difference to the ground truth (bottom section).The colours indicate the lightcone sub-volume at the late (  = 7.68, blue), middle (  = 8.24, green) and early (  = 8.97, red) stage of reionization.The results from the neutral regions in the predicted fields are shown with solid lines and the ground truth with dashed lines.For the case of PCA, we plot as a reference the predicted size distribution with a dot-dashed line.

Figure 9 .
Figure 9. Relation between the volume of ionised region versus the grand total of ionising photons within the same region.For a coeval cube at redshift  = 9 (x HI = 0.75) and box size of L box ≈ 348 cMpc.Relation derived from the ground truth is represented with blue cross data, while orange circle points are derived from SegU-Net prediction.The dashed red line corresponds to the linear fit of the ground truth data points.The vertical line indicates the 2 km baseline smoothed resolution.

(
,  )  ∈ [0, 1] which indicates the importance of the feature in the image at location (, ) for the class th.This score is given as   =    is the normalization factor, corresponding to the sum of the th weights, while   is the weight corresponding to the feature map   .In our case, we focus on the neutral regions, categorized with a value of  = 1 in our maps.Values close to one indicate high importance, while in the opposite case, it indicates irrelevance.In FigureB1, we show the result for three hidden layers.The first column shows the input image and the   score represented by dark shadows, indicating the location in the image employed in the classification of the neutral regions.Solid line contours indicate the ground truth.The central column shows the Grad-CAM filtered region obtained by element-wise multiplication of the input image with the importance score.This plot shows us what features the network emphasizes in the image for identifying the neutral region.

Figure B1 .
Figure B1.Region of interest detected by Grad-CAM for three hidden layers.Left panels: Input image with shadow areas that indicate the region of attention detected by the Grad-CAM method.Black solid contours indicate the ground truth for comparison.Central panel: The filtered Grad-CAM image element-wise multiplication between the input and the   filter.Right panel: A visualisation of the hidden layer output.The number of sub-panels indicates the channel size of the hidden layer.

Table 1 .
The telescope parameters used in this work.For the frequency channel width, we indicate the quantity at  = 7 and 11.

Table 2 .
Result summary of the predicted binary field for the tested pre-processing step on the three lightcone sub-volume at representative stages of reionization.

Table A1 .
SegU-Net hyper-parameter optimization analysis for the best-performing setups of seven parameters with optimisation on the validation loss.Ranking Activation Channels latent space Depth Dropout Final activation Kernel Pooling type   [%] Validation loss [×10 −2 ]