Cosmic variance suppression in radiation-hydrodynamic modeling of the reionization-era 21-cm signal

The 21-cm line emitted by neutral hydrogen is the most promising probe of the Epoch of Reionization (EoR). Multiple radio interferometric instruments are on the cusp of detecting its power spectrum. It is therefore essential to deliver robust theoretical predictions, enabling sound inference of the coeval Universe properties. The nature of this signal traditionally required the modelling of $\mathcal{O}(10^{7-8} \, {\rm Mpc}^3)$ volumes to suppress the impact of cosmic variance. However, the recently-proposed Fixed&Paired (F&P) approach uses carefully-crafted simulation pairs to achieve equal results in smaller volumes. In this work, we thoroughly test the applicability of and improvement granted by this technique to different observables of the 21-cm signal from the EoR. We employ radiation-magneto-hydrodynamics simulations to ensure the most realistic physical description of this epoch, greatly improving over previous studies using a semi-numerical approach without accurate galaxy formation physics and radiative transfer. We estimate the statistical improvement granted by the F&P technique on predictions of the skewness, power spectrum, bispectrum and ionized regions size distribution of the 21-cm signal at redshift $7 \leq z \leq 10$ (corresponding to $\geq80\%$ of the gas being neutral). We find that the effective volume of F&P simulations is at least 3.5 times larger than traditional simulations. This directly translates into an equal improvement in the computational cost (in terms of time and memory). Finally, we confirm that a combination of different observables like skewness, power spectrum and bispectrum across different redshifts can be utilised to maximise the improvement.


INTRODUCTION
One of the most important probes for studying the Epoch of Reionization (EoR; the period of the Universe when the first astrophysical objects formed and emitted UV photons that ionized the neutral hydrogen of the intergalactic medium) is the brightness fluctuation of the 21-cm line of neutral hydrogen as observed in emission or absorption against the cosmic microwave background radiation (CMB; Field 1959;Hogan & Rees 1979;Madau et al. 1997;Shaver et al. 1999;Tozzi et al. 2000;Ciardi & Madau 2003;Zaroubi 2013).The 21-cm signal is a highly-forbidden hyperfine transition produced by the spin-flip transition of the ground state of neutral hydrogen.Thanks to the long lifetime of the excited state combined with the abundance of neutral hydrogen in the intergalactic medium (IGM) at the inception of the EoR, it can be used to trace the gas progressive ionization during the EoR.A statistical detection of the ★ E-mail: anshuman@mpa-garching.mpg.destrength of the fluctuations of its brightness temperature can allow us to constrain our models of the early Universe and the formation of the first stars and galaxies.
Multiple interferometric low-frequency radio telescopes have been designed to search for this signal, like LOFAR 1 , HERA 2 , MWA 3 , PAPER 4 and the upcoming SKA 5 .While the signal is yet to be detected, upper limits of the 21-cm power spectrum have been obtained (for e.g., Mertens et al. 2020;The HERA Collaboration et al. 2022;Trott et al. 2020).These have allowed to rule out some extreme astrophysical models by comparison with simulations (e.g., Ghara et al. 2020;Mondal et al. 2020;Greig et al. 2021a,b;Abdurashidova et al. 2022).Further efforts are being made to minimise 1 Low-Frequency Array, http://www.lofar.org 2 Hydrogen Epoch of Reionization Array, https://reionization.org/ 3 Murchison Widefield Array, http://www.mwatelescope.org 4 Precision Array to Probe EoR, http://eor.berkeley.edu 5Square Kilometre Array, https://www.skao.int/enobservational errors by improving signal extraction techniques (e.g.Mertens et al. 2023;Acharya et al. 2023).It is therefore imperative to concurrently improve numerical modeling, readying it for the detection of the 21cm signal.As of now, the main limitation is the trade-off between simulated volume and resolution achieved, forced by limits in computational power.
In fact, as the thermal noise is proportional to the wave-modes (Koopmans et al. 2015), the best results from upcoming surveys with LOFAR, HERA, MWA and, subsequently, SKA are expected at the lowest  values, i.e., 0.15 ≤ /(ℎ cMpc −1 ) ≤ 1.5, after sufficient mitigation of foreground contamination.This corresponds to large physical scales, and indeed e.g.Iliev et al. (2014) found that simulations with box sizes greater than 100 ℎ −1 cMpc are necessary to accurately model the evolution and distribution of ionized regions.Kaur et al. (2020), instead, found that even larger boxes, i.e. ≳ 175ℎ −1 cMpc, are necessary.However, a simulation with a box size corresponding to  ∼ 0.15 ℎ cMpc −1 is still limited by sample variance.The straightforward solution to this issue is to employ larger boxes in order to increase the sampling of the long wave modes observationally relevant.For example, in Ghara et al. (2020) a side length of 500 ℎ −1 cMpc has been simulated, while the semi-numerical approach of Mesinger et al. (2016) and Greig et al. (2022) has modeled Gpc scales.Such large boxes also allow to study the topology of large ionized regions (≳ 10 cMpc), which is especially important to understand the role of quasars.However, even these approaches have issues, as extremely large computational resources for running the simulations and storing the generated data are required, and running large number of such simulations would only be computationally affordable with the semi-numerical approaches.While they can be appropriate for some analyses, they also have their own limitations (see for e.g., Choudhury & Paranjape 2018, for details).
Additionally, there are other limitations on the extent of physics that can be explored with such simulations.For example, mini halos (≲ 10 8 M ⊙ ) are expected to play a significant role in reionization (see Iliev et al. 2005;Haiman et al. 2001), but their inclusion would either require a high enough mass resolution to be able to capture them in large boxes, or running significantly smaller boxes.The former exacerbates the computational requirements, while the latter limits the wave modes that can be sampled.A possible approach to tackle these problems is to use simulations with different box sizes, and combine their results.A recent example of this is an analysis of galaxy populations at  ≥ 8 in Kannan et al. (2023), which, however, does not provide a truly combined picture, as the different boxes are manually re-scaled to account for resolution differences without actual convergence between simulations.There have also been suggestions to use techniques like deep learning to increase the resolution of large box simulations (see Kodi Ramanah et al. 2020;Li et al. 2021, for applications to dark matter only -body simulations), however these haven't been carefully tested for EoR redshifts.
Another approach is to run computationally-cheaper boxes but include the properties of the more-expensive, full-physics ones.Recently, Sinigaglia et al. (2021) developed a method to map baryonic properties of the IGM onto DM-only simulations, but its applicability to the reionization epoch is unclear.
An alternative approach is to simulate the full physics of reionization in small boxes that are augmented with techniques that compensate the limited volume.One way to do this is the Fixed and Paired (F&P hereafter) approach described in Angulo & Pontzen (2016) and Pontzen et al. (2016).It has been shown to substantially reduce the cosmic variance in the matter power spectrum (see Maion et al. 2022;Klypin et al. 2020;Villaescusa-Navarro et al. 2018, for examples) with respect to the traditional approach for the same box size, bypassing the need to run a large number of smaller boxes.However, this reduction is substantial only for statistical properties (e.g., the stellar mass function), and not for local ones like the gas distribution of individual galaxies (check Villaescusa-Navarro et al. 2018, for more details).The 21-cm signal is influenced by both the large-scale distribution of neutral hydrogen, and the local properties of the ionized regions around galaxies.Exploring the extent of improvement (if any) on the effective volume for estimating the summary statistics of the 21-cm signal using the F&P approach can be helpful in minimising the required computational needs.This idea was explored in Giri et al. (2023) by running a large number of realizations of the F&P approach using 21cmFAST (Mesinger & Furlanetto 2007;Greig & Mesinger 2015).By comparing them with randomly generated simulations, they found that F&P boxes could obtain the same precision in the 21-cm signal power spectrum as traditional boxes twice their size, allowing at least a factor of 4 reduction in computing costs.However, 21cmFAST does not take into account baryonic hydrodynamics, thus implicitly assuming that baryons track dark matter.It also does not include a proper implementation of radiative transfer (RT) or galaxy properties, and thus becomes unreliable at scales below 1 cMpc.Hence it cannot provide an accurate picture of galaxy-scale effects on the 21-cm signal.
To increase confidence in the applicability of the F&P method for improving EoR 21-cm signal studies, it is necessary to use a more realistic framework, i.e. one which includes baryonic hydrodynamics, RT and which models galaxy properties more accurately.For this, we employ a setup similar to the THESAN simulations (Kannan et al. 2022;Garaldi et al. 2022;Smith et al. 2022;Garaldi et al. 2023), detailed in Section 2. We discuss the results of various summary statistics in Section 3, and the improvement on using the F&P approach in Section 4, while we give our conclusions in Section 5.

The 21-cm signal
The brightness temperature fluctuations of the 21-cm signal are given relative to the CMB temperature for any patch of the IGM as (see Furlanetto et al. 2006): where  HI is the fraction of neutral hydrogen,  B is the fractional overdensity of baryons,  S is the hydrogen spin temperature,  CMB is the temperature of the CMB photons at redshift , Ω m is the total matter density, Ω B is the baryon density, and ℎ is the Hubble constant in units of 100 kms −1 cMpc −1 .Above we assume that the spin temperature is coupled to the gas temperature, i.e.,  S =  gas , where  gas is the gas temperature self-consistently calculated in the simulations.
In the next section, we introduce the simulation used to generate mock differential brightness temperature ( b ) maps.

Simulations
Our setup is inspired by the THESAN simulations.We run a suite of radiation-magneto-hydrodynamic simulations that utilize the moving-mesh hydrodynamics code AREPO (Springel 2010;Weinberger et al. 2020), which includes a gravity solver based on the hybrid Tree-PM method (Barnes & Hut 1986), a quasi-Lagrangian Godunov method (Godunov & Bohachevsky 1959) based hydrodynamics solver implemented on an unstructured Voronoi mesh grid (Vogelsberger et al. 2020) and the radiative transfer extension AREPO-RT (Kannan et al. 2019) for a self-consistent treatment of ionizing radiation.We include the production and propagation of ionizing photons in three energy bins relevant for hydrogen and helium photoionization ([13.6, 24.6, 54.4,∞] eV).Further, we emply a non-equilibrium thermochemistry solver to model the coupling of radiation fields to gas.For the luminosity and spectral energy density of stars, we use a complex function of age and metallicity calculated using the Binary Population and Spectral Synthesis models (BPASS v2.2.1;Eldridge et al. 2017), modeling the unresolved birth cloud with a uniform escape fraction of  esc = 0.37.We note that we do not perform a recalibration of this parameter with respect to THESAN since (i) it mostly impacts the final phases of reionization, while we are interested in the initial ones, and (ii) our goal is to compare methods for initial condition generation, so a slightly-inaccurate reionization history is not expected to affect at all our results.For further details on the THESAN simulations, see Kannan et al. (2019Kannan et al. ( , 2022)).
All our simulations have a box-size of  = 95.5 cMpc, and  = 2 × 525 3 particles, giving a dark matter and baryonic particle mass of  DM = 2.0 × 10 8 M ⊙ and  gas = 4.7 × 10 6 M ⊙ , respectively.The gravitational softening length for the star and dark matter particles is set to 6.0 ckpc, which is also the minimum value for the adaptively softened gas cells according to cell radius.The cosmological parameters are taken from Planck Collaboration et al. (2016) as ℎ = 0.6774, Ω m = 0.3089, Ω Λ = 0.6911, Ω b = 0.0486,  8 = 0.8159 and   = 0.9667.In the initial conditions the gas is assumed to follow the DM distribution perfectly, with a primordial hydrogen and helium fractions of  = 0.76 and  = 0.24, respectively.We start the simulations from  ini = 49, generating 54 snapshots between  = 20 to 7.
Due to the chosen particle number, our dark matter halo masses are ≳ 10 10 M ⊙ , leading to reionization being driven by relatively massive galaxies.The resulting topology of reionization therefore somewhat differs at small scales from the one that would be produced including lower mass, more abundant, galaxies.This will impact the 21-cm signal and slow down the reionization process (which is driven by  star ∼ 10 7 M ⊙ galaxies as shown in Rosdahl et al. 2022;Yeh et al. 2023;Kostyuk et al. 2023).However, these effects will equally affect all simulations, so will be factored out by our comparative analysis.A higher mass resolution would be immensely computationally expensive to run the number of simulations necessary to perform the statistical analyses discussed in subsequent sections of this work.As discussed earlier, the necessity of simulating large physical scales in order to compare with upcoming surveys prevents us from employing smaller (and thus computationally cheaper) boxes.
Once a significant fraction of neutral hydrogen is reionized, the improvement of using the F&P approach to study the 21-cm signal is expected to saturate.This is depicted in Figures 8 and A1, showing the improvement factor calculated according to the methodology of Section 4.1.Thus, we only run the simulations down to  fin = 7, which corresponds to ⟨ HI ⟩ ≈ 0.8.We note that at =7 ⟨  HI ⟩ ranges between 0.72 and 0.82 for the GIC simulations.
Lastly, as done in Kannan et al. (2022), we also save Cartesian data output at a higher cadence (243 outputs between  = 16 to  = 7) by gridding the simulation data onto a regular Cartesian grid employing a first order Particle-In-Cell approach.We use a 256 3 grid, i.e. each cell is ∼ 372 ckpc in size.

The Fixed & Paired approach
The F&P approach was first proposed by Angulo & Pontzen (2016) and is based on the creation of a special pair of initial conditions (ICs) that, when employed together, significantly reduce the impact of cosmic variance.In the traditional approach to ICs creation, an initially uniform and isotropic distribution of tracers particles is perturbed (following e.g.Zel'dovich 1970) to induce density perturbations: where the phase  k is sampled from a flat distribution in the range [0, 2], and the power spectrum amplitude (k,  ini ) is sampled from a Gaussian distribution centered on its expectation value  [( k,  ini )] (necessary to produce a Gaussian random field).
In the F&P approach, the power spectrum modes are fixed to their expectation values.This produces the fixed initial conditions, while the paired one is obtained by reversing the phase associated to each particle displacement.By combining simulations run with these two sets of ICs, the variance-induced fluctuations (which in traditional ICs mainly affect the large-scale modes, where the sampling of the power spectrum amplitude is scarce and therefore more susceptible to deviations from its expectation value) are suppressed up to the fourth perturbative order (Angulo & Pontzen 2016).Additionally, despite breaking the Gaussianity of the generated field, this approach does not induce unwanted features, as shown by e.g., Angulo & Pontzen (2016)  In this work, we generate 5 such pairs of F&P simulations, in order to explore the effects of the random sampling of density perturbations.To minimise the effect of randomness, for each pair we fix the seed for the random number generator used for stochastic algorithms like the star formation prescription.Further, we run all the simulations described here on the same machine with the same hardware configurations.As AREPO is coded to be binary identical in such conditions, it allows us to avoid floating point errors building up and biasing our results.
We compare the averages of these 5 F&P pairs against 35 traditional Gaussian-sampled initial conditions based simulations (hereafter referred to as GIC).As an example, in Figure 1 we show the reionization histories of the 5 pairs (magenta, orange, red, blue, green dashed), as well as that of the GIC simulations (grey solid).From the figure, we note that the reionization histories of the F&P averages cluster close to the average of those of the GIC simulations.This is a consequence of the fact that the F&P method ensures that the F&P averages closely match the halo mass function, while GIC simulations can spuriously have an excess/dearth of very bright sources.
In Figure 2, we show maps of  b at different redshifts ( = 10, 8.3, 7.6, 7; these correspond to ⟨ HI ⟩ = 0.99, 0.95, 0.9, 0.8) for one of the fixed simulations, its corresponding pair, their average, and the average of two randomly chosen GIC simulations.We choose two GIC simulations at random to provide a visual comparison of results obtained from averaging them as opposed to averaging an F&P pair.Note that due to the phase inversion used for generating the initial conditions of the pair, the regions of high  b in the fixed simulation roughly overlaps with regions of low  b in the pair.This is evident in their average, which does not have specific regions of high (or low)  b , unlike the average of the two GIC simulations.With decreasing redshift, we see such regions beginning to form for the F&P average as well, but a closer analysis of the  b summary statistics is necessary to analyse the difference from GIC simulations.
In the next section, we compare the ensemble average of the GIC simulations (taken to be the "true" value) against the F&P average simulations.We show qualitative comparisons between two of the F&P averages and the true value in Section 3, and a more quantitative analysis using all five of the averaged simulations in Section 4.

ANALYSIS AND RESULTS
In this section, we compare the various summary statistics of the 21-cm signal for the GIC simulations and their ensemble average versus the F&P averages.In particular, we focus on the skewness, the power spectrum, and the bispectrum.In Sections 3.1, 3.2 and 3.3 we analyse their behaviour at various redshifts, focusing on  = 10, 9, 8, and 7. We also compare the ionized bubble size distribution across these redshifts in Section 3.4.

Skewness
The skewness ( μ3,b ) is a statistical measure of asymmetry in a distribution, i.e., as the name suggests, it quantifies how skewed a distribution is around its mean value.μ = 0 indicates a symmetric distribution.Here, we use the definition of skewness from Ma et al. (2021), given as: where   is the -th central moment, and E[] is the expectation value at each snapshot where we calculate the skewness.In our case, this would be the volume average.
In Figure 3, we compare the evolution of skewness versus redshift from  = 10 to 7 for the 35 GIC simulations, their ensemble average, and the 5 F&P average simulations.We note that the F&P averages are a good estimate for the true value of the skewness in most cases.For the 2nd F&P average, the deviation is larger at  ≲ 8, which is indicative of more inhomogeneity in the  b .We find that this anomalous skewness is caused by the fixed simulation in the second pair having a chance association of galaxies with strong black hole feedback and high output of ionizing photons.While such cases are rare, they are physically possible and highlight how the relevance of galactic processes in the production of the 21-cm signal hinders the improvements granted by the F&P approach.We also note that such cases are not completely captured by e.g. the 21cmFAST simulation used in Giri et al. (2023).
This anomalous skewness however, is not expected to cause a major effect on other summary statistics like the power spectrum and bispectrum, because the effect is localised around few simulated galaxies.Thus, not only the scales involved are significantly smaller than the scales we consider in the next sections, but this effect is driven by a statistically-insignificant number of objects.To ensure this, we checked that by replacing in Equation 3 the volume average with the median, the anomalous behavior vanishes.To make this more evident, in all following figures concerning summary statistics (0.15 ≤ /(ℎ cMpc −1 ) ≤ 2.0) we elected to show the F&P pair that most closely follows the ensemble-averaged skewness and the pair that most deviates from it.There is no significant difference between these two.

Power Spectrum
The power spectrum is expected to be the first detectable statistic of the 21-cm signal during the EoR, and thus is of particular interest.So analysing the improvement in modelling, if any, with the F&P method is important.The power spectrum of  b as defined in Equation 1 is given by: where   is the Dirac function,  b (k) is the differential brightness temperature in Fourier space, and ⟨...⟩ is the ensemble average.In this work, we use the normalized form of the power spectrum given by: In Figure 4 we show the power spectra of the GIC simulations (in grey), their ensemble average (in black), and the F&P averages (magenta and orange) for 2 pairs (purple and green, with the fixed simulation in dashed, and their corresponding pairs in dotted lines) out of the 5 generated in Section 2.2.1.While we have analysed results for all of the five generated F&P pairs (see Section 4.1 for a quantitative comparison), for the sake of visual clarity we show only two out of the five pairs.We explicitly choose the second F&P pair average (orange) to check if its deviation in skewness leads to any difference in behaviour as compared to one of the other four pairs (magenta).
We note that the F&P averages match the ensemble average across all redshifts.While their deviation from the ensemble average increases with decreasing redshift, it is still less than that of the individual GIC simulations (see lower panels of Figure 4).This is understandable as a combination of two effects.First, the F&P approach is expected to yield improvements on the predictions of features that are primarily governed by the large-scale structure (as they are tied to the matter power spectrum), while it does not provide a statistical improvement on the prediction of galaxy properties (Villaescusa-Navarro et al. 2018), such as ionizing photons output, as they are entirely dominated by local physics.Secondly, the period of emergence of ionized regions is one where the 21-cm signal becomes increasingly non-Gaussian.Thus, the information content in the power spectrum is reduced in this period as compared to earlier and later redshifts, which have a homogenized distribution of neutral and ionized hydrogen, respectively.While the contribution from the large-scale structure remains the same, the fluctuations of  b are increasingly affected by the presence of ionized gas, which in our simulation is dominated by large and isolated ionized regions.While this mostly affects the power spectrum at small scales, the cumulative fluctuations over large scales can also show up.Thus for the F&P method we observe deviations in the 21-cm signal power spectrum which are larger than those in the matter power spectrum as redshift decreases.However, we note that these deviations are still smaller than those of the individual GIC simulations.Nevertheless, it is necessary to check if an F&P pair average power spectrum is more likely to minimize cosmic variance as opposed to an average of two random GIC simulations.Thus, we quantify the improvement on using the F&P method in Section 4.1, where we consider all our F&P pair averages.While discrepancies between our results and Giri et al. ( 2023) could possibly be due to the difference in methodologies adopted for comparing the F&P method versus traditional generation of initial conditions, contributions from better handling of galaxydriven physics in our simulations are also possible.It is however difficult to discern the extent of the effect galaxy-driven physics has on our results.Finally, it is also possible that our ensemble average is biased because it is an average of just 35 simulations.Using significantly larger number of GIC simulations may rule out this issue.However, we note no appreciable difference in the ensemble average once more than 20 GIC simulations have been used, and thus refrain from running additional simulations.

Bispectrum
As discussed in the previous section, the non-Gaussianity of the  b increases with the growth of ionized regions as redshift decreases.This means that a statistic that focuses only on the Gaussian parts of the signal, like the power spectrum, encapsulates less and less information as we move to lower redshift, into the regime of ⟨ HI ⟩ ≲ 0.9.Therefore, we turn to higher-order statistics to assess whether the F&P average is still a good approximation of the ensemble average of simulations in this regime, since they are able to capture the non-Gaussian features of the  b .While skewness as discussed in Section 3.1 would show some broad non-Gaussian features of the signal, it is still a one point statistic, i.e. it will not quantify the correlation of the signal between different Fourier modes.Thus, we now focus on the bispectrum as defined in Majumdar et al. (2018): where k 1 , k 2 , k 3 are the Fourier space wave numbers,   is the Dirac delta function and The different triangles can be formed by varying the magnitude of the wave numbers.
To evaluate the bispectra from our simulations, we use the publicly available BiFFT package (Watkinson et al. 2017), which employs a Fourier transform based technique (as described in Scoccimarro 2015;Sefusatti et al. 2016) much faster rather than the more traditional approach of counting individual triangles, while still providing consistent results.Following Majumdar et al. (2020), we normalize As done in Figure 4, we focus on wave numbers between 0.15 ≤  ≤ 2ℎ cMpc −1 and show only two of the five F&P averages for the sake of visual clarity.Further, we consider only two reference cases: • Equilateral triangles ( equi ): Here we set  1 =  2 =  3 = , where  goes from 0.15 to 2 ℎ cMpc −1 .This allows us to explore the non-Gaussian features of the signal across various physical scales.The results are shown in Figure 5, with the same colours and linestyles as Figure 4. We note that the F&P averages are a close match to the ensemble average across all redshifts.The apparent large deviations seen at some wave modes of the F&P average in comparison to the ensemble average (lower panels of Figure 5) arise because at those scales the bispectra approach 0. The normalisation by the ensemble average thus exaggerates the small differences significantly.Giri et al. (2023) carried out a similar analysis for  = 9 (corresponding to ⟨ HI ⟩ = 0.8 in their simulations), and found the F&P averages to be a close match to the ensemble average for  > 0.1 ℎ cMpc −1 .Thus our results are consistent with their conclusions.We also quantify the improvement on using the F&P method for  equi in Appendix A, using the methodology of Section 4.1, considering all our F&P pair averages.
• Isosceles triangles ( isoc ): We set  1 =  2 =  = 0.2 ℎ cMpc −1 ≠ k 3 to explore large physical scales more thoroughly.We plot  isoc versus the opening angle between the vectors k 1 and k 2 given as  = cos −1 (k 1 • k 2 /( 1  2 )) in Figure 6, with the same colours and linestyles as Figure 4.As expected, the GIC simulations show large sample variance, which reduces with decreasing redshift.However, the F&P averages are an even closer match to the ensemble average as compared to the GIC simulations, and thus continue to provide an improvement across all four of the redshift bins used.Similarly to what observed for  equi , the large deviations seen at some scales at  = 10 are due to  isoc approaching 0.
We also repeat the process for other  isoc , by varying  1 =  2 =  from 0.15 to 1.5 ℎ cMpc −1 , and find that the trends hold across all physical scales.This is an interesting result, as it provides a useful statistic for modelling the 21-cm signal using F&P averages at redshifts with ⟨ HI ⟩ ≤ 0.9.It thus works well for comparing with observations too, as at these redshifts the bispectrum is a more useful statistic than the power spectrum.
However, it is necessary to verify that the improvement in estimating the "true" bispectrum when using an F&P average indeed correlates with statistical estimates of the physical properties (i.e., the distribution of neutral hydrogen) at the observed redshifts.A good way to check this, is to analyse the properties and distribution of the source of the non-Gaussian features of the  b , i.e., the ionized regions.

Bubble size distribution
There is no universal consensus on how to identify ionized regions, with several methods having been used in literature (see Giri et al. 2018, for a detailed comparison of different methods).Here, firstly, we choose to define cells with  HI ≲ 0.5 as ionized.Next, we use a Friend-of-Friends algorithm based on the ndimage package of SciPy (Virtanen et al. 2020) to identify regions with clusters of such ionized cells, which are then considered as "bubbles".From their volume, we derive the radius of an equivalent sphere and use it as an effective bubble radius ( eff ).The threshold value of  HI =0.5 chosen to identify a cell as neutral or ionized is arbitrary.However, we find that varying this value does not affect our qualitative results.
The maximum value of  eff is determined by the box size, while we ignore bubbles equivalent to an individual cell as they are resolution limited.We then find radii in the range 0.16 ≲  eff /(ℎ −1 cMpc) ≲ 40.As our simulations are run only down to  = 7 and quasars are not included as sources of ionizing photons, we do not expect many bubbles with  eff > 5 ℎ −1 cMpc, and thus do not need to worry about the other extreme of the resolution range.We thus limit our analysis to the range 0.3 ≤  eff /(ℎ −1 cMpc) ≤ 5, and build histograms of number of bubbles ( bubbles ) binned according to  eff .In Figure 7, we construct violin plots (cyan) for the different values of  bubbles for the GIC simulations, and also show the ensemble average (black circles).To compare this with the two F&P averages used in Section 3.3, we plot the averages for the number of bubbles for the individual fixed and paired simulations using the same colour scheme (magenta and orange triangles).Again, we only show two out of the generated five averages for the sake of easy visual comparison, as all five give similar results.We note that both F&P averages are a good match for the ensemble average, and even when they deviate, they remain well within the violins, showcasing the ranges of the GIC simulations.This confirms that the improvement in the bispectrum noted when using the F&P average is obtained because they closely match the number and sizes of ionized regions of the ensemble average.
Further, as expected, we see that the violin plots get narrower as we approach lower redshifts, as small scale variability between the GIC simulations grows with decreasing redshift.Lastly, we note that the number of smaller bubbles grows when going from  = 10 to  = 7.6, but by  = 7, many of them would have begun merging, leading to a fall in the number of bubbles with 1.0 ≤  eff /(ℎ −1 cMpc) ≤ 2.0.

Advantage of the F&P method
Qualitatively, we note that the F&P average provides a closer estimate of the ensemble average, as compared to any individual GIC simulation, for statistics like the skewness and the power spectrum, at least for ⟨ HI ⟩ ≥ 0.9.For ⟨ HI ⟩ ≲ 0.9, the improvements for the power spectrum is reduced and the bispectrum becomes the better statistic to use.
However, it is necessary to quantify this improvement as compared to the average of multiple GIC simulations.For this reason, here we lay out the methodology to find the number of GIC simulations needed to match the performance of one F&P average with respect to the power spectrum.For this, we define "performance" as the extent of deviation of the average of multiple GIC simulations or of an F&P pair from the power spectrum of the ensemble average.If the number of GIC simulations required to match the performance of an F&P average is greater than 2, this means more of them need to be run to achieve the same performance.In this case, using the F&P average which just needs 2 simulations to be run would reduce computational costs.
Ideally, for this comparison one should run a large number of GIC simulations as well as F&P averages, and compare the extent of their deviation from the ensemble average at specific wave-modes.may be performing worse at lower redshifts, it still does better than an average of a few GIC simulations.In fact, we note that ⟨  imp ⟩ is ∼ 6 at  = 10 (thus one F&P average is better than running 12 GIC simulations), but ∼ 8 at  = 7 (equivalent to 16 GIC simulations) for the case of F&P averages being 1- away.The lowest average improvement is 3.5 at  = 7.6, indicating that at least 7 GIC simulations are needed to match an F&P average.
As we sum the power spectra across the aforementioned wavemode window, a direct comparison between our results and those of Giri et al. (2023) is difficult.Nevertheless, we note that in the range 10 ≥  ≥ 7, we obtain  imp ≥ 3.5, which agrees with their result of an improvement of at least a factor of 4 at  = 0.1 ℎ cMpc −1 at  = 9, ⟨ HI ⟩ = 0.8.The worst possible value of  imp is 0.5, which corresponds to the case when we have a single GIC simulation perform better.This is expected, as pure randomness does allow such chance events.However, as the average improvement is above 2, we believe this still supports the use of F&P averages for modelling the 21-cm signal power spectrum rather than running GIC simulations.
A similar analysis can be run for the equilateral triangles bispectrum.We present this in more details in Appendix A, where we find  imp ≥ 5.0.This result showcases that the F&P method is even better for the bispectrum, and thus using multiple summary statistics for the 21-cm signal can allow us to maximise the interpretation of the 21-cm signal without requiring large effective volumes.Note that similar analyses can be performed for bispectra generated for different values of  1 ,  2 and  3 , by using a window over the opening angle instead of the wave-mode.stance, the lower computational cost required (with respect to traditional approaches) entails e.g. that larger volumes, as well as a broader range of cosmologies and/or physical models can be explored with accurate simulations, greatly improving the reliability of predictions and -eventually -inference from 21-cm signal data.While basing the entire inference process on RMHD simulations remains prohibitive, they will be needed in order to confirm and improve constraints obtained through computationally-cheaper lessaccurate methods.Additionally, they are necessary to explore the coupling between small and large scales, e.g.spatial correlations between galaxies and the 21-cm signal, that the advent of SKA will enable.Our results suggest that such accurate predictions can be obtained from RMHD simulations even in statistically-significant volumes of the Universe.Finally, the F&P approach operates orthogonally to superresolution techniques (Kodi Ramanah et al. 2020;Li et al. 2021, as discussed in Section 1); therefore these two approaches should be considered complementary rather than in opposition.

SUMMARY
Running simulations for EoR is computationally very expensive.This is exacerbated when aiming at resolving low-mass galaxies (whose importance has been recently shown observationally in Atek et al. 2023) in volumes large enough to be statistically significant for reionization studies.In this work, we explored the Fixed & Paired (F&P) approach (Angulo & Pontzen 2016;Pontzen et al. 2016) to investigate the possibility of reducing the number of simulations needed (and thus the overall computational expense) to produce unbiased models of the 21-cm signal.While past efforts have used semi-numerical approaches to implement this, we have shown more rigorous results by using radiation hydrodynamic simulations that model more accurately galaxy-scale effects on the 21-cm signal.We focus on the wave modes in the range 0.15 ≤ /(ℎ cMpc −1 ) ≤ 2, as the best measurements from present and upcoming radio telescopes like LOFAR, HERA, MWA and SKA are expected in this regime.Further, we focus on redshifts 10 ≥  ≥ 7, which in our case correspond to 1.0 > ⟨ HI ⟩ ≥ 0.8.
To explore the improvement with respect to various 21-cm signal statistics obtained by adopting the F&P approach rather than running Gaussian-sampled initial conditions based (GIC) simulations, we use a setup similar to that of the THESAN project (Kannan et al. 2022;Garaldi et al. 2022;Smith et al. 2022;Garaldi et al. 2023).In particular, we investigate the impact on the skewness, power spectrum, bispectrum and the ionized region size distribution, and introduce a novel method to quantify the improvement.
We find that the skewness and power spectrum are wellestimated by the F&P averages for ⟨ HI ⟩ ≥ 0.9, and their performance is good also in the range 0.9 > ⟨ HI ⟩ ≥ 0.8, with an improvement in computational cost better than 3.5 for the generation of the power spectrum.We find that the bispectrum is well estimated for ⟨ HI ⟩ ≥ 0.8, with the scaled deviation between the F&P averages and the ensemble average being below 1.The only exception are those modes where the ensemble average bispectrum approaches zero, artificially increasing the small differences between the F&P average and the ensemble average.In fact, we find that the improvement in computational cost is better than a factor of 5 for the equilateral triangle bispectrum.This confirms that the F&P average bispectrum is a great complement to the power spectrum for studies of the 21-cm signal when ⟨ HI ⟩ ≥ 0.8.Finally, we show that in this regime the F&P averages also provide a good estimate of the HII regions size distribution, with the F&P averages being within 2 −  deviation of the ensemble average for bubbles with radius ≤ 5ℎ −1 cMpc.
Thus, the F&P method can be used to model the 21-cm signal summary statistics with significantly reduced effective volume and computational expense.

Figure 1 .
Figure1.Evolution of the volume averaged neutral hydrogen fraction versus redshift for GIC simulations (grey solid), their ensemble average (black solid) and five F&P averages (orange, magenta, red, blue and green dashed).We note that at =7 ⟨  HI ⟩ ranges between 0.72 and 0.82 for the GIC simulations.

Figure 2 .
Figure 2. Middle slices of the  b maps of (from left to right): one of the fixed simulations, its corresponding pair, their average, and the average of GIC simulations, at  = 10, 8.3, 7.6, 7 (with ⟨  HI ⟩ = 0.99, 0.95, 0.90, 0.80 respectively).The average of 2 GIC simulations shows clear regions of high (and low)  b , unlike the F&P average.With decreasing redshift, such regions begin to form for the F&P average as well, but a closer analysis of the  b summary statistics is necessary to analyse the difference from GIC simulations.

Figure 4 .
Figure 4. Top: Power spectra of GIC simulations (grey solid), their ensemble average (black solid), two F&P pairs (purple and green, dashed and dotted) and their averages (magenta and orange, solid) for  = 10, 8.3, 7.6, and 7 from left to right.Bottom: The normalised deviation of each random simulation and F&P average from the ensemble average.
measure of the number of triangles (weighted by the  b values at their vertices) of different configurations formed by wave numbers k 1 , k 2 and k 3 .

Figure 8 .
Figure 8.The standard deviation curve (grey) generated by interpolating the standard deviations of ⟨  ws ⟩ m GIC (grey squares) versus number of sampled GIC simulations  going from 1 to 35 for  = 10, 8.3, 7.6 and 7, clockwise from top left.The five F&P averages are plotted as triangles (circles) with the assumption of them being 1- (2-) away from ⟨  ws ⟩ 35 GIC using the same colours as used in Figures 1 and 3. Their closest matches in the standard deviation curve are shown with black dotted vertical lines gives  eq .

Table 1 .
imp is the factor of improvement on running an F&P average over running multiple GIC simulations.We report the minimum, maximum and average value of this quantity for the two cases discussed in (iv), for the five F&P averages.Redshift evolution of the improvement factor in computational expense,  imp , when using the F&P approach.The errorbars show the range from  imp,min to  imp,max for the F&P averages being 1- away (yellow) or 2- away (blue) from an assumed distribution of F&P averages, and the points show the average, ⟨  imp ⟩.The top axis shows the corresponding ⟨  HI ⟩ and the right axis shows the number  eq of GIC simulations corresponding to  imp .The black dashed line indicates no improvement.
erties, as they are dominated by local processes and environment.Therefore, the improvement granted on statistical quantities (e.g. the power spectrum and bispectrum) does not necessarily mirror exactly on different observables that might be more affected by individual objects (see e.g. the discussion in Sec.3.1).We foresee numerous application of this technique.For in-

Table A1 .
As Table1, showing the  imp improvement factor for the equilateral triangle bispectrum ( equi ).