A General Bayesian Framework to Account for Foreground Map Errors in Global 21-cm Experiments

Measurement of the global 21-cm signal during Cosmic Dawn (CD) and the Epoch of Reionization (EoR) is made difficult by bright foreground emission which is 2-5 orders of magnitude larger than the expected signal. Fitting for a physics-motivated parametric forward model of the data within a Bayesian framework provides a robust means to separate the signal from the foregrounds, given sufficient information about the instrument and sky. It has previously been demonstrated that, within such a modelling framework, a foreground model of sufficient fidelity can be generated by dividing the sky into $N$ regions and scaling a base map assuming a distinct uniform spectral index in each region. Using the Radio Experiment for the Analysis of Cosmic Hydrogen (REACH) as our fiducial instrument, we show that, if unaccounted-for, amplitude errors in low-frequency radio maps used for our base map model will prevent recovery of the 21-cm signal within this framework, and that the level of bias in the recovered 21-cm signal is proportional to the amplitude and the correlation length of the base-map errors in the region. We introduce an updated foreground model that is capable of accounting for these measurement errors by fitting for a monopole offset and a set of spatially-dependent scale factors describing the ratio of the true and model sky temperatures, with the size of the set determined by Bayesian evidence-based model comparison. We show that our model is flexible enough to account for multiple foreground error scenarios allowing the 21-cm sky-averaged signal to be detected without bias from simulated observations with a smooth conical log spiral antenna.


INTRODUCTION
One of the remaining unmeasured periods in the history of our Universe is Cosmic Dawn (CD), the period in which the first stars are formed.These first generation stars give rise to the Epoch of Reionization (EoR), the stage in which the neutral hydrogen in the intergalactic medium (IGM) is ionized by the first galaxies.CD and EoR take place after formation of the Cosmic Microwave Background (CMB), but precede observations of the modern universe (Morales & Wyithe 2010;Furlanetto et al. 2006;Loeb & Furlanetto 2013;Loeb & Barkana 2001).The EoR and CD therefore represent the missing link between the early universe and the structure we see today.Despite its importance, a direct measurement of the EoR has been difficult due to a lack of direct probes.Thus far we have only indirect detection of its existence based on Ly absorption ★ E-mail: michael.pagano@mail.mcgill.ca† E-mail: psims@physics.mcgill.caand CMB optical depth measurements (Planck Collaboration et al. 2016).
One of the most promising methods to make direct detection of CD and the EoR is to use the 21-cm hyperfine transition of Hydrogen in which a 21-cm wavelength photon is emitted when the electron flips its spin relative to the Hydrogen nucleus.Thus the 21-cm line directly probes the neutral hydrogen in the IGM during CD and the EoR.The emitted 21-cm wavelength photon is then redshifted according to cosmic expansion, which also allows for tomographic measurements of the neutral hydrogen along the line of sight.The 21-cm radiation falls in the radio portion of the electromagnetic spectrum and is measured in contrast to the CMB.The temperature difference between 21-cm photons and the CMB is referred to as the differential brightness temperature   (Furlanetto et al. 2006;Liu & Shaw 2020).
There have been two main approaches employed by experiments to measure   during the EoR.One approach uses interferometers which are sensitive to the spatial fluctuations of   .This has been the approach of collaborations such as Hydrogen Epoch of Reionization Array (HERA, DeBoer et al. 2017), The Low-Frequency Array (LOFAR, van Haarlem et al. (2013)), The Murchison Widefield Array (MWA , Lonsdale et al. 2009), Precision Array to Probe the Epoch of Reionization (PAPER, Parsons et al. 2010), and the Square Kilometre Array (SKA, Dewdney et al. 2009).An alternative approach to measuring   are 21-cm global signal measurements where the evolution of the spatially averaged   as a function of frequency.This has been the approach of experiments such as Broadband Instrument for Global HydrOgen ReioNisation Signal (BIGHORNS, Sokolowski et al. 2015), Experiment to Detect the Global EoR Signature (EDGES, Bowman et al. 2008), Large aperture Experiment to detect the Dark Ages (LEDA , Price et al. 2018), Probing Radio Intensity at high-Z from Marion (PRIZM, Philip et al. 2019), The Giant Metrewave Radio Telescope (GMRT, Paciga et al. 2013), Radio Experiment for the Analysis of Cosmic Hydrogen (REACH, de Lera Acedo et al. 2022), and Shaped Antennas to measure the background RAdio Spectrum (SARAS, Singh et al. 2017).In this case, global signal experiments measure the mean of the differential brightness temperature, i.e,   ≡   given by: () −  CMB ()   () × 0.15 where  HII is the mean ionized fraction of hydrogen, () is the Hubble parameter (with ℎ as its dimensionless counterpart),  CMB () is the CMB temperature, Ω  is the normalized matter density, Ω  is the normalized baryon density, and where  is the redshift related to the frequency  of the redshifted 21-cm photons by 1 +  = / 0 where  0 is the rest frequency of a 21-cm photon.
The mean differential brightness temperature (hereafter referred to as the global signal) depends on the spin temperature   () of the neutral hydrogen gas, which measures the relative number of HI atoms that are in the excited versus ground hyperfine states.The global 21-cm signal is an important probe of CD and the EoR.After the time of recombination,   and the neutral hydrogen gas temperature   are coupled due to collisional excitations.The remaining free electrons from recombination scatter off CMB photons keeping  CMB coupled with   leading to no 21-cm signal.After redshifts of  ∼ 150   and   decouple from  CMB .As the Universe expands the temperature of the neutral hydrogen gas   cools adiabatically.Thus due to collsional coupling and cooling gas temperature, the spin temperature   drops along with the gas temperature   leading to the dark ages absorption trough.As the Universe expands collisions become more infrequent and eventually   decouples with   which once again falls into thermal equilibrium with  CMB .Once CD is underway, the first generation stars begin emitting Ly photons which again couple   to   through the Wouthuysen-Field effect (Wouthuysen 1952;Field 1958).The spin temperature is again driven to the gas temperature   which creates an absorption signature relative to  CMB .The redshift, duration and amplitude of the absorption trough due to the Ly coupling directly depends on the details of reionization.Detection of this absorption trough was first reported by EDGES (Bowman et al. 2018).However, the unexpectedly large amplitude and flattened profile was considerably larger than was predicted by astrophysical and cosmological models (Reis et al. 2021).This has spurred theoretical interest in explaining the profile with excess radio background models, millicharged dark matter models and non-ΛCDM cosmology (Ewall-Wice et al. 2018;Fialkov & Barkana 2019;Barkana 2018;Hill & Baxter 2018).Other works have shown that the profile may be the result of an unaccounted for systematic (Singh et al. 2022;Hills et al. 2018;Singh & Subrahmanyan 2019;Bradley et al. 2019;Sims & Pober 2020a;Bevins et al. 2021).As has been discussed by Tauscher et al. (2020a); Spinelli et al. (2022); Anstey et al. (2022); Sims & Pober (2020a) improper modeling of the beam in the analysis of the data is one route through which such a systematic can be introduced.Finally as the first generation stars begin emitting xrays, the gas temperature is heated, causing   to create an emission spectrum relative to  CMB (Pritchard & Loeb 2012;Liu & Parsons 2016;Mirocha 2019;Mirocha et al. 2021;Furlanetto 2006;Fialkov et al. 2020;Cohen et al. 2017;Liu et al. 2013;Bernardi et al. 2016;Harker et al. 2012).Clearly, much of the physics of CD and EoR is encoded in precision measurements of the global signal.
One of the difficulties in detecting the global 21-cm signal are the radio foregrounds (hereafter referred to as just foregrounds) which are 2-5 orders of magnitude brighter than the expected 21cm signal within the expected redshift range of CD and the EoR.The effect of the foregrounds on global signal experiments has been well characterized in previous works (Bernardi et al. 2009(Bernardi et al. , 2010;;Liu et al. 2009a;Shaver et al. 1999).The foregrounds are mostly due to synchrotron radiation which cause them to be smooth as a function of frequency which have allowed other studies to take advantage of their spectral smoothness to extract the 21-cm signal (Liu et al. 2009b).Several techniques employ a model and subtraction strategy where the global signal is extracted by removing the foregrounds (Gleser et al. 2008).In general there are a diverse set of foreground mitigation techniques that have been well developed.(Tauscher et al. 2018;Rapetti et al. 2020;Tauscher et al. 2020bTauscher et al. , 2021;;Harker et al. 2009;Bowman et al. 2009;Sims et al. 2023).
The Radio Experiment for the Analysis of Cosmic Hydrogen (REACH) has embraced a forward modelling approach to mitigate the foregrounds in order to detect the global 21-cm signal.A forward model approach requires modeling the instrument, the atmospheric environment, as well as the individual components of the radio sky, i.e, the galactic and extragalactic radio foregrounds and the cosmic 21-cm signal (Anstey et al. 2022(Anstey et al. , 2021;;Shen et al. 2021).These radio components and all relevant errors are propagated though the analysis.Forward modelling the foregrounds requires modeling the spatial distribution of the foregrounds as well as their spectral distribution.This was previously explored in Anstey et al. (2021).To model the spatial distribution of the foregrounds, Anstey et al. (2021) used radio sky models such as Landecker & Wielebinski (1970); Remazeilles et al. (2015); de Oliveira-Costa et al. (2008); Zheng et al. (2017).These radio sky maps were measured without detailed error maps, rather only with quoted uncertainties on the offset and scale errors in the map (for example at 150MHz it has been shown by Singh et al. (2017); Bowman et al. (2018) that the uniform scale errors are at the ∼5%-11% level).Thus, forward modelled foregrounds relying on these maps will contain spatial temperature perturbations (or spatial "amplitude" perturbations) in the forward model relative to the dataset that is observed by the instrument.In this paper we extend the foreground model in Anstey et al. (2021) to account for the spatial errors in the foregrounds.
In this paper we build on the work of Anstey et al. (2021) and show that perturbations in the amplitude of these radio maps can cause systematics in the analysis of global 21-cm signal experiments which if unaccounted for will bias or make recovery of the true signal impossible.As of yet there is no accepted error model for these radio sky maps.In this work we bracket the range of systematic scenarios by studying a bracketing range of error scenarios and morphologies.
We also study the effect that temperature offsets in radio sky maps have on our analysis.We then introduce a framework by which we can account for the amplitude and offset errors within the analysis.We show that our framework can recover the 21-cm signal without bias.We perform the analysis of this framework using the REACH experiment as our fiducial instrument; however, this framework is applicable to any global 21-cm experiment.
This paper is structured as follows.In Section 2 we introduce our fiducial global 21-cm experiment REACH and discuss our forward model of the foregrounds as well as our procedure to create simulated datasets.In Section 3 we discuss the range of possible error scenarios in radio sky maps and the systematics that they cause in global 21-cm analysis.We also introduce our most realistic foreground map error scenario and use it as our fiducial foreground error realization in our dataset.In Section 4 we introduce the amplitude scale factor framework that we use to account for foreground amplitude systematics.In Section 5 we apply this framework to simulated datasets containing our fiducial foreground errors but assuming perfect knowledge of the spectral structure of the foregrounds.In Section 6 we again apply this framework to our simulated datasets containing our fiducial foreground errors but now also assume imperfect a priori knowledge of their spectral structure and jointly fit for this, in the manner described in Anstey et al. (2021).We then conclude in Section 7.

FIDUCIAL INSTRUMENT AND ANALYSIS PIPELINE
The Radio Experiment for the Analysis of Cosmic Hydrogen (REACH) (de Lera Acedo et al. 2022) is a global 21-cm signal experiment located in the South African Karoo Desert.REACH will have observation windows corresponding to frequencies 50MHz to 170MHz corresponding to redshifts  ≃ 28 − 7.This is the expected redshift range for CD and EoR signals.REACH forward models the 21-cm signal, antenna beam and the galactic foregrounds as a function of .The observation data taken over a set of LSTs is then fit to a forward modelled sky plus an additional signal model component.In this section we describe our simulated datasets, foreground model and Bayesian analysis of the simulated datasets.

Bayesian Evidence
Given a model  of the sky which is parameterized by a set of cosmological parameters θ, we can infer the probability distribution of θ given the measured dataset d for our particular model i.e. (θ|d, ).This is the posterior distribution in Bayes theorem: where (d|, ) is the likelihood, (θ|) is the prior on the parameter set  and (d|) is the Bayesian Evidence.Throughout this paper we denote the Bayesian evidence as .The Bayesian Evidence is the normalization factor which ensures that the posterior of Bayes' theorem is a properly normalized probability distribution function.To numerically estimate the posterior probability distributions (θ|d, ) of the parameters, one draws samples from the prior distribution (θ|).Bayes' theorem can also be used for model selection, where one can determine the relative probabilities of models for the data via calculation of the Bayesian evidence, In order to compute the Bayesian evidence for each model we used the nested sampler PolyChord (Handley et al. 2015a,b;Skilling 2004) which greatly decreases the computational cost required to compute the Bayesian Evidence.PolyChord operates by placing  live points within the prior volume and then slice sampling the prior subject to the constraint that the new point has a higher likelihood than the lowest likelihood point.That lowest point is then discarded and the process repeats, generating new live points at each iteration.The sampling occurs from regions of low likelihood to high likelihood (see Handley et al. (2015a,b) for more details).The Bayesian evidence is implicitly computed as the sampler gradually makes its way to the region of high likelihood.This sampling technique is also more effective at sampling multi-modal posteriors as compared to Markov chain Monte Carlo (MCMC) methods.The relevant PolyChord parameters for our analysis are n prior , the number of samples drawn from the prior, n fail the number of consecutive failed sample draws at which the algorithm should terminate, and n repeats which toggles the number of repeats of the slice sampling procedure.Throughout our analysis we use the default values for these parameters,  live = 25 dim ,  prior = 25 dim ,  fail = 25 dim and  repeats = 5 dim where  dim are the number of parameters in the inference.

Data Simulations
In this section we describe our simulated dataset.The temperature of the sky within our observation band is modeled as the sum of the radio foregrounds, CMB and the cosmological signal  sky (, , ) =  21 () +  FG (, , ) +  CMB .The temperature of the radio foregrounds are dominated by synchrotron and free-free radiation emission which are individually well described by spatiallydependent spectral power laws.Following Anstey et al. (2021), we form a model for these foregrounds as,  FG (, , ) = ( base (, ,  0 ) −  CMB )(/ 0 ) −( , ) (5) where (, ) is the spatially varying spectral index and  0 is the reference frequency of the foreground model which we set to 408MHz.A map of spectral index variation across the sky was previously derived in Anstey et al. (2021)   The resulting spectral index map is shown in Figure 1. base is the temperature of the sky measured at a reference frequency  0 .This map, which we refer to as the foreground "basemap" is then extrapolated in frequency according to (, ).Since REACH is a drift scan experiment, the resulting temperature of the sky is then rotated for time and observation location of the instrument.The resulting dataset as seen through the beam is then  (Scheutwinkel et al. 2022b), and to model select across noise models (Scheutwinkel et al. 2022a).

Spectral Sky Model
In the previous section we introduced our fiducial map for the spatially varying spectral index which was used to create the dataset.Since the true spatial variation of the spectral index is not known, the spectral indices must be fit for within our foreground model.This spectral model was the main focus of Anstey et al. (2021) which we briefly describe here.Since fitting for the spectral index at each pixel on the sky is computationally infeasible, in order to model the spatial variation of the spectral index in a computational feasible manner, the spectral map is split into   regions of uniform .The regions are selected such that the spectral indices are similar within the region.This framework becomes a closer approximation to the true spectral sky as the number of regions   is increased.Our foreground model can then be written as (10) where  are the indicies corresponding to th spectral region.The values  , (, ) are masks that take on the value of 0 or 1 thereby determining whether a pixel is part of the th spectral region which contains a spectral index   .On the left column of Figure 2 we show example regions where the spectral map from Figure 1 has been split into 4 regions (top) and 8 regions (bottom).The foregrounds are then rotated onto the coordinates of the beam at the specified LSTs (corresponding to the observational times).The sky regions as measured by the instrument are then computed as In practice the uncertainties in the beam model are also liable to introduce additional instrumental chromaticity (e.g.(Sims & Pober 2020b;Spinelli et al. 2022)) which can bias recovery of the 21-cm signal.Accounting for beam uncertainty in the context of the analysis pipeline presented here will be explored in future work (Anstey et al in prep).

FOREGROUND ERRORS
Detailed foreground sky maps such as those derived from the Haslam sky survey at 408 MHz (Remazeilles et al. 2015), the Landecker & Wielebinski (1970) sky survey at 150MHz, and the Guzmán et al. (2011) sky survey at 45MHz are essential tools in analyses aimed at detecting redshifted 21-cm emission.Historically, sky surveys have not been accompanied by detailed error models characterizing the full covariance information in the map, which complicates detailed statistical forward modelling of the signals.
Subsequent analyses comparing these maps to absolutely calibrated spectrometer data have sought to characterize their offset and scale factor uncertainties (e.g.Patra et al. (2015) and Monsalve et al. (2021)).Additionally, recent surveys such as the LWA1 Low Frequency Sky Survey between 35 and 80 MHz (Dowell et al. 2017) and EDA2 map of the southern sky at 159 MHz (Kriele et al. 2022) included more detailed uncertainty characterization, but do not have full-sky coverage and are lower in resolution than the Haslam sky survey at 408MHz.We will show in Section 5.2 that the correlation length (which here we equate with map resolution) of the uncertainties in the base map used in our forward model strongly affects reconstruction accuracy for a given complexity of foreground model.In the context of modelling the 21-cm global signal, less regions are required to model the foregrounds at a level sufficient for unbiased 21-cm signal recovery when the errors have shorter correlation lengths, and vice versa.Thus, increasing the correlation length of the uncertainties in the base map have the effect of requiring more degrees of freedom to model the foregrounds accurately.As such, here, we make use of the ∼ 1 • resolution Haslam sky survey at 408MHz as our fiducial base model  base .We then consider multiple error scenarios  error , bracketing the possible systematic errors caused by amplitude perturbations in the foreground maps.This allows us to study the systematics created in the analysis in each of these cases and characterize the dependence of the model complexity on both foreground error level and correlation length (see Section 5.2 for details).In Section 3.1 we describe our method to assess whether the foreground error realisations produce significant systematics in our analysis.In Section 3.2 we build intuition on different foreground error scenarios, while in Section 3.3 we introduce our method for generating more realistic foreground amplitude errors.

Methodology For Assessing Foreground Errors in the Analysis
To assess whether a particular foreground error scenario produces significant systematics in our analysis, we create a simulated dataset using the procedure described in Section 2.2 but with our noise realization added to the basemap in Equation 5, i.e.  true base (, ) =  base (, ) +  error (, ) where  error (, ) is the temperature map of the noise realization and  true base is the true temperature of the sky.To isolate the effect that the temperature perturbation errors have on the data analysis pipeline, we do not add 21-cm signal in the dataset.Note that for brevity we will refer to the spatial temperature perturbations in the basemap as amplitude errors.This condition essentially makes our assessment of the foreground errors a null test, i.e. there is no 21-cm signal inside the dataset, and so any recovered 21-cm signal is thus due to a systematic caused by the foreground amplitude errors.Thus the null test is passed if the amplitude of the recovered 21-cm signal is consistent with zero at 1.Note that one can also use the Bayesian evidence to test whether a signal is preferred.This is done by comparing the Bayesian evidence for a model with a signal component versus one without a signal component.Our dataset is written as The remainder of the simulated dataset procedure remains the same as discussed in Section 2.2.The simulated dataset includes the estimated noise realization while the model does not include noise.
In our foreground model we fix the number of spectral regions to   = 10 which (for this antenna and observation settings) has been shown by Anstey et al. (2021) to lead to unbiased recovery of the 21cm signal.Thus   provides sufficient flexibility to account for the spatial variation of the spectral index in the sky.By fixing   = 10, we avoid adding additional spectral components in the foreground model to fit for amplitude perturbations in the foreground basemap1 .This assumption allows us to better isolate the effect of the amplitude errors in the analysis.Note that we only fix   = 10 in this section.In the following section we consider a range of scenarios for  error (, ) and using this methodology study their affect on the analysis.

Error Scenarios
Since we do not precisely know the nature or morphological structure of the errors present in the foreground map, we consider a range of error scenarios and amplitudes with the aim of bracketing the possibilities for the systematics.Roughly, the errors we have considered fall into two categories : (i) Noise Scenario 1: Homoscedastic errors (ii) Noise Scenario 2: Heteroscedastic errors Scenario 1 corresponds to error realisations where the standard deviation of the errors is uniform over the sky.In Scenario 2 we consider the case where the standard deviation of the errors are proportional to the temperature in each pixel of the basemap.In each scenario, we assess the systematics produced by the presence of these errors in the analysis through the methodology and observation settings discussed in Section 3.1.To produce the errors in Scenario 1, we add random Gaussian noise with mean  = 0 and of standard deviation  noise = Δ •  sky where  sky is the mean temperature of the map and the Δ is a dimensionless parameter which we vary [0.01, 0.05, 0.1] corresponding to random Gaussian error fluctuations from 1% to 10% of the mean temperature in the map.To imprint correlation structures into our error realisation we smooth over the map by convolving the error realisation with a Gaussian symmetric beam.By modifying the FWHM of the Gaussian beam we can produce error realisations which are correlated on a physical length scale measured in radians on the sky  FWHM .We consider noise correlation scales corresponding to the full width half max (FWHM) of The extreme correlation scenarios  FWHM → ∞ and  FWHM → 0 corresponds to a uniform offset in the map and uncorrelated random Gaussian noise respectively.
In the left panel of Figure 3 we summarize the results for error Scenario 1.In each noise scenario we indicate whether the error scenario has passed or failed the null test according to our criteria above.We find that uncorrelated Gaussian errors with fluctuations up to 5% the mean temperature of the map do not produce significant  8while the red curve is the dataset (Equation 9) with the fitted foreground removed, i.e.Equation 10 with the best fitting foreground parameters.The lightly shaded red and blue regions represent the 3 regions for their respective contours.
The systematic produced by the errors in the basemap prevent the model from passing the null test described in Section 3.1 enough systematics for our analysis to fail the null test described in Section 3.1.As we progressively increase the smoothing scale of the errors from  FWHM = 0 to  FWHM = 10 • , we find that the amplitude of systematics introduced into the data increases.In the limit that the errors are smoothed with  FWHM → ∞ (i.e. a uniform offset) then even errors at 1% produce a sufficiently large systematic for the null test to fail for the 25mK noise level in the data considered here.
In Scenario 2 we consider the heteroscedastic version of Scenario 1 in which the random Gaussian noise have spatially dependent fluctuations, i.e.  noise acquires a spatial dependence.We again produce a range of error scenarios with temperature fluctuations  noise = Δ •  sky (, ) where  sky (, ) is the temperature of the Haslam at location ,  on the sky and Δ is a unitless parameter which obtains the same values as Scenario 1.For each value of  noise we progressively smooth the map by convolving the errors with a Gaussian beam with FWHM On the right panel of Figure 3 we summarize the results of these tests.Qualitatively, we find that increasing the amplitude of the error fluctuations Δ or the correlation length  FWHM of the errors produce larger systematics in the analysis.Quantitatively, we find that the systematic created by the heteroskedastic noise realization with amplitude Δ > 5% and correlation length  FWHM ≥ 1 • is large enough to prevent recovery of the 21-cm signal.In Figure 4 we show an example of a systematic created by the scenario Δ = 5%,  FWHM = 1 • .The blue curve represents the reconstructed 21-cm signal using the 21-cm model parameters described in Equation 8while the red curves represent the dataset (see Equation 9) with the fitted foreground removed (Equation 10).The red curves can thus be interpreted as containing any 21-cm signal and residuals remaining after the fit.In an ideal scenario for the null test that we are conducting in this section, the blue curves are consistent with the true 21-cm signal (which is zero in this case since it is a null test) and red contours are reduced to the noise level of the dataset.Also note that the lighter blue and red parts of the plot represent the 3 level of their respective contours.For fixed Δ, increasing  FWHM increases the likelihood that the null test fails.When  FWHM ≥ 1 • , all error scenarios lead to systematics such that the null test outlined in Section 3.1 are not passed.The largest systematics are caused by scenarios with  FWHM = ∞ and Δ = 10%.We further test this error scenario with longer observation settings (6hr integration time) as well as observation times where the galactic plane is predominantly below the horizon.The systematics caused by this type of error scenario are present in all observation configurations but are lower for longer integration times and when the galaxy is below the horizon.
We consider the errors generated in scenario 2 to be most realistic, i.e. we make the assumption that the fluctuations of the errors depend on the local temperature.We also assume that the errors have structure with some unknown correlation length  FWHM .In the following section we build on the intuition developed in this section and introduce our fiducial error scenarios used in our analysis.

Fiducial Error Models
In the absence of an error structure for the radio foreground basemap, we need to construct our estimate for a realistic error scenario.Our construction of the errormap is motivated by the expectation that the errors likely have spatial structure, i.e. they are not entirely uncorrelated, and that the magnitude of the errors is proportional to the absolute temperature of the basemap.Thus our expectation is that the error fluctuations are largest in the galactic plane, where the temperature is the largest.As discussed in Remazeilles et al. (2015) the absolute calibration of the Haslam 408MHz map is of the thought to be accurate to better than 10% leading to an overall zero level of ±3K.As a conservative estimate of the monopole error, here we assume that our map has an offset of 3 K.We do not expect our error realization to exactly match reality, however we use it as our fiducial error realisation to study the effect that such noise scenarios have on our data analysis pipeline.
We now discuss our procedure to generate our error realization.This is a two step process, the first step is to construct the spatial error fluctuations while the second step is to adjust the mean of the map to 3 K.To construct the spatial error fluctuations via a Gaussian generated heteroskedastic errormap with amplitude proportional to the temperature at each location in the sky, we use the extended Global Sky Model (eGSM).Specifically we use the eGSM absolute errors of the 408MHz Haslam map derived from Kim et al. (in prep) which we show in Figure 5.The eGSM is a data-driven model similar in spirit to previous global sky models published by de Oliveira-Costa et al. (2008) and Zheng et al. (2017), in that it takes empirical data from multiple surveys and interpolates over gaps in frequency and sky coverage using dimensional reduction methods such as principal component analyses.Importantly, the eGSM includes the Monte Carlo propagation of measurement errors in the input surveys through the interpolation process.The errors roughly trace the galactic morphology.Note that many low frequency survey data are missing the southern pole which influences the eGSM model in this region.Admittedly, the final error bars are only rough estimates given that many of the input surveys do not come with published uncertainties, thus making it necessary to simply assume ∼ 10% errors in some input maps.Nonetheless, Figure 5 likely captures some of the spatial patterns in the errors of typical sky models and is therefore sufficient for substantiating at least the qualitative lessons of this paper.
To go from the map of error bars in Figure 5 to an error realization, we first draw a noise realization from the absolute error map at 408MHz.To do this, at each pixel in the basemap we draw a random value from a Gaussian distribution with mean  = 0 and standard deviation  pixel proportional to the value of the absolute error in that pixel.This produces a heteroscedastic noise realisation of zero mean and with error fluctuations proportional to the absolute errors at 408MHz.The fractional error fluctuations in our noise realization are largest in the galactic plane and smaller away from the galactic center.In the upper left of Figure 6 we show the resulting errormap.Note the areas of high foreground map uncertainty in the southern pole (lower right portion of the basemap).Many low frequency survey data are missing the southern pole which influences the eGSM model in this region.In practice, with particularly poor empirical constraints in those parts of the sky, it would be sensible for experiments to avoid those regions.Note that this might cause further limitations for global signal experiments which are unable to avoid the southern pole.We intend to address this in future work.We select our observation time such that these regions are below the horizon at the time of observation.We chose to set our observation starting at 6pm on 2021-01-01.Unless otherwise stated, we use this observation time for the remainder of our analysis.
To implement correlations into our noise realization we smooth the map by convolving our noise realization with a Gaussian beam of FWHM of  FWHM .Since we don't know the true correlation structure of the errors, we consider multiple smoothing scales chosen to encompass a range of physical scenarios.We use These correlation scales are chosen to roughly correspond to the resolutions in the 150MHz and 408MHz empirical foregrounds maps from Landecker & Wielebinski (1970); Remazeilles et al. (2015) which are approximately 1 • and 5 • FWHM respectively.We make the assumption that the correlation lengths of the error features roughly match the resolutions of these maps.Note that smoothing the map obscures the features of the original eGSM map.For small smoothing scales the morphology of the underlying eGSM map is still obvious, however with increasingly large smoothing scales, the features of the underlying eGSM map are entirely obscured.In an extreme case when  FWHM → ∞, the errors are totally independent of the eGSM map (assuming they all have the same mean).Convolving our error realisation with a Gaussian beam reduces the total spatial power in the map.We wish to preserve the total power in the map.To quantify how the total power in the map is affected we first express the original (pre-smoothed) error map in spherical harmonics where the indicies  = 0, ..., ∞ are referred to the multipole which represent the angular scale on the sky, and − ≤  ≤ +.The    are the associated Legendre Polynomials.We can expand the map using where We can define the power spectrum  ℓ of the temperature fluctuations in the map which is the variance of the harmonic coefficients: where the angular brackets represent an ensemble average taken over many independent realizations in the map.If we perform the average over different  modes we can write the angular power spectrum   as The total power  max    of the smoothed error map will have decreased power relative to the original (before the convolution).This decrease in total power will result in dampening the error fluctuations, thereby reducing the amplitude of the errors relative to our original noise realization.To correct for the decrease in total angular power of the map and conserve the power of the error fluctuations, we scale the coefficients by √︁    0 /    smooth where    0 is the total power of the error realisation before the smoothing process and    smooth is the total power of the map after smoothing.We then compute the inverse transform of Equation 17to return to configuration space.We then re-adjust the mean by adding a uniform offset of 3 K chosen to match the zero level offset in Remazeilles et al. (2015).We do not expect this error realization to exactly match reality, however we use it as our fiducial error realisation to study its effect on our data analysis pipeline.Note that our framework is applicable to any error scenario.Thus the effectiveness of our framework to precisely recover the 21-cm signal as shown in Section 6 are not motivated by our specific choice for the noise realization.

MODELING AMPLITUDE PERTURBATIONS IN OUR FOREGROUND MODEL
Our foreground model described in Section 2.3 relies on radio sky maps to construct our basemap.Since these radio sky maps are derived from measurements with no detailed error maps, the true amplitude of the radio sky will deviate from that of our foreground basemap by an amount that, given current uncertainties on low frequency sky maps, will (almost) certainly preclude unbiased recovery of the 21 cm signal within a forward modelling analysis of the type described in Section 2.2.To account for these temperature fluctuations relative to our model, as well as any global temperature offsets in our map, we introduce a new foreground model, with amplitude correction factors (amplitude scale factors).Along with a linear offset term, we show that our model can account for these observational uncertainties.In this section we introduce our amplitude scale factor model and discuss our optimization procedure, as well as possible limitations of our framework.

Amplitude Scale Factors
In this section we introduce our amplitude scale factor model.Recall in Equation 10 that the spectral map was split into  ,  regions where each region has a uniform spectral index  which is then fit for.The spectral map used to define these regions was shown in Figure 1.Example regions for   = 4 and   = 8 are shown on the left side of Figure 2. Our approach to account for amplitude perturbations in the basemap is analogous to this framework, but now applied to fitting for the true base-map in place of the spectral structure on the sky.To implement amplitude correction factors (hereafter amplitude scale factors) into our foreground model, we split the foreground basemap into   subregions, with each region having an associated multiplicative scale factor   that adjusts the temperature in the foreground basemap for that subregion.Note that the   amplitude scale factor regions are not coincident with the   spectral regions and thus the  ,  masks do not align with the   scale factor regions.This is a result of not using the spectral map in Figure 1 to define the   scale factor regions.On the right side of Figure 2 we show example amplitude scale factor regions defined using the Haslam basemap.Note that although they appear similar since they both follow galactic morphology, they are not identical.Having non-coincident   and   regions provides our model more flexibility leading to higher fidelity fits.Our updated foreground model, including these amplitude scale factors, is given by, where  , (, ) are masks applied to the foreground basemap which take on the value of 1 or 0 depending on whether a pixel is part of the th scale factor region or not.All pixels in the map comprising a scale factor region are then multiplied by   .Thus any amplitude perturbations in the data relative to the foreground basemap can be adjusted for by scaling the model in that region by   .Similarly   are the number of spectral regions and  ,  (, ) are the associated masks for the spectral regions (as discussed in Section 2.2), where   are the associated spectral values for that spectral subregion.Thus there are   spectral parameters and   amplitude scale factor parameters in our foreground model.Note that in the special case where the   and   regions are perfectly coincident, the   and   regions now refer to the same areas of the sky but one still has   and   parameters to adjust for in the foreground model.The term  offset is a constant term added to the foreground model to adjust for any temperature offsets in the basemap relative to the data2 .Note that an error scenario consisting only of a uniform offset corresponds to the errors in the rightmost columns in the left and right panels of Figure 3.The value of the amplitude scale factor   , in each subregion is adjusted along with the offset ,   parameters and 21-cm signal parameters using the Bayesian inference framework described in Section 2.1 using the likelihood in Equation 13, with  model now referring to our modified foreground model (i.e.Equation 20).To determine the optimal number of regions   required to fit the data, we perform the inference analysis described in Section 2.1 for a range of models with different   .For each model we compute the Bayesian Evidence described in Section 2.1.The suitable number of regions to split the prior map into is then determined post analysis by selecting the value of   and   that maximizes the Bayesian Evidence.The Bayesian Evidence is penalized when additional complexity is added without substantially improving the fit (MacKay 1992).Determining   and   using his method also allows us to avoid using more amplitude scale factors than are required to fit the data and thus prevents an overfitting scenario.For other applications of this methodology in the field of 21-cm cosmology the reader is encouraged to see e.

Prior Maps & Limitations
In the previous section we introduced a foreground model which has the flexibility to account for multiplicative errors in the true temperature of the foregrounds relative to our model.In this section we discuss the importance of an appropriate error prior map to define the number of regions.To define the regions on the sky we require a map which describes the spatial arrangement and amplitude of the errors.This map essentially acts a prior on the amplitude scale factors and thus we refer to the map used to define the regions as the "error prior map".To construct the regions we bin the pixels of the error prior map into   segments, where   are the number of amplitude scale factors.The boundaries of the th bin is determined by computing the temperature corresponding to the 100i/  percentile in error, where the width of each bin is 100/  .Therefore the th bin corresponds to pixels in the map lying between the 100( − 1)/  and 100/  percentile.The pixels in each bin are then mapped to their corresponding locations on the sky.Note that this procedure can be applied using any map and thus in principle any map can be used to define the   sub-regions.However since the scale factors are temperature multiplicative factors, grouping regions of the sky with similar temperature perturbations increases the effectiveness of the amplitude scale factors within this framework.In Section 5.2 we show that having   sub-regions to coincide with the true fractional errors in the foreground map optimizes this approach by both reducing the number of regions and improving the fits.Thus using our best estimate of the foreground errors as our error prior is an important step in our procedure.We then proceed in our analysis by presenting two scenarios which act as limiting cases.In a conservative approach we use the 408MHz Haslam basemap to define the regions.In the right column of Figure 2 we show example regions using the Haslam basemap.The upper right panel corresponds to   = 4 while the lower right column corresponds to   = 8.Using this map as our prior map is conservative since the morphology of the absolute error map roughly follows the galactic structure and so represents a scenario where we haven't included any additional information in our framework regarding the errors.In our second scenario, we consider the limiting case of having perfect knowledge of the spatial structure (but not necessarily amplitude) of the errors in the sky model.For this, we use the true fractional error map to define the region defined as where  base refers to the unperturbed basemap and  refers to error realisation contained in the basemap.Information on the per-pixel covariances in the eGSM map may be available in the future and thus would inform our models regarding the correlation structure of the errors.For now, we use this perfect prior map to illustrate the best-case scenario for the performance of this approach.Unless otherwise mentioned we define the   scale factor regions using Haslam basemap from Remazeilles et al. (2015).
In the previous section we discussed how the Bayesian Evidence can be used to determine the optimal number of regions.One limitation in using this framework to define the regions is that adjacent   models are not subsets of another, i.e. a model with   regions is not a subset of a foreground model with   + 1 regions.As a result, increasing the complexity of the foregrounds models by increasing   does not build on top of another one; instead, each model is an entirely new reconfiguration of regions3 .This impacts how the Bayesian Evidence of each model depends on   .In general one would expect that as more parameters are added to the model, the fit of the model to data should improve, thereby increasing the Bayesian evidence up until the additional complexity is no longer required to describe the data, at which point the Bayesian Evidence decreases with   .Since, for most values of   the amplitude scale factor regions are not nested, we no longer expect a strict monotonic increase of the Bayesian Evidence as a function of   .This effect will be more pronounced when the error prior map used to define the regions does not coincide with the true temperature perturbations errors in the data.In this scenarios the placement of the regions with respect to the true fractional errors is suboptimal and so there will be certain models with the scale factor regions that are, by chance, better aligned with the true errors.In contrast, in the perfect error structure information scenario the region placement is already aligned with the errors in the map and so the variation in ln() from model to model is smaller.Moving forward we denote the scenario as having an ideal prior map.

Map Reconstruction
Outside of the context of a global 21-cm signal experiment, one may use this approach to recover a model map of the radio sky without the error perturbations, i.e. performing foreground map reconstruction.To do this one uses the mean posterior model map as the best-fitting model of the sky as seen by the instrument.However using this framework for map construction should be used with discretion.As discussed in Section 4, the mean temperature on the sky is the key metric of our model which is being compared to the dataset within our Bayesian inference framework.As a result our inference framework finds the optimally fitting values of   in each sub-region of our foreground model such that the mean temperature of the sky in our model fits the mean temperature of the simulated dataset.Therefore one finds that the optimal value of   in the th sub-region is the spatially averaged value of all the amplitude perturbations within the th sub-region.However because each sub region will contain a distribution of temperature perturbations, and only the mean of   is chosen, the boundaries between adjacent sub-regions will be disjointed.Our framework thus only adjusts the mean in each region, which does not prioritize the smoothing between boundaries of regions.Additional operations would be required to smooth out the map.However increasing the number of scale factor regions   will create a smoother interface at the boundaries improving the model reconstruction.
The performance of map reconstruction is also dependent on the error prior map used to define the scale factor sub-regions.Using the true fractional error map will decrease the disjointedness at the boundary between regions and increase the performance of the map reconstruction.This is because any region defined using the true fractional errors contained in the data will have a smaller range of amplitude perturbations contained in that region and therefore create a more seamless transition across the boundary.Note that disjointed boundaries between regions do not impact the performance of our framework or cause any systematics in our analysis because the temperature of each sub-region in the sky is summed according to Equation 20 without any data analysis procedure operating on the boundaries.

ISOLATED AMPLITUDE ERRORS
In this section we use our foreground model to fit a dataset which was constructed using a basemap containing the error realizations derived in Section 3.3.A 21-cm signal is included in the dataset using Equation 8 with amplitude, standard deviation and centering frequency  21 = 0.155K,  21 = 15MHz, and  21 = 85MHz respectively, which are equivalent to the fiducial signal model in Anstey et al. (2021).Note that it has been shown by Anstey et al. (2022) that varying the amplitude and central frequency of the fiducial signal can affect the bias in the recovered signal in the REACH pipeline.We expect similar qualitative results to hold in our amplitude scale factor framework.Since the focus of the paper is to introduce our amplitude scale factor framework and not practically explore signal recovery, we keep  21 ,  21 , and  21 fixed throughout the analysis.Note that we include a model of the conical log spiral beam when constructing our simulated data sets and when forward modelling the data to recover the 21-cm signal estimates (see Figures 3 and Figure 4 in Anstey et al. 2021 for the details of our beam model).For example, see Equation 7where we take the sky, rotate it to zenith multiply by the beam  at each frequency channel and at every point in space and then integrate across the sky.Also in Equation 11we include the beam in our forward model of the sky.Thus even a flat spectrum would appear chromatic in our analyses.However note that improper modelling of this antenna beam can also lead to biased recovery of the 21cm signal.This will be explored in upcoming work (Anstey et al. in prep).In this upcoming work it will be also be shown that parametrizing the beam can account for errors in the beam modelling and recover unbiased estimates of the 21cm signal.Thus in a realistic observational pipeline one must jointly fit for the beam parameters and the   amplitude scale factors.In this paper we assume that the antenna beam is known without error in order to isolate the effect of amplitude scale factors on the data analysis pipeline.We also make the simplifying assumption that there is no spatial dependence to the spectral index.We fix  = 2.7 in the simulated dataset as well as in the foreground model used to fit the dataset.We therefore isolate the effect that amplitude perturbations have on the analysis without the effect of spectral errors.We consider the case of spatially varying spectral index in the following sections.In Section 5.1 we apply our amplitude scale factor framework to a single error scenario from Section 3.3 to highlight the general trends between the number of amplitude scale factor regions and the Bayesian Evidence.We compute optimal number of regions to describe the simulated dataset and show the recovered 21-cm signal for that model.In Section 5.2 we then show how the optimal number of amplitude scale factors changes with the error properties in Section 3.3.In Section 5.3 we discuss the effect that using the true fractional error map discussed in Section 4.2 as a prior map to define the   regions has on the analysis.

Spectrally Uniform
In the previous section we introduced our amplitude scale factor framework.In this section we select one error scenario from Section 3.3 to highlight our model and illustrate the general trends of the number of scale factors   and the Bayesian evidence.In the following section we apply our framework to all error scenarios introduced in Section 3.3.In Figure 7 we show the Bayesian evidence as a function of   for Δ = 6% and  FWHM = 1 • .On the left panel 21-cm signal is included in the dataset while on the right panel no 21-cm signal is included in the dataset.For each case, the dataset is fit with a foreground model that has   amplitude scale factors, one offset parameter  and either a signal model (red curves) or no signal model (black curve).Starting from   = 1 in the left panel, we can see that as more degrees of freedom are added to the model, ln() increases rapidly until where we see that   = 16 maximizes the Bayesian Evidence.The residuals for   = 1 and   = 16 are shown in the first column of Figure 8. From this figure, we can see that the systematic created by the amplitude perturbations is substantially reduced as we increase the number of scale factors.Since one scale factor is equivalent to adjusting the mean of the foreground map, this result implies that the presence of the spatial fluctuations of the foreground errors about the mean of the error map also contribute to the bias in the signal recovery within the analysis.Our amplitude scale factor framework can accommodate these fluctuations with more precision by adding more regions to the foreground model.Thus the size of the regions shrink and can begin to accommodate smaller features in the error map.In Figure 8 we show the residuals for the   = 3 model (first row, second column) and see that the bias in the signal recovery has shrunk relative to the   = 1 model (first row, first column).In the lower left of Figure 8 we show the residuals for the model which maximizes the Bayesian evidence (  = 16).In this case the amplitude, central frequency and standard deviation of the reconstructed signal are consistent with the true values at 1.Referring back to Figure 9 we see that as we continue to increase the regions beyond   = 16 the Bayesian Evidence slowly decreases.In the lower right panel of Figure 8 we show the residuals for the   = 32.The amplitude and standard deviation of the systematics have decreased further; however, the additional flexibility added to the system beyond   = 16 is not required to fit the data, and as a result there is a statistical cost to adding more parameters.This is due to to Equation 4, whereby we can see that adding additional parameters increases the prior volume which penalizes the Bayesian evidence.Note that as discussed in Section 3.2 observation times when the galaxy is above the horizon produce larger systematics in the data as compared to when the galaxy is below the horizon.We observe that configurations when the galaxy is above the horizon translates to requiring more regions   to maximize the Bayesian evidence as compared to galaxy down configurations.In each case our model is applicable however one should expect that the  a required to maximize the Bayesian Evidence will depend on observation times.Also note the scale difference of the red and black curves in the left and right panels in Figure 7: a Gaussian 21-cm model component can be used as an extra degree of freedom to fit foreground systematics (see Figure 13).However a foreground model does not have the ability to fit a 21-cm Gaussian signal.Thus (i) if there is a 21-cm signal in the data, we will notice a large difference in evidence between sky models which include/do not include 21-cm components .
(ii) if there is no 21-cm signal in the data we will notice smaller differences in evidence between sky models which include/do not include 21-cm components.
Note that simulations with a more flexible 21-cm model might therefore covary with systematics making it more difficult to separate each component in the modeling.We plan to investigate this further in future work.
To quantify the fit of the recovered signal to the true signal we compute the Figure of Merit (FoM) introduced in Anstey et al. ( 2021) and defined as where  ˚= 151 are the number of frequencies,  amp = 0.155 is the fiducial amplitude of the signal and  true and  fit are the true signal and recovered signal respectively.A larger FoM suggests a more accurately recovered 21cm signal.In the bottom panels of Figure 7 we show the FoM as a function of   for our models.Note that because the regions between two consecutive models   and   + 1 are entirely reconfigured, sky models with   regions are not a subset of those used in an analysis with   + 1 regions.Thus there is no requirement that the FoM or the Bayesian Evidence ln() must monotonically decrease (or increase) with   .Therefore fluctuations in the FoM and ln() after the Bayesian Evidence maximizing model are expected.

Effect of the Amplitude of Fluctuations Δ and Correlation Length 𝜃 FWHM of the Errors on the Analysis
In Section 3.3 we introduced our fiducial foreground error model, which was parametrized in terms of the amplitude of the fluctuations Δ, and the FWHM of correlated features  FWHM .In this Section we apply our amplitude scale factor foreground model described in section 4.1 to a dataset constructed using a foreground map which contains these fiducial error scenarios.In Figure 9 we show the Bayesian Evidence for each scenario.We see that the number of amplitude scale factor regions required to maximize the Bayesian Evidence depends on the level of the fluctuations Δ and  FWHM of the error realisation in the datasets.From Figure 9 it is evident that for fixed  FWHM , increasing Δ of the errors in the dataset increases the optimal number of amplitude scale factors required to fit the data.This is an expected result.Consider a foreground model with   scale factors regions.Any systematics remaining in the analysis are due to small fractional differences in temperature of the foreground dataset relative to the foreground model.Thus uniformly increasing the mean fractional differences of the entire foreground  22) for the recovered signal model (red curve).A larger FoM suggests a more accurately recovered signal.Right panel: no 21-cm signal is included in the dataset, in which case a sky model without a 21-cm signal component is preferred since the maximum value of ln() occurs in the black curve.The blue curves correspond to a model where an ideal prior map (i.e. the fractional errors in Equation 21) are used to define the regions.The Δ ln() curve is the difference in Bayesian evidence between the black curve and red curve.We use a one hour observation duration in these simulations.Note the area of high foreground map uncertainty in the lower right portion of the basemap in the upper left of Figure 6 (this area is present in each map, but is most pronounced in the unsmoothed map).This area of high foreground map uncertainity is due to many low frequency surveys missing the southern pole which influences the eGSM model.We choose the one hour observation to occur at a time where this area is mostly below the horizon.Since regions below the horizon are removed during our simulated observation, this area of high uncertainty in the lower right of the eGSM basemap does not enter our analysis.In particular, we avoid this part of the map since regions with least foreground uncertainty are most promising for 21-cm signal recovery (this reduces the potential for systematics and bias in the analysis which also reduces the foreground model complexity required to model the 21-cm signal to high fidelity).
map used to construct the dataset relative to the basemap will scale any systematics remaining in the analysis.Thus scaling the temperature fluctuations will require more precision in the foreground model to reduce the systematic.This trend is independent of how the scale factor regions are defined.This behaviour is independent of the prior map used to define the regions.
It is also clear from Figure 9 that increasing the FWHM of the error structures results in requiring more scale factors to maximize the Bayesian evidence.To understand this, consider a region of the sky containing many small correlation structures such as in Figure 6.This leads to a scenario where many independent error realisations can be contained within a region and thus tend to average out, i.e. the noise is driven to Gaussian random noise.However as the FWHM of the correlation structures increases, a single error feature may overlap into multiple regions, thus requiring more regions to isolate and localize independent error realisation features.
Note that we do not consider multiple values of global temperature offsets in the foreground errors since the linear offset term  in Equation 20 allows our model to be robust to these scenarios.Thus our model allows the analysis of an error scenario with offset 3K to be equivalent to one with 0K offset.

Limiting Cases
In the previous sections, we used the 408MHz Haslam basemap to define the scale factor regions.The regions thus followed the Galactic morphology, which do not have a similar structure to true fractional errors in the basemap.In this section we illustrate the effect that using a suitable prior map to define the   regions has on the performance on our framework.We compute the fractional error map as defined by Equation 21to define the   regions.These two scenarios effectively bracket extreme scenarios, i.e. a scenario where we know precisely the morphology of the amplitude perturbations and a scenario where we simply assume the errors follow the Galactic morphology.In the future information on the per-pixel covariances in the eGSM map may be available.For now, we use this perfect prior map to illustrate the best-case scenario for the performance of this approach.
In Figure 7 (blue curve), we show the evolution of the Bayesian evidence as a function of   using regions derived from the ideal prior map for the error scenario Δ = 6% and  FWHM = 1 • .The behavior of the Bayesian evidence as a function of   using the ideal prior to define the regions produce similar qualitative trends with two quantitative differences: (i) The Bayesian Evidence for the maximum evidence model is larger when using the fractional error map to define the regions (ii) more regions are required for models using the fractional errors.
These quantitative differences are due to the morphological differences between the absolute error map (which follows the galactic morphology) and the actual amplitude perturbations in the dataset.Since the fractional errors are better aligned with the true multiplicative errors in the map, the number of regions required to optimize the Bayesian Evidence is increased from   = 16 using the absolute error map to define the regions to   = 20 using the fractional errors to define the regions.The maximum Bayesian Evidence is also larger using the fractional errors to define the regions.
We show the residuals for   = 20 maximum evidence model in Figure 10.Referring to Figure 10 we see that the residuals are smaller and more "noise-like" using the fractional error map to define the regions as compared to using galactic morphology to define the errors.When using the galactic morphology to define the regions, the amplitude scale factor regions do not line up properly with the true fractional errors in the map.As the result, each region defined using the Haslam basemap encompasses a larger distribution of fractional errors.As a result each region cannot capture the finer error features in the foreground error map.This also results in more regions being required to describe the error structure present.Therefore we continue to see a more substantive increase in the residuals between the reconstructed 21-cm signal and the true signal as we go to large   compared to what we saw for large   models in the factional error prior (where the benefit to the fits were smaller).
The blue curve on the right panel of Figure 7 shows the Bayesian evidence for a sky model without a 21-cm model component and using the fractional errors to define the regions.Here we can again see that this also results in increasing the number of scale factor regions required to maximize the Bayesian evidence from   = 28 (black curve) to   = 32 (blue curve).

SPATIALLY VARYING SPECTRAL INDEX
In this section we discuss our more realistic observation scenario, where the foregrounds in the simulated dataset have a spatially varying spectral index and foreground basemap errors.To simulate a realistic data analysis where we do not have access to the true fractional error, we use the 408MHz Haslam basemap to fit our simulated data.Thus we add the formalism introduced in Anstey et al. ( 2021) which was described in Section 2.3 to our analysis.That is, we split the spectral map in Figure 1 into   sub regions each with uniform  in accordance with Equation 20.Since the spectral regions do not need to coincide with the   amplitude scale factor regions, there are   ×   + 1 parameters required to be optimized in the foreground model.We again use PolyChord to compute the Bayesian Evidence for each model.In the interest of reducing computational cost, we use only one error scenario from Section 5.2.In Section 6.1 we select our error realisation and in Section 6.2 we show our results.

Error Selection
In this section we select the most realistic error scenario from Section 3.3.To do so, note that we expect that the correlation length of the errors in the basemap will be inversely proportional to its resolution.Since in this work we use the Haslam map to construct our foreground model, we select the correlation length of our fiducial error scenario to match the 1 • FWHM resolution of the Haslam map.We therefore use the Δ = 6%,  max = 1 • scenario from Section 5.2.  7.In this model the regions were defined using an ideal prior, i.e. the fractional errors in the basemap as defined by Equation 21.Using this prior represents a best case scenario where the true fractional errors in the amplitude of the basemap are known.Note that the residuals are smaller, and more consistent with the true signal (green) as compared to Figure 8. Furthermore, since the resolution of the basemap is related to its base frequency, maps at lower frequency will have lower spatial resolution.For example the 150MHz empirically derived map introduced in Landecker & Wielebinski (1970) has a 5 • FWHM resolution compared to 1 • for Haslam at 408MHz.Under this assumption, we would expect more correlated errors in the Landecker map.For a fixed level of power of errors in the map, we would expect this increased correlation to introduce more significant foreground systematics and, correspondingly, larger biases in the recovered 21cm signal, in the absence of error modelling.In the context of the sky map error fitting formalism we present in this paper, we would expect a more complex error model to be required to explain the data when using the Landecker map as our base-map in an analysis of instrumental data. 4.Recall from Section 3.2 that increasing the FWHM of the errors results in requiring more amplitude scale factor regions to maximize the Bayesian Evidence.Provided there is a negligible change to the morphology of foregrounds as a function of frequency, one can strategically use higher frequency basemaps in the foreground model to reduce the systematic impact of foreground amplitude errors in the analysis.However, increasing the base frequency of the foreground basemap might also increase the number of spectral regions   required to model the spatial variation of the spectral index.To see why this is recall that the REACH observational frequencies range from 50MHz to 170MHz.Using a basemap at higher frequencies, i.e.Haslam at 408MHz requires more spectral precision in the foreground model since the spectral model is extrapolated over a larger frequency range.Thus more spectral regions   are required relative to using a basemap set at lower frequencies. 4This assumes that the error structure follows the Gaussian generated errors discussed in Section 3.2

Correlations Between 𝑁 𝑎 and 𝑁 𝛽
The datasets we consider in this section contains spectral complexity as well as errors in the amplitude component of the basemap.Thus to select the model which maximizes the Bayesian evidence, we must optimize   and   in a two dimensional grid of models.From Section 5.1 we know that without spectral errors,   = 16 maximizes the Bayesian evidence.To gain intuition regarding how many spectral parameters might be required to sufficiently describe the spectral component of the dataset without the presence of amplitude scale factors in the basemap, we can perform the same analysis shown in Figure 10 of Anstey et al. (2021).Note that the analysis procedures in our work has two modifications as compared to the analysis done in Anstey et al. (2021).The primary difference is that we use the percentile splitting methodology discussed in Section 4.2 to define the spectral regions.Secondly we construct our foreground model using the Haslam basemap (at 408MHz) rather than GSM (at 230MHz).For completeness we perform the analysis for datasets which contain, and do not contain, a 21-cm signal.In Figure 11 we show the the Bayesian evidence as as a function of the number of spectral parameters   when the sky signal (i.e. the simulated dataset) does not contain a 21-cm signal (left) or contains a 21-cm signal (right).The red curves correspond to a sky model that contains a 21-cm signal component while a black curve does not.Note that in the left panel, the highest evidence models occur in the black curve while on the right panel the red curves correspond to models with higher Bayesian evidence.In the left panel (without 21-cm signal in the dataset) we see that the model   = 10 maximizes the Bayesian evidence while on the right panel (with 21-cm signal in the dataset) we see that model   = 12 maximizes the Bayesian evidence.
The simulation we consider in this section has spectral errors in addition to basemap errors.To get a sense as to where the Bayesian Evidence maximizing model might reside on a two dimensional grid of   and   models, recall that in Figure 7 we found that   = 16 regions were such that they maximized the Bayesian evidence without the presence of spectral complexity in the dataset.Similarly in Figure 11 we see that the model   = 12 maximizes the Bayesian evidence when the dataset does not contain amplitude errors.Thus we would expect that   ≤ 16 and   ≤ 12 might maximize the Bayesian evidence.The "≤" is used instead of "=" since we allow for the possibility that   and   might have some correlation.
In Figure 12 we show the Bayes' factor, i.e.B ≡ ln( max ) − ln() as a function of the number of amplitude scale factor regions,   , and spectral regions   used to construct the foreground model.The darkest regions of the grid correspond to models with the largest Bayesian Evidence and thus smallest Bayes' factor B. In each square we denote the value of Bayes factor B. Qualitatively we see that the Bayesian evidence is highest along a diagonal strip of models affirming that there is a degree of correlation between models with   and  beta .Model   = 8,   = 7 maximizes the Bayesian evidence, and is indicated with a 0 on its grid-point square.If multiple models have comparable Bayesian evidences, then a model averaging technique, where a weighted average is formed using their evidences should be employed.In terms of Figure 12, the model with the next highest evidence is model   = 7,   = 12 which is ln() = 9 below the highest evidence.This corresponds to 9 log units disfavouring from   = 8,   = 7 making any model averaging technique equivalent to selecting   = 8,   = 7.In Figure 12, models with ln() < 0 are labeled with -.Models with this level of Bayesian evidence have systematics multiple orders of magnitude larger than the signal, making recovery of the 21-cm impossible.Note that models with   = 0 correspond to foreground models without any amplitude scale factors.Referring to the models with   = 0, i.e. first column in Figure 12, it is evident that increasing the number of spectral regions, even to   = 18 cannot compensate for the systematics caused by the foreground amplitude errors.Thus amplitude errors in the foreground basemap cannot account for using a foreground model with spectral and linear offset parameters.
Similarly by examining the bottom row in Figure 13 (i.e,  = 1) we can see that increasing   to 32 regions does not account for the spectral errors in the dataset.Qualitatively, we find that above a threshold value of   and   that there is a trade-off between the number of spectral regions and number of amplitude scale factor regions.
In Figure 13 we show the recovered 21-cm signal for   = 8,   = 7 which maximizes the Bayesian evidence.Recall from Section 2.3 that our fiducial signal model has amplitude  21 = 0.155, standard deviation  21 = 15MHz and centering frequency  21 = 85.Note that the mean amplitude, standard deviation and centering frequency of the recovered 21-cm signal is 0.136K, 16.1MHz, and 85.6MHz.Thus we find that all of the signal parameters are within 1 of the true values.

SUMMARY & CONCLUSION
Detection of the global 21-cm signal is key to our understanding of CD and the EoR.Unfortunately, foreground contaminants are orders of magnitude brighter than all theoretical models of the cosmological signal.While there do exist scientifically interesting constraints that can be made even in the face of imperfect foreground removal (especially when combined with other probes; Bégin et al. (2022)), realizing the full potential of global 21-cm signal measurements requires a robust foreground mitigation program.Global 21-cm signal experiments which forward model the instrument, the systematics and radio contaminants have the potential to provide us the strongest constraints on the global 21-cm signal.However forward models of the foregrounds rely on observations of the radio sky which lack detailed uncertainty characterisation.Thus approaches that forward model the foregrounds are potentially limited by our empirical knowledge of the temperature of the radio sky.In this paper we have introduced a framework which is able to account for temperature deviations in the radio sky relative to our model foreground map in global 21-cm experiments.We have constructed a foreground model where the sky is segmented into different regions, each with an associated multiplicative scale factor which can adjust the amplitude of the foregrounds for that sub-region of the sky.By fitting for the these amplitude scale factors within our Bayesian framework, we can account for temperature perturbations in the true radio sky relative to our foreground model.We select the number of sub-regions in our model by computing the Bayesian Evidence for a large range of   and select the model which maximizes the Bayesian evidence.Though we use the REACH experiment as our fiducial 21-cm experiment, our method is applicable to any global 21-cm experiment.
In this paper we used a 1hr in LST observation with the log conical antenna.While we include a model of the conical log spiral beam into our dataset and forward model (see Equations 11 and 7 where we assumed the beam of the conical antenna is exactly known).We also assume that the mean 21-cm signal is Gaussian in .Each of these simplifying assumptions will impact the performance of our foreground model insofar as REACH will observe over a larger LST range with a dipole + log conical antenna (de Lera Acedo et al. 2022;Anstey et al. 2023) where the true beam may deviate from the modeled beam.Thus, an investigation of the impact of LST-range, beam parameters, realistic 21-cm models, antenna choice and joint analysis with multiple antennas on our posterior inference (which encompasses foreground sky and global 21-cm signal) provide interesting direction we intend to pursue in future work.
Since there are no definitive models that describe the amplitude and morphology of the errors in radio sky maps, we parametrically produce a range of simulated foreground errors by drawing a Gaussian noise realisation from the 408MHz absolute error map and then parametrically smoothing the map with a Gaussian beam with various FWHM.We perform our analysis with and without spectral complexity in the sky.We find in either scenario that our framework is effective at reducing the systematics in the analysis due to the presence of the foreground map errors allowing us to recover the 21-cm signal without bias.We find that our approach is limited by our knowledge of the nature of the errors in the radio sky.Thus our work shows that more work on modelling the foreground errors is needed in order to maximize the effectiveness of our approach.Going forward we might have access to full error covariance in which case we can improve on this approach.With information regarding the correlation structure of the foreground errors we can more effectively construct the regions to match the error structure in the dataset.Since the scale factors are essentially multiplicative factors in the map, defining the regions using a map which informs the model about the location of the amplitude perturbations is an ideal scenario.Without any existing work on foreground error maps, we use the morphology of the Galaxy to define our regions.This conservative approach demonstrates how well our model can do without detailed knowledge of the true error structure.We show that even when using a conservative error map to define these regions, our method is able to construct a sufficiently high fidelity model for the foregrounds for unbiased recovery of the global 21-cm signal.Thus, we have shown that the base-map error fitting framework presented here, in combination with the spectral structure fitting methodology presented in Anstey et al. (2021), represents a powerful tool for detecting the 21cm global signal.

Figure 1 .
Figure 1.The spatial variation of the spectral index  shown in galactic coordinates.Bright yellow regions correspond to larger spectral indices while darker regions correspond to small spectral indices.The morphology of  roughly follows the galactic morphology.

Figure 2 .
Figure 2. In each panel we show the division of the regions on the sky (displayed in Galactic coordinates).The left column corresponds to splitting the spectral map shown in Figure 1 into 4 regions (top) and 8 regions (bottom) (see Anstey et al. 2021).The right column corresponds to amplitude scale factor regions (see Section 4).Here we split the Haslam map into 4 regions (top) and 8 regions (bottom).

Figure 3 .
Figure 3. Summary of the null test results described in Section 3.1 for the error scenarios described in Section 3.2.The horizontal axis corresponds to the FWHM of the error correlation length while the vertical axis corresponds to the amplitude of the fluctuations.A red panel indicates that the error scenario does lead to a systematic in the analysis large enough such that the null test fails.A blue panel indicates that the error scenario does not lead to a systematic large enough such that the null test fails.The left panel are error scenarios where the standard deviation of the errors are uniform across the sky while the right panel are error scenarios that have standard deviation which is spatially dependent.

Figure 4 .
Figure 4. Signal recovery plot for the heteroscedastic error realisation with Δ = 5% and  FWHM = 1 • .The blue curve represents the reconstructed 21-cm signal using Equation8while the red curve is the dataset (Equation9) with the fitted foreground removed, i.e.Equation 10 with the best fitting foreground parameters.The lightly shaded red and blue regions represent the 3 regions for their respective contours.

Figure 5 .
Figure 5. Absolute errors in the temperature of the sky at 408MHz shown in Galactic coordinates.Note that the errors roughly trace the galactic morphology.Also apparent is that many low frequency survey datasets are missing the southern celestial pole which influences the eGSM model in this region.

Figure 6 .
Figure 6.In this figure we show our fiducial error scenarios at 408MHz.Each figure is displayed in Galactic coordinates.In the top left we show the a heteroskedatic error realization generated by drawing values from a Gaussian distribution of zero mean and standard deviation equal to the absolute errors at 408MHz.Moving clockwise from the top left panel, the resulting error realisation is smoothed by a Gaussian beam of FWHM equal to 1 • (top right), 5 • (bottom left), and 10 • (bottom right).

Figure 7 .
Figure7.The Bayesian evidence, ln(), as a function of the number of amplitude scale factor regions   .The black curve indicates a scenario where no 21-cm signal is included into the sky model while the blue curves indicate sky models when a 21-cm signal is included.Left panel: 21-cm signal is included into the dataset.We see that a 21-cm signal model component is preferred since the maximum value of ln() originated from the red curve.The Δ ln() curve is the difference in Bayesian evidence between the blue curve and black curve.The dotted blue curve indicates the FoM (see Equation22) for the recovered signal model (red curve).A larger FoM suggests a more accurately recovered signal.Right panel: no 21-cm signal is included in the dataset, in which case a sky model without a 21-cm signal component is preferred since the maximum value of ln() occurs in the black curve.The blue curves correspond to a model where an ideal prior map (i.e. the fractional errors in Equation21) are used to define the regions.The Δ ln() curve is the difference in Bayesian evidence between the black curve and red curve.We use a one hour observation duration in these simulations.Note the area of high foreground map uncertainty in the lower right portion of the basemap in the upper left of Figure6(this area is present in each map, but is most pronounced in the unsmoothed map).This area of high foreground map uncertainity is due to many low frequency surveys missing the southern pole which influences the eGSM model.We choose the one hour observation to occur at a time where this area is mostly below the horizon.Since regions below the horizon are removed during our simulated observation, this area of high uncertainty in the lower right of the eGSM basemap does not enter our analysis.In particular, we avoid this part of the map since regions with least foreground uncertainty are most promising for 21-cm signal recovery (this reduces the potential for systematics and bias in the analysis which also reduces the foreground model complexity required to model the 21-cm signal to high fidelity).

Figure 8 .
Figure 8.Each plot shows the results of fitting our foreground model to the simulated data discussed in Section 5.1.The upper left panel corresponds to   = 1.Moving clockwise, we have models   = 3, 16, 32.The blue curves correspond to the fitted 21-cm signal while the red curves correspond to the residuals of the dataset after the fitted foregrounds are removed.

Figure 9 .
Figure9.The Bayesian evidence, ln(), as a function of the number of amplitude scale factor regions   for each error scenario described in Section 3.3.The black curves indicate a scenario where a 21-cm signal is included into the sky model while the blue curves indicate sky models where no 21-cm signal is included.Note that 21-cm signal is included into the dataset leading to black curves having larger ln() than the blue curves.We observe that the number of amplitude scale factor regions required to maximize ln() tends to increase with error fluctuation amplitude Δ and correlation length  FWHM .

Figure 10 .
Figure10.The reconstructed signal (blue) and residuals (red) for the maximum Bayesian evidence model,   = 20 corresponding to the blue curve on the left panel of Figure7.In this model the regions were defined using an ideal prior, i.e. the fractional errors in the basemap as defined by Equation21.Using this prior represents a best case scenario where the true fractional errors in the amplitude of the basemap are known.Note that the residuals are smaller, and more consistent with the true signal (green) as compared to Figure8.

Figure 11 .
Figure 11.The Bayesian evidence as a function of the number of spectral parameters   .The black curves indicate scenarios where a 21-cm signal is included into the sky model while the blue curves indicate sky models where no 21-cm signal is included.Left panel: no 21-cm signal is included into the dataset.We see that a model without a 21-cm signal model component is preferred since the maximum value of ln() originates from the black curve.Quantitatively, the the maximum value of ln() for the black curve is 3 log units larger than the blue curve .Right panel: a 21-cm signal is included in the dataset, in which case a sky model with a 21-cm signal component is preferred since the maximum value of ln() occurs in the blue curve.

Figure 12 .
Figure 12.The logarithm of the Bayes' Factor as a function of the number of spectral parameters   and amplitude scale factors   .The gray shading in each square indicates the value of Bayes' factor, i.e. ln() max − ln() where the ln() max = 265 is the model   = 8 and   = 7 which maximises the Bayesian evidence.All models with Bayes' factor < 240 or negative Bayesian evidence are saturated at 240.Notice that models with   = 1 are disfavoured even with large values of  beta and vice-versa.Thus multiple spectral and scale factor regions are required.The model which maximizes the Bayesian evidence is highlighted in yellow.