FlexKnot and Gaussian Process for 21 cm global signal analysis and foreground separation

The cosmological 21 cm signal is one of the most promising avenues to study the Epoch of Reionization. One class of experiments aiming to detect this signal is global signal experiments measuring the sky-averaged 21 cm brightness temperature as a function of frequency. A crucial step in the interpretation and analysis of such measurements is separating foreground contributions from the remainder of the signal, requiring accurate models for both components. Current models for the signal (non-foreground) component, which may contain cosmological and systematic contributions, are incomplete and unable to capture the full signal. We propose two new methods for extracting this component from the data: Firstly, we employ a foreground-orthogonal Gaussian Process to extract the part of the signal that cannot be explained by the foregrounds. Secondly, we use a FlexKnot parameterization to model the full signal component in a free-form manner, not assuming any particular shape or functional form. This method uses Bayesian model selection to find the simplest signal that can explain the data. We test our methods on both, synthetic data and publicly available EDGES low-band data. We find that the Gaussian Process can clearly capture the foreground-orthogonal signal component of both data sets. The FlexKnot method correctly recovers the full shape of the input signal used in the synthetic data and yields a multi-modal distribution of different signal shapes that can explain the EDGES observations


INTRODUCTION
21 cm cosmology is the endeavour to use the hyperfine transition of neutral hydrogen to probe the Universe at high redshifts (5 ≲  ≲ 30), from the Dark Ages, when the Universe was neutral via Cosmic Dawn when the first stars formed, to the Epoch of Reionization (EoR), when early galaxies ionized the intergalactic medium (IGM).This is one of the last unexplored frontiers of cosmology, and measurements of the 21 cm signal of neutral hydrogen will allow us to study the properties of the IGM and the first stars and galaxies (Madau et al. 1997;Furlanetto et al. 2006;Pritchard & Loeb 2012; Barkana 2016).
The 21 cm signal is the emission or absorption line corresponding to the hyperfine transition of neutral hydrogen.When the spin temperature of the hydrogen atoms is higher than the background radiation temperature, the atoms emit, creating a peak in the differential brightness temperature.When it is lower, the hydrogen atoms absorb the background radiation causing a trough in the spectrum.We generally expect a trough during Cosmic Dawn and potentially a peak towards lower redshifts when the IGM is heated above the background temperature.We can observe this signal as a distortion in a smooth radio background (typically assumed to be the Cosmic Microwave Background, CMB), at a frequency of  21 = 1420.4MHz/(1 + ) depending on the redshift  of the corresponding hydrogen cloud.
★ E-mail: heimersheim@ast.cam.ac.uk The two main types of 21 cm experiments are global signal experiments, which measure the sky-averaged 21 cm signal, and interferometers, which measure the spatial fluctuations of the signal.Global signal experiments include EDGES (Bowman et al. 2008), SARAS (Patra et al. 2013), SCI-HI (Voytek et al. 2014), BIGHORNS (Sokolowski et al. 2015), LEDA (Price et al. 2018;Garsden et al. 2021), PRIZM (Philip et al. 2019), MIST (Monsalve et al. 2023), andREACH (de Lera Acedo et al. 2022).Of particular interest are EDGES, which has claimed a detection of the 21 cm signal (Bowman et al. 2018), and SARAS 3, which reported a non-detection (Singh et al. 2022) of the signal claimed by EDGES.Future experiments (e.g.REACH and MIST) are expected to provide independent global signal measurements to answer this question.Measurements of fluctuations in the 21 cm signal have been performed by several interferometers.While no detection has been reported so far, these measurements have set upper limits on the 21 cm power spectrum.The strongest limits so far have been set by the HERA telescope (Abdurashidova et al. 2022;The HERA Collaboration et al. 2022); other interferometers include LOFAR (Patil et al. 2017;Mertens et al. 2020), MWA (Dillon et al. 2014;Trott et al. 2020), NenuFAR (Mertens et al. 2021), PAPER (Kolopanis et al. 2019), GMRT (Paciga et al. 2013).Although fluctuations in the cosmological signal contain unique information about the first sources of light and the underlying cosmology, in this paper we choose to focus on the global 21 cm signal measurement.
The 21 cm global signal is challenging to observe.Not only does a measurement require sufficient sensitivity (the signal reaches a contrast of up to 280 mK, Cohen et al. 2017, under the assumption that the background is the CMB, ), but it is also contaminated by astrophysical foreground sources (Chandrasekhar 1960;Rogers et al. 2015), radio frequency interference (RFI, Rogers et al. 2005), ionospheric effects (Vedantham et al. 2014;Datta et al. 2016;Shen et al. 2021Shen et al. , 2022)), and instrumental effects such as beam chromaticity (e.g.Monsalve et al. 2017b;Anstey et al. 2021) or analogue receiver properties (Jishnu Nambissan et al. 2021;de Lera Acedo et al. 2022).While instrumental challenges can be addressed by careful design and calibration (Monsalve et al. 2017a), the unknown astrophysical foreground sources have to be constrained jointly with the signal.Thus, we fit the measured sky temperature as a sum of the foreground and signal components  sky () =  fg () +  signal (), following Bowman et al. (2018).
The claimed detection of the 21 cm signal by Bowman et al. (2018) showed a much stronger absorption trough than expected from physical models that assume the CMB as the background radiation and standard astrophysical heating and cooling processes (Cohen et al. 2017;Reis et al. 2021).This has sparked discussion about the choice of foreground model (Hills et al. 2018;Singh & Subrahmanyan 2019;Bevins et al. 2021), the question of instrumental calibration (Sims & Pober 2020), and the possibility of non-standard cosmological signals (Barkana 2018;Ewall-Wice et al. 2018;Feng & Holder 2018;Muñoz & Loeb 2018;Slatyer & Wu 2018;Fialkov & Barkana 2019;Mirocha & Furlanetto 2019).However, all previous work either assumed the signal to have a particular shape that does not necessarily correspond to the actual signal (including the original work by Bowman et al. 2018), or used physical models, typically based on simulations (such as Furlanetto et al. 2004;Iliev et al. 2006;McQuinn et al. 2007;Mesinger et al. 2011;Fialkov et al. 2014a;Ross et al. 2017;Semelin et al. 2017) that are limited by our understanding of the underlying astrophysics (Majumdar et al. 2014).This mismatch between the signal model and the true signal will affect the quality of the foreground separation and prevent the extraction of the correct signal from the data.
In this work, we propose a new way to model the signal component that is agnostic to the precise form of the signal, allowing for an unexpected cosmological signal shape or an undetected systematic contribution.From this point on, we will refer to any non-foreground part of the measurement as the signal component here, and do not address the question of separating the cosmological signal from systematics.This question is considered in the forthcoming work by Shen et al. (2023).Concretely we propose two techniques to describe this signal component: A non-parametric Gaussian Process, and a model-independent FlexKnot parameterization.These freeform approaches allow us to find out what (not necessarily cosmological) signal is hidden in the data.This signal can then be interpreted, either to learn about systematics in the data or about astrophysical processes.The latter will however require physical models to connect the observed signal properties to physical properties, re-introducing model-dependence.
Gaussian Processes (GPs, Rasmussen & Williams 2006) are a standard tool for modelling functions in a non-parametric fashion.Taken as a prior over smooth functions, GP regression enables Bayesian inference over a regression function with an unknown parametric form and has successfully been used in a wide array of applications (Almosallam et al. 2016;Ghosh et al. 2020;Li et al. 2021;Mertens et al. 2023) 1 .Utilizing a GP to model the unknown signal allows us to place a prior distribution over the signal without assuming a known shape.Gaussian Processes have a small number of parameters specifying the mean and covariance but these only encode priors on the signal smoothness and strength and do not impose any functional form on the signal.However, since GPs are very flexible function approximators, they need to be sufficiently constrained to avoid confounding with the foreground model.We make use of quite a harsh constraint to obtain this, relying on orthogonalizing the GP with respect to the foreground fit, but other options are possible as well.
In this work, we utilize the concept of a foreground-orthogonal basis, that has been previously considered in the literature (Liu & Tegmark 2012;Vedantham et al. 2014;Switzer & Liu 2014;Tauscher et al. 2018Tauscher et al. , 2020;;Rapetti et al. 2020;Liu & Shaw 2020;Bassett et al. 2021) We say a signal function is orthogonal to the foregrounds if no amount of foreground terms can reduce the norm of the signal.If we consider the foreground terms and signal as vectors containing their value at each of the observed frequencies then this orthogonality is equivalent to the signal vector being orthogonal (zero dot product) to all foreground term vectors.This concept is particularly useful here because the foreground-orthogonal component of a measurement is unambiguous (although still dependent on the foreground model) and has to be fully contained within the signal component (which could be of cosmological origin or caused by systematics).
In addition to the Gaussian Process, we explore a second method to fit the signal in a model-independent way.The FlexKnot parameterization is based on a free-form method that has been introduced by Bridges et al. (2009) for modeling the primordial power spectrum.It has been used in the CMB community for several years, constraining the primordial power spectrum and dark energy equation of state (Vázquez et al. 2012a(Vázquez et al. ,b, 2013;;Abazajian et al. 2014;Aslanyan et al. 2014;Hee et al. 2016;Planck Collaboration et al. 2016;Hee et al. 2017;Finelli et al. 2018;Handley et al. 2019;Planck Collaboration et al. 2020a,b;Escamilla & Vazquez 2023).It was also used to model the pressure profile of galaxy clusters (Olamaie et al. 2018).Millea & Bouchet (2018) adapted the method for parameterizing the reionization history, giving it the name FlexKnot.More recently the use of FlexKnot has been extended by Heimersheim et al. (2022) to explore reionization history constraints from hypothetical highredshift Fast Radio Bursts.Their implementation forms the basis of the implementation adopted in this work.
The FlexKnot method differs from the Gaussian Process in that it is a parametric method, and allows for a variable amount of complexity.The parameterization works by specifying a set of interpolation points (knots) as  and  coordinates (the parameters) that allow us to describe the signal function by interpolating between these knots.The number of interpolation knots is itself also a parameter, leading to the key difference between the Gaussian Process and FlexKnot methods: We can compare interpolations of different complexity levels and use Bayesian model selection to find the simplest signal that can explain the observations.At the same time, we still obtain a free-form signal fit that is not limited to a particular shape.
With this paper, we introduce the concept of a free-form model for the global 21 cm signal.We highlight what the Gaussian Process and FlexKnot methods can and cannot do, and demonstrate their use on both synthetic data, and the EDGES low-band measurements from Bowman et al. (2018).We describe both methods in section 2, present the results for both cases in section 3, discuss advantages and limitations in section 4, and give our conclusions in section 5.

METHOD
The quantity observed by a global 21 cm signal experiment is the skyaveraged brightness temperature as a function of frequency,  sky ().The actual measurement is a spectral intensity but it is typically expressed as the corresponding blackbody temperature.The sky temperature is dominated by contributions from foregrounds but contains a small cosmological signal that we are interested in.Additionally, the measured temperature may contain systematic contributions.In this work, we focus on separating the foregrounds from the remaining signal,  signal (), and will not distinguish between cosmological and systematic contributions (this is addressed in the forthcoming work by Shen et al. 2023).Thus, for our purposes, the sky-averaged observed brightness temperature is where  fg () is the foreground component and  signal () contains the remaining contributions (including the cosmological signal and possible systematics).
In section 2.1 we briefly describe the foreground model we adopt from Bowman et al. (2018).Section 2.2 describes the synthetic data we generate to validate our methods and section 2.3 summarizes the likelihood function adopted from Bowman et al. (2018).Finally, we give a detailed description of the Gaussian Process and FlexKnot methods in sections 2.4 and 2.5 respectively.

Foreground model
The core advances discussed in this work concern the modelling of the signal component, hence we only briefly describe the foreground model and use a simple model derived in Bowman et al. (2018).However, our signal modelling approach can be used with any parameterized foreground model (e.g. the alternatives considered in Hills et al. 2018).
For concreteness, we assume the following parameterized foreground model.It describes the foreground contribution to the sky temperature as a sum of polynomial terms where   are free parameters and   is set to 75 MHz.We note that this model corresponds to equation (2) rather than (1) of Bowman et al. (2018) but Bowman et al. (2018) themselves report consistent results for both models.We assume wide and nearly uninformative priors on the foreground parameters, as we discuss in sections 2.4.2 and 2.5.3 for each method.

Synthetic data
We generate a synthetic data set with a known ground truth to validate our methods.We use the foreground model from equation (2), and a signal component corresponding to a flattened Gaussian (from Bowman et al. 2018) (3) with amplitude , central frequency  0 , full-width at half-maximum , and flattening factor . Finally, we add Gaussian noise with a standard deviation of 25 mK to emulate the thermal noise level of the EDGES experiment (Bowman et al. 2018).The synthetic data are generated with foreground parameters ( 1 ,  2 ,  3 ,  4 ,  5 ) = (1570, 620, −1000, 700, −175) and signal parameters ( ,  0 , , ) = (500 mK, 78 MHz, 19 MHz, 7); both sets were chosen to be similar to those found in Bowman et al. (2018).2Note that we use equation ( 3), not only to generate the synthetic data (with the parameter values mentioned above) but separately also to fit the EDGES low-band data.In the latter case, we fit the parameters of equation ( 3) in order to reproduce the method of Bowman et al. (2018) as a comparison to our methods.We emphasize that these are two separate uses of equation ( 3), with different signal parameters and different foregrounds.

Likelihood function
To compare a given model m = T fg + T signal (a vector composed of temperature values at the observed frequencies) to the measured data d (vector of measurements at every frequency) we use a Gaussian likelihood, following Bowman et al. (2018).The likelihood accounts for Gaussian thermal noise with a free parameter  noise that determines the standard deviation: We use this likelihood for both our synthetic data and the EDGES data.Since the EDGES low-band data are measured in 123 frequency channels (Bowman et al. 2018)  1 , . . .,  123 , the data as well as the signal and foregrounds will be 123-dimensional temperature vectors corresponding to those frequencies.

Gaussian Process signal model
We now proceed with introducing the methods to represent the signal component, starting with the Gaussian Process approach.
GPs are a common way to model smoothly varying functions.A GP is a stochastic process, i.e. a collection of random variables, defined such that the joint distribution of any finite sub-collection is a multivariate normal.Specifically, we consider the continuous Gaussian Process GP ((), (,  ′ )) to describe the signal at a frequency , where () is the mean function calculated at frequency , and (,  ′ ) is the covariance function calculated between two frequencies  and  ′ .We generate the GP signal vector as a finite realization of a Gaussian Process.It follows a multivariate normal distribution [ GP ( 1 ), . . .,  GP (  )]  ∼ N ( , ) where  = [( 1 ), . . ., (  )]  denotes the mean vector constructed by evaluating the mean function at each input location.Correspondingly, the covariance matrix  is defined such that    = (  ,   ) for all frequency pairs (  ,   ).The GP is fully specified by its mean and covariance functions.Of particular importance is the covariance function, sometimes denoted as the kernel of the GP, which controls many features of the resulting function such as the level of smoothness (correlation) across the input locations.
We choose the mean function to be zero,  = 0, to not bias the signal towards any particular shape.As the covariance matrix, we choose a simple squared exponential kernel with amplitude  2  and length scale ℓ.The amplitude controls how far the GP samples are allowed to move away from the mean, and the length scale controls the overall correlation length of the samples across frequency space.
In addition to this smooth variation, we allow for thermal noise as described in section 2.3.Taking equation ( 4) with m = T fg + T GP yields the corresponding likelihood for observational data d.In order to draw samples from the Gaussian Process we use a formulation that explicitly includes the noise term   , i.e. we draw expected measurements at frequency   from the model  fg (  ) +  GP (  ) +   with   being a Gaussian random variable with zero mean and variance  2 noise .

Orthogonal Gaussian Process
It should be noted that combining the parametric model with the GP with no further constraints yields a solution characterised by degeneracy between estimates of the foregrounds, and signal.To illustrate this point, consider the signals in Figure 1.All the demonstrated curves contain the same cosmological signal but differ in the foreground component, and are thus equally preferred by the data (as any 'foreground-shaped' addition can be compensated by slightly different foreground parameters).This leads to confounding (Hodges & Reich 2010), as the fit is jointly appropriate, but there is no incentive for the GP to separate the signal from the foregrounds.This degeneracy leads to large uncertainties on both the GP signal and the foreground parameters.
To control the degeneracy between the GP and the foreground model, the GP must be penalized away from functional forms that are similar to the foreground.One way to do this is through penalization Foreground-free projection for illustrative purposes, the different lines correspond to different settings of the X-ray production in early galaxies (   from 10 −1 , blue, to 10 3 , yellow, see Fialkov et al. (2014a,b); Fialkov & Barkana (2014) for details4 ).The top panel shows  signal with the recognizable shape, but the bottom panel shows the foreground-orthogonal component  signal, projected of these signals (note the  axis scale).We want to stress that we show these cosmological signals without any systematics and only for illustrative purposes, while  signal in our analysis may contain both cosmological and systematic contributions.
of the kernel hyperparameters ℓ and   , for example, a prior choice for   that encourages small deviations from the prior mean at zero.That choice however could also negatively affect the result, e.g. by discouraging a better-fitting signal requiring larger deviations from the prior mean.
Another way to avoid the degeneracy, and which will be our main method pursued in the paper, is to only fit the foreground-orthogonal component of the signal, rather than the full signal.The foregroundorthogonal part contains less information than the full signal because there exist multiple signals that have the same foreground-orthogonal component (e.g.those shown in Figure 1).However, these are exactly the kind of signals that cannot be distinguished by the likelihood.Thus we fit the foreground-orthogonal component which presents exactly the information that can be extracted from the data alone.In section 3.1 we will see that the Gaussian Process can very accurately estimate this component of the signal.
The foreground-orthogonal signal fit obtained from the GP is generically useful for signal comparison purposes.In particular, it can be used to compare the data to (e.g.physical or systematic) theory signals by simply comparing their foreground-orthogonal components.This can be done by applying the foreground-orthogonal projection to the theory signal and then comparing the resulting foregroundorthogonal theory to the foreground-orthogonal GP fit.We illustrate the projection in Figure 2, showing full theory signals (top panel) and their foreground-orthogonal components (bottom panel).
Comparing foreground-orthogonal projections is sufficient and lossless (under a Gaussian likelihood as in equation ( 4), and an unrestricted foreground model as adopted here).If the foregroundorthogonal theory signal is close to the foreground-orthogonal component of the data, this implies that there exists a set of foreground parameters for which the theory model plus a foreground term are equally close to the full data.We give a mathematical derivation of this in the paragraph on analytical foreground marginalization in section 2.5.1.
In the future, it might be practical for signal interpretation purposes to consider the foreground-orthogonal components, rather than analyze the full signals.Projecting an ensemble of physical signals would enable the interpretation of the foreground-orthogonal component in terms of astrophysical properties, just as it is done now with full signals.We discuss this option further in section 4.
To create estimate the foreground-orthogonal component we construct a Gaussian Process that is always orthogonal to the foregrounds, allowing us to fit the parametric foregrounds and nonparametric GP signal simultaneously without degeneracy.While this is generally difficult to do for GPs across the entire input space (see e.g.Plumlee & Joseph 2017), it can easily be achieved at any finite set of locations by conditioning by kriging (Rue & Held 2005).These processes have been utilized in spatial statistics under the name restricted spatial regression (RSR, Hanks et al. 2015;Khan & Calder 2022).
We can achieve this by first drawing T GP from the unconstrained multivariate Normal distribution N (0, ) and then applying the transformation T GP,proj = (Id −   ) T GP using the projection matrix   =    (   ) −1  with the 5 × 123-dimensional design matrix   =   (  ) describing the 5 independent foreground modes from equation ( 2).This can be derived by first considering the joint distribution of the GP T GP and a linear transformation of the GP T GP -which will be multivariate normal.Then the conditional distribution of T GP |T GP = 0 is available through standard formulas, enforcing the orthogonality constraint.To efficiently sample from the posterior distribution, we use the standard trick of marginalizing out the GP from the joint distribution to obtain the following distribution of our data: That is, we assume our observed data is drawn from a normal distribution centred around the foreground model, with spatial covariance defined by the orthogonal GP.Conditional on the foreground, kernel hyperparameters, and noise parameter  2 , the posterior distribution of the 21cm signal is available in closed form, and is simply a multivariate normal where K = (Id −   ) (Id −   )  , following the standard formulas of Gaussian Process regression.We sample from this distribution at each step of an MCMC chain to recover the posterior distribution of T GP,proj itself.

Gaussian Process priors
We expect the variation of the GP to be of the order of ±1 K so we set the prior on   to be the exponential distribution () = exp(− K −1 ).This standard choice allows deviation from the mean with a preference for smaller fluctuations (we want this preference to not introduce artificial fluctuations due to the prior  Furlanetto 2019) this choice can be adjusted.As for the smoothness we are working in the frequency range [50 MHz, 100 MHz] and set the prior on ℓ to be an inverse-gamma distribution with shape  = 5 and scale  = 75 MHz.This is a fairly standard choice where we penalize the model away from exceedingly small or large length scales, but the GP is not very sensitive to these choices due to the orthogonality constraint.The remaining priors are set as follows.For the foreground component, we place uninformative priors on the parameters of the foreground model using the recommended QR reparametrization trick as described in the Stan manual (Stan Development Team 2023a).The prior for each scaled and transformed parameter is given by a wide and uninformative Normal distribution N (0, 106 ).For the noise  noise we assume a half-Normal distribution value N + (0 K, 1 K) to allow for a wide range covering all reasonable values.We do full MCMC for all parameters in the model, and implement the orthogonal GP model in Stan (RStan version 2.26, Stan Development Team 2023b, 2018) using the No-U-Turn-Sampler algorithm (Hoffman & Gelman 2014).The results of estimating the Gaussian Process are shown in section 3.1.Note that we also did fit a non-orthogonalised GP to the data, using the same prior distributions for the hyperparameters as above, but found that it was difficult to avoid degeneracy with the foreground through prior distributions only.

FlexKnot signal model
The Gaussian Process fulfils one of our goals: it allows us to extract information about the signal model without assuming any particular function or shape, and we obtain a tight constraint on the foregroundorthogonal projection of the signal.However, this leaves a wide range of possible full (non-orthogonal) signals compatible with this constraint.
Our goal in this section is to differentiate between these possible full signals and to find the most likely underlying signal.To achieve this we parameterize the signal with the FlexKnot method (Vázquez et al. 2012a;Millea & Bouchet 2018;Heimersheim et al. 2022) and apply Bayesian model selection to find the most probable true signals.
Our final result will be a posterior distribution of signals, showing us which shapes are compatible with the data.
Even though multiple signals may fit the data equally well, we can differentiate them based on their complexity.Simpler, more predictive models are more likely to be correct, whereas complex functions with many free parameters are susceptible to overfitting the data.In Bayesian statistics, this effect is quantified as the Bayesian evidence for a given model, and we can make use of this in the FlexKnot method.
We use the FlexKnot method to parameterize the signal  signal as a function of frequency .FlexKnot uses a family of flexible parameterizations that are based on interpolation.The first FlexKnot function is an interpolation between two points, i.e. a straight line.The second FlexKnot function adds a third point and interpolates between the three points, and so on.We refer to these interpolation points as knots.Each function is parameterized by the coordinates {  ,   } of its knots, and the value at any other point  is interpolated.We illustrate this in the top panel of Figure 3.The interpolation function used between the knots is the piecewise cubic Hermite interpolation polynomials (PCHIP)7 , following Millea & Bouchet (2018).Alternative options are other spline variations or even a linear interpolation.We have not found meaningful differences between various interpolation techniques, and chose the PCHIP interpolation to avoid sharp the features that a linear interpolation would introduce, making the functions easier to interpret.Note that, unlike Heimersheim et al. (2022, modelling the reionization history), we do not fix the first and last interpolation knot to the minimum and maximum -axis levels, as those minima and maxima do not exist.Instead we extrapolate the cubic Hermite polynomials before the first, and after the last interpolation knot.
After setting up the FlexKnot parameterization we sample the parameter space using Nested Sampling (Skilling 2004), specifically we use the code PolyChord (Handley et al. 2015a,b).We perform multiple sampling runs for each number of knots to ensure our results are not influenced by outlier runs, averaging the posterior samples.In section 3.2 and 3.3 we show the evidences of individual runs (specifically Figures 9 and 12) and show the scatter is small.The parameter space spans all knot coordinates and foreground parameters, however as discussed further in section 2.5.1 we only need to sample the knot coordinates.We choose Nested Sampling so that we can compute the Bayesian evidence  for each model, and we choose PolyChord because it is well-suited for the high-dimensional parameter space of FlexKnot models.Every Nested Sampling run gives us the posterior distribution of knot positions for a given number of knots; The lower panel of Figure 3 shows an example for  knots = 6.We use the package anesthetic (Handley 2019) to analyse the Nested Sampling results.As the final visualization step do not show the distribution of knot coordinates, but the distribution of functions by computing the interpolation for every sample.The reason for this is that the knot coordinates are not directly interpretable, and the interpolation function is both, what the data is ultimately sensitive to, and what we are interested in.For this purpose we used fgivenx (Handley 2018), which yielded the functional posterior distributions that we show in section 3. We explore functions from  knots = 2 to a maximum of 15 knots.The cutoff was chosen empirically as the Bayesian evidence for models declines after 6 knots, falling to a negligible level of less than 10 −6 of the total evidence after 15 knots.We would like to treat  knots as a discrete parameter of the model, but it is impractical to sample  knots as a parameter.Instead, we run separate Nested Sampling runs for each  knots and average the results weighted by the Bayesian evidence  of each run.This is equivalent to treating  knots as a parameter and marginalizing over its value.We can show this by writing the posterior distribution of  signal given the data d and expanding the  knots parameter.This shows that ( signal | d) is equivalent to the average over individual posteriors ( signal |  knots , d) weighted by the Bayesian evidence  ( knots ): The same rewrite can be used for other quantities such as the foreground parameter posterior.

Orthogonalizing the foregrounds
There is one additional technique we want to introduce in this section.
While the FlexKnot method is not limited to foreground-orthogonal functions, we can make use of the orthogonalization to drastically speed-up the sampling.We can analytically integrate out the foreground parameters and reduce the parameter space we need to numerically sample by 5 dimensions.We note that this idea is not limited to FlexKnot but can be applied to any parameterization such as e.g. the flattened Gaussian or even physical models.
In the following, we will show (i) how to find an orthogonal basis for the foregrounds, and (ii) how to analytically solve the foreground parameter marginalization.
Orthonormal foreground basis: We want to express the foreground with new basis functions f that are orthogonal and normalized where the functions f are orthonormal.We mark quantities in the new basis with a tilde to distinguish them from the equivalent quantities in the original basis.The new foreground basis vectors are given by f = f ( 1 ), . . ., f ( 123 ) .Then f and f are orthonormal if and only if f • f =   .To calculate this basis we define the 5 × 123-dimensional foreground matrix8   =   (  ) and apply a singular value decomposition9 to obtain the desired orthogonal foreground basis F = (  )  .We can convert between the original foreground parameters   and the corresponding new basis parameters ã as Figure 4 shows a comparison of the original foreground basis functions f  (top), and the new orthogonal foreground basis functions f (bottom).Note that orthogonal here refers to the functions being orthogonal to each other, not orthogonal to the foregrounds.

Analytical foreground marginalization:
To fit a signal model to the data one typically needs to simultaneously sample the signal and foreground parameters.However, we found that for Gaussian likelihoods and foreground models that are linear in their parameters, such as our equation (2) or equation (1) of Bowman et al. (2018), there exists a shortcut.We can analytically compute the posterior of the foreground parameters conditional on the signal parameters.This means that we can sample the foreground parameters analytically (negligible computational cost) and need to sample only the signal parameters numerically.We start by inserting the orthogonal foreground model, equation (8), into the likelihood, equation (4).This yields the likelihood as a function of the FlexKnot parameters  FK = {  ,   } and the foreground parameters  fg = { ã }: with the residuals vector  = d − T signal ( FK ) (omitting the  FK dependence for brevity).We can simplify the exponent in two steps: First, we decompose  into foreground-proportional terms with coefficients   =  • f and a foreground-orthogonal remainder   .Secondly, we use the orthogonality of the bases to write the square of the sum as a sum of squares, as all cross-terms vanish due to Adding the foreground and FlexKnot prior terms gives us where the prior scale on foreground parameters  ã = 10 6 K is much larger than noise level  noise ≪ 1 K and much larger than the range allowed by the likelihood (on the order of 10 3 K).Thus, we can approximate the foreground prior term by a constant and obtain At this point we see the posterior has factorized into a foregrounddependent Gaussian term, and a FlexKnot-dependent term.We can use this to both, analytically sample the foreground parameters conditional on the FlexKnot parameters (simply a Normal distribution with mean   ( FK ) and scale  noise ), and to marginalize over the foreground parameters as follows: We use this form in our implementation, sampling the FlexKnot parameters  FK numerically and sampling foreground parameters analytically as derived parameters in PolyChord.We store both, the orthogonal foreground parameters ã and the original foreground parameters   in our chains, and will show the posterior distributions of both (Figure D1, and 10 & 13, respectively).We emphasize that, unlike for the GP analysis, this orthogonalization is not essential for the FlexKnot method to work.All it does is significantly improve the sampling efficiency, both by reducing the parameter space by 5 dimensions, and by letting us skip sampling over the degenerate foreground parameters.

Mean subtraction
We apply one important post-processing step to the final FlexKnot signals, which is to normalize all functions to have zero10 mean.This is necessary because, while FlexKnot can eliminate almost the entire foreground degeneracy that affects the shape of the signal, it cannot eliminate the degeneracy between the absolute signal level and the foreground level.This is because (i) the foregrounds can combine to create an almost constant offset in the signal, as demonstrated by the blue line in the lower panel of Figure 4.And because (ii) a FlexKnot parameterization can also allow for a constant offset by shifting all knot coordinates vertically.The latter does not incur a complexity penalty, thus creating a degeneracy between the signal level and the foreground level that cannot be removed by the FlexKnot method.
The post-processing step allows us to visualize the good constraint on the shape and contrast of the signal.In terms of scientific results, this means we cannot constrain quantities that depend on the absolute signal level but it leaves inferences about the shape of the signal unaffected.Constraints on absolute signal levels can be made possible by adding more assumptions, such as fixing the value of the signal at a given frequency based on a physical model, but this is beyond the scope of this work.
Note that a similar technique does not solve the degeneracy of the Gaussian Process fit seen in Figure 7.This is because the GP already penalizes constant offsets of the fit, and an additional mean subtraction would be less useful.We explicitly show this in appendix A.

FlexKnot priors
We choose simple priors on the FlexKnot parameters, i.e. the coordinates of the interpolation points.We allow the frequency coordinates   to lie between 50 MHz and 100 MHz (the frequency range corresponding to the EDGES low-band observations), and the temperature coordinates   to lie between −2 K and 2 K (the largest signal amplitude we consider11 ).The temperature prior is uniform (flat), for the frequency prior we use a forced identifyability prior (Handley et al. 2015c).This means we sample the frequency coordinates as though they were drawn from a uniform (flat) prior, and then sorted so that the knot coordinates are always in ascending order.This gives the same result as a uniform (flat) prior on   , but avoids the combinatorial explosion of having to sample every possible permutation of the knot coordinates.As for the number of interpolation knots  knots , we use a discrete uniform prior from the minimum of 2 knots up to 15 knots; we notice empirically that the evidence for models with more than 15 knots is negligible.As prior on the foreground parameters we adopt a Normal distribution N 0 K, 10 6 K over the transformed foreground parameters ã (see section 2.5.1).Given that the actual values lie in the ±10 3 K range12 , this is effectively an uninformative and approximately flat prior.The difference in priors between this choice and a prior on   (as chosen for the Gaussian Process) is negligible, as the priors have such a small effect on our results.The noise level  noise is a free parameter as well, here we choose the same prior as in the Gaussian Process run, a non-negative Normal distribution N + (0 K, 1 K).
In addition to these, we also consider additional priors to encode our prior knowledge about the signal.In our analysis we consider either: a uniform prior [−2 K, 2 K] on the signal values, which affects the posterior only slightly beyond the existing knot coordinate priors, a Gaussian prior 0 K ± 1 K on the signal contrast (the difference between the highest and lowest signal value), or a Gaussian prior on every signal point.We use the second option of a Gaussian prior on the signal contrast for the rest of our analysis and show the effects of other choices in appendix B.

RESULTS
We apply the Gaussian Process and FlexKnot methods to both, the synthetic data, where we know the ground truth, and the EDGES low-band data from Bowman et al. (2018) to test our methods on a real measurement.We will first discuss the GP fit (section 3.1) to both data sets, then the FlexKnot fit to the synthetic data (section 3.2) and EDGES low-band data (section 3.3).

Gaussian Process
As discussed in section 2.4, we fit both, an unconstrained GP and a foreground-orthogonal GP.Extracting the signal with an unconstrained fit proves difficult, but we can extract the foregroundorthogonal part of the signal with a high degree of confidence.We will start this section with that result, and show the corresponding results for the unconstrained GP in Figure 7.
We show the foreground-orthogonal Gaussian Process posterior fit to the synthetic data in Figure 5.The plot shows the posterior distribution (blue contours) of the Gaussian Process that is (by construction) orthogonal to the foreground component.The shades of blue show the 1, 2 and 3 contours, corresponding to 68%, 95% and 99.7% credibility intervals.These constraints on the foreground-orthogonal components show us which data features need to be explained by systematics or an astrophysical model.For comparison, we also show the projected ground truth signal (red line) and a projection of the synthetic data (black points) that this fit is based on.These projections are the foreground-orthogonal components of the ground truth signal and data, respectively, using the orthogonalization discussed in section 2.5.1.We see that the GP posterior matches the ground truth signal very well, and its width is consistent with the scatter in the data points.The posterior for the noise level (a free parameter) is  noise = 0.027 K ± 0.002 K, which is consistent with the input noise level of 0.025 K. 13 As for the other GP parameters, we give mean values and 95% credible intervals to account for the asymmetry of the distribution.The amplitude   has mean 0.93 K and lies within  et al. 2018).We show posterior credible intervals as blue contours, and the projected data points in black.We do not have a ground truth as the real cosmological or systematic signal present is unknown.However, we can show a flattened Gaussian fit (orange dashed line, equation 3) following Bowman et al. (2018) for comparison.This shows us which features in the signal the flattened Gaussian cannot explain, and we see multiple points where the GP and flattened Gaussian fits deviate by more than 3.Note that the orange dashed line here is a fit (varying ,  0 , , and ) of equation ( 3) to the EDGES data, while the red solid line in Figure 5 is the known input signal for the synthetic data.Both look alike because the input signal for the synthetic data was chosen to be similar to the best-fitting signal in Bowman et al. (2018).Also note that the shape of the flattened Gaussian looks unfamiliar because we are only showing its foreground-orthogonal component.
Applying the same method to the EDGES low-band data from Bowman et al. (2018) we obtain the posterior shown in Figure 6.Again we show the posterior contours in blue, and the projected data points in black, but in this case we do not know the true signal.An interesting comparison is to see how well a flattened Gaussian signal (as reported in Bowman et al. 2018) can explain the data, or equivalently, how well it can explain the foreground-orthogonal component shown here.We show a (projected) fit of the flattened Gaussian model (equation 3) as an orange dashed line, and see that although the flattened Gaussian fit largely overlaps with the GP posterior, it significantly deviates from the GP contours around 74 MHz, 82 MHz, and 91 MHz.
The Gaussian Process fit posterior for the noise level is  noise = 0.021 K ± 0.002 K, somewhat lower than found in Bowman et al. (2018), which is expected, as the Gaussian Process fits the data more accurately.For the other parameters, we again give the mean value and 95% credible intervals: The amplitude mean is 1.95 K with the credible interval   ∈ [0.4 K, 3.8 K]; the mean kernel length scale is 7.23 MHz with interval ℓ ∈ [5.6 MHz, 8.9 MHz].
We want to comment on the sharp features at 70 MHz and 90 MHz in Figures 5 and 6.These are reminiscent of the features found in the original EDGES analysis (Bowman et al. 2018) and demonstrate an example of a feature that cannot be absorbed by the foregrounds, and thus is present in both analyses.14Note however that this feature does not necessarily have to be present as a sharp rise and fall like it is here: In section 3.3 (specifically Figure 11) we will see signal shapes that contain the same foreground-orthogonal component but look visually different.This highlights why the foreground-orthogonal projection is such a useful tool for comparing whether different signals are consistent with the data, or with each other.
As a point of comparison we also show the fit of an unconstrained Gaussian Process to both data sets in Figure 7.The posterior constraints show large uncertainties (blue contour bands) which are caused by the degeneracy with foregrounds.These constraints depend somewhat on the choice of priors, but the qualitative result remains the same.

FlexKnot signal fit to synthetic data
Unlike our Gaussian Process, the FlexKnot method can reliably produce a posterior for the full signal.As discussed in section 2.5, the reason this is possible is that the FlexKnot method has an implicit prior towards simpler, more predictive functions due to the Bayesian evidence weighting.This is the distinguishing criterion between multiple signals that fit the data equally well: while their foreground-orthogonal projection can be the same, the full signals differ in complexity.The FlexKnot method will find the simplest signals among the infinite number of signals that are consistent with the data.While the simplest signal is not necessarily the true signal, we expect it to be at least instructive, and notice that the method does in fact recover the ground truth for the synthetic data.
First, we plot the posterior contours of the FlexKnot functional posterior, showing the posterior of the signal value at every frequency.The top panel of Figure 8 shows credible intervals (blue contours) along with the ground truth signal (red line) used to generate the synthetic data.We see that the contours are compatible with the input signal and indeed suggest the correct shape.Overall we see a much larger uncertainty than in the orthogonal Gaussian Process fit, which is expected since here we are fitting the full signal, not just the foreground-orthogonal component.The main source of uncertainty is the degeneracy with the foregrounds.
We make one post-processing adjustment to the posteriors for all FlexKnot plots in this section: we shift all signals to have zero mean.This is because one of the foreground components is an approximately constant signal, and without additional priors, we cannot resolve that degeneracy since the constant does not add any complexity.However, we are mostly interested in the shape of the signal here, and not its absolute level, thus this is not an issue in our analysis.Note that this is the same effect that causes the FlexKnot knot coordinate posteriors in Figure 3 to be stretched out vertically.In Figure C1 we show the posterior functions without this mean-zero shift.In the same figure, we also show the foreground-orthogonal projection of the FlexKnot posterior for comparison with the GP results.
In the posterior contour plot (top panel of Figure 8), it can be hard to determine the shapes of individual functions, so in addition, we want to look at a representative sample (drawn proportional to the posterior) of signals.The bottom panel of Figure 8 shows ≈ 4, 000 of such function samples, and allows us to get an intuition for the functional shapes.Here we can see that almost all functions have a flattened absorption trough, similar to the ground truth signal (red line).On the low-frequency tail, we see the posterior spreading out, as well as at the high-frequency tail (although to a lesser degree).
The posterior distribution is marginalized over the number of knots.As discussed in section 2.5, this is an average over all runs,  .FlexKnot fit to our synthetic data set, visualized as posterior contours (top) and random posterior samples (bottom).These plots use the full signal  signal , rather than the foreground-orthogonal projection we showed for the GP fit.The only post-processing we apply is to shift the signal to have zero mean  signal − ⟨ signal ⟩  .Both plots show the input signal (red line) as a ground truth that the synthetic data was generated from.Top: Credible intervals of the temperature value at each frequency , visualized as blue contours correspond to 68%, 95% and 99.7% credible intervals respectively.We see that the ground truth (red line, synthetic data input signal) is recovered within the 68% credible region of the posterior.Bottom: Random samples drawn proportional to the FlexKnot posterior.Every sample is a function  () shown as a thin black line.This visualization gives us a better view of the functional shapes contained in the posterior.We see that many samples have a similar shape to the ground truth (red line), although a fraction of posterior samples spread out at low frequencies and do not recover the flat nature of the ground truth.Overall though we could infer the underlying signal from observing the posterior samples.
weighted by the Bayesian evidence of each run. 15We show the evidence as a function of  knots in Figure 9.We see that the evidence rises sharply until we reach the peak at  knots = 6, and falls off continuously after that.Both effects are what we would expect: the sharp rise since low- knots models cannot adequately fit the data, For each  knots value we perform multiple Nested Sampling runs (shown as black dots) to account for outliers, and take the mean (red markers) of the log  values since  is log-normal distributed (Handley et al. 2015c).and the fall off since high- knots models are more complex than necessary and less predictive, thus being weakly penalized by the Bayesian evidence.At the largest number we explore,  knots = 15, the evidence has fallen to  15 ≈ e 223 which is smaller than the peak evidence ( 6 ≈ e 245 ) by a factor of > 10 9 .Thus, we expect larger numbers of knots to be negligible, and truncate our analysis after  knots = 15.
Finally, the noise level and posteriors of the foreground parameters are shown in Figure 10.As expected, the foreground parameters are highly correlated.We note that the transformed foreground parameters ã (see section 2.5.1) are significantly less correlated, as demonstrated in Figure D1.In both Figures, we see that the foreground posterior parameters are consistent with the input parameters at the 95% confidence level.We find a noise level of  noise = 0.026 K±0.001 K, also consistent with 0.025 K noise added to the synthetic data.
As discussed in section 2.5 we have chosen a Gaussian prior on the largest difference in the signal values for our FlexKnot implementation.Different choices of priors can affect the posterior if the signals differ significantly, but the effect on the synthetic data set is small.We show posterior plots for other prior choices in appendix B, for both the synthetic data and the EDGES low-band data.

FlexKnot signal fit to EDGES low-band data
Now we apply the FlexKnot analysis to the EDGES low-band data from Bowman et al. (2018).In Figure 11 (top panel) we show the resulting posterior contours along with the flattened Gaussian fit for comparison.The posterior contours are more complex compared to the synthetic data (which only contain the flattened Gaussian signal, a foreground, and random noise) and appear multi-modal.The credible intervals are rather large, as the different modes cover different parts of the temperature range.
To better visualize this posterior distribution we again turn to a plot of randomly drawn posterior function samples, shown in the bottom panels of Figure 11.We can see a clear distinction and observe the individual signal modes in this plot. 16We show the same posterior samples in five subpanels of Figure 11: subpanel A shows all samples in black, and again the best-fitting flattened Gaussian for comparison.The next four subpanels (B through E) each highlight a subset of the posterior samples in colour to illustrate the different apparent modes in the posterior.This ad-hoc classification is based on the signal slope between 73 and 84 MHz and should not be treated as a rigorous classification, and thus the numbers we give below are only to be taken as estimates.
We notice that the majority of the posterior samples (subpanel B, ∼ 61%) follow an oscillatory pattern with a peak around 65 MHz, a trough around 85 MHz, and another peak around 95 MHz.A smaller fraction of the data (subpanel C, ∼ 14%) shows a similar shape with an additional trough around 55 MHz and without the 95 MHz peak.Neither of these two signal modes is compatible with the flattened Gaussian shape, and neither would be detected in an analysis that forcefully assumes such a shape.The third mode (subpanel D, ∼ 23%) resembles the signal found by Bowman et al. (2018) with a trough at 78 MHz, except for the tails at low and high frequencies that are not flat in our case.We also note that the flattened Gaussian best-fitting profile (orange dashed line in subpanel A) shows a much deeper trough that is not recovered by the FlexKnot fits.Finally, the small fourth family of posterior samples (subpanel E, ∼ 2.8%) follows an inverted oscillatory pattern, with a trough around 65 MHz and a peak around 90 MHz.However, it is much less probable than the other modes, and only highlighted here for its distinct shape.
The balance between the different signal modes is directly influenced by any priors chosen for the analysis.As we show in Appendix B, the prior choice does not particularly affect the shapes, but does change the relative probabilities of the different modes.In particular, the third mode (subpanel D) has a lower contrast (difference between maximum and minimum of the signal) and thus is preferred by the Gaussian prior (see Figure B1).
We also show the Bayesian evidence as a function of  knots in Figure 12.The evidence again peaks at  knots = 6 and falls off after that.However, this fall-off is slightly weaker than for the synthetic data (Figure 9).This could be an indication that the signal underlying the EDGES low-band data is more complex than the flattened Gaussian we used for the synthetic data.However, this trend is only a weak argument, and in both cases the evidence peaks at the same complexity level.The ratio between the peak ( = e 267 ) and  knots = 15 ( = e 255 ) is still very large (> 10 6 ) so we can still safely truncate our analysis at that point.
We show the posterior for foreground parameters in Figures 13 (  ) and D1 ( ã ).The foreground parameters   values cover a large range due to their degeneracy, while the transformed ã parameters cover a narrower range.The FlexKnot foreground posteriors cover a range near the flattened Gaussian best-fit point (yellow crosses, best visible in Figure D1, lower panel) but do not overlap as well as the synthetic data inputs did with the respective posteriors (upper panel of Figure D1).This is not unexpected as the EDGES data might not be well described by a flattened Gaussian.The noise level of  noise = 0.020 K ± 0.001 K is higher than the passive load test done by the EDGES team (0.015 K, Bowman et al. 2018) Posterior of foreground parameters a n and noise parameter σ Figure 10.The posterior distribution of noise and foreground parameters for the FlexKnot signal fit to the synthetic data set.We find the expected degenerate distribution of foreground parameters but recover the input values (this is more visible in the upper panel of Figure D1 in the Appendix).Note that the degeneracy is not a problem for our Nested Sampling due to the foreground orthogonalization discussed in section 2.5.1.The noise posterior distribution shown in the top right insert is consistent with the input noise level (red dashed line).0.002 K) showing that the FlexKnot functions allow a closer fit to the data.
Figure 13 also shows a multi-modal distribution of foreground parameters.This is expected, as the signal distribution (Figure 11) is multi-modal and the foreground plus signal components must add up to the data.We confirmed that the clusters seen in Figure 13 are indeed the same clusters as in Figure 11.

DISCUSSION
In this section, we want to discuss the strengths and limitations of the proposed methods, specifically the foreground orthogonalization we used and its dependence on the chosen foreground model.
The foreground-orthogonal Gaussian Process fit is a new way to analyze the 21 cm global signal.It is a tool to extract and present exactly the part of the data that cannot be explained by the foregrounds, allowing us a direct comparison to theoretical predictions.This does not assume that the theoretical signals are foreground-orthogonal, we just apply a foreground-orthogonal projection to their theoretical description.As discussed in section 2.4.1, this projection is lossless, allowing us to obtain the likelihood without the need for further sampling and marginalization.
Relying on the Gaussian Process fit to compare observations to theory models requires high confidence that the GP result indeed follows the data.By design, the GP foreground-orthogonal fit is a very robust method because the data always contains an unambiguous foreground-orthogonal component.We can verify this by plotting this component, i.e. the foreground-orthogonal projection of the data points themselves, along with the Gaussian Process posterior.We have done this in Figures 5 and 6, showing that the GP posterior contours indeed follow the data.
Finally, the foreground orthogonalization is based on distinguishing between foreground-parallel and foreground-orthogonal components of the signal.Thus any result is dependent on the assumed foreground model.This is not a unique problem to this method but applies to all methods that rely on foreground modelling.With the foreground model choice being an active topic of research (e.g. Hills et al. 2018;Singh & Subrahmanyan 2019;Bevins et al. 2021), we emphasize that our foreground orthogonalization can be applied directly to any other foreground model with unbounded linear parameters, such as equations ( 4), ( 7), (9) in Hills et al. (2018), and equations ( 2) through (8) in Bevins et al. (2021).The only change required is to recompute the design matrix   as described in section 2.4.1.We leave expanding the GP method to foreground models with non-linear parameters or bounded parameter ranges for future work, and note that the FlexKnot method can already be applied to any parameterized foreground model without the aforementioned restrictions.

CONCLUSIONS
In this paper, we have presented two model-independent approaches to explore a 21 cm measurement and find the signal (cosmological Again we show individual runs (black dots) as as the mean (red markers) for every  knots value.The evidence again peaks at  knots = 6 but falls off less strongly than for the synthetic data (Figure 9).The evidence still falls off by more than a factor of 10 6 making  knots > 15 negligible.The slower fall-off suggests that the signal contained in the EDGES data is more complex than the flattened Gaussian in our synthetic data.
or systematic) that is present in addition to the foregrounds.The two methods achieve different and complementary goals.
Our first method, based on Gaussian Processes, can reliably extract the foreground-orthogonal component of the signal.This is an easyto-implement and robust method that has been under-utilized in the literature so far.We successfully apply the technique to both a synthetic data set and the EDGES low-band data (Bowman et al. 2018).In the future, we expect this method to be used to compare observational data to theoretical signal predictions.In particular, theorists could explore which types of astrophysical models are especially 'bright' in the foreground-orthogonal space and are thus easier to detect or rule out.Furthermore, one could approach investigating possible sources of systematics in the same way i.e. consider which kinds of systematics can explain the foreground-orthogonal part of the observed data.
Transitioning to our second method, we can summarize the gist of our first method as absorbing as much of the observed sky temperature as possible into the foreground component, ending up with only the features that cannot be explained by the foreground.Our second method on the other hand, based on FlexKnot, absorbs only as much of the observation into the foreground component as is beneficial to leave the simplest signal remainder.
Our FlexKnot approach is the first free-form parameterization that can fit the entire 21 cm signal present in the data (foregroundorthogonal and -parallel components) without relying on a theoretical model or pre-defined functional shape.Instead it utilizes a flexible interpolation scheme and Bayesian model selection to find the simplest signal that explains the data.We successfully demonstrate its ability on a synthetic data set and present a novel analysis of the EDGES low-band data (Bowman et al. 2018).The multi-modal posterior distribution we find suggests that multiple signals are consistent with the data, comparable to individual results in the literature (e.g. Hills et al. 2018).
For both of our methods, we briefly want to discuss the required instrument sensitivity.The Gaussian Process method relies on the foreground-orthogonal component of the data being significantly larger than the noise level (otherwise the Gaussian Process is consistent Posterior of foreground parameters a n and noise parameter σ with zero), and the FlexKnot method indirectly relies on the same criterion (otherwise the simplest FlexKnot model to explain the data will be a constant).Thus we can only expect these methods to provide useful results if the foreground-orthogonal projection of the data is significantly larger than the noise level of the instrument.This is a similar sensitivity requirement to other methods, as a lower sensitivity would always allow a foreground-only fit to explain the data.
In the future, we expect the foreground-orthogonalization to be applied to a multitude of data sets, theoretical signal predictions, and different foreground models.In particular, a combination with recent advances in foreground modelling (Tauscher et al. 2018;Anstey et al. 2021;Shen et al. 2022) could provide the basis for future 21 cm global signal analyses.
We are especially excited about extensions of the FlexKnot method.In particular, while distinguishing between cosmological and systematic contributions is not within the scope of this paper, there are properties we expect of the cosmological signal that can help us distinguish between the two.The FlexKnot method could be expanded to model the cosmological signal specifically (Shen et al. 2023), or to focus on certain types of systematics.7 but with the mean centering as described in section 2.5.2.The mean-centering helped reduce the degeneracy in the Gaussian Process contours somewhat, but does not solve the issue.The effect is not as strong as for the FlexKnot method (which is discussed in appendix C).

Figure 1 .
Figure 1.Five different signals that are equivalent up to their foregroundproportional terms.Each of these will have the same likelihood (for different values of the foreground parameters), thus making the signals equally preferred by the data.

Figure 2 .
Figure2.Demonstration of the foreground-orthogonal projection with a range of simulated cosmological signals.The particular signals are chosen for illustrative purposes, the different lines correspond to different settings of the X-ray production in early galaxies (   from 10 −1 , blue, to 10 3 , yellow, seeFialkov et al. (2014a,b);Fialkov & Barkana (2014) for details 4 ).The top panel shows  signal with the recognizable shape, but the bottom panel shows the foreground-orthogonal component  signal, projected of these signals (note the  axis scale).We want to stress that we show these cosmological signals without any systematics and only for illustrative purposes, while  signal in our analysis may contain both cosmological and systematic contributions.

Figure 3 .
Figure 3. Illustration of the FlexKnot parameterization  signal ().Top: An example FlexKnot with 6 interpolation points (knots, red) and the corresponding interpolated curve (blue).Only the knots (not the interpolation) are restricted by the priors to be within the range   ∈ [50 MHz, 100 MHz] and   ∈ [ −2 K, 2 K].Bottom: An example posterior distribution of the knot coordinates {  ,   } for a FlexKnot run.This is the raw data which the signal posteriors (sections 3.2 and 3.3) are computed from. 6

Figure 4 .
Figure 4. Original foreground basis functions   () (top) and orthogonal foreground basis functions f () (bottom).The new basis functions are orthogonal and normalized, and thus allow us to easily project signals into a foreground-orthogonal space.

Figure 5 .Figure 6 .
Figure5.Gaussian Process fit to the synthetic data set introduced in section 2.2.The blue contours indicate the 68%, 95% and 99.7% credible intervals of the temperature values  signal, projected at every frequency, derived from the Gaussian Process posterior.The red line shows the foreground-orthogonal component of the true signal (see section 2.2 for details) and the black dots show the foreground-orthogonal component of the synthetic data vector.The figure shows that the Gaussian Process correctly recovers the input signal shape and its uncertainty is consistent with the scatter in the data.

Figure 7 .
Figure 7. Gaussian Process posterior contours without orthogonalization.The blue contours show the posterior distribution of signals for the synthetic dataset (top) and the EDGES low-band data (bottom).The red line (top) indicates the input signal (ground truth) for the synthetic data, the orange dashed line (bottom) shows the flattened Gaussian fit to the EDGES data.The Gaussian Process contours for the synthetic data are consistent with the input, but both Gaussian Process posteriors are rather wide and unconstraining due to their degeneracy with the foreground term.

Figure 8
Figure8.FlexKnot fit to our synthetic data set, visualized as posterior contours (top) and random posterior samples (bottom).These plots use the full signal  signal , rather than the foreground-orthogonal projection we showed for the GP fit.The only post-processing we apply is to shift the signal to have zero mean  signal − ⟨ signal ⟩  .Both plots show the input signal (red line) as a ground truth that the synthetic data was generated from.Top: Credible intervals of the temperature value at each frequency , visualized as blue contours correspond to 68%, 95% and 99.7% credible intervals respectively.We see that the ground truth (red line, synthetic data input signal) is recovered within the 68% credible region of the posterior.Bottom: Random samples drawn proportional to the FlexKnot posterior.Every sample is a function

Figure 9 .
Figure9.Bayesian evidence (logarithmic, base ) as a function of the number of knots for the FlexKnot fit to the synthetic data set.Marginalizing over the number of knots corresponds to weighing runs by their evidence  so this plot shows us which runs account for the largest amount of posterior mass.For each  knots value we perform multiple Nested Sampling runs (shown as black dots) to account for outliers, and take the mean (red markers) of the log  values since  is log-normal distributed(Handley et al. 2015c).
, but closer to the expectation than the flattened Gaussian fit ( noise = 0.026 K ±

Figure 11 .Figure 12 .
Figure 11.FlexKnot posteriors for the EDGES low-band data set (Bowman et al. 2018), visualized as credible intervals (top) and posterior samples (bottom).The plots show the full signal except for a mean-zero shift, as described in section 3.2.Top: Credible intervals of the temperature  signal − ⟨ signal ⟩  at every frequency (blue contours) derived from the FlexKnot fit, along with the best-fitting flattened Gaussian model (orange dashed line, equation 3) for comparison.While we can recognize the main posterior mode from this plot, the contours indicate a multi-modal distribution and don't describe the posterior in a useful way.Bottom: Random samples drawn from the FlexKnot posterior.We show five subplots in this panel: Subpanel A shows samples drawn from the whole posterior (black), as well as the bestfitting flattened Gaussian model (orange dashed line) again for comparison.The posterior samples (black line) clearly show the multi-modal nature of the posterior.We can distinguish four modes in the posterior by eye, and colour these in the subpanels below.Subpanel B highlights the largest mode (blue, 61% of the posterior), followed by subpanel C (green, 14%), subpanel D (orange, 23%), and subpanel E (red, 2.8%).In each panel, we show all posterior samples in grey and highlight the respective mode in colour.

Figure 13 .
Figure 13.Noise and foreground parameter distribution for the FlexKnot signal fit to the EDGES low-band data.The foreground parameters   are degenerate as expected.We notice the FlexKnot fit is closer to the data and thus produces a lower noise posterior  noise (top right insert, compared to the dashed orange line).

Figure A1 .
Figure A1.Gaussian Process posterior contours with mean-centering for the synthetic data (top) and EDGES low-band data (bottom).This Figure is the pendant to Figure7but with the mean centering as described in section 2.5.2.The mean-centering helped reduce the degeneracy in the Gaussian Process contours somewhat, but does not solve the issue.The effect is not as strong as for the FlexKnot method (which is discussed in appendix C).