Optimal data compression for Lyman-α forest cosmology

The Lyman-α three-dimensional correlation functions have been widely used to perform cosmological inference using the baryon acoustic oscillation scale. While the traditional inference approach employs a data vector with several thousand data points, we apply near-maximal score compression down to tens of compressed data elements. We show that carefully constructed additional data beyond those linked to each inferred model parameter are required to preserve meaningful goodness of ﬁt tests that guard against unknown systematics, and to a v oid information loss due to non-linear parameter dependences. We demonstrate, on suites of realistic mocks and Data Release 16 data from the Extended Baryon Oscillation Spectroscopic Surv e y, that our compression approach is lossless and unbiased, yielding a posterior that is indistinguishable from that of the traditional analysis. As an early application, we investigate the impact of a covariance matrix estimated from a limited number of mocks, which is only well conditioned in compressed space.


INTRODUCTION
In recent decades, the Lyman- (Ly) forest gained popularity as a probe of the distribution of matter at redshifts  > 2. The forest consists of a sequence of absorption lines in high-redshift quasar (QSO) spectra, caused by neutral hydrogen placed along the line-of-sight, and hence it is a tracer of the intergalactic medium (IGM).Therefore, it contains cosmological information, and in particular Lyman- clustering shows the distinct baryon acoustic oscillations (BAO) feature.This feature was first detected in the Ly auto-correlation function using the Baryon Oscillation Spectroscopic Survey (BOSS) DR9 data (Busca et al. 2013;Slosar et al. 2013;Kirkby et al. 2013), and subsequently extracted from the Ly cross-correlation with QSOs using DR11 data (Font-Ribera et al. 2014).
The Ly forest auto-correlation and its cross-correlation with quasars have been widely used to place constraints on the cosmological model (e.g.Aubourg et al. 2015;Alam et al. 2017;Cuceu et al. 2019;Alam et al. 2021;Cuceu et al. 2023a).These two correlation functions are typically computed on a 2D grid in comoving coordinates along and across the line-of-sight, resulting in high dimensional data vectors, usually 2500 long for the auto-correlation and 5000 for the cross-correlation.However, standard BOSS and eBOSS (du Mas des Bourboux et al. 2020; hereafter dMdB20) Ly forest analyses have so far focused on extracting cosmological information from the BAO peak, which is well localized to a smaller subset of bins.This means that the vector can be reduced to a smaller ★ E-mail: francesca.gerardi.19@ucl.ac.uk † NASA Einstein Fellow, E-mail: cuceu.1@osu.edudimensionality, encoding the information we wish to capture.Hence, in this context, applying a data compression scheme could be useful to optimize the inference.In addition, the accuracy of the parameter estimates is tightly linked to the covariance matrix of the data vector, under the assumption of a Gaussian likelihood.As the true covariance  of the correlation function is inaccessible, standard analyses usually estimate it either from large set of mocks or analytically from models of the covariance matrix (Kitaura et al. 2016;Wadekar et al. 2020).In Ly analyses, producing mocks can be a highly computationally-expensive process, therefore only a limited number is available, 100 in the case of dMdB20.However, if the number of samples is significantly lower than the number of data points, the estimate of the covariance is singular and has no inverse (Hartlap et al. 2007;Dodelson & Schneider 2013;Taylor & Joachimi 2014;Sellentin & Heavens 2015;Percival et al. 2021).
In the eBOSS DR16 analysis, the covariance matrix C is computed via the sub-sampling method, which, given some dataset, consists of computing the covariance of correlation functions obtained in individual subsamples of the sky.Despite being larger (∼ 800) than the number of mocks (100), the number of subsamples is still lower than the number of data points (2500-5000); hence, the covariance matrix must be tested.Alternatively, in the same analysis, the authors computed a Gaussian covariance matrix using the Wick approximation (Delubac et al. 2015) and used it to benchmark the covariance computed from the sub-sampling method.The accuracy of the covariance matrix would increase by alleviating the mismatch between the number of bins and the number of mocks.This can be done by applying a data compression algorithm and evaluating the (compressed) data covariance matrix in a new space characterized by a lower dimen-sionality.In particular, given the available set of a hundred mocks, we reduce each of them to a set of compressed data vectors and compute a newly defined mock sample covariance, which is a good estimator of the true covariance, given that the length of the compressed data vector is now much smaller than the number of mocks.Then, a comparison between the covariance matrix of the data, mapped into the compressed space, and the mock sample covariance, obtained from the compressed vector, can clarify whether there has been an underestimation or overestimation of the contours in the standard analyses.Moreover, we are interested in obtaining a more sensitive goodness of fit test.The length of Ly correlation data vectors is of the order of O (103 ), which could easily hide any bad fit in a subset of the data.By reducing the dimensionality of the data vector through compression, we wish to obtain a test that would highlight when a few important points are off.
Driven by these optimization problems, we perform the inference analysis on realistic Ly×Ly auto-and Ly×QSO cross-correlation functions in a data compression framework.The compression algorithm we use is score compression (Alsing & Wandelt 2018), under the hypothesis of a Gaussian likelihood (and hence analogous to the Multiple Optimised Parameter Estimation and Data compression (MOPED) scheme; see Heavens et al. 2000).By construction, the dimensionality of the compressed data vector will be equal to the number of parameters we wish to keep information of, namely O (10).
The paper is structured as follows.We start in Sect. 2 by outlining the method, explaining the computation of the covariance matrix, and introducing the modelling and the basic idea behind score compression.We proceed in Sect. 3 by testing the compression algorithm against loss of information, comparing the inferred posterior distribution for our sampled parameters in the traditional and compressed frameworks.In Sect.4, we compare the constraining power of the original estimated covariance matrix against the mock-to-mock covariance.We then perform goodness of fit tests in the compressed framework in Sect. 5. Throughout the analysis a tight prior on the BAO parameters is imposed to overcome the problem of the non-linear relation between these and their corresponding summary statistics components.We relax the prior constraint, and hence made the analysis more generalizable, by extending the framework as described in Sect.6.An application of our new framework to eBOSS DR16 data is presented in Sect.7. Conclusions are drawn in Sect.8.
Making sure the analysis is both optimized and reliable is key to interpret the vast amount of Ly forest data which will become available from the Dark Energy Spectroscopic Instrument (DESI).

METHOD
Generically referring to the Ly auto-and cross-correlations as the data vectors, the goal of this work is to study data compression in the context of Ly forest 3D analyses.In particular, this means compressing the data down to a set of summary statistics t, which will encode into a shorter vector the information we are interested in.As we have just seen, this also benefits the computation of the covariance matrix.The new 'compressed' framework is tested against the traditional analysis while performing parameter inference.To evaluate posterior distributions we use the nested sampler Polychord (Handley et al. 2015a,b).
We start in Sect.2.1 by introducing the mocks used in this analysis, with a focus on the computation of the covariance matrix.We then describe the modelling of the Ly×Ly and the cross Ly×QSO power spectra in Sect.2.2, as implemented in vega1 (Cuceu et al. 2023b), and the set of randomly generated Monte Carlo realizations of the correlation function in Sect.2.3.In Sect.2.4 we finally outline the compression method used, namely score compression.

Synthetic data vector and covariance
In this work we use a set of 100 realistic Ly mocks, with and without contaminants, which were produced for the Ly eBOSS DR16 analysis (du Mas des Bourboux et al. 2020).The synthetic Ly transmitted fluxes are produced using the CoLoRe (Ramírez-Pérez et al. 2022) and LyaCoLoRe (Farr et al. 2020) packages, from the same cosmology for all the mocks.Synthetic quasar spectra are then generated given some astrophysical and instrumental prescriptions, and contaminants are added if requested.Then the mocks run through the same analysis pipeline (picca2 ) as the real data, resulting in measured auto-and cross-correlation functions (dMdB20).These are derived from computing the correlation function in each HEALPix 3 (Górski et al. 2005) pixel -about 880 pixels (subsamples) for the eBOSS footprint (NSIDE=16) -and evaluating the mean and covariance over the full set of pixels of the mock, to be then assigned to the entire survey.In this way, for every i-th mock, there will be a measurement of both the correlation function and the covariance matrix   , which will be only an estimate of the true covariance  as mentioned above.In each subsample, the correlation has a size of either 2500 ( auto ) or 5000 ( cross ) bins, hence the number of subsamples (880 pixels) is significantly lower than the number of data points (2500 or 5000).This means that the covariance should be singular, however off-diagonal elements of the correlation matrix are smoothed to make it positive definite (dMdB20).
Finally, given the same hundred mocks, it is possible to define a stack of them.In particular, the correlation function for the stack of mocks is obtained by collecting all the subsamples (for all the hundred mocks), and computing the mean and covariance of the correlation functions computed in each of them, effectively reducing the noise.We will refer to the contaminated auto-and cross-mock correlations of the stack as stacked correlations.
In this analysis, we use the same scale cuts as in eBOSS DR16 (du Mas des Bourboux et al. 2020), assuming  min = 10h −1 Mpc, up to  max = 180h −1 Mpc.The effective redshift of the correlation functions is  eff = 2.3.

Modelling and parameter space
To model the Ly correlation functions we follow Eq. ( 27) of du Mas des Bourboux et al. (2020), while applying the same prescriptions as in Gerardi et al. (2022).Given a certain cosmological model and a corresponding isotropic linear matter power spectrum (, ), the Ly auto and Ly-QSO cross power spectra are computed as  1.Full set of sampled parameters, alongside with the fiducial values used to compute the summary statistics (see Eq. 8), priors and the 1-D marginals (68% c.l.).Uniform (U) priors adopted for the sampling procedure, while we assign a Gaussian prior on  HCD , where by notation the Gaussian distribution N ( , ) has mean  and standard deviation .Results are split into 'Testing the framework (stacked)' and 'Testing the covariance (single mock)', which respectively refer to the setup in Sect. 3 and Sect.4. The former set of results shows the comparison between the traditional and the compressed inference pipelines using the stacked auto-and cross-correlation mocks, while the second between the compressed method using either the original covariance  (which is mapped into the compressed space) or the mock-to-mock covariance   , for a single mock.
where   =  ∥ /, with  and  ∥ the wave vector modulus and its line-of-sight component, respectively.On one hand, the Ly×Ly power spectrum in Eq. ( 1) depends on the Ly forest linear bias  Ly and RSD parameter , where  ,Ly is an extra unknown bias, the velocity divergence bias, and  () the logarithmic growth rate.The  nl,Ly term accounts for non-linear corrections (Arinyo-i-Prats et al. 2015).On the other hand, the quasar parameters that contribute to the Ly×QSO power spectrum in Eq. ( 2) are the quasar linear bias  QSO and the redshift-space distortions (RSD) term  QSO =  ()/ QSO .Finally, we model non-linear effects of quasars and redshift errors following du Mas des Bourboux et al. (2020), using a Lorentzian function where  v is the velocity dispersion.The power spectra in Eqs.(1-2) only account for Ly flux and in reality this is also contaminated by absorption lines due to heavy elements, generally referred to as metals, and high column density (HCD) systems (Bautista et al. 2017;Font-Ribera et al. 2012).Let us first focus on the modelling of the HCDs.Font-Ribera et al. (2012) showed their broadening effect along the line-of-sight can be modeled at the level of new effective Ly bias and RSD parameters with  HCD and  HCD being the linear bias and RSD parameters. HCD ( ∥ ) is a function of the line-of-sight wavenumber, and it is modeled following dMdB20.On the other hand, metals contribute to the final auto-and cross-correlation functions as per where  generically refer to a metal and the sums are performed over all possible metals considered.The modelling of the crosscorrelation of a metal with other metals (  1 × 2 ) and with Ly ( Ly× ) and QSO ( QSO× ) follows the modelling of the auto-and cross-correlations of the Ly, and each metal line has a linear bias   and RSD parameter   =  ,  ()/  .Following dMdB20, we fix all   = 0.5, and sample the metal biases.
Based on this modelling, we use the code vega to compute the two-dimensional correlation functions .This same code computes both the BAO feature parameters { ∥ ,  ⊥ }, which shift the peak along and across the line-of-sight, and the Gaussian smoothing (Farr et al. 2020), which accounts for the low resolution of the mocks and is parameterized by { ∥ ,  ⊥ } smoothing parameters.
For all these parameters we choose uniform priors, which are listed in Tab. 1.The only exception is given by  HCD , for which, following the previous eBOSS DR16 analysis, we impose an informative Gaussian prior.

Monte Carlo realizations
We here introduce a different kind of simulated data, which we will later use, defined as Monte Carlo realizations.They are correlation functions obtained by adding noise on top of the model, as defined in Sect.2.2.The noise is given by a covariance matrix from one of the hundred mocks correlation that have been seen so far.What this means is that we can imagine every data point to be generated from a normal distribution N (, ), where  is the model correlation function and  is given by the covariance of the first realistic mock.Using Monte Carlo simulations comes with two advantages.First, it is possible to generate as many as needed to build any statistics.
Secondly, we have control over the model and there will be clear knowledge of the underlying physics.

Score compression
To reduce the dimensionality of our datasets we use score compression (Alsing & Wandelt 2018).Given a known form for the log-likelihood function L, this method corresponds to linear transformations of the data, based on the idea of compressing them down to the score function  = ∇L * .The components of the compressed vector are the derivatives of the log-likelihood function, evaluated at some fiducial set of parameters  * , with respect to the parameters of interest .Under the assumptions that the likelihood function is Gaussian and the covariance  does not depend on parameters, from the data  the compressed data vector is obtained as where  * is the fiducial model.Under these assumptions the compression is identical to the widely used MOPED scheme (Heavens et al. 2000) apart from a bĳective linear transformation.
In our case the model corresponds to the correlation function , described earlier in Sect.2.2.The corresponding likelihood distribution in compressed space will be then given by where  is the length of ,   () is the compressed model  evaluated at , namely is the Fisher matrix.
When considering both the auto-and cross-correlations, some parameters will be in common; for this reason, there is the need to build a joint summary statistic.If we define independently the Ly auto-and cross-data vectors, characterized by the covariances C auto and C cross respectively, and given they do not correlate with each other, in the joint analysis the full covariance matrix will be given by Then the resulting summary statistics vector and Fisher matrix will be respectively obtained as  =  auto +  cross and  =  auto +  cross .This compression method is dependent on the choice of the fiducial set of parameters  * , which might not be known a priori.However, Alsing & Wandelt (2018) suggest iterating over the Fisher scoring method for maximum-likelihood estimation until convergence of the full set of parameters.How this is done in our particular case is described at the beginning of Sect. 3.An important note is that this iterative procedure does not take into account parameters priors, which means it can potentially lead to unusual values for those parameters which are not well constrained by the data.
In computing the score compression components over the parameters { ∥ ,  ⊥ }, we realized their relation with their corresponding summary statistics components, namely {  ∥ ,   ⊥ }, was not monotonic, as per Fig. 1.This is problematic as this means the posterior can have more than one peak (Graff et al. 2011) if we sample over the full [0.01,1.99] interval.We overcame this complexity by imposing a tighter prior on { ∥ ,  ⊥ } at the sampling step.This prior is designed  8), against the value of   ∥ evaluated using  ∥ = 1.00 (see Tab. 1), denoted as 'data'.The remainder of the parameters are set to the fiducial values listed in Tab 1.This figure highlights a non-monotonic relationship between the two parameters, which would lead to multiple peaks in the posterior if a tight prior is not imposed.
to allow for  ∥ values in between the minimum and maximum of   ∥ ( ∥ ).The same prior is imposed on  ⊥ .This tightening does not affect the inference when performed on the correlation function of the stacked mock, in which case posteriors are well within this prior, but it reveals to be quite important when evaluating the posteriors on the individual mocks.For this reason, we make sure we provide example results for those mocks whose contours are within the prior range.
Later, in Sect.6 we will see how we can remove the tight prior constraint by evaluating the summary statistics components associated to { ∥ ,  ⊥ } at multiple fiducial values of the BAO parameters, effectively enlarging the compressed vector.

COMPRESSION PERFORMANCE
In this Section we apply the score compression algorithm, outlined in Sect 2.4, to Ly auto-and cross-correlations measured from contaminated mocks.The pipeline starts by choosing a fiducial set of parameters for computing the score compressed vector, as per Eq. ( 8).The fiducial is obtained by iterating over Eq. ( 12), with  0 given by the best fit of the stacked correlation functions.Given this initial guess, we then iterated assigning to  +1 the median of the  values over the hundred mocks at the -th step.
The likelihood is assumed to be Gaussian, which has a major impact on the final form of the compressed vector, given that the latter is computed as the gradient of the log-likelihood.Based on previous analyses, we assume the data are normally distributed and this assumption of Gaussianity will also be inherited in the compressed space.In general, when mapping in a compressed space, this property might not be preserved, but given that score compression is a linear transformation, that is the case.We make a consistency check by running the Henze-Zirkler test (Henze & Zirkler 1990) for multivariate normality in the compressed space.Intuitively, this test measures the distance between the measured and target (multivariate) distribution, and it was shown to perform well in high-dimensional problems.We found that the summary statistics, computed for the hundred mocks at the end of the iterative process, follows a multivariate normal distribution.
Provided the fiducial model and the Gaussianity checks, we first test the compression method on the stack of the mocks, with results ,  HCD ,  HCD } (upper right panel).The green contours refer to the results obtained performing the inference using the full uncompressed data vector, which we denote as 'Traditional analysis', while the blue dashed refer to the compressed analysis results, denoted as 'Score compression analysis'.
presented in this Section, and later, in Sect.4, we compute the covariance matrix for the summary statistics over the set of hundred mocks and compare it to the Fisher matrix as defined in Eq. ( 10).It is important to keep in mind that, when referring to the Fisher matrix, we are simply referring to the mapping of the data covariance matrix  into the compressed space.
To test the score compression algorithm against the traditional approach, for simplicity, we employ both the contaminated autoand cross-stacked correlations, which are almost noise-free.This choice is motivated by the fact that we imposed a tight prior on the { ∥ ,  ⊥ } parameters to overcome the challenges coming from the non-monotonic relationship between these parameters and their corresponding summary statistics components (see Fig. 1).Thus, experimenting over less noisy mock data facilitates running the test in a case where it is granted that posteriors will not hit the priors.
For both the traditional (uncompressed data) and the compressed frameworks we run the Polychord sampler for the auto-and crossstacked correlations, while sampling the full set of 15 model parameters { ∥ ,  ⊥ ,  Ly ,  Ly ,  QSO ,  QSO ,  v ,  ∥ ,  ⊥ ,  ,SiII(1260) ,  ,SiII(1193) ,  ,SiIII(1207) ,  ,SiII(1190) ,  HCD ,  HCD } and results are presented in Fig. 2. The two methods agree well with each other, leading to almost identical results.The numerical values of the peaks and 1 confidence intervals of the 1d marginals are presented in Tab. 1 as part of the 'Testing the framework (stacked)' set of columns.From the table, it can be noticed that in some cases the fiducial parameters used to compute the compression are not within the 3 confidence interval.Despite the fiducial being a first guess, and not necessarily accurate, the contours of the two methods agree well with each other.
We just demonstrated that the score compression inference pipeline leads to the same results as the standard approach.This shows the linearity of the parameters in the model to a good approximation.However, it is important to bear in mind that, in this case, this only holds locally around the fiducial, because of the non-linearity of the components that relate to  ∥ and  ⊥ , on which we imposed a tight prior.

TESTING THE COVARIANCE MATRIX
An interesting application of the compression algorithm consists of evaluating the accuracy of the covariance matrix  by comparing it to the mock-to-mock covariance   , which is the covariance matrix of the summary statistics vectors of the set of hundred mocks.We now showcase this application using a single mock.
We recall that the computation of the standard data covariance happens in a setup where the length of the data vector is larger than the number of samples, which is sub-optimal.The covariance should be singular; however, the off-diagonal elements of the correlation matrix are smoothed to make it positive definite (du Mas des Bourboux et al. 2020).Reducing the dimensionality of the data vector via score compression allows us to compute a new covariance matrix   , which has a dimensionality significantly lower than the number of samples used to compute it, given that the new data vector will be ∼ O (10) long.The fact that now the number of mock samples is larger than the number of compressed data points, means that we are now in a framework where the estimated   is in principle a better estimator of the true covariance  in compressed space than , which is obtained by mapping the covariance  into this space.
We now repeat the same experiment as in Sect. 3 over a single mock and evaluate the posterior using Polychord for the full set of parameters { ∥ ,  ⊥ ,  Ly ,  Ly ,  QSO ,  QSO ,  v ,  ∥ ,  ⊥ ,  ,SiII(1260) ,  ,SiII(1193) ,  ,SiIII(1207) ,  ,SiII(1190) ,  HCD ,  HCD }.This is either done using the original covariance  matrix (mapped into the compressed space, to the Fisher matrix) in the Gaussian likelihood in Eq. ( 9) or instead using the mock-to-mock covariance   adopting a t-distribution as a likelihood function, as proposed in Sellentin & Heavens (2015).The latter is of the form of where   is the number of mocks,   is the length of the compressed data vector and Γ is the Gamma function.Once again the choice of the tight prior on both { ∥ ,  ⊥ } affected the choice of the set of mocks in order to run this second experiment.However, the goal of this second experiment is to provide an example case of testing the The blue filled contours refer to the results obtained performing the inference using the original covariance matrix  (mapped into the compressed space) in the likelihood function, and hence are denoted as 'Original covariance'.On the other hand, the red dashed results, denoted as 'Mock-tomock covariance', refer to the case in which the mock-to-mock covariance matrix is used instead, while adopting a t-distribution likelihood.
accuracy of the subsampling estimation of the covariance matrix.If the method is demonstrated to effectively work over some subset of mocks, it is expected that will also be the case for the others.
The results for the BAO parameters { ∥ ,  ⊥ } and the Ly parameters { Ly ,  Ly } are shown in Fig. 3, while the full set is presented in Sect.A and listed in Tab. 1 ('Testing the covariance (single mock)' columns).In this test case, using the mock-to-mock covariance results in a small enlargement of the posterior for the  ⊥ parameter: while using the original covariance matrix provides  ⊥ = 1.002 ± 0.027, the mock-to-mock covariance results in  ⊥ = 1.004 0.029 −0.032 .On the other hand, the Ly linear bias and RSD parameter absolute errors increase by 50% and ∼ 25% respectively, with final relative error of about 5 − 6%.The uncertainty of the vast majority of the other parameters agree remarkably well.
We end this discussion on covariance matrix estimation by noting that the test presented here is meant as a showcase of the usefulness of compressing Ly forest correlation functions.However, proper testing of the Ly forest covariance matrices would require a more comprehensive analysis using a larger sample of mocks 4 , and comparison with other estimation methods (see e.For simplicity in the  2 analysis we do not include contamination coming from HCD, so these features are only the effects of metal lines.Also, in this example, in order to better visualize the difference between the two, we have been generating noise from the covariance matrix of the stacked auto-correlation mock.1190) were used to build extra degrees of freedom.In blue the histograms and  2 distributions for the uncontaminated data, orange for contaminated.The corresponding  2 distributions (dashed lines) are generated assuming as number of degrees of freedom the mean of the histogram distributions.The first set of histograms, that relates to all four extra degrees of freedom, present a strong shift between the orange and the blue distributions: their corresponding means are 3.89 and 67.51.In the SiII(1260) case, both distributions have a mean of ∼ 1.1, while in the SiII(1190), the mean for the contaminated case is 1.01, against 10.04 in the contaminated case.
want to test whether in this case a bad fit to the contaminated data is apparent as a mean  2 significantly larger than 4.
The  2 histograms are shown in the left panel of Fig. 6: the mean values for the uncontaminated and contaminated cases are respectively 3.89 and 67.51.Considering a  2 with number of degrees of freedom equal to 4, the p-values for the two means are respectively 0.4 and 10 −14 : the bad fit in the contaminated case has emerged.
We further experimented over the addition of metals and we considered adding a single extra degree of freedom at a time, associated to either one of the following metals: the SiII(1260) and the SiII(1190).The resulting  2 histograms are shown in the middle and right panels of Fig. 6, respectively.These two metal lines were chosen because of how differently they affect the data: while the SiII(1260) contamination happens around the BAO scale along the line-of-sight, the SiII(1190) contributes to the peak at ∼ 60Mpc/ℎ.We run the same exact experiment and find that the addition of   ,SiII(1190) does bring out the bad fit, while the other does not.Specifically, the two  2 distributions when the extra degree of freedom is given by  ,SiII(1260) have a mean of ∼ 1, again equal to the number of degrees of freedom, but they cannot be distinguished.The p-values for both distributions, assuming one degree of freedom, are all above a threshold of 0.01.Both distributions are indicative of an acceptable fit.On the contrary, adding the extra compressed component related to SiII(1190) results in having a mean  2 of 1.01 in the uncontaminated case and 10.04 in the contaminated one, with corresponding p-values of 0.3 and 10 −3 if we consider a target  2 distribution of one degree of freedom.This perhaps is indicative about the fact that in order to capture a bad fit, adding extra degrees of freedom is not enough: these extra degrees of freedom must be informative about features not captured by the core set of parameters.The SiII(1260) affects the model at scales of the correlation function which are on top of the BAO peak, which we model for, whereas SiII(1190) effectively adds information on a feature which is completely unmodelled.
In light of this, a possible solution is to add some extra degrees of freedom to the maximal compression vector, which are designed to be orthogonal to the already known components in the compressed space.This would allow the extra flexibility, that is not captured in the model, to highlight for a bad fit in the compressed framework.This is an interesting problem which is left for future work.However, a similar solution has already been implemented in the context of MOPED (Heavens et al. 2020), specifically to allow new physics to be discovered.
Not modelling the SiII(1260) line in the uncompressed traditional framework does not result in any bad fit, which makes this an example of systematics hidden in the large original data vector.At the same time, the fact that the SiII(1260) test in the compressed framework fails to show a bad fit at the level of the  2 is quite problematic, given this metal line is one of the primary contaminants we have to be careful of in BAO measurement, affecting the peak's scale.The worry is then that, despite constructing an extended framework, there is a chance that some systematics hiding in the signal could be missed.This effectively means that in order to apply data compres-sion, the underlying physics must be already well known to a good extent.Because some systematics could be either hard to model or to detect, in this example, we deliberately assumed we had no knowledge about known systematics, where in principle we could have also marginalized over them (Alsing & Wandelt 2019).

ROBUSTNESS TO PARAMETER NON-LINEARITIES
Each component of the score-compressed data vector relates to a specific model parameter, as per Eq. ( 8), via the gradient.Throughout the analysis, the BAO parameters proved to be a source of nonlinearities in relation to their summary statistics components (see Fig. 1), sometimes resulting in a multi-peaked posterior distribution.With the intent of mitigating this effect, we were forced to impose a tight prior on both { ∥ ,  ⊥ }, which reduces the generalizability of the approach.
Based on the work of Protopapas et al. (2005), we explore extensions to the algorithm by considering an ensemble of fiducial values of the BAO parameters to compute the score-compressed vector components related to { ∥ ,  ⊥ }.
where  extra is the model evaluated at { extra ∥ ,  extra ⊥ }, keeping the previously defined fiducial values for the other parameters.As these extra components effectively represent an extension of the compressed dataset, the Fisher matrix in Eq. ( 10) will also be ex- We test this extension on the same mock that was used to test the subsampling covariance matrix in Sect.4, and results are presented in Fig. 7, imposing a physically motivated uniform prior [0.65, 1.35] for both  ∥ and  ⊥ .The ensemble of extra fiducials is given by the set [{ ∥ = 0.8,  ⊥ = 1.2}, { ∥ = 1.2,  ⊥ = 0.8}, { ∥ = 1.3,  ⊥ = 0.7}, { ∥ = 0.9,  ⊥ = 1.1}], in addition to the original { ∥ = 1.00,  ⊥ = 1.01} (see Tab. 1).From Fig. 7 it can be seen that the constraining power on the BAO parameters between the traditional and compressed methods match.This same result is also true for the other parameters, not shown here.
We tested the extension in terms of generalizability by progressively adding extra points to the ensemble, with reasonable spread, and found that with an ensemble of three to four extra fiducial sets of BAO parameters the algorithm is able to effectively get rid of the secondary posterior peaks and increase the accuracy of the measurement.Hence, the assumption of multiple fiducials for the BAO parameters, for which we had to impose a tight prior, enables us to relax the prior constraints.

APPLICATION TO REAL DATA
The score compression framework has so far been tested on realistic mocks, hence it is straightforward to apply this same algorithm to real eBOSS DR16 Ly data, for which we refer to du Mas des Bourboux et al. (2020).The set of nuisance parameters is now extended to also include the contamination from carbon absorbers, the systematic quasar redshift error Δ ∥ , the quasar radiation strength  TP 0 and the sky-subtraction parameters  sky,Ly and  sky,Ly .The results presented in Sect.6 motivate a direct test of the whole extended framework, which gets rid of the tight prior, to the real data.The green filled contours refer to the results obtained performing the inference using the full uncompressed data vector, which we denote as 'Traditional analysis', while the blue dashed refer to the compressed analysis results, denoted as 'Score compression analysis'.The framework of the latter is extended here to the assumption of multiple fiducial values for {  ∥ ,  ⊥ } when performing the compression, namely The ensemble of BAO parameter fiducial values is given by the set of { ∥ = 1.05,  ⊥ = 0.96} -which are the best fit values obtained through the traditional analysis -and which were found to be effective in Sect.6.The fiducial values of the other parameters are set to the best fit found with the standard uncompressed analysis.In Fig. 8, we present the agreement of the extended framework against the traditional approach at the level of { ∥ ,  ⊥ ,  ,Ly ,  Ly , Δ ∥ ,  QSO ,  v }.The nuisance parameters are also found to be in excellent agreement.

CONCLUSIONS
Standard analyses of the Lyman- (Ly) forest correlation functions focus on a well localized region, which corresponds to the baryon acoustic oscillations (BAO) peak.However, these correlation functions usually have dimensions of 2500 or 5000, which means the cosmological signal is extracted from a small subset of bins.This means that reducing the dimensionality of the data vector, while retaining the information we care about, could be a step forward in optimizing the analysis.At the same time, as extensively explained in Sect.2, the covariance matrix C used for Ly correlations analyses is estimated via sub-sampling.However, the dimensionality of the correlation functions is much larger than the number of data samples used to estimate the covariance.Reducing the dimensionality of the data vector to O (10) allows for a reliable estimate of the covariance matrix.Given these premises, the goal of this work is to apply and explore a data compression algorithm for realistic Ly auto-and cross-correlation functions.
We reduced the dimensionality of the data vector to a set of summary statistics t using score compression.We assume a Gaussian likelihood, test for its validity, and show that this assumption is preserved in the compressed space as well, as the compression is a linear transformation.In the compressed space the covariance can be ei- ther given by the mapped traditional covariance or by a covariance estimated directly in such a space.We tested the compressed framework against the traditional approach at the posterior level, when using the original covariance C, and found the two of them agree, and no bias is introduced.We then showcased a test example of covariance matrix evaluation in the compressed space, which is a key benefit of the approach, enabling a comparison to the covariance matrix obtained in the traditional suboptimal framework.Because of non-linear relationship between the BAO parameters and their summary statistics components, throughout the analysis we adopted a tight prior on { ∥ ,  ⊥ }.Later in the analysis, with the aim of increasing the generalizability of the approach, while relaxing the prior constraint, we successfully tested extensions to the framework by assuming an ensemble of fiducial values for these problematic parameters.
We then further examined the compressed framework, by testing the inference against unmodelled effects and we find that if any information about the unmodelled features in the correlation function is not captured by the compressed data vector , this can potentially lead to biases, which do not emerge at the level of the  2 goodness of fit test.Hence, we advise against performing goodness of fit tests in compressed space, unless the compressed vector is extended to include extra degrees of freedom, analogous to what is done in Heavens et al. (2020).Extending the framework in this sense is left for future work.
We applied our extended compression framework to DR16 data from the Extended Baryon Oscillation Spectroscopic Survey and demonstrated that the posterior constraints are accurately recovered without loss of information.A step change in constraining power, and thus accuracy requirements, is expected for forthcoming Ly cosmology analyses by the on-going DESI experiment (see e.g., Gordon et al. 2023), which will observe up to 1 million high-redshift quasars with  > 2. Optimal data compression as proposed in this work will facilitate these analyses through inference that is complementary to the traditional approach and through additional consistency and validation tests.
program of the Generalitat de Catalunya.For the purpose of open access, the authors have applied a creative commons attribution (CC BY) licence to any author-accepted manuscript version arising.

Figure 1 .
Figure1.This plot shows the behaviour of the summary component   ∥ as a function of  ∥ , which is the parameter it is related to as per Eq.(8), against the value of   ∥ evaluated using  ∥ = 1.00 (see Tab. 1), denoted as 'data'.The remainder of the parameters are set to the fiducial values listed in Tab 1.This figure highlights a non-monotonic relationship between the two parameters, which would lead to multiple peaks in the posterior if a tight prior is not imposed.

Figure 3 .
Figure 3. Triangle plots of the BAO parameters of interest {  ∥ ,  ⊥ } and the Ly parameters { Ly ,  Ly } for one set of the Ly auto-and cross-mock correlations.The blue filled contours refer to the results obtained performing the inference using the original covariance matrix  (mapped into the compressed space) in the likelihood function, and hence are denoted as 'Original covariance'.On the other hand, the red dashed results, denoted as 'Mock-tomock covariance', refer to the case in which the mock-to-mock covariance matrix is used instead, while adopting a t-distribution likelihood.

Figure 4 .
Figure 4.This wedge plot, for |  | = | ∥ / | between 0.95 and 1.0, shows the effect of adding metals (in orange) to the correlation model  without metals (in blue) along the line-of-sight.For simplicity in the  2 analysis we do not include contamination coming from HCD, so these features are only the effects of metal lines.Also, in this example, in order to better visualize the difference between the two, we have been generating noise from the covariance matrix of the stacked auto-correlation mock.

Figure 5 .Figure 6 .
Figure 5.  2 histograms (left panel) for the maximal compression and corresponding best fit values histograms for the Ly parameters (right panels), where blue refers to the uncontaminated case and orange to contaminated.In the maximal compression setup  =  max = {  ∥ ,  ⊥ ,   Ly ,   Ly ,   ∥ ,  ⊥ }.The black dashed lines in the two panels on the right correspond to the true values used to generate the Monte Carlo realisations.