Clustering of binary black hole mergers: a detailed analysis of the EAGLE+MOBSE simulation

We perform a detailed study of the cosmological bias of gravitational gave (GW) events produced by binary black hole mergers (BBHM). We start from a BBHM distribution modeled inside the EAGLE hydrodyamical simulation using the population synthesis code MOBSE. We then compare our findings with predictions from different Halo Occupation Distribution (HOD) prescriptions and find overall agreement, provided that the modeled properties of host galaxies and halos in the semi-analytical treatment match those in the simulations. By highlighting the sources of these discrepancies, we provide the stepping stone to build future more robust models that prevent the shortcoming of both simulation-based and analytical models. Finally, we train a neural network to build a simulation-based HOD and perform feature importance analysis to gain intuition on which host halo/galaxy parameters are the most relevant in determining the actual distribution and power spectrum of BBHM. We find that the distribution of BBHM in a galaxy does not only depend on its size, star formation rate and metallicity, but also by its kinetic state.


INTRODUCTION
Future GW interferometers, such as the Einstein Telescope (ET) (Punturo et al. 2010;Sathyaprakash et al. 2012;Maggiore et al. 2020) and Cosmic Explorer (CE) (Evans et al. 2021), are expected to detect ∼ 10 4 -10 5 merger events per year.In this regime, it will be possible to study the spatial distribution and clustering properties of GW sources in similar fashion as done for galaxies.These GW catalogues will contain a significantly smaller number of objects with respect to galaxy surveys and will have relatively poor sky localization, making the smallest scales unaccessible.
Nevertheless, at least for binary black hole mergers, they will probe very large cosmic volumes, since they will be able to collect events from almost the full sky, with so high sensitivity that their horizon will reach very high redshifts ( ∼ 10).Therefore, it will be possible to use them for high precision measurements of cosmological parameters, either by considering their angular power spectrum (Namikawa et al. 2016;Libanore et al. 2021;Namikawa 2021;Libanore et al. 2022;Mukherjee et al. 2021a;Zhu et al. 2022;Yang & Hu 2022), or cross-correlating their distribution with that of other tracers (Scelfo et al. 2018(Scelfo et al. , 2020;;Balaudo et al. 2022;Mukherjee et al. 2022;Scelfo et al. 2023;Libanore et al. 2023).Note that GW surveys do not allow for a direct measurement of the redshift ★ E-mail: matteo.peron@unipr.itat which the merger took place: thus, a tomographic analysis of the sources distribution has either to assume the background cosmology (as it is done e.g. for LIGO-Virgo-KAGRA catalogues Abbott et al. 2016;Abbott et al. 2019;Collaboration & Collaboration 2021;Collaboration & the Virgo Collaboration 2021), to use cross-correlations or other statistical techniques (e.g., in Del Pozzo 2012;Scelfo et al. 2022;Mukherjee et al. 2021a), or to rely only on luminosity distance (adopted, e.g., in Namikawa et al. 2016;Libanore et al. 2021) as a radial position indicator; all these approaches have been so far considered in the literature and have their specific advantages and drawbacks.Methodologies that infer the redshift and combine it with the luminosity distance of GW events also allow for interesting generalizations of the Alcock-Paczynski test (Mukherjee & Wandelt 2018).Finally, it has been pointed out that GW spatial clustering could be used to detect the presence of Primordial Black Hole (PBH) merger events in a catalogue (Raccanelli et al. 2016;Scelfo et al. 2018Scelfo et al. , 2020;;Libanore et al. 2023).
Since they entail the study of GW sources as Large Scale Structure (LSS) tracers, a key ingredient in all these applications is a detailed understanding of the GW cosmological bias properties, defining how the sources cluster with respect to the underlying dark matter (DM) field.The bias  m of astrophysical merger events is linked with the underlying bias of both their host galaxies and DM halos.Based on how this link is modeled, a plethora of different prescriptions for  m has been proposed in the literature.They range from assuming linear dependency on the host galaxy bias  g (Oguri 2016), to redshiftdependent power laws (e.g., (Mukherjee et al. 2021b,a;Vĳaykumar et al. 2020;Libanore et al. 2022;Cañas Herrera et al. 2021)), to more refined parameterizations, accounting for the properties of the binary and/or its host galaxy, e.g., (Mukherjee & Silk 2020;Calore et al. 2020;Bellomo et al. 2022;Boco et al. 2019;Scelfo et al. 2020;Libanore et al. 2021;Capurri et al. 2021).
This significant variety in bias models and assumptions is an issue that can propagate to generate systematic and statistical uncertainties in cosmological studies of GW clustering.In this work, our main goal is to address and clarify this issue, focusing on a detailed study of the cosmological bias of black hole merger events.In order to capture the complexity of the binary formation process as accurately as possible, we rely on state-of-the-art results, obtained by combining binary black hole catalogues from population synthesis models and the outputs of hydrodynamical, cosmological simulations.More specifically, our dataset is based on the eagle cosmological, hydrodynamical simulation (Schaye et al. 2015) populated with black hole (BH) binaries through the mobse population synthesis code (Mapelli et al. 2017;Giacobbo et al. 2018) and evolved to get the final distribution of binary mergers, as described by Mapelli et al. (2017); Mapelli & Giacobbo (2018); Artale et al. (2019b); Artale et al. (2020) (hereafter, A19, A20).We use these data to build models and predictions for the bias of Binary Black Hole Mergers (BBHM), in different ways.First, we implement a suitable estimator of the merger power spectrum and directly measure the bias from the simulations; second, we use a Halo Occupation Distribution (HOD) approach to build a semi-analytical model of the bias (following Libanore et al. 2022), based on the measured abundance of mergers in the simulations, as a function of the galaxy stellar mass and star formation rate.After testing the consistency of the results for internal validation purposes, we also compare them with the GW bias model of Scelfo et al. (2020) and we make an accurate assessment of how the starting assumptions in the different methodologies impact the final bias prediction.Finally, we use the input simulations to train a neural network (NN) to map the relation between the number of GW sources and parameters describing the properties of their galaxy or DM halo hosts.The main purpose in this case is to understand whether the general models used in semi-analytical approaches capture all the relevant Physics or could be improved by including extra parameters.
The paper is organized as follows.Section 2 contains a summary description of the simulations used for our analysis; in Section 3 we briefly review some key aspects of the bias model that we consider in this work; in Section 4 we describe our bias estimator, discuss the formulation and calibration of semi-analytical models for GW bias and give details on the implementation and architecture of our neural network.Results and conclusions can be found in Sections 5 and 6.
The self-consistent algorithm we developed to estimate the (more than tree-level) power spectrum and analyse the clustering properties of the eagle+mobse data-set is publicly available at https: //github.com/MatPeron/powerbias.

MOCK DATA SET
Our study is performed on a simulated dataset of BBHM events obtained populating two of the eagle simulations (Schaye et al. 2015) with merger events drawn from the mobse population synthesis code (Giacobbo et al. 2018) as described in A19, A20.
We mainly use the eagle simulation with a box size of 100 Mpc and 1504 3 gas and dark matter particles initially, in which the cosmology is fixed to the Planck 2013 results (Ade et al. 2014).We consider 7 redshift snapshots, specifically at  ∈ {0, 1, 2, 3, 4, 5, 6}.For each we use: • the full DM particle snapshot (namely, the position of all the DM particles in the box) to estimate the matter density field in section 4.1; • the host galaxy catalogue (describing the physical properties of the galaxies that host the mergers), which is a key ingredient for the semi-analytical model described in section 4.2 and enters the implementation of the machine learning algorithm in section 4.3.
Then we consider a BBHM catalogue as well, obtained by post processing the previously described snapshots with the population synthesis code mobse.While mobse alone provides binary catalogues only based on the population synthesis models, the post-processing procedure combines it with the eagle stellar particles.This allows us to populate the galaxies inside the hydrodynamical simulation with BBHM (details in A19, A20); we refer to the output catalogues of this procedure as eagle+mobse.We use the BBHM data-set (which includes position and physical properties of all the merger events) to estimate the auto-and cross-power spectra of the tracer in section 4.1, and train and test our machine learning algorithm in section 4.3.Note that in the former case we marginalize over all the properties of the mergers (e.g., progenitor masses, spins etc), while for the latter we include them explicitly; In the first part of our analysis, we develop a self-consistent algorithm to estimate the power spectrum and clustering properties (i.e., the bias) in a mock data-set.Due to the huge computational cost required to run the full hydrodynamics, the eagle suite provides simulations with volumes of (100 Mpc) 3 , or smaller.This is limiting for our cosmological application, since the large linear scales are inaccessible from the simulation and non-linear effects become non-negligible, particularly at low redshift.As we discuss in section 3, this makes it necessary to go beyond tree-level and jointly measure the linear bias  1 together with higher order bias terms (see e.g., the review in Desjacques et al. 2018) We validate the results of our algorithm by applying it to the study of DM halo clustering in the Quijote simulation (Villaescusa-Navarro et al. 2020).Quijote provides N-body DM-only snapshots at  ∈ {0, 0.5, 1, 2, 3} inside a  = 1000 ℎ −1 Mpc comoving box.Using Quijote we have access to large, linear scales, which allows us to directly extract  1 .In this way we can compare the results of our  1 measurement at non-linear scales with tree-level power spectrum and standard bias estimators, checking the reliability of our pipeline.More detail on this validation test can be found in appendix C.
To calibrate the semi-analytical models in the second part of our analysis, we also use the eagle+mobse catalogues, in this case to extract the probability density functions (PDF) of galaxies and merger events, as a function of the stellar mass and star formation rate.
Finally, to train and test the neural network that reproduces the map between the number of mergers and properties of their host, we use the aforementioned  = 100 Mpc eagle+mobse data-set and a smaller realization based on the 25 Mpc side with initial 752 3 gas and dark matter particles.In the following, we refer to them as EAGLE100 and EAGLE25, respectively.

A SHORT OVERVIEW OF COSMOLOGICAL BIAS
Forthcoming GW detectors will access almost the full sky, but they will be characterized by a low sky-localization precision (see e.g., the latest results by Iacovelli et al. 2022).Therefore, a power spectrum analysis of GW clustering will be able to probe only large scale modes, which are in the linear regime.On such scales, for standard Gaussian initial conditions, the bias of a general DM tracer well described by a single constant coefficient,  1 .In this work, we want to directly measure  1 for GW sources from cosmological, hydrodynamical simulations, to provide a benchmark for future studies.
The simulation we analyse, however, is necessarily limited to small cosmological volumes and contains scales ranging from mildly to fully non-linear: the size of the simulated box sets  min = 10 −2 ℎMpc −1 , while  max is set by the Nyquist frequency, which in turn depends on how finely the mass distribution is sampled within the simulation.To avoid too large non-linearities, as we describe in the next section, we only take into account scales up to ∼ 0.3 ℎMpc −1 .At these scales, perturbative bias models become significantly more complex and must include several higher-order parameters, that we have to measure and marginalize over, if we want to estimate  1 . 1  In this work, we adopt the bias convention detailed in Desjacques et al. (2018), which collects earlier results from e.g.McDonald & Roy (2009); Chan et al. (2012).In this framework, the expansion of the halo density contrast  ℎ up to third order can be written as where, for convenience, we dropped the time  and scale  dependencies of the matter density field  and of the bias operators and parameters.The first three lines captures correspondingly increasing orders in the bias expansion and contain local ( =1,2,... ,   2 ), nonlocal ( ∇ 2  ,  td ) and stochastic (  ) contributions.This functional form actually holds not just for halos but for a generic LSS tracer field   , irrespective of its nature; for this reason in the following we replace ℎ → .
From Eq. (1), the tracer auto-power spectrum,   (), and its cross power spectrum with the matter field,  DM (), can be directly derived up to next-to-leading order in the expansion.These include terms up to fourth order in the  perturbative expansion, and can be written as 1 A more refined way to estimate  1 from mock data, achieving higher precision and without including higher order coefficients, would be to rely on a separate Universe approach (Mead & Peacock 2014;Wagner et al. 2015;Li et al. 2016) where   ,  13 and  22 are the tree-level and 1-loop contributions to the matter power spectrum.With respect to Eq. ( 1), we here introduced   2 =  2 /2.All the other terms define the tree-level and 1-loop contributions to the matter power spectrum; we explicitly define them in appendix A (for a complete derivation see Desjacques et al. (2018) and references therein).
We refer to the model in Eqs. ( 2) and ( 3) as EFTofLSS and we use it as starting point for the mock-data bias analysis in section 4.1.We also consider a simpler model, in which third order and scaledependent contributions are neglected.This is defined as This model is labelled LIMD (Local In Matter Density), as it accounts only for local contributions of the matter field to the power spectrum.In both Eqs. ( 2) and (4), the   term represents the first order stochastic contribution.This is scale independent and in first approximation corresponds to the shot noise n−1  , where n represents the tracer density (Dekel & Lahav 1999).To reduce the dimensionality of the problem, we neglect higher order, subdominat, stochastic terms by setting   =   2 =   2 = 0 and   = n−1  .We also rewrite   2 = −2( 1, − 1)/7 to further reduce the number free parameters in the analysis (Baldauf et al. 2012).The interested reader can find further details on this part of the analysis in appendix B.
Higher order correlators (e.g., the bispectrum Matarrese et al. 1997) are useful to measure  1, on large scales (as it is done e.g. in Jung, Gabriel et al. 2022), and could be useful in this case as well to break degeneracies between bias parameters at different orders.However, because of the range of scales that we can safely access from the simulation (10 −2 ℎMpc −1 ≲  ≲ 0.3 ℎMpc −1 ), they should be expanded beyond the tree-level as well.This would make their implementation particularly challenging; moreover, their application to real data analysis would be limited by the lack of sky-localization in GW detection.For these reason, we leave further investigation on this point for a follow-up work.

METHODOLOGY
The linear bias parameter of binary black hole mergers can be estimated from the data-sets presented in section 2 by determining the tracer power spectra from the catalogues and fitting them with the models discussed in section 3. We present our  1, estimates in section 4.1 and we test our results against semi-analytical models in section 4.2, as a counter-check of the reliability of this approach.
In this way, we are also able to understand which physical parameters are crucial to determine the clustering of the sources.Such analysis can then be compared with the outputs of the machine learning algorithm described in section 4.3, in which the relative importance of the parameters is tested inside the simulation itself.

Direct estimation on mock data
To measure the bias of any given tracer  in our mock data-sets, we first extract the tracer power spectrum (), by relying on a straightforward estimator that can be summarized as follows: (i) We estimate the number density field ñ() by weighing the particle positions with a Cloud-In-Cell (CIC) scheme.
(iii) We deconvolve the CIC filter function from the estimated δ( ) to recover the actual Fourier transform of the field, ( ).
(iv) We measure the auto-or cross-tracer-DM power spectrum estimator, P , by averaging   ( ) *  ( ), where ,  can be either dark matter(DM) or the tracer (), inside bins in Fourier space and estimate the 1 error bars on P () by taking the square root of the Gaussian variance of our results.
As a counter-check, we also implement a second estimator, based on the FKP prescription Feldman et al. (1994).Further details on the technical implementation of the two estimators are discussed in appendix B.
Having obtained estimates P () and PDM (), we then fit the models of Eqs. ( 2)-( 5) to our results via Markov Chain Monte Carlo.We assume a Gaussian likelihood for P() and consider the parameter vectors ( 1, ,  2, ,  td, ,  ∇ 2 , ) and  = ( 1, ,  2, ), for the EFTofLSS and LIMD models, respectively; we then take a uniform prior in the interval  1, ∈ [0, 10] for the linear bias parameter and uniform improper priors on all other parameters.We combine the auto-spectra corrected for the shot noise (that is, we subtract   = n−1  ) with the cross-spectra and truncate the analysis at a given scale,  NL (specified later), chosen to exclude the highly nonlinear regime where perturbation theory fails.Posterior sampling is performed with the emcee2 package (Foreman-Mackey et al. 2013), while for the computation of the bias operators   ,   and   ′ that appear in the EFTofLSS and LIMD models (see appendix A) we use velocileptors3 (Chen et al. 2020(Chen et al. , 2021)).To constrain  1, , we treat all the non-linear bias parameters as nuisance parameters and marginalize over them.Our final results for are maximum a posteriori estimates and we use the 18th and 84th percentiles to define our  1, credible interval (see Figs. C4 and C5).The reliability of our bias estimator is preliminarly tested on DM halo catalogues from the Quijote simulations, for which we can easily compare our results with theoretical expectations (see appendix C).

Semi-analytical model
To validate the results of the previous section, we also evaluate the bias with a different approach, which exploits information extracted from the simulations as an input to build a Halo Occupation Distribution model (HOD, see e.g., Peacock & Smith 2000;Berlind & Weinberg 2002;Karagiannis et al. 2018;Libanore et al. 2021) and connect the BBHM bias with the bias of their host galaxies and DM halos.
In Libanore et al. (2021), a similar technique was firstly applied to the study of merger bias.With respect to the usual HOD approach (Peacock & Smith 2000;Berlind & Weinberg 2002)aimed at modeling the bias of galaxies from the distribution of their host DM halos -this now requires an extra step to account for the binary black hole distribution as a function of the host galaxy features; all the relevant distributions are calibrated using our mock eagle+mobse dataset, also analysed in section 4.1.Compared to Libanore et al. (2021), we now adopt an updated simulation setup, using the  = 100 Mpc side eagle box (while Libanore et al. 2021 used data from the 25 Mpc box).We also use the full mock data-set (the same as in section 4.1), while in Libanore et al. (2021) specific selection effects due to the Einstein Telescope signal-to-noise ratio were implemented.
We now discuss the aforementioned procedure more in detail.First of all, we need to compute the bias of the host galaxies, which requires the following three ingredients: (i) the halo mass function ()/ ℎ in each snapshot, i.e., the number of halos per mass bin.We consider 6 halo mass bins with  ℎ uniformly spaced in log-space between 10 9  ⊙ and 4 × 10 14  ⊙ .
(ii) the linear halo bias  1,ℎ ( ℎ , ), for which we rely on the parametrization derived in (Tinker et al. 2010) using the peakbackground split approach.In appendix C -Figs.C3 and C2 -we compare this parametrization with the actual estimates of  1,ℎ from simulations at  = {0, 3}, obtained from the algorithm described in the previous section.
(iii) the galaxy HOD ⟨  ( * , , )| ℎ ⟩, namely the probability of finding   galaxies having stellar mass  * and star formation rate  inside a halo of mass  ℎ in the  snapshot.The number per bin is directly estimated from the simulations; we consider 15 stellar mass bins between  * = 10 7  ⊙ and 10 12  ⊙ and 15 star formation rate bins between  = 10 −5.5  ⊙ yr −1 and 10 2.5  ⊙ yr −1 .We note that in Libanore et al. (2021) a similar procedure was applied by fitting data with the Tinker et al. (2010) halo mass function and the eagle HOD analytical prescription (Artale et al. 2018).We verified that such prescription holds as a first-order approximation, but using the halo and galaxy PDFs directly measured on simulations is more reliable when considering bins with a small number of data.
By combining all these ingredients, we compute the galaxy linear bias per stellar mass and star formation rate bin at each redshift as where the numerical integration is performed between the values of   ℎ and   ℎ that define the halo mass bins.As mentioned, the semi-analytical model we build to describe the merger bias adds an extra layer to the analysis discussed up to this point, namely the galaxy occupation distribution ⟨  ()| * , ⟩, which we define as the probability of finding   mergers in a galaxy as a function of its stellar mass  * and star formation rate .We estimate ⟨  ()| * , ⟩ from the eagle+mobse simulation and finally model the linear bias of the merger as and also in this case the numerical integration is performed between the  * and  bins boundaries.The galaxy number density   ()/ *  is computed by dividing the number of galaxies in each bin   ( * , , ) by the width of the Δ * Δ bin and the volume of the simulation box,  3 .We checked that marginalizing over the star formation rate dependence before computing the galaxy and merger bias (i.e., dividing the galaxies and merger only in dependence of  ℎ ,  * ) has a small impact on the final results.Indeed, as we discuss in detail in section 4.  6) and ( 7) is negligible compared to the errorbars provided by the numerical estimate of  1,  in the previous section.The internal comparison between the semi-analytical linear bias of the mergers estimated from Eq. ( 7) and the results from section 4.1 is displayed in Fig. 1.

Machine learning methodology
The numerical simulations allow us to verify which physical parameters are the most relevant in the modelling of BBHM distribution and their spatial clustering.The former was previously studied in A19, which our results are in agreement with.To do so, we train a neural network to paint BBHM events on the simulations on a galaxy by galaxy basis depending on the physical properties of their host, in a similar fashion to how HOD are routinely used, following the approach established since Kamdar et al. (2016).For example, knowing the characteristics of a certain galaxy (e.g., stellar mass, star formation rate, metallicity...), the NN determines how many BBHM it is likely to contain.In this way, our NN can be used to study the host parameters relative importance for the presence of BBHM, as well as can be applied to quickly paint them on N-body or cosmological, hydrodynamical simulations.It is known (Kamdar et al. 2016;Agarwal et al. 2018;De Santi et al. 2022) how this procedure tends to successfully reproduce the average properties of the target parameters, but fails in reproducing their distributions.In our case, the BBHM have an intrinsic scatter due to both the probabilistic nature of the events, and the baryonic physics that is not easily captured by galaxy properties (Stiskalek et al. 2022).Therefore, we will compare the performance of a NN trained to assign to each galaxy a number of BBHM in a deterministic way, with the performance of a NN trained to assign to each galaxy the BBHM PDF, as done in Nix & Weigend (1994); Villaescusa-Navarro et al. (2021a); Stiskalek et al. (2022), based on their ability to reproduce the BBHM global PDF and power spectrum.In the following we will dub the two methods deterministic and statistical, respectively.To account for the strong dependence of the BBHM number on the galaxy mass  ★ ,5 we fit the pairs ( ★ , BBHM number) of the training set with a cubic polynomial and calculate the fractional residuals dividing the BBHM number data by the interpolant, in order to remove the main scaling.
The NN is then trained on the fractional residuals.This is not only motivated by speeding up the NN training, but also from the fact that the prediction from the simple cubic interpolation already leads to a reasonable approximation of the power spectrum (see section 5).
We randomly divide the catalogues from EAGLE100 into training (60%), validation (20%) and test (20%) sets.This leads to a spatially inconsistent distribution of the galaxies of each set; while we can assume that each set is a fair sample of the BBHM PDF, none of them can be used by itself to calculate a meaningful power spectrum.Thus, the EAGLE25 is used to test the power spectrum reconstruction instead.We test different parametrizations of the BBHM PDF and the PDF of their fractional residuals, finding that a Gaussian approximation of the fractional residuals offers the best results. 6oth the deterministic NN and the statistical NN share the same architecture for the input and hidden layers, while the output layers differs.In the former case we have a single node output with elu+1 activation both when we are fitting for the BBHM number and the fractional residuals.In the case of normal and lognormal distributions, we use a two node output: the first is interpreted as mean and the second as standard deviation of the variable (or logarithm of the variable, respectively).In the case of Poisson distribution, the single output is the mean of the distribution.In all output nodes an elu+1 activation function is used again, with the excepton of the mean of the lognormal distribution which has a linear activation.For the deterministic network, we use the mean square errors as loss function, while in the statistical network the likelihood of the parametrized distribution is used instead.As optimizer, we employ Adam (Kingma & Ba 2014) with a cyclical learning rate (Smith 2015), and we apply a dropout (Srivastava et al. 2014) to all hidden layers.We tune the number and hidden layers, nodes, dropout rate and activation func- Since our goal is to verify how all the information on the BBHM distribution is encoded in a few parameters, we use a rather agnostic approach in choosing our baseline feature set.This includes the comprehensive list of halo and galaxy parameters available in the eagle simulation, shown in Tab. 2, some of which we expect to have negligible effect on the BBHM distribution.

Bias estimates
In section 4.1 and 4.2, we have presented two independent tools to estimate the bias of BBHM from simulations, the former fully based on mock data-sets, while the latter requiring some analytical prescription to connect the mergers to the properties of their host galaxies.Tab. 1 collects our results on the direct estimates, while Fig. 1 compares them with the semi-analytical results (whose error bars are negligible with respect to the numerical model ones, as we described in section 4.2).Results are in good agreement between the two methods and with the close-to-linear trend previously obtained by Libanore et al. (2021Libanore et al. ( , 2022)).
For practical purposes and applications in follow-up analysis, we provide the following fit to the bias estimated from simulations: The parameters in the previous equation are calibrated on the LIMD results, since we find that the LIMD model is favored by the Bayes Information Criterion over the EFTofLSS one.

Comparison with the merger-rate-weighted bias
The main goal of our analysis is to understand how much the BBHM bias is affected by the properties of the hosts.This is a key information to understand whether the clustering can be used to disentangle different host populations (and possibly formation channels), as well  7) (red circles) compared with the merger rate-weighted biases computed via Eq.( 9) under different prescriptions.The blue thick line uses the same setup as S20, while dashed lines change the lower (orange) and upper (green) halo mass integration limits.We obtained the dotted red line by re-calibrating the star formation rate and abundance matching results on the eagle catalogues; the purple dotted line relies on a different time delay distribution.More detail in the text.
as to check how much our results depend on the specific astrophysical models assumed in the simulations.With this in mind and to verify the robustness of our results we re-estimated the bias following a completely independent approach with respect to the one adopted in previous section.This alternative methodology is based on the analysis depicted in Boco et al. (2019); Scelfo et al. (2020) (a similar approach is also followed by Bellomo et al. (2022)), where the BBHM bias is computed by weighting the bias of the hosts through the merger rate, which in turn relates with the host properties, modelled from theoretical assumptions and real data.
As we will discuss in detail at the end of this section, the parametrization used in Boco et al. (2019); Scelfo et al. (2020) (hereafter B19, S20) is different with respect to the assumptions made by the authors of A19 to simulate the BBHM catalogues we rely on.This propagates to the bias computation, leading to discrepancies between the results of the two procedures (particularly at high redshift).Indeed, the bias in S20, as well as the bias in Scelfo et al. (2018), is at face value consistent neither with each other, nor with our results.
To shed light on this possible issue, we focus on S20 and investigate the reason for the discrepancy.We model the bias according to the procedure described in S20, to which we refer the reader for details, while we limit ourselves here to briefly recap the necessary building blocks.In the same spirit as Eqs.( 6), ( 7), the bias is calculated convolving the halo mass function and bias with an appropriate kernel, describing the number distribution of BBHM, conditioned over the halo masses.
We model the chirp mass distribution of the merging black holes as a function of the mass of the progenitor star and the metallicity of the host galaxy, as it is done in B19.We adopt the initial mass function from Chabrier (2003) and we relate it to the black hole mass that the star forms as in Spera & Mapelli (2017).The galaxy distribution is modelled according to the galaxy star formation rate function (GSFRF) parametrized as a Schechter function in Mancuso et al. (2016).
All the quantities in the analysis are estimated at the binary forma- tion time: this requires us to introduce a reasonable distribution for the time delay   between this moment and the time of the merger.
The isolated binary evolution predicts the time delay distribution (  ) ∝  −1  , with a minimum time delay cutoff of  ,min = 50 Myr, which is used in n {B19, S20}.In general, the exact form depends on several physical processes such as mass transfer, kicks, stellar winds, and orbital separation, and early type galaxies present a flatter distribution compared to early type galaxies (Santoliquido et al. 2022).In our analysis, therefore, we test two independent cases: (  ) ∝  −1  (to be consistent with S20), and a uniform distribution (  ) ≡ const that recalls the one implemented in the simulations by A19 analysed in section 4.1.The specific choice of (  ) may largely affect the final results on the black hole merger rate (Santoliquido et al. 2021), but how this propagates to the clustering of events, has not been discussed yet.We anticipate that our analysis shows that different choices of (  ) have a limited impact on the bias (see Fig. 2).
In this bias model, the host halo and host galaxy mass are related to the star formation rate via the abundance matching relations (AMR) of Aversa et al. (2015).
Having said that, we estimate the black hole merger rate R  (, log 10 , M  ) as in {B19, S20} by marginalizing over the chirp mass M  of the events and the star formation rate  of the host galaxies.Finally, we estimate the merger bias through where the galaxy bias as a function of  is compute as in Aversa et al. (2015), i.e.,   (, ) =  1,ℎ ( ℎ , ), where the halo bias is the same we discussed in the previous section.Different prescriptions on the models and parameters in Eq. ( 9) can lead to very different bias results.We highlight this is Fig. 2, in which the blue thick line has been recomputed using the same prescriptions as S20, the red circles are the bias values estimated in section 4.2, and we obtained all the other lines by tuning Eq. ( 9).
First of all, results in S20 include detector selection effects via a signal-to-noise ratio distribution; to relate to our analysis (which involves the intrinsic signal of the full merger population), we need to drop this term.Then, in S20 the lower bound of the star formation rate integral is directly related to a lower bound on the mass of the host galaxy via the abundance matching relation.At every redshift this lower bound is chosen such that the integral converges.Instead, in eagle+mobse at every redshift only galaxies with stel-lar mass ≳ 10 7  ⊙ are included because of mass resolution limits (see, e.g., the discussion in Mapelli et al. 2018) If the same cut is imposed in S20 (orange dashed line in Fig. 2) the characteristic flattening S20 had at high redshift partially disappears, since the least biased objects are removed from the distribution.The opposite effect, i.e. a decrease in the bias due to the absence of the most biased objects, takes place when we remove host galaxies with mass ≳ 10 12  ⊙ .This is not due to a physically motivated effect, but to the fact that the eagle simulation is quite limited in size, and does not contain larger objects.To circumvent this limitation, larger cosmological, hydrodynamical simulations or self-consistent extrapolation to higher massess are required.Finally, we have to take into account that both the galaxy star formation rate function of Mancuso et al. (2016) and the abundance matching relations provided in Aversa et al. (2015) do not fully agree with the corresponding quantities extracted from eagle.Re-calibrating these functions on the eagle catalogues, the bias estimated from Eq. ( 9) turns out to be in broad agreement with our HOD results from Eq. ( 7) (red dotted line in Fig. 2).As anticipated, both to test how strong the dependence of the bias on the time delay distribution is, and to check consistency with the eagle+mobse simulation, we repeat the calculation using a different (  ).We pick a flat distribution with a minimum time delay of  ,min = 10 Myr from the distributions considered in Santoliquido et al. (2021), obtaining the purple dotted line in Fig. 2. The two bias distributions ((  ) ∝   , and (  ) ≡const) braket the one used in eagle+mobse.The difference between the two resulting bias is negligible; therefore, as we anticipated, we conclude that the BBHM clustering is only marginally affected by the time delay specific choice.

Validation of the NN and its architecture
We start by discussing the necessity to move past the description of the average quantity in the context of ML-based painting of simulations, as also pointed out in Stiskalek et al. (2022).As anticipated, a limited amount of physical properties of the host halo and, in our case, of the host galaxy, fail to describe the single realization of the painted label.Instead, what they characterize is the label probability density function: this is not at all dissimilar to the standard HOD approach.Employing the NN to learn the PDF, rather than the single realization of the label, not only allows us to better reconstruct its variance, which is often underestimated (Kamdar et al. 2016;Agarwal et al. 2018;De Santi et al. 2022), but also to better recover properties related to the average label distribution.First, we can test the NN ability to reproduce a BBHM distribution that is statistically equivalent to the test set one, dubbed true in the following.In the two panels of Fig. 3 we compare how well the two architectures described in section 4.3 reproduce the true distribution as a function of the host galaxy mass.The host galaxy is singled out among all features because it correlates the most with the target BBHM distribution (see the discussion in the following and Fig. 6).In each host galaxy mass bin (we use 50 logarithmically spaced bins), the average number of events is reproduced successfully both using the deterministic and the statistical network.However, in most bins, the former generates a distribution which underestimates the true scatter in the bin.In practice, galaxies that host an extreme (either high or low) number of events are more likely to be populated by the deterministic NN with a number of BBHM closer to the average.The statistical network, instead, increases the variance of the proposal distribution from which the predictions are drawn when the population of events falls in a broad spectrum.This means that, on a galaxy-by-galaxy basis, the prediction of this NN might differ substantially from the true one, but in such a way that the whole distribution is more closely followed.The same behaviour appears in the global distribution of BBHM per galaxy shown in Fig. 4. To provide a quantitative metric of this intuition we use the -sample Anderson Darling test (Scholz & Stephens 1987), whose score (ADS) quantifies how close the distributions underlying two samples are to each other.In other words, the ADS is related to the likelihood that the two samples are drawn from the same distribution.In the following we will always calculate the ADS score of a sample against the true one.While the true sample and the one predicted from the deterministic network are fixed, the samples generated by the statistical network change depending on the seed.While the general trends of all the statistical-network-generated samples are consistent between runs, their ADS scores varies.To produce consistent results, we produce 100 realizations, all based on the same galaxy test set, and we concatenate them to calculate the ADS.For the deterministic network we find ADS = 80, making the significance of the two sets following the same distribution vanishingly small.On the other hand, for the statistical network we calculate ADS = 1.2, thus we cannot discern it from the true distribution at <95% C.L.
Second, we are interested in analysing the difference in performance at the power spectrum level.In Fig. 5 we show, as a function of wavemodes, the fractional difference between the merger power spectrum extracted from EAGLE25 (our test sample) and the merger power spectrum of the same simulation painted with the different NN architectures.We highlight how (due to the smaller box size) the scales represented here only marginally coincide with those used in our bias analysis, and with the scales relevant for Cosmology.However, in this context the power spectrum ought to be interpreted as a generic signal that we can use to test our painting scheme, without venturing into any claim about what cosmological information could be extracted from it.While the deterministic NN architecture already achieves a ≤ 5% agreement, which on large scales is negligible with respect to sample variance, the statistical NN architecture leads to a factor ≈ 2 improvement.
Having demonstrated the better performance of the statistical network, we will from now on only use this architecture.

Feature importance
We test the relative importance of features by re-training the NN with different sets of features, comparing each network performance.Since most of the features are strongly correlated, this is a safer procedure than simply calculating the permutation importance, even if more costly.In the remainder we use the Spearman's rank correlation coefficient  as a measure of correlation (see e.g.Daniel 1990).We notice that random-forest-based conditional permutation importance (Strobl et al. 2008) would be a valid alternative, that we will consider in future work, to assess every feature separately.Since, on the contrary to the the previous section, the loss function employed by the network is now fixed, we can use as our performance metric not only the ADS, but also the likelihood used as loss function itself.
While the ADS is sensitive to the global distribution (regardless of, e.g., host galaxy mass, SFR, metallicity, etc.), the likelihood instead is local in feature space, providing complementary information.We start by assessing how important the information about the progenitors is.Training the statistical NN on the parameters pertaining to the snapshot where the merger happen, we calculate ADS = 46.This is to be expected (Scelfo et al. 2020;Bellomo et al. 2022) as what is relevant is the physical state of the host at the time of the binary formation.After the formation, the evolution of the binary, until its eventual merging is only slightly influenced by changes in the surrounding environment.
Having assessed that information about the host progenitors is crucial, we turn our attention to 5 different (groups of) parameters:  ★ ,   , , ,  max .In each group, say  ★ , we include the value of that parameter for the host, and for the progenitors of the host at z=1, 2, 3, say ( ★ ( = 0),  ★ ( = 1),  ★ ( = 2),  ★ ( = 3)).Different determinations of the halo mass are not included as   strongly correlates with them at each redshift ( ≥ 0.95), acting as a reliable proxy.The same goes for the parametes related to the kinetic state of the galaxy and  max (|| ≥ 0.92).An exception is made for  and  which are both included despite their correlation ( ≥ 0.84), because of their physical importance and relevance usually addressed to them in the literature.We train 5 networks, each of which only receives as input one single group of variables, and select the best according to ADS and likelihood.In all cases investigated here the two metrics produced the same rankings.The most relevant feature is fixed and 4 networks are trained receiving in input that one and another feature, and the process is repeated.
The progression is shown in Fig. 6.The host galaxy mass is the most relevant feature to determine the BBHM number distribution (similar to what discussed in A19 and Santoliquido et al. 2021), together with the mass of its first-progenitors at different snapshots.We find that the mass of the host halo and  max , which we remind is a proxy of the kinetic state of the galaxy, are also extremely relevant.Deeming this worth of further investigation we test if this is driven by   ,  max and  ★ acting in concert as a proxy of the relevant information contained in  and .If this were the case, a network trained with , Z, and  ★ would perform better than one trained with only  ★ ,   , and  max .However, as shown in Fig. 6, this does not hold.Therefore, we confirm that the large-scale environment in which the host galaxy forms-meaning its position in the cosmic web, if it is a central or satellite galaxy, etc.-drives the amount of BBHM.This is akin to how the large-scale environment of an halo influences the number and type of galaxies it hosts (Artale et al. 2018).
and  rank as the least important variables, and the addition of one over the other brings very little additional information.This is not to say that they are physically irrelevant, but rather that since they are extremely correlated to each other, and quite correlated to  ★ they are partially redundant.In fact, this is specifically exploited in e.g.S20 through the use of the AMR.
As for the power spectrum, we find it to be a quite forgiving observable.In fact, even assigning to each galaxy the number of BBHM using the cubic polynomial interpolant (i.e.setting all the fractional residual to 1), the error on the power spectrum is ≲ 10%, while training the statistical network just with the host galaxy mass as input, the deviation is ≲ 4% (Fig. 5).Training the network on the full set of parameters does not lead to an improvement.Our analysis highlights a difference in the relative importance of the additional parameters to fully characterize the power spectrum and BBHM distribution.The existence of a relation that adequately links the sole mass of the halo to the BBHM power spectrum, while incredibly useful to populate relatively cheap N-body simulations, is not helpful in understanding the underlying physics.If one's goal is the latter, and they wish to do so by building an HOD, the use of the BBHM distribution cannot be bypassed.In this sense, our analysis informs how bias models shall be built.

Apply the NN to paint BBHM
The machinery demonstrated in the previous sections can in principle be readily used to paint hydrodynamic simulations.The resulting BBHM catalogues are statistically equivalent to those obtained with the procedure described in section 2, but can be produced in a few tens of seconds using a single GPU.However, a few caveats are due.By virtue of a separation of scale argument, we can expect that the formation of BH binaries and their subsequent evolution should mostly be influenced by their galactic environment and not from the cosmological background.Thus, changes in the cosmological parameters should have negligible effect on the BBHM content of any given galaxy, so we can expect that the NN can be safely generalized to other simulations, even when they assume (at least slightly) different cosmologies.However, this hypothesis ought to be verified prior to any application.Moreover, it is known (e.g.Villaescusa-Navarro et al. 2021b) how different hydrodynamical models produce qualitatively different simulations, and as such the network might fail to populate simulations which are not created with the same algorithm, as seen in e.g.Villaescusa-Navarro et al. (2021a).

CONCLUSIONS
With the fourth LIGO-Virgo-KAGRA observational campaign approaching (Abbott et al. 2020) and 3rd generation detectors planned for the next decade (Kalogera et al. 2021;Branchesi et al. 2023), it is becoming more and more important to build accurate models to understand the properties of GW sources.These include their redshift distribution and clustering properties, which, in the case of astrophysical binary mergers, ultimately depend on the way their progenitors form according to the properties of their host halo and galaxy.
In this work, we focused on the study of the clustering of astrophysical binary black hole mergers (BBHMs).We relied on cosmological, hydrodynamical simulations (Artale et al. 2019b;Artale et al. 2020) to estimate the cosmological linear bias of such events at different redshifts.On one side, we developed a fully-numerical estimator (publicly available at: https://github.com/MatPeron/powerbias)which can be applied to mock or real GW catalogues.Because of the small scales accessible by the size of the simulation adopted, we needed to account for tree-level and 1-loop contributions to the power spectrum and to marginalized over nuisance, higher order bias parameters.Our algorithm was tested by comparing the theoretical DM power spectrum and halo bias with their estimates on larger N-body simulations, in which linear scales can be accessed and leading order truncation can be performed.On the other side, we implemented a semi-analytical HOD-based approach (Libanore et al. 2021(Libanore et al. , 2022) ) to estimate the BBHM bias as a function of the properties of their host galaxies.The method was calibrated on the same simulations as the fully-numerical estimator and the two approaches show internal consistency.We also compared our results with an independent semi-analytical model (Scelfo et al. 2020) based on merger rate-weighting of the host galaxy bias.Our finding is that the different approaches are consistent, provided that the underlying assumptions on the astrophysical modelling and the properties of the host halos and galaxies are in agreement.While this seems to be obvious, the important message is that, if we want to use the GW bias as a tool to perform precision cosmological studies, we first of all need to understand how to properly account for uncertainties in the formation channels and host properties.As a first step in this direction, we implemented a neural network aimed at studying the relative relevance of the host parameters with respect to the BBHM distribution and bias.While for the former it is important to account for the host galaxy mass, as well as the host halo mass and the galaxy properties at the time of the binary formation, the latter (which in the end depends on how the power spectrum behaves) seems to suppress the relevance of parameters other than the host galaxy or host halo mass.Our neural network can also be applied to paint BBHM on N-body or cosmological, hydrodynamical simulations quickly and with a low computational cost.To improve its reliability, the neural network should be tested on simulations accounting for different cosmologies and astrophysical models, which are not available yet in the context of GW sources.
To conclude, with our work we compared the different GW bias prescriptions available on the market and we provided bias estimates based on the state-of-the-art situation in BBHM simulations.Our results in Tab. 1 and the fit in Eq. ( 8) can be used as priors to perform cosmological studies.Improving these findings will be more and more important as the number of GW observations will increase.Once observed GW catalogues will allow to perform statistical studies up to cosmological distances, the clustering will be the perfect tool to perform a wide variety of tests, from the progenitor formation channels (here we only dealt with astrophysical binary black holes, while primordial binary black holes, if they exist, should have a completely different bias (Raccanelli et al. 2016;Scelfo et al. 2020;Libanore et al. 2023)), to modified gravity models (Scelfo et al. 2023;Balaudo et al. 2022) or redshift-space distortion effects (Mukherjee & Wandelt 2018;Mukherjee et al. 2021a).To achieve this goal, the next challenges will be to understand how to access larger scales with the mock catalogues and how to include selection and observational effects, so to realize mock data-sets as close as possible to the real ones.Once these will be available, it will be possible to updated the tools we realized in order to make them applicable to real estimates.

DATA AVAILABILITY
The self-consistent algorithm developed to estimate the (more than tree-level) power spectrum and analyse the clustering properties of eagle+mobse data-set is publicly available at https://github.com/MatPeron/powerbias.The trained neural network models are available upon reasonable request.mobse is an open source code available at https://gitlab.com/micmap/mobse_open.

APPENDIX A: 1-LOOP BIAS OPERATORS
In Eqs. ( 2), (3) we define the auto-and cross-power spectrum expansions adopted in the EFTofLSS estimator.The tree-level and 1-loop contributions that enter such equations are defined as in Desjacques et al. (2018) and presented here for ease of reference: where

APPENDIX B: POWER SPECTRUM ESTIMATOR
The snapshots and catalogues we used in our analysis contain   point-like particles (DM or tracer) distributed inside a cubic box of size .Each particle is localized within the box by  = (, , ), where the origin (0, 0, 0) is at one of the box vertices.We define the density as () ≡    =1   ( −   ) and the density contrast as where ρ =  n =   / 3 is the spatially averaged density within the box.On a theoretical level, the power spectrum is then defined by taking the Fourier transform ( ), computing its square modulus |( )| 2 = ( ) * ( ), and averaging the resulting modes inside  bins.This procedure assumes that the fair sample hypothesis holds.
Since we are dealing with discrete data-sets, the density field is unevenly sampled in space.Therefore, before applying the fast Fourier transform (FFT) we need to smear () onto a regular grid within the box by convolving it with a proper filter (Tegmark et al. 1998).
In this way, we interpolate δ(); to recover ( ), one then has to deconvolve the filter, which in Fourier space is simply done by dividing the Fourier transform of the filter to δ( ).
With this basic recipe in mind, we can discuss in more detail the steps we take in order to arrive to an estimator P().
(i) The filter function  () we adopt is the standard Cloud-In-Cell (CIC) weighting scheme (see equation A.2 of Chaniotis & Poulikakos 2004), which smears a single particle across a cube of eight neighbouring cells.Chosen a number of grid cells per side   , we use it to compute the weights  (  ) for each particle at each grid's node position   .
The weights are then summed across the entire catalogue to estimate the number density field ñ(  ) = ρ(  )/.
(iv) The Fourier transform of the CIC filter function  () is computed analytically and divided to δ( ) in order to recover the actual Fourier transform of the field, ( ).
(v) The proper normalizations given by the FFT and the discretization of the Dirac deltas are applied to ( ).
(vi) We compute P() by averaging |( )| 2 inside bins in Fourier space.Bins are defined inside the interval [ f ,  Nyq ) and are  f -wide, where  f = 2/ is the fundamental mode, while  Nyq =   / is the Nyquist mode, i.e., the maximum mode supported by the grid.
(vii) We compute error bars for the estimator by taking the square root of the Gaussian variance, √︁ Var( P())|  = √︁ 2/  P(), where   is the number of independent Fourier modes within the bin (Scoccimarro et al. 1999).We neglect non-Gaussian contributions related with the four-point function of ( ) (namely, the trispectrum) since they are only significant at even smaller scales than the ones we consider (Mohammed et al. 2017).
Since we are computing P() from a discrete collection of particles, shot-noise arises from their auto-correlations.We can assume that the particles are Poisson sampled from the underlying density field; under this assumption, the shot noise contribution is equal to n−1 (see appendix A of Feldman et al. 1994).In our analysis, we subtract it before making any measurement of the bias parameters.
Another consideration to make is that by choosing to smear the particles onto a regular grid, we are effectively shifting all the power associated to the modes  >  Nyq inside the range of modes supported by the grid.This effect is known as aliasing and it causes a spurious increase in power at modes  ≳  Nyq /2 (Colombi et al. 2009).Aliasing can be alleviated (Sefusatti et al. 2016) or outright removed under certain assumptions on the shape of () (Jing 2005).We implemented the interlacing approach from Sefusatti et al. (2016) to deal with this effect (see Fig. C1).It is important to note that interlacing, while useful if seeking accurate estimation from large scale simulations, is not necessary for small-side simulations such as eagle.In fact, the range of modes where the perturbative prescription of Eq. ( 1) is valid is very close to the fundamental mode, where aliasing has no effect.For this reason, we only use interlacing to analyse Quijote data in the validation test performed in appendix C.
The procedure we depicted up to now is optimal to estimate the power spectrum on ideal, mock data-sets; with the prospect of applying the same methodology to real data, however, selection effects should be taken into account.For this reason, in our algorithm we also included the Feldman-Kaiser-Peacock (FKP) estimator (Feldman et al. 1994), in which the data-set is compared with a synthetic random catalogue, in order to optimize the variance.In our case, this catalogue is generated by uniform sampling a set of particles inside the same box in which the simulation has been run.The implementation follows the same general scheme presented above, with some extra-steps related with the synthetic catalogue and the optimal weighting that characterizes the FKP estimator.In particular, step (i) in our algorithm is applied twice, once on the mock catalogue and another on the synthetic, random catalogue.Then, before step (ii), we define the optimal weights () = [ n() 0 () + 1] −1 from a initial guess of the power spectrum  0 () = const.These are used to compute the weighted number density field  (): where the  subscript refers to the synthetic random catalogue, while  ≲ 0.1 is the ratio between the density of the data catalogue and the synthetic, random one.The algorithm then proceeds using  () in place of ().The presence of  0 ( ) in () implies that the weights can be optimized iteratively by using the resulting PFKP ().Therefore, by starting from  0 () = const, we get a first estimate PFKP () which is used to improve the estimate of the optimal weights bin-by-bin; for each , we then compute a new set of weights () using  0 () = PFKP (), and obtain a new estimate PFKP ().This is iterated up to some convergence criterion.

APPENDIX C: VALIDATION TESTS
As we anticipated in section 2, the small volume of the eagle simulation (Schaye et al. 2015) forces us to estimate the power spectrum and bias on small, non-linear scales.Theoretical models in this regime are complicated and depend on large sets of parameters.Before using our algorithm on these scales, therefore, we tested it in a "simpler" configuration, namely on the large, linear scales provided by the Quijote simulation (Villaescusa-Navarro et al. 2020).
We now describe the validation tests we performed both on the power spectrum and bias estimators.

C1 Power spectrum estimator
We validate our power spectrum estimator on the dark matter field from the Quijote simulations (Villaescusa-Navarro et al. 2020).We consider five different realizations with the same cosmological parameters (the fiducial ones).For each, we compute the DM power spectrum for every available redshift snapshot, namely  ∈ {0, 0.5, 1, 2, 3}, interpolating its density contrast field onto a grid of   = 512 cells per side.The resulting 25 power spectra are then averaged across the five realization, obtaining five estimated power spectra, one for each redshift.We then use CAMB7 (Lewis et al. 2000;Howlett et al. 2012) to compute the theoretical non-linear matter power spectrum given by the halofit model (Mead et al. 2021) for the given cosmology at each redshift.
Figure C1 compares the estimated and theoretical power spectra at  = 0. Up to  ∼ 0.2 ℎ Mpc −1 , the power spectrum measured from the simulations agrees within 5% with the theoretical one.At smaller scales a systematic shift in power is observed, which can be explained by aliasing (Sefusatti et al. 2016;Jing 2005).This behavior is observed at all redshifts.Using the interlacing algorithm introduced in Sefusatti et al. (2016) improves the estimate considerably, pushing the agreement up to  ∼ 0.4 − 0.5 ℎ Mpc −1 .

C2 Bias estimator
We test the MCMC estimator described in section 4.1 on DM halos from Quijote simulations (Villaescusa-Navarro et al. 2020).
We consider the same five realizations used in the previous section, sorting the halos into six mass bins uniform in log-space.For each mass bin and redshift, we compute the interlaced power spectrum using a grid of   = 512 cells per side to interpolate the density field.We obtain a total of 118 power spectra (we are unable to compute 32 of them due to the absence of halos at high redshifts and masses).Again, we average the different realizations, obtaining 24 estimated power spectra, one for each redshift and mass bin.
Finally, we obtain estimates for the linear bias parameter  1,ℎ by fitting the models of Eqs. ( 2)-(5) to the estimated power spectra.As in section 4.1, we use a Gaussian likelihood and uniform priors, taking care in setting  1,ℎ ∈ [0, 50] to provide enough parameter space for the chains to explore the posterior.We combine auto and cross-power spectra corrected for the shot noise and truncated at a scale  NL , which we set at 90% the non-linear threshold scale for matter described in Smith et al. (2003); Fonseca et al. (2019), that is  NL = 0.9 • 0.2( + 1) 2/(2+  ) ℎ Mpc −1 with   = 0.9624.
We compare these estimates with the theoretical expectations of  1,ℎ from the Peak-Background Split (PBS) method, assuming the theoretical halo mass function distribution described by Tinker et al. (2010) and computed through the hmf python package.The PBS gives us  1,ℎ ( ℎ ) as a function of the halo mass  ℎ ; we use the halo mass function to weigh it inside each mass bin, so to find a single value per bin.Figs.C2 and C3 compare the EFTofLSS and LIMD halo bias estimates with the PBS expectation values weighted in each mass bin.Using the Tinker halo mass function to weigh the PBS bias gives results that are in overall agreement with the empirical estimations at linear scales, with some small deviations due to differences between the actual halo mass function of the simulation compared to the Tinker prescription.
As a further test, we estimate the linear bias truncating the power spectra at  ∼ 0.08 ℎ Mpc −1 .This leaves us only with the linear scales, where the bias expansions can be truncated at leading order:  1,ℎ can easily be obtained by reversing these equations.These estimates are then averaged together across the various -modes to obtain a final estimate of  1,ℎ , with 1 errorbars given by their standard deviation.This empirical bias estimate is compared with the MCMC EFTofLSS and LIMD results in Figs.C2 and C3.MCMC estimates obtained with the LIMD model agree really well with the empirical ones, with some noise at high redshifts explainable by the small amount of halos in this range.On the other hand, EFTofLSS results while following the same trend as the other estimator, are noisier, which may be explained by the higher dimensionality of the model itself.It is important to note that, while these results favor the LIMD model over the EFTofLSS one, the binary black hole merger catalogues on which we recover the final result of this paper contain far more objects than the dark matter halos used here, meaning that the measured power spectra will be less noisy.Indeed, section 4.1 shows that LIMD and EFTofLSS are in perfect agreement when it comes to binary black hole mergers.This paper has been typeset from a T E X/L A T E X file prepared by the author.
. 1, which respectively show and plot the values of  1, and the corresponding errors in the chosen redshift bins.

Figure 1 .
Figure 1.Bias estimates computed with the different methods described in the text: MCMC results from section 4.1 and Tab. 1 in the LIMD (black circles) and EFTofLSS (black triangles) cases, semi-analitycal HOD model from section 4.2 (red circles) and updated merger rate-weighted results from section 5.2 (S20, blue line).
Figure 2. HOD bias values from Eq. (7) (red circles) compared with the merger rate-weighted biases computed via Eq.(9) under different prescriptions.The blue thick line uses the same setup as S20, while dashed lines change the lower (orange) and upper (green) halo mass integration limits.We obtained the dotted red line by re-calibrating the star formation rate and abundance matching results on the eagle catalogues; the purple dotted line relies on a different time delay distribution.More detail in the text.

Figure 3 .
Figure 3. BBHM number distribution as a function of the host galaxy mass.The black line describes the "true" distribution from the test set, while the blue line describes the NN result.The top panel refers to the deterministic architecture, while the bottom panel relates with the PDF (statistical) one.

Figure 4 .
Figure 4. BBHM number PDF per galaxy, marginalized over all the host galaxy and halo properties.The deterministic architecture fails in recovering the "true" distribution at the low and high number ends of the distribution.

Figure 5 .
Figure5.Fractional difference between the N-body power spectrum and different population methods.A simple power-law (blue squares) is already capturing the power spectrum at 10% level.Using the statistical NN approach (red dots) improves the emulation compared to methods not describing the spread of the population (orange x's).Basing the training on just the host halo mass (green crosses) does not compromise the results.

Figure 6 .
Figure 6.Anderson-Darling score and log-likelihood for different (incremental) sets of features.Each set includes all labels on its left plus the labels indicated on the axis.

Figure C2 .
Figure C2.Halo bias at  = 0 and  = 3 estimated with MCMC based on EFTofLSS (black circles), compared with the PBS expectation value weighted by the halo mass function (blue crosses) and the empirical linear bias estimator (red diamonds).

Figure C3 .
Figure C3.Halo bias at  = 0 and  = 3 estimated with MCMC based on LIMD (black circles), compared with the PBS expectation value weighted by the halo mass function (blue crosses) and the empirical linear bias estimator (red diamonds).

Figure C4 .
Figure C4.Posterior and marginalized distributions of  1 and  2 obtained through a joint Markov Chain Monte Carlo sampling of the auto power spectrum of binary black hole mergers, and cross power spectrum between mergers and matter computed at redshift  = 0.The red lines indicate the maximum a posteriori estimate of the parameters, while the red diamond indicates the initial guess.

Table 2 .
Parameters in the eagle catalogue used as input features for theNN (McAlpine et al. 2016).If the parameter is used in the text, the string used in the eagle database is paired with the symbol we use.The merger tree info are the value of said parameters for the object progenitors (if present) along the main branch within the 16 snapshots preceding the merger.