Improving and extending non-Poissonian distributions for satellite galaxies sampling in HOD: applications to eBOSS ELGs

Halo Occupation Distribution (HOD) models help us to connect observations and theory, by assigning galaxies to dark matter haloes. In this work we study one of the components of HOD models: the probability distribution function (PDF), which is used to assign a discrete number of galaxies to a halo, given a mean number of galaxies. For satellite galaxies, the most commonly used PDF is a Poisson Distribution. PDFs with super-Poisson variances have also been studied, allowing for continuous values of variances. This has not been the case for sub-Poisson variances, for which only the Nearest Integer distribution, with a single variance, has been used in the past. In this work we propose a distribution based on the binomial one, which provides continuous sub-Poisson variances. We have generated mock galaxy catalogues from two dark-matter only simulations, UNIT and OUTERIM, with HOD models assuming different PDFs. We show that the variance of the PDF for satellite galaxies affects the one-halo term of the projected correlation function, and the Count-In-Cells (CIC) one point statistics. We fit the clustering of eBOSS Emission Line Galaxies, finding a preference for a sub-poissonian PDF, when we only vary the parameter controlling the PDF variance and the fraction of satellites. Using a mock catalogue as a reference, we have also included both the clustering and CIC to constrain the parameters of the HOD model. CIC can provide strong constraints to the PDF variance of satellite galaxies.


INTRODUCTION
The nature of dark matter and dark energy are two of the greatest mysteries in cosmology.The large scale structure of the Universe is a powerful tool to investigate these two components, ant it has gained significant attention in the last 20 years (Colless et al. 2001;Alam et al. 2021;Abbott et al. 2022).During this period of time, large scale surveys have increased significantly the volume of the Universe that has been mapped.Cosmological simulations must adapt to this growing volumes (Heitmann et al. 2016(Heitmann et al. , 2019;;Chuang et al. 2019;Ishiyama et al. 2021;Maksimova et al. 2021).
In general, the largest cosmological simulations only include dark matter particles information and gravity-only evolution (e.g.Heitmann et al. 2019;Maksimova et al. 2021), whereas other smaller simulations may include gas matter particles with hydrodynamical evolution (gas cooling, AGN feedback, etc.) as well (e.g.Pillepich et al. 2018;Schaye et al. 2023).Thus, for the large dark-matter only simulations a connection between dark matter and tracers such as galaxies needs to be applied (Somerville et al. 2001;Wechsler & Tinker 2018).Several methods have been developed for this purpose, such as Semi-Analytical Models, which encapsulate the processes that we understand to be most important for the formation and evolution of galaxies into coupled differential equations (e.g.Baugh 2006).A simpler model consists on Sub-halo Abundance Matching ★ E-mail: mnras.org.uk(KTS) (SHAM), which links observed galaxies with simulated haloes, relying in a monotonic correspondence between halo mass functions and luminosity functions, with a certain level of scatter (e.g.Favole et al. 2016;Yu et al. 2023).
An other relatively simple approach is the Halo Occupation Distribution (HOD) modelling (e.g.Benson et al. 2000;Seljak 2000;Cooray & Sheth 2002;Berlind et al. 2003;Zheng et al. 2005;Hearin et al. 2016).This can be applied to large simulations without the need to be complete in the subhalo mass function and without requiring resolving too much internal properties of the haloes.The parameter range can be explored fast, and they are typically used for very large cosmological simulations (Zehavi et al. 2005;Manera et al. 2013;Carretero et al. 2015;Avila et al. 2020;Yuan et al. 2023;Rocher et al. 2023).
One key ingredient of HOD models is the mean HOD, which controls the expected number of galaxies to be placed in a halo of a given mass.Another key ingredient is the probability distribution function (PDF) used to sample that mean HOD.The Poisson distribution is the most commonly employed approach in the literature to place satellite galaxies within haloes (e.g.Zehavi et al. 2005;Carretero et al. 2015;Avila et al. 2018;Rocher et al. 2023).However, PDF with sub-Poisson and super-Poisson variances have been used in the literature (Jiménez et al. 2019;Avila et al. 2020).In the case of super-Poisson variances, the negative binomial distribution has been used, which can parametrize the variance continuously.In the case of sub-Poisson variances, the Nearest Integer function gives a single variance, which is the smallest possible one.Nonetheless, there remains a range of variances between the Nearest Integer and Poisson distributions that has yet to be explored.
In this work we propose and validate a full prescription for the HOD PDF covering all the range in variances.The minimum variance is given by the Nearest Integer, then we cover the sub-Poissonian range with binomial distribution and an extension that we propose in Section 4, we also include trivially the Poisson distribution and we cover the super-Poissonian space with an improved prescription of the Negative Binomial distribution.Ongoing galaxy surveys such us Euclid (Laureĳs et al. 2011) or DESI (Collaboration et al. 2016) will be able to constrain the satellite PDF for different tracers.On the one hand, these constraints can give a unique insight to the physics of galaxy formation that rule the galaxy-halo relation.On the other hand, we will show that the PDF variance can affect severely several clustering statistics (see also Avila et al. 2020).Hence, if this is not accounted for, it may bias our cosmological interpretation when studying galaxy clustering (see appendix B of Avila et al. 2020).
In general, two-point correlation measurements are usually used for constraining HOD parameters, but in the scientific community an interest started for searching alternative statistics to constrain the HOD such as 2k-Nearest Neighbours Yuan et al. (2023), which have demonstrated to be at least as good as two-point statistics, providing complementary information and reducing possible degeneracies between parameters.
In this work, we introduce Count-in-Cells (CIC) as an alternative statistics for constraining HOD and we will show that it has a very promising constraining power.
The dark matter simulations and observational data from eBOSS used in this work are described in Section 2. Section 3 describes the components of the HOD model.In Section 4, we introduce the binomial and extended binomial distributions and we evaluate their performance in Section 5.In Section 6 we investigate the impact the probability distribution function (PDF) has on two-point statistics and we also fit the galaxy catalogues to eBOSS data in Subsection 6.2.In Section 7 we evaluate the constraining possibilities of Counts-in-Cells and we use it to reproduce a model catalogue of galaxies in Subsection 7.1.Finally we summarize the results in Section 8.

SIMULATIONS AND OBSERVATIONS
In this work we make use of two dark matter-only simulations, unit and outerrim ( § 2.1).The cosmology and the properties of both simulations are summarized in Table 1.The clustering of model galaxies obtained from these simulations is later compared with SDSS-IV/eBOSS data ( § 2.2).

Dark Matter Simulations
The unit simulations (or UNITsim, Chuang et al. 2019) are full Nbody dark matter-only simulations.Their initial conditions are set using the Zel'dovich Approximation (Zel'Dovich 1970) provided by the FastPM code (Feng et al. 2016).The UNITsims were implemented with the fixed & paired technique, where the Fourier-mode amplitudes are fixed to the ensemble-averaged power spectrum (fixed) and two pairs of simulations were run with a  offset on the phases within each pair (paired) (Angulo & Pontzen 2016).Both techniques reduce the wavemodes variance considerably, raising the effective volume of the simulations up to  eff = 150 h −1 Mpc.Nevertheless, we only use one of the four unit simulations in this work.The evolution of particles up to redshift  = 0 is done by L-Gadget, a version unit outerrim of the gravity solver code Gadget2 (Springel 2005).The dark matter haloes of unit are obtained using the rockstar halo finder (Behroozi et al. 2013), and halo masses are defined using the virial theorem.
The other simulation we use in this work is outerrim (Heitmann et al. 2019).It was run using the Hardware/Hybrid Accelerated Cosmology Code (HACC) (Habib et al. 2016) .Haloes were obtained using the Friends-of-Friends (FoF) halo finder (Davis et al. 1985) with a linking length of b = 0.168.FOF linking lengths define halo masses.
Here we analyse single simulation snapshots.For unit we consider  = 0.8594, which is the closest snapshot to the eBOSS ELG effective redshift  eff = 0.845 (Raichoor et al. 2020).In the case of outerrim, the closest snapshot corresponds to  = 0.865.For both simulations, we only consider haloes with at least 21 particles.

The bias function
Following Avila et al. (2020), we first compute the bias and halo mass functions, which are used to set constrains in the HOD model as we will see in Subsection 3.1.For both functions we use halo mass bins that are narrow for small halo masses and become wider for higher masses to account for the lower number of haloes (see Figure 1).In Subsection 3.1 we relate those quantities to the HOD parameters.
We calculate the halo bias function for outerrim following the same procedure as in Avila et al. (2020), in order to allow a direct comparison.We find differences in the halo bias function below 1 per cent with respect to Avila et al. (2020).
However, in the case of unit, we obtain the halo bias function slightly different, by computing the power spectrum for each mass bin.We find this method provides more stable results than when considering the two-point correlation function used in Avila et al. (2020).For obtaining the bias from the power spectrum, we use the following limits  min = 2/ box + /2 and  max = 2 grid / box + /2, with linear spacing  = 2/ box and  grid = 512.For each mass bin we obtain the correspondent bias minimizing the  2 function: where  th () is the linear matter power spectrum computed from unit dark matter particles,  ℎ, () is the halo power spectra for each mass bin  and  is the linear bias.We use  cut = 0.1 Mpc −1 h, since the effect of non-linearities is non-negligible beyond this scale.The error in the power spectrum Δ ℎ, () 2 has the following expression: with the halo number density is at the mass bin , and  ℎ, as the number of haloes at that mass bin.
We finally obtain the linear bias for each halo mass bin considering  2 min with a 1 confidence interval given by Δ 2 = 1.In Figure 1 we represent the halo mass and the halo bias functions for outerrim and unit simulations.Higher-mass haloes are less frequent and they also have a higher bias.
For both simulations we fit the resulting bias function with a fifth order polynomial (represented by a red dashed line in Figure 1).The polynomial fit is able to encapsulate the simulation measurements within 1, except for the two most massive bins of the unit simulation.Nevertheless, we expect this effect to be negligible, as very few haloes are found within those two massive bins.

Observational data
In this work we generate mock catalogues with the number density and linear bias fixed to the Emission-Line Galaxies (ELG) from the extended Baryon Oscillation Spectroscopic Survey (eBOSS) from the DR16 Sloan Sky Digital Survey (SDSS) IV (Dawson et al. 2016;Ross et al. 2020;Alam et al. 2021).These galaxies have redshifts in the range 0.6 <  < 1.1.eBOSS ELGs are mostly star-forming galaxies with strong spectral emission lines that enable a fast and reliable determination of their redshift (Raichoor et al. 2020).
We use HOD models to populate with galaxies the unit and outerrim simulations, which assume different cosmologies (see Table 1).Below we explain how we obtain the needed number density and linear bias from the observations.

Number density
The number density,  gal , is the total number of observed galaxies per observed volume.To take into account the incompleteness due to the photometric target selection, redshift failures and other effects such as fiber collision, a weight is associated with each galaxy (Raichoor et al. 2020;Ross et al. 2020).In order to compute the total number of observed galaxies we take into account those weights.
Since the comoving survey volume depends on the assumed cosmology,  gal varies with cosmology.The survey volume is calculated as follows: where () is the comoving distance, which also depends on cosmology. eff =   +   , with   () = 369.451(357.546)deg 2 is defined as the effective areas for the North (South) Galactic caps covered by eBOSS.Finally,  min = 0.6 and  max = 1.1 are the minimum and maximum redshifts of the observed ELGs considered in the LSS catalogues (Raichoor et al. 2020).

Linear Bias
The linear bias,  gal , can be defined as the ratio between the overdensity of galaxies and the overdensity of dark matter at large scales.We calculate the bias using the Kaiser factor (Kaiser 1987) to relate the observed galaxies and the dark matter monopoles  0 while accounting for Redshift Space Distortions (RSD).
we use the approximation  () = (Ω M ()) 0.545 (Peebles 1980;Linder 2005) for the growth rate of structure, evaluated at  eff = 0.845, and  th is the monopole of the matter two-point correlation function.
For both simulations we consider the range: 15 <  < 75 h −1 Mpc, with linear binning: Δ = 5 h −1 Mpc.For the unit simulation we calculate the monopole of dark matter particles for 0.5 per cent of the total particles 1 .For the outerrim simulation, we use the monopole provided by camb linear theory evaluated at the outerrim cosmology since we do not have enough information on the particles.The monopoles from unit camb linear theory and dark matter particles are compatible for scales  = 15 − 75 h −1 Mpc, with errors below 4 per cent.
Finally, for the unit simulation we need to include the simulation and survey volume ratio to correct the amplitude of the precision matrix in the  2 function.We compute the  2 as follows: with  box the volume of the simulation and  −1 (,  ′ ) is the inverse of the covariance or precision matrix.The covariance matrix is calculated using  EZ = 1000 Effective Zeldovich (EZ) Mocks (Zhao et al. 2021) for both simulations.The bias is found by minimising the  2 , with a 1 confidence interval computed using Δ 2 = 1.We obtain  gal = 1.37 ± 0.03 for unit and  gal = 1.33 ± 0.02 for outerrim cosmologies.

MODEL GALAXIES
Halo Occupation Distribution (HOD) models populate dark matter haloes at a given redshift with galaxies using analytical equations that relate the probability of finding a galaxy of a certain type with the mass of the host halo.We distinguish between two types of galaxies: centrals and satellites.
Central galaxies are placed at the center of host haloes and share their velocity.Due to this definition, a particular halo will not host more than one central galaxy.
On the other hand, satellite galaxies do not have to share the position and velocity of its host halo.They are associated to haloes with a given radial and velocity profile.
For a generic HOD model, equations are chosen to describe the following properties: In the bottom we show the ratio between the polynomial and the halo bias function.Right: same for outerrim simulation.We find around 1 per cent differences in the halo bias function, when comparing outerrim case with Figure 1 of Avila et al. (2020).
• Probability distribution function (PDF) for both the number of central and satellite galaxies within a halo ( § 3.2).We will extend the satellite PDF in Section 4.

Mean Halo Occupation Distribution
The shape of the mean HOD, parametrises how many galaxies will be hosted, on average, by a halo of a given mass.Usually, two analytical expressions for the shape of the mean HOD distribution are used: one for satellites and another for centrals.
Those expressions for the mean distribution of galaxies can be used to calculate the total number density and bias of the galaxy catalogue resulting from an HOD, given we know the halo mass function and halo bias functions (see Subsection 2.1): We can also calculate the fraction of satellites as follows: We fix the number density of galaxies and the galaxy bias to the observed values of eBOSS ELG data calculated in Subsection 2.2.The fraction of satellites is set as free parameter.
In this work we follow the assumptions presented in (Avila et al. 2020) for the modelling of satellite and central galaxies.
Central galaxies follow an asymmetric Gaussian distribution.This description is motivated by semi-analytical models of galaxy formation and evolution (e.g.Gonzalez-Perez et al. 2018).For central galaxies we assume the following shape: where  is the halo mass and   determines the amplitude of the Gaussian, which has mean  and variance  2 .For log  ≥ ,  < 0 controls the sharpness of the decaying power law.
For satellite galaxies, we assume they follow a power-law: where   controls the fraction of satellites and  > 0 provides how steep is the power law.The average number of satellites is zero if  ≤  0 and it increases as the halo mass increases starting from this point.If   = 1 and  0 ≪  1 ,  1 represent the mass of haloes in which we expect 1 galaxy satellite on average.In this work we fix the mass parameters  0 and  1 with a relation to  given by log (M 0 ) =  − 0.05 and log (M 1 ) =  + 0.35 (Gonzalez-Perez et al. 2018).We can see the shape of the average number of satellites per halo mass in the black solid line of Figure 2. HOD parameters ,  and  are fixed throughout this work to the values shown in Table 2. ,  ,  are determined by the number density  gal , the linear bias  gal from eBOSS ELG observations and also the fraction of satellites  sat (free parameter of our model), using Equation 6, Equation 7and Equation 8, and the latter.We define a default HOD in Table 2 that will be used several times in this work, in which we fix  gal = 6 •  eBOSS ,  gal = 1.37 and  sat = 0.3.

Probability distribution function
In order to place galaxies into haloes we need to use a discrete probability distribution function (PDF) that will determine how many galaxies of a given type, , we place given a mean number ⟨⟩ determined by the mean HOD described above.
For all discrete probability distribution functions considered in this work, the number of central or satellite galaxies each halo will host is determined by drawing a random number  ∈ [0, 1).Then, using the cumulative probability distribution function  C () =   (), we determine the value  such that  C ( = ) <  and  C ( =  + 1) > .Then, the number of central or satellite galaxies hosted by the halo are  = .
In the case of galaxy centrals, the Nearest Integer Distribution is always used as haloes can host either one or none central galaxy (for mean values between 0 and 1, the Nearest Integer distribution is identical to a Bernouilli distribution).
In the case of satellite galaxies, PDFs with different variances have been used in the literature: Poisson distribution, the standard case for most HOD models ( §3.2.1), super-Poisson through the Negative Binomial distribution (NB, §3.2.2) and sub-Poisson.So far in the literature, for this last case a Nearest Integer (NI) Distribution ( §3.2.3) has been used.However, this involves a single value for the sub-Poissonian variance.We will introduce in Section 4 two more PDFs that increase the flexibility when assigning satellites to dark matter haloes: the Binomial distribution ( §4.1) and an extended Binomial distribution ( §4.2).This is central focus of this work.
Some examples of non-Poisson PDFs in the literature are described here.Jiménez et al. (2019) uses a super-Poissonian PDF for satellites considering a Negative binomial distribution (NB).This was motivated for model star forming galaxies.Although a Nearest Integer distribution (NI) is used only for centrals, Berlind et al. (2003) found NI in better agreement with galaxies selected by Smoothed Particle Hydrodynamics (SPH) and Semi-Analytic models, independently of them being centrals or satellites.Several best fits on eBOSS data found in Avila et al. (2020) also show preference for NI and NB.However, there is a significant gap in the variance of the PDF from the Poisson distribution to the only sub-Poisson function considered (NI), motivating the modelling of a continuous distribution.
In this work we quantify the deviations of the PDF variance with respect to that of a Poisson distribution by the parameter   : The parameter   is defined differently for super and sub-Poisson variances, as it is detailed below and in Section 4. Equation 11 is adequate for a range of variances that can be continuous.Nevertheless, as it is detailed in Subsubsection 3.2.3, the Nearest Integer distribution is the only distribution considered in this work that has a single variance and its expression differs from the above equation.

Poisson Distribution
The most commonly used probability distribution to determine how many satellite galaxies are placed in a particular halo of a given mass is the Poisson distribution.A Poisson PDF can be written as follows for a halo with an average satellite galaxy  = ⟨ sat ⟩: For this distribution, the variance is equal to the mean 2 =  = ⟨ sat ⟩.The shaded black region in Figure 2 represents the theoretical Poisson variance for a particular HOD model, and the green lines represent the observed variance after 20 realizations of galaxy catalogues, which are computed with a Poisson distribution.As expected, the black shaded region and green lines are in agreement.

Super-Poisson distribution: negative binomial
The probability of getting an integer random variable,  sat , for the negative binomial (NB) PDF can be written as follows: where  traditionally describes the number of successes, hence is defined as a natural number.This quantity, , can be naturally extended to positive real numbers despite losing its original meaning:  ∈ R + .0 <  < 1 is the probability of success and  sat the number of failures, which is the random variable of this PDF.In that traditional interpretation  sat +  − 1 would represent the total number of trials, from which  can inherit the nature of a free parameter.In the context of HOD models,  sat is the number of satellite galaxies, and  and  are the PDF parameters that determine its mean, and standard deviation, In this case, the parameter that controls the variance deviations with respect to Poisson,   , is defined as   = 1  > 0 to parameterise.
In the limit  − → ∞ we recover the Poisson distribution, corresponding to   = 0. Equation 11 also represent the variance of the binomial distribution (See Subsection 4.1) and the extended binomial distribution (See Subsection 4.2) for   < 0.
As we will see in Subsection 4.1 our code will be implemented with a free parameter , which although it is motivated to follow the behaviour of   in Equation 11 for the negative binomial distribution, in some parts of our parameter space using sub-Poissonian distributions this is not possible.We represent the 1 −  contours using the negative binomial distribution for  = 0.5 2 in the blue lines of Figure 2, computed using 20 galaxy catalogue realizations.
The negative binomial distribution used in Jiménez et al. ( 2019) and Avila et al. (2020) had a slightly different parametrization of the variance, with  2 =  (1 + ) where  =   .

Sub-Poisson distribution: Nearest Integer
The probability of getting an integer random variable,  sat , with  = ⟨ sat ⟩, for the Nearest Integer (NI) PDF is: in which () is the closest lower integer to .The variance of this distribution is: with  ′ =  − trunc(), which is the smallest possible variance.This PDF was the only sub-Poisson PDF considered in Avila et al. (2020), 11.648 0.0368 0.03583 0.9 0.12 −1.4  with the drawback that there is only one possible variance for each value of .This implies that  2 is not an independent parameter from , in contrast with the Negative Binomial case (which was already used in Avila et al. (2020) as a super Poisson PDF.) ( Equation 15).We represent the Nearest Integer 1 deviations by red solid lines in Figure 2, computed using 20 galaxy catalogue realizations.The Nearest Integer Distribution is also represented by a black dashed line in Figure 4. We note that in the case when 0 < ⟨⟩ < 1, the NI function is identical to the Bernouilli function, often quoted in HOD models.

Spatial and velocity distributions of satellites
In this work, positions of satellite galaxies within haloes are assigned following a Navaro-Frenk and White (NFW) radial profile (Navarro, Frenk & White 1997).The velocity distribution of satellite galaxies are calculated considering the virial theorem, following Bryan & Norman (1998).The implementation of these two components is described in detail in Avila et al. (2020).

EXTENSIONS TO THE SATELLITE PDF
So far, in the literature, sub-Poisson variances for satellite galaxies have been only described with a Nearest Integer (NI) PDF (Berlind et al. 2003;Zheng et al. 2005;Jiménez et al. 2019;Avila et al. 2020).
The NI PDF only admits one possible variance which is the smallest possible.This limits the parameter space that can be explored to find the best description of galaxies.In this section we introduce the extensions proposed in this work to sample sub-Poisson variances continuously.We also propose a solution to mitigate numerical errors affecting non-Poisson distributions in general, derived from Γ functions with large arguments.

Binomial distribution
The binomial distribution, B, provides a discrete range of possible sub-Poissonian variances.For this distribution, the probability of getting a certain number of satellite galaxies in a halo,  sat , which is the random variable of the PDF, is given by: B(N sat ; q, p) = Γ (q + 1) with the parameter  ∈ N defined traditionally as the number of trials, which is the maximum number of satellites that can be placed with non-zero probability, and 0 <  < 1 the probability of success, i.e. of actually placing a satellite galaxy in a halo.The possible number of satellite galaxies has to be less than the number of trials,  sat ≤ .Unlike in the negative binomial distribution, the possible values of  cannot be extended to real numbers, since negative probabilities could arise.
Given the parameters  and , the mean of the binomial distribution, B(N sat ; q, p), is: As we can see in Figure 2 and Equation 10, in our model, the mean number of satellite galaxies  = ⟨ sat ()⟩ increases with the halo mass, .The variance of the binomial distribution is: Here we define   ≡ − 1  as the parameter to control the variances lower than Poisson3 , following what we have previously done for the negative binomial distribution, NB.Unlike the Nearest Integer distribution, the binomial distribution has sub-Poissonian variances that not only depend on the mean number of satellites, ⟨ sat ⟩, but also on   .As it happened for the negative binomial distribution, in the limit  − → ∞ we recover the Poisson distribution (  = 0).The variance of the distribution has exactly the same expression as in Equation 11.However, since for the binomial distribution  ∈ N, the only possible input values of   are discrete: We extend this range by introducing a new PDF in Subsection 4.2.
The skewness of the binomial distribution has the following expression: We have provided expressions for the third first moments of the binomial distribution: the mean (Equation 19), the variance (Equation 20), and the skewness (Equation 21).These first three moments are properly defined with the above equations for  ≥ 1,  ≥ 2 and  ≥ 3, respectively.That is, if  < , the -th moment of the distribution may follow another expression.
So far, we have introduced   to parametrize super-Poisson (Subsubsection 3.2.2) and sub-Poisson (Subsection 4.1) variances.Now we need to introduce , which will be the parameter that is input to the HOD model to control the variance of the satellite PDF.This parameter follows the behaviour of   , however for some regions in the parameter space  > 1,  < − 1  negative probabilities arise from Equation 18for certain  sat (see red shaded area in Figure 3).We describe and address this limitation below.

Extension to avoid unphysical negative probabilities
Since the mean number of satellites increases with halo mass, massive enough haloes will be able to host several satellite galaxies, ⟨ sat ⟩ > 1.The exact number of haloes that fulfill ⟨ sat ⟩ > 1 also depends on other HOD parameters, such as the fraction of satellites  sat .
If we consider our default HOD presented in Table 2, which corresponds to a fraction of satellites  sat = 0.3, we have that 3.4 (11.2) per cent of all galaxies (satellite galaxies) are attached to haloes with ⟨ sat ⟩ > 1.
For those haloes that contain, on average, one or more satellite galaxies, if  < −1/ (we remind that  is related to the variance) the PDF becomes negative and the expression of the binomial variance described by Equation 20 will give negative values of the variance,  2 < 0 (red area in Figure 3).
To avoid unphysical cases with negative probabilities the HOD model must correct the input value of  for those haloes, enhancing the variance only what is strictly necessary: in which  is the input parameter for the HOD model used for the entire catalogue, and   is the value that will be finally used for a given halo mass .We will use by default   throughout this paper, except when it is an input parameter on the code.In this way, it is guaranteed that we always recover the correct mean number of satellites and a positive variance when we populate haloes with satellites.
We represent in Figure 2 the mean number of satellites  = () in a black solid line.We also focus on the orange solid lines, which represent the  ± 1 contours, in which  follows Equation 20 for  = −0.5.Those lines are computed using 20 galaxy catalogue realizations.
In Figure 3, we represent the parameter space  = 1   ,  .We show the limit when the PDF starts to be negative (red line), together with the limitation we have set in the parameter space (green staggered line).Above the green staggered line we do not need corrections in the parameter space and have   =  (green area and blue area, the latter representing  < 1).Below the red line we represent the region in which corrections have to be made: Considering the default HOD model, an 88.8 per cent of satellites reside in the blue region, in which the binomial distribution does not give unphysical negative variances.The points leading to unphysical negative variances, cover the red region below the red line, including it.The red region above the red line is out of the parameter space, since  is an integer number.We use the ladder  = −1/  = trunc (⟨N sat ⟩) + 1 as the preferred boundary for corrections, ensuring: 1) the minimum effective variance to be positive and 2) ⟨  sat ⟩ to be recovered.We show an example correcting a point in the red region of the parameter space:  = 2.5,  = − 1 2 to the nearest point in the green region of the parameter space with the same mean ⟨  sat ⟩:  = 2.5, (red area).Despite green and red regions being filled continuously,  is an integer.Therefore, the region between red and green lines is not populated by the binomial distribution.
Figure 4 shows a set of binomial variances given by different input values of  and mean number of satellite galaxies,  = ⟨ sat ⟩.They are compared with Poisson and Nearest Integer variances.As we can see, binomial variances are lower than the Poisson variance and higher than the Nearest Integer one: for low  we see the inverted arc shape which is the natural shape of Equation 20.For high  there is a serrated behaviour arising from the application of Equation 22to avoid the unphysical negative values of the variance.This serrated behaviour is directly related to the green ladder in Figure 3.

Extended Binomial distribution
As we described in Subsection 4.1,   is limited to discrete values in the binomial distribution, since   = −1/ and  range is limited to natural numbers.Hence, the variance  2 for a particular ⟨ sat ⟩, given by Equation 11, is also limited to discrete values.In this section our objective is to provide an extension to continuous values of   allowing the variance to take continuous values for a given ⟨ sat ⟩.For this purpose, we define a new probability distribution function, the extended binomial distribution  ext : where  ( sat ; , ) is the binomial distribution. q ( sat ; ,   ) is introduced to extend our range of sub-Poissonian variances since our input parameter   can now take continuous values: −1 ≤   < 0.
We have that  ≡ ceil (−1/  ), in which ceil(x) is the closest upper integer to  (we justify in appendix A1 this relation between  and   ).Note that  ∈ N as required. q ( sat ; ,   ) is also computed such that  ext matches the first three binomial moments (see the detailed calculations in appendix A).
The calculation of the analytic expression of  q ( sat ; ,   ) is detailed in appendix A.  q ( sat ; ,   ) has to be calculated separately for each  solving  + 1 equations: a normalization equation (such that the PDF sums to unity), an equation for the mean number of satellites  and −1 equations for the subsequent central moments.We solve for  = 1 (trivial case,   = −1),  = 2   ∈ −1, − 1 2 and  = 3   ∈ − 1 2 , − 1 3 , by solving the corresponding equations.For  ≥ 2, we can choose the variance to have the expression of Equation 11 4 .For  ≥ 3, we choose the skewness to match the expression given by Equation 21 5 .For  ≥ 4, one would need to specify the following higher order central moments, becoming increasingly complicated.In appendix A we propose a generalisation of our  = 2 and  = 3 solutions for  ≥ 4. We find this generalisation to work well in most of the parameter space, but we also find a small part of the parameter space in which negative probabilities arise (   < 0).This has a very small impact in our parameterisation as we will see in Section 5.
Note that the binomial distribution is a particular case of  ext , with   = −1/.(See Subsection 4.1).In those cases  q = 1.

Mitigating numerical limitations of the satellite PDF
Γ functions are used in the negative binomial distribution, Equation 13, the binomial distribution, Equation 18, and its extension, 4 The expression of the variance for  = 1, which cannot be imposed, coincides with Equation 11, see appendix A 5 The expression of the skewness for  = 1 and  = 2 cannot be imposed, and for  = 2, it does not coincide with Equation 21(see gray region of Figure 6 and appendix A) Equation 23.Numerical errors for these Γ functions may arise for very large arguments: for Γ( ≥  Γlim ) an overflow is produced.
Then, the implementation of the probability distribution function fails in the task to place satellite galaxies in the haloes, obtaining a galaxy catalogue without satellites, thus decreasing the input number density of galaxies.Γ( ≥  Γlim ) imposes a limitation on the parameter space (,  sat ): where ( sat ) is the remaining Γ argument.
In this work, we compute Γ functions using tgamma from the ℎ.ℎlibrary in C. With this particularities,  max = 171.7.Since  is inversely proportional to   , Equation 24 sets a lower limit in   .
As the Γ functions enter in both Equation 13and Equation 18 as a division, we define the following function: The above function fails for Γ( ≥  Γlim ) when Equation 24is not satisfied due to numerical error.To avoid this we propose to use the product function, which is mathematically equivalent to Equation 25and does not fail when Equation 24 is not satisfied: With this substitution, we can rewrite the binomial and negative binomial with products as follows: B Π (N sat ; q, p) = Except for the discussion introduced in this section, throughout this work we will always use Equation 27 and Equation 28 when Negative binomial and binomial functions are needed, respectively.In this section, they are denoted as NB Π and B Π , but in further sections we will use the notation NB and B for simplicity.
In our model, this correction is only applicable for both the negative binomial distribution and for the binomial distribution.The extended binomial distribution also suffers from Γ overflows on the   function, but it cannot be corrected, since this correction is only applicable when two Γ( + ( sat )) functions stay at both sides of a division.This is not the case of  ext (See Equation 23, Equation A26 and Equation A27).
We now quantify the effect of this change on the PDFs.When   > 0 we use the negative binomial distribution and when   < 0, we use the binomial distribution.
We present in Figure 5 a comparison between the input number density  gal and the HOD model output for this quantity, as a function of  6 .We have computed model galaxy catalogues with Figure 5. Ratio between the recovered and the target number density multiplied by 10 3 for clarity, as a function of  > 0. The orange line is obtained using NB(N sat ; q = 1/) Equation 13.When Γ (  >  Γ ) in Equation 13, numerical computation of this function fails and no satellite galaxies are assigned to dark matter haloes, and thus the recovered number density decreases abruptly.The red line is obtained considering NB(N sat ; q = /), which has a smoother decay when  − → 0. Green and blue lines are obtained using NB Π (N sat ; q = 1/) and NB Π (N sat ; q = /), respectively.For both cases we recover correctly the number density in the whole range of .Similar results are found for  < 0 using B(N sat ; q = −1/), compared to B Π (N sat ; q = −1/).We use the default HOD model to compute the galaxy catalogues (Table 2).
Figure 5 shows the negative binomial (Equation 27, green solid line) and the negative binomial using only Γ functions (Equation 13, orange solid line).In the latter case, since numerical errors imply losing satellite galaxies, we do not recover the input number density.We observe an abrupt decaying of  gal at  = 0.006.A similar behaviour is found for the binomial distribution using products (Equation 28) and only Γ functions (Equation 18), respectively.
In this work, for the negative binomial distribution we use the definition  = 1/  , however in the literature  has been also defined as  = ⟨ sat ⟩/ = 1/  (Avila et al. 2020;Jiménez et al. 2019).In the first case we use   to control the variance of the PDF and in the second case they use  for the same purpose.The variance can be expressed as  2 =  (1 + ).We represent in Figure 5 the same comparison between the input number density and the HOD output, as a function of .We observe similar results as before: the use of negative binomial with products implies recovering the number density in the whole range (See blue dashed line), while in the case of using only Γ functions we observe a decay, which is smoother since  depends on ⟨ sat ⟩ ∝ .

COMPUTATIONAL IMPLEMENTATION OF THE SATELLITE PDF
Here we describe the algorithm we follow to choose a PDF for satellite galaxies given a target variance.This is quantified by an input parameter, , to our halo occupation distribution (HOD) model.This free parameter quantifies how far the variance of the satellite PDF is from that for a Poisson distribution.Given any input , unphysical negative probabilities can arise for some haloes containing more than one satellite galaxies, when  < −1/(trunc(⟨N sat ⟩) + 1) (Subsubsection 4.1.1).To prevent this from happening, we modify the input parameter  when needed, following Equation 22.We let denote   the input parameter with the possible correction.
Depending on the value of   , the code for the HOD model will use different PDFs for satellite galaxies: • If   = 0 we assume a Poisson PDF (Equation 12), with variance  2 =  ≡ ⟨ sat ⟩.
• If   < 0 we have two options for the satellite PDF, with a sub-Poisson variance: -If   < −1 we assume a nearest integer PDF (Equation 16), which has the smallest possible variance.
The  sub−P PDF is based on the (extended) binomial distributions introduced in Section 4. By default, we use the new PDF that we have introduced in Subsection 4.2 to allow for continuous values of   .This new PDF is the extended binomial,  ext =  q  (Equation 23).Note that  ext , is reduced to the binomial one, , when  q = 1.This occurs when   = 1/, ∀ ∈ N + .
We do not use the  ext for the  sub−P PDF, in one case.When   > −1/ Γlim we use ( = ceil (−1/  )) (Equation 28).For this range of   ,  q cannot be evaluated due to the numerical limitations of using Γ functions (see Subsection 4.3 and Section A).And thus,  ext =  q  cannot be computed in this range.
Below, we evaluate the numerical performance of  sub−P .

Performance for a uniform distribution
We evaluate the  sub−P implementation introduced in Section 4 as a mathematical tool that can be applied when a continuous range of sub-Poisson variances are needed.For this purpose, we generate 3500 points distributed randomly with  ∈ [−1, 0) and  = [0, 10] (this range is motivated in Subsection 5.2).This uniform distribution is shown in the top left panel of Figure 6.Such number of pairs {, } can provide a reliable estimation of how  sub−P behaves when  q < 0 (red points in the top left of Figure 6).
In the upper-left plot of Figure 6 we distinguish between four types of points, representing the three cases described above for  sub−P , that is, the use of  ext , green and red points in the top panels of Figure 6; ( = trunc (−1/  )), orange points; and (  ), dark blue points.The red points in Figure 6, correspond to negative probabilities due to  q ( sat ; ,   ) < 0. Points with  q < 0, represent potentially problematic regions in the parameter space, where errors could arise when recovering the input mean, variance and higherorder moments of the distribution.This region is centered around ( > 1,   > −1/4).We evaluate the impact of these below.
In the upper panels of Figure 6 we show the effect of correcting the input , dark blue points uniformly distributed on the left, when  < −1/(trunc(⟨N sat ⟩) + 1) (Subsubsection 4.1.1).The resulting   parameter has a serrated behaviour, as shown by the dark blue points in the upper-right panel of Figure 6.
We address the errors introduced by the corrections mentioned above by studying 10 6 realisations of  sub−P for each pair (  , ) shown in the top right panel of Figure 6.We compute the first three central moments: the mean  recov , the variance  recov and the skew-ness  3,recov of those realisations (for the skewness we only consider  > −0.5 since this is the range of applicability of Equation 21).We also compute  ,recov from  recov ,  recov inverting Equation 11.Finally, we compare the results with the inputs ,   ,  and  3 .
In the middle and lower panels of Figure 6, we represent for each one of the points the following quantity: | recov / − 1|, for  = ,   , ,  3 , which represents the differences between recovered and input moments.As expected, high values of | recov / − 1| are found in the region containing points with  q ( sat ; ,   ) < 0, in which  > 2 and   > −1/3.We get the largest errors when we consider the skewness  3 and the lowest errors when we consider the mean .Errors arise also in the  =   subplot for (,   − → 0): This is found to be numerical noise, and the relative errors get smaller as we increase the number of realisations.Finally, we also see deviations in the  =  3 subplot, near to the  3 = 0 curve, again simply due to statistical limitations.To sum up, in those two last cases there is not an intrinsic bias, since errors simply decrease when we consider a larger number of realizations.Intrinsic bias in the computation of moments arise in the region of the parameter space corresponding to  q ( sat ; ,   ) < 0.
Since our main interest is to recover correctly the mean number of satellites and the variance, in Figure 7 we show the differences between recovered and input values of  and .As we can see, only red points (  q < 0) have appreciable errors in our analysis.As we can see for those points, less of them ends up with a higher error values and higher discrepancies between  and  recov involve also higher discrepancies between  and  recov (Note that both parameters are related by Equation 11).In addition, we find  recov <  while  recov > .
We discussed so far where we can find in our parameter space differences between recovered and input values of our parameters ,   ,  and  3 .Now we can focus on determining how frequent those errors are.Considering  = ( = ), a 2 (2.8) per cent of the points in the parameters range uniformly explored in Figure 6 have errors greater than 0.5 per cent.In contrast, if we consider  =  3 , a 9.5 per cent of the points have errors greater than 0.5 per cent.

Performance for galaxy catalogues
In the previous subsection we evaluated how  sub−P performed in a uniform {, } space.Now, we evaluate the performance of  sub−P for galaxy catalogue generation considering a reduced parameter space to isolate points in which   < 0. First, let us recall that we are considering a growing power law for ⟨ sat ()⟩ (Equation 10, Figure 2).At the same time, the halo mass function is a decreasing power law: we have much less massive haloes than lighter ones.This translates to a lesser number of haloes with a higher average number of satellites.If we consider Figure 6, the difference we make is to consider progressively less number of points with higher ⟨ sat ⟩.
We consider at first instance 1,000,000 points in the following parameter space:  ∈ − 1 3 , 0 ,  = ⟨ sat ()⟩ , in which  is obtained from 1, 000, 000 unit7 halo masses selected randomly applying Equation 10 and considering  sat = 0.658 .All the evaluated points stay at 0 <  < 10, motivating the limits imposed in the parameter space considered in Subsection 5.19 .Note that we skipped  ∈ −1, − 1 3 , since in those cases  q ≥ 0 ∀ sat .We motivate below further cuts to isolate the problematic region (   < 0) found in the last subsection, to avoid a large number of unnecessary  sub−P evaluations.
We exclude points with  ≤ 1, since it can be demonstrated mathematically that  q ( sat ;  ≤ 1, )) ≥ 0. This is also demonstrated numerically in our previous test (Subsection 5.1).Finally, we also exclude points with  < −1/(trunc() +1) (equivalent to blue points in Figure 6).For those points we remind that Equation 22 is applied and the input value of  is substituted by   = −1/(trunc() + 1).All values of   = −1/(trunc() + 1) fall in the range of the binomial distribution, and then we have that   = 1 for all those points (that is,  ext = ).We remind that those selections we made are motivated to isolate the problematic region of the parameter space in which   < 0. We have that 851 points (from the initial 1, 000, 000 points) passed the selection.We focus on those points to determine errors in the central moments.
In particular we compute, as in Subsection 5.1, 10 6 realizations of  sub−P for all selected points.We use the results to compare the recovered and fiducial values of ,   ,  and  3 .In this test, we focus now on quantifying how frequent errors on  = ,   ,  and  3 are.In this particular test, we are interested only in the errors arising when  q ( sat ; ,   ) < 0. Thus, the statistics are obtained considering the number of points with errors greater than | recov / − 1| ( = ,   , ,  3 ) for the 851 selected points, but we normalize the result adding zero errors for the remaning points in which we already knew that   ≥ 0 (In total, 1, 000, 000 points).Note that in order to obtain a reliable statistic for galaxy catalogue generation, we need to consider all the original points.The selection was done to compute the errors only in the isolated region of the parameter space in which   < 0 was possible, thus avoiding a high computational cost.
In Figure 8 we present the number of  sub−P realizations that have an error greater than the | recov / − 1| values on the x-axis: only 6 per million points have and error greater than 0.5 per cent recovering the mean.In the same way we find that only 17 per million points have an error greater than 0.5 per cent recovering  and only 131 per million points recovering the skewness.The discrepancies between the fiducial and recovered values arise at high values of , which correspond to very massive haloes, that are much less frequent in nature.To sum up, in the particular task of mock generation, numerical errors are not expected to have an impact in the resulting galaxy distribution of the catalogues.

2PCF AND THE SATELLITE PDF
We expect the satellite PDF to affect the pair-counts of galaxies within a halo, i.e., the 2-point correlation (2PCF) 1-halo term.On the other hand at large scales (in the 2-halo term), the 2PCF would be unaffected by the PDF itself.This intuition is indeed confirmed when we measure the 2PCF HOD catalogues generated with different PDF variances.
In Figure 9 we can see how the variance of the probability distribution used to populate haloes with satellite galaxies affects the projected correlation function  p ( p ).
In the upper panel we see the general shape of  p ( p ).As we can We distinguish green points:  q (  sat ; , ) > 0 ∀  sat , red points, in which  q (  sat ; , ) < 0 for some  sat .We also consider points with  < −1/(trunc() + 1) and finally those with  >  Γlim , represented in orange.Top right: correcting  in favor of   for blue points.Middle left: we compare the mean recovered  recov with its respective fiducial value  for all points.Middle right: we make the same comparison for   .Lower left: same for .Lower right: we do the same analysis for the skewness  3 .We represent 3290 points, since we exclude points placed in the grey area and we also represent  3 = 0.
see, it follows the typical decaying power law as a function of the projected distance.
In the middle panel we divide all  p ( p ) considered by the Poisson  p,P ( p ).We can see that small scales are affected as expected: clustering rises when the variance considered is larger.A larger variance favor more instances with more than one satellite in the same halo, increasing precisely the 1-halo term.Since the linear bias has been fixed, intermediate and large scales are mostly not affected since galaxy pairs comes from different haloes and now former differences are averaged out.
In the lower panel we can see the difference between all  p and the Poisson  p,P divided by the error calculated using jackknife resampling.We detail the calculation of jackknife errors in the following subsection.We show that variations in the one-halo term of the projected correlation function induced by changes in the PDF are clearly significant, showing the constraining power that  p ( p ) has on the variance on small scales.
We do not show it here but we find that the dependence of the monopole and quadrupole on the PDF variance is more modest, in line with the results from Avila et al. (2020).Since we use mostly the one-halo term of the projected correlation function when fitting galaxy catalogues with eBOSS data, we also use the linear scales of   0 () and  2 () in our fits.In this regime, the monopole depends more strongly on the linear bias, and there is no appreciable dependence in the quadrupole, which is more affected by the velocity distribution.

Jackknife resampling
To estimate the covariance of the two-point functions y = ( p ,  0 ,  2 ) introduced in this section, and also for the Count in Cells estimator introduced in Section 7 we use jackknife resampling.
In the case of two-point functions using unit simulation, we construct  box = 1000 copies of a catalogue of galaxies, with  =  eBOSS and (,  sat ) = (-0.8,0.4).To each copy we extract a different sub-box of  cell = 100 h −1 Mpc.We compute   = ( p,i ,  0, ,  2, ) for each copy and we apply: We finally need to rescale the variance, comparing the volume of the simulation and the eBOSS ELG volume: In the case of unit simulation,  box = 1 h −1 Gpc.Nevertheless, when considering two-point function using outerrim simulation, we have that  box = 3 h −1 Gpc.In that case, we divide the latter box in 27 sub-boxes of 1 h −1 Gpc, we follow the steps outlined above for each one of the sub-boxes and we finally average the errors.
Finally, we compute the jackknife covariance for Count-in-cells following the steps described above, but using  box = 125.

Application to model eBOSS ELGs
As an application we fit the eBOSS ELG data using HOD models with a range of satellite PDFs.This is done as an exercise and thus, we will only vary 2 parameters: the fraction of satellites  sat and the PDF variance by setting free .We get independent fits for unit and outerrim simulations.
We consider the following PDFs: Nearest Integer (NI),  sub−P with  ∈ [−1, −0.05], Poisson ( = 0) and Negative binomial  ∈ [0.05, 0.5].We consider a step Δ = 0.05 for the  sub−P and the NB PDFs.This parameter space is shared by the two simulations used in this work.
We also vary the fraction of satellites.For unit, we consider  sat ∈ [0.05, 0.75] with Δ  sat = 0.05.We exclude  sat > 0.75 since in this case  <  min = 10 10.42  ⊙ h −1 , implying a model in which most of the central galaxies have to be placed at small-mass unresolved haloes.We point out that this issue is particular to the HOD considered in this work.Other HOD models may allow more freedom in the fraction of satellites range.We exclude as well  sat = 0, corresponding to a galaxy catalogue populated only with central galaxies.For outerrim, we consider  sat ∈ [0.05, 0.6], since  sat > 0.6 is implying  <  min = 10 10.58  ⊙ h −1 .All remaining parameters are set to their default values (See Table 2).To increase the signal-tonoise ratio, the galaxy catalogues constructed have a number density  gal = 6 eBOSS .
We compute the  2 to compare the reference data catalogue with each one of our galaxy catalogues.The expression of the  2 as follows: in which  = ( p ,  0 ,  2 ) ,  = (  sat , ) and  is the covariance matrix computed in Subsection 6.1.
Finally, we find the best fit galaxy catalogue extracting (  sat , ) values that correspond to the minimum  2 ,  2 min .In the top panel of Figure 10 we show a comparison between our unit-based galaxy catalogues and eBOSS ELG data as our reference catalogue through the evaluation of  2 (  sat , )−  2 min .As we can see, there is some degeneracy between the two parameters.If we increase the variance of the distribution  or the fraction of satellites  sat it also increases the one halo term of the projected correlation function.Both effects seem to compensate each other.Besides that, we find that the preferred galaxy catalogue has (  sat ,) = (0.35, −0.65), corresponding to a catalogue in which the 35 per cent of galaxies are satellites, and satellites are placed with a sub-Poisson variance, modelled with  ext ( = −0.65).
In the bottom panel of Figure 10 we show  2 (  sat , ) −  2 min for galaxy catalogues computed using outerrim simulation.The best fit prefers satellite galaxies to be distributed following a Nearest Integer distribution, with 40 per cent of the total number of galaxies as satellites (  sat = 0.4).
We can see that outerrim  2 (  sat , ) function is smoother than the same quantity computed using unit simulation, since the volume of the simulation is 27 times bigger (See Table 1).unit fixed technique suppresses variance on large scales (Chuang et al. 2019), but since small scales have an important weight in our  2 fitting, it is not possible to benefit from this suppression.The results are not fully consistent between the two simulations.In particular, the difference of the best fits found for both simulations, considering outerrim simulation is around √︁ Δ 2 ≈ 3.5.Nevertheless, this inconsistency is not surprising since both simulations are run with different cosmologies (See Table 1).
The results obtained are just a first test aiming to give a first use to the  sub−P extension in a simple context of two free-parameter fitting (  sat , ) to eBOSS ELG data, using standard clustering measurements ( p ( p ),  0 (),  2 ()), following Avila et al. (2020).
We also want to see possible simulation-dependent differences in the results, as the use of different volumes and cosmologies on the  2 fitting results.The actual best fit values are not so important in this work: here the focus is on the probability distribution function, and thus we fixed most HOD parameters.

COUNT IN CELLS
In Avila et al. (2020) the authors explored the HOD parameter space using clustering statistics, and degeneracies appeared between some of the parameters.We find similar degeneracies in this work in Figure 10.Given that the CIC statistics is indeed evaluating the PDF of galaxy number counts in cubic cells, we expect it to have a strong constraining power on the PDF of satellites within a halo.
The CIC estimator we consider,  CIC  gal , is obtained following a similar approach as (Yang & Saslaw 2011): the simulation box is divided in cubic cells of side  cell = 5 h −1 Mpc, then we count how many galaxies have each cell and finally we find out how many cells have  gal galaxies.Finally, we normalize over the number of cells: CIC ( sat ) is estimated using 100 realizations of galaxy catalogues computed from unit simulations, with  sat = 0.3 and  =  eBOSS .We point out that Counts-in-Cells, unlike two-point statistics, depends on the number density of the galaxy catalogue taken into account.If the number density is higher, we will have more cells filled with a higher number of galaxies, and less cells with a lower number of galaxies.
In Figure 11 we can see how counts in cells vary with respect to .We remind that  is related to the variance of the distribution (See Equation 11).
In the upper panel we see the general trend of  CIC ( gal ): fewer cubic cells in the box are expected to contain more galaxies.This decay is roughly exponential.
The middle panel of Figure 11 shows better the variation on the number of cells depending on the value of .All lines are divided by Poisson  CIC,P ( = 0).For  gal / cell ≥ 2 we start to see that for  > 0, more number of cells are filled with higher number of galaxies.The trend is opposite for  < 0.
In the lower panel we can see the potential of CIC constraining .The highest constraining is achieved considering cells with 2 or 3 galaxies inside.We can see also that noise increases for cells filled with higher number of galaxies.

Proof of concept: using CIC to reproduce a model catalogue of galaxies
In this section we compare the constraining power of Count-in-Cells and two-point statistics, using unit simulation.We use as our reference catalogue a galaxy catalogue with (,  sat ) = (−0.8,0.4).We consider the same parameter space as in Subsection 6.2.For two-point statistics, the galaxy catalogues constructed have a number density  gal = 6 eBOSS , as in Subsection 6.2.In the case of CIC, since the results depend on the number density, we have constructed 10 realizations with number density  gal =  eBOSS for each point in the parameter space to achieve a similar resolution.We consider the same two-point statistics as before: ( p ( p ),  0 () and  2 ()), but now we compare our galaxy catalogues to the reference galaxy catalogue generated with  = −0.8, sat = 0.4, instead of eBOSS data.For the Count-in-Cells information, we use  CIC ( gal ) evaluated at  gal / cell from 0 to 5, since beyond some catalogues have no cells with  gal / cell > 5. Finally, we also join all statistics.
In the top panel of Figure 12 we show the  2 of the combined projected correlation function, monopole and quadrupole.As it can be seen,  sat and  are partially degenerated.That is, points far from each other in the parameter space may share similar  2 near to the minimum.
In the middle panel we show the performance of the Count-in-Cells and in the bottom panel we sum CIC + 2PCFs.CIC turns to be promising when fitting the HOD parameters, since it has more constraining power on (,  sat ) than two-point functions.Also, it complements 2PCFs, providing a different orientation of the degeneracy contour, helping to break previous degeneracies between  sat and .
We also expect other statistical methods based on counting galaxies in different volumes such as counts in cylinders and 2D k-Nearest Neighbours (Yuan et al. 2023) to be promising to break degeneracies in the parameter space.
Nevertheless, despite CIC has potential as an statistic to describe the distribution of galaxies, more work needs to be done to properly account for data systematics such as the survey window, masks or redshift evolution of the number density () and linear bias (), when fitting our galaxy catalogues to data surveys (Salvador et al. 2019).Moreover, the use of lightcone catalogues would also be more appropriate for this type of measurements.
considering  =  eBOSS .We have applied our extension to find the best HOD model describing the clustering of eBOSS Emission-Line Galaxies (ELGs) at  = 0.6 − 1.1.For this exercise, only two parameters have been set free: , which controls the variance of the PDF for satellite galaxies; and  sat , which controls the fraction of galaxies that are satellites ( § 6.2).We have fitted simultaneously the observed monopole,  0 (), quadrupole,  2 () and the projected correlation function,  p ( p ).For the error bars we have used jackknife covariance matrices ( § 6.1).A sub-Poisson distribution together with a high fraction of satellites are preferred to reproduce the clustering of eBOSS ELGs.In particular, for unit we find:  = −0.65, sat = 0.35; and for outerrim:  < 1 (minimal variance, corresponding to a nearest integer distribution),  sat = 0.4.As an exercise to show the potential of CIC to constrain the PDF variance, and using a mock catalogue as a reference, we have included both the clustering and CIC to constrain the HOD model parameters.Again, we have only set free  and  sat .CIC has more constraining power over the variance of the PDF for satellite galaxies.This estimator can reduce the degeneracies found between  and  sat when fitting only the clustering ( § 7.1).However, more work needs to be done to fit mock catalogues to observed CIC, as several other aspects need to be taken into account that were beyond the scope of this article.These include observational systematic errors, survey window function, the effect of density and bias evolution ((), ()), etc.
In summary, we have developed a robust way to generate HOD galaxy catalogues with a range of variances for the PDF of satellite galaxies within dark matter haloes.We have shown that this variance has a large impact on one and two point summary statistics.In particular, CIC is a promising statistic to constrain the distribution of galaxies within dark matter haloes.Likewise, the PDF of satelllites needs to be understood well, if the CIC statistics is to be used to constrain cosmology.
Our extension to the HOD modelling and the proposal to use the observed CIC in the future give us the opportunity to gain insight of the galaxy-halo connection with existing and upcoming data.Reciprocally, it is fundamental to understand the impact of galaxy-halo connection ingredients on different galaxy clustering statistics, since some of these ingredients could be degenerated with cosmological parameters.These topics are of great relevance now that we have entered the stage-IV of cosmological surveys with the data acquisition by the Dark Energy Spectroscopic Instrument (DESI) and Euclid.

DATA AVAILABILITY
The data that support the findings of this study are available on request.

Figure 1 .
Figure1.Left: at the top we represent the unit halo mass function, in the center the halo bias function (black points) with 1 confidence interval calculated from Δ () 2 = 1 and a fifth order polynomial (red dashed line) fitting the halo bias function.In the bottom we show the ratio between the polynomial and the halo bias function.Right: same for outerrim simulation.We find around 1 per cent differences in the halo bias function, when comparing outerrim case with Figure1ofAvila et al. (2020).

Figure 2 .
Figure 2. Mean and standard deviation of satellite galaxies as a function of halo mass .The solid black line shows the mean default HOD, and the black shade is the theoretical 1 contour considering a Poisson distribution ( =  P ).Solid colored lines represent the measured 1 standard deviation around the mean, for different values of .

Figure 3 .
Figure 3.The parameter space (,  ≡ −1/  ) of the binomial distribution.Considering the default HOD model, an 88.8 per cent of satellites reside in the blue region, in which the binomial distribution does not give unphysical negative variances.The points leading to unphysical negative variances, cover the red region below the red line, including it.The red region above the red line is out of the parameter space, since  is an integer number.We use the ladder  = −1/  = trunc (⟨N sat ⟩) + 1 as the preferred boundary for corrections, ensuring: 1) the minimum effective variance to be positive and 2) ⟨  sat ⟩ to be recovered.We show an example correcting a point in the red region of the parameter space:  = 2.5,  = − 1 2 to the nearest point in the green region of the parameter space with the same mean ⟨  sat ⟩:  = 2.5,   = − 1 3 .

Figure 4 .
Figure 4. Top: Theoretical standard deviation  2 of the Nearest integer (NI), Poisson (P) and binomial (B) distributions as a function of the mean , for several input values of .Bottom: ratio between all distributions and Poisson.

Figure 6 .
Figure6.Top left: 3500 random points with  ∈ [−1, 0) and  ∈ (0, 10).We distinguish green points:  q (  sat ; , ) > 0 ∀  sat , red points, in which  q (  sat ; , ) < 0 for some  sat .We also consider points with  < −1/(trunc() + 1) and finally those with  >  Γlim , represented in orange.Top right: correcting  in favor of   for blue points.Middle left: we compare the mean recovered  recov with its respective fiducial value  for all points.Middle right: we make the same comparison for   .Lower left: same for .Lower right: we do the same analysis for the skewness  3 .We represent 3290 points, since we exclude points placed in the grey area and we also represent  3 = 0.

Figure 7 .
Figure 7. Difference between the recovered and input values of  and , divided by its respective input values.We represent 3500 points, divided in four flags as in the upper plots of Figure 6.The most relevant errors which introduce a real bias in the results are found for  q (  sat ; ,   ) < 0. The errors for points corresponding to the other flags are best shown in the zoom panel.Those are smaller and due to the limited number of realizations taken into account to compute  and .

Figure 8 .
Figure 8.We represent the normalised number of  sub−P realization errors (multiplied by 10 4 due to axis clarity) higher or equal to |  recov / −1| divided by the total number of points considered,  = 10 6 , with colours as indicated by the legend

Figure 9 .
Figure 9. Projected correlation functions calculated from unit simulations  p considering a fraction of satellites  sat = 0.3.We consider Negative Binomial distributions ( > 0), Poisson distribution ( = 0),  sub−P (−1 ≤  < 0) and Nearest Integer distribution.Top: projected correlation function  p .Middle: ratio between all  p with Poisson  p,P ( = 0).Bottom: difference between all  p and Poisson  p,P divided by the error.

Figure 10 .
Figure 10.Top: comparison between eBOSS ELG data and galaxy catalogues obtained form unit simulation with  and  sat as free parameters.Specific ranges of the projected correlation function, the monopole and the quadrupole are used for the comparison.Each point represents the  2 −  2 min for a particular point of (,  sat ) parameter space. 2 −  2 min = 0 represented by a red point is the best fit to the data, corresponding to ( ,  sat ) = (−0.65,0.35).Since  < 0, a sub-Poisson distribution is preferred, parametrised by the extended binomial distribution.Bottom: same comparison using outerrim galaxy catalogues.The best fit prefers a Nearest Integer distribution and a fraction of satellites  sat = 0.4.

Table 1 .
. unit and outerrim simulation parameters.Ω M is the dimensionless dark matter density, obtained dividing the dark matter density by the critical density  crit .Taking into account a negligible radiation density Ω  and the assumption of an Euclidean Universe Ω  = 0, Ω M is directly related to the dark energy density Ω Λ by Ω M = 1 − Ω Λ . 8 is the amplitude of the density fluctuations;   is the spectral index;  box is the volume of both cubic simulations;  P is the number of dark matter particles; and  P the mass of dark matter particles.

Table 2 .
Default mean HOD used in this work.The parameters ,   and   are obtained considering  sat = 0.3,  gal = 1.37 and  gal = 6 •  eBOSS , in the context of unit simulation.