Differentiable Stochastic Halo Occupation Distribution

In this work, we demonstrate how differentiable stochastic sampling techniques developed in the context of deep Reinforcement Learning can be used to perform efficient parameter inference over stochastic, simulation-based, forward models. As a particular example, we focus on the problem of estimating parameters of Halo Occupancy Distribution (HOD) models which are used to connect galaxies with their dark matter halos. Using a combination of continuous relaxation and gradient parameterization techniques, we can obtain well-defined gradients with respect to HOD parameters through discrete galaxy catalogs realizations. Having access to these gradients allows us to leverage efficient sampling schemes, such as Hamiltonian Monte-Carlo, and greatly speed up parameter inference. We demonstrate our technique on a mock galaxy catalog generated from the Bolshoi simulation using the Zheng et al. 2007 HOD model and find near identical posteriors as standard Markov Chain Monte Carlo techniques with an increase of ~8x in convergence efficiency. Our differentiable HOD model also has broad applications in full forward model approaches to cosmic structure and cosmological analysis.


INTRODUCTION
Over the past twenty years, there has been significant observational and theoretical progress in connecting galaxies to their cosmic environments (Peacock & Smith 2000;van den Bosch et al. 2003;Kravtsov et al. 2004;Wechsler et al. 2006;Neistein et al. 2011).Understanding this connection is critical for understanding galaxy formation/evolution (Crain et al. 2009;Zehavi et al. 2011) as well as using galaxies as bias tracers of the underlying mass density for cosmological analyses (Benson et al. 2000a;Desjacques et al. 2018).Studying this connection is a key component of many upcoming galaxy surveys including the Prime Focus Spectrograph (Takada et al. 2014;Tamura et al. 2016), the Dark Energy Spectroscopic Instrument (DESI Collaboration et al. 2016;Abareshi et al. 2022), and the Nancy Grace Roman Space Telescope (Spergel et al. 2015;Wang et al. 2022).
A key theoretical tool for these studies has been the Halo Occupation Distribution (HOD; Lemson & Kauffmann 1999;Benson et al. 2000b;Seljak 2000;Scoccimarro et al. 2001;Berlind & Weinberg 2002;Wechsler & Tinker 2018), a framework that specifies how collapsed dark matter halos (Press & Schechter 1974;Bond et al. 1991;Cooray & Sheth 2002) are populated with galaxies.This is in contrast to "environmental" biasing schemes, such as Eulerian or Lagrangian biasing schemes (Mann et al. 1998;Desjacques et al. 2018), common in cosmological analyses of galaxy survey data (Cuesta et al. ★ E-mail: bhorowitz@princeton.edu 2016; Ivanov et al. 2020;To et al. 2021;Beutler et al. 2017).Unlike environmental biasing schemes that only model summary statistics like power spectrum of the galaxy field, a well-formulated HOD model provides direct physical insight into galaxy formation physics through its parameters, which are related to critical mass scales in galaxy-halo relation.For example, this allows direct measurement of HOD parameters by comparing observed galaxy populations with dynamical mass measurements, such as x-ray clusters (Zheng et al. 2009;Mehrtens et al. 2016).
In standard HOD implementations (e.g Zheng et al. 2007), the HOD model specifies the probability distribution of the number of galaxies, , hosted by a dark matter halo given its properties, such as halo mass: ( | halo ).A semi-analytical halo-model approach can include HOD parameters to predict two and higher point function (Cooray & Sheth 2002), but those predictions are often not accurate enough for analysing modern datasets.Alternatively, a more precise Monte Carlo approach is often used to stochastically assign galaxies to halos in a large simulation box following the HOD prescription, and then the galaxy power spectrum or other quantities of interest are directly measured from the simulation.
In practice, Markov-Chain Monte-Carlo methods have primarily been used to fit HOD parameters from mock or actual data (e.g.White et al. 2011;Rodríguez-Torres et al. 2016;Sinha et al. 2018).However these methods scale poorly with the number of parameters that need to be fit.As novel decorated HOD models increasingly add more assembly bias parameters to accurately capture small scale observations, this exercise can become challenging, especially if there are unforeseen degeneracies in the parameter space.These challenges can be overcome more easily with parameter inference methods that rely on the gradient information i.e.where we can estimate the response of the observations with respect to the underlying parameters of the model, such as Hamiltonian Monte Carlo (Duane et al. 1987a), Variational Inference (Peterson 1987;Beal 2003;Blei et al. 2016) or combinations thereof (Gabrié et al. 2022;Modi et al. 2022).However these methods are not applicable in current HOD models as the implementation of their stochastic galaxy assignment schemes make them classically non-differentiable.
A differentiable HOD framework will enhance dynamical forward modelling frameworks that seek to reconstruct latent cosmological fields (e.g.Seljak et al. 2017), which are constrained to use gradient based methods for optimization due to the high dimensionality of the inference problem.These frameworks generally rely on perturbative bias models that are accurate only on large scales (Schmidt et al. 2019;Modi et al. 2019) or heuristic neural network models with a large number of latent parameters (Modi et al. 2018).A differentiable HOD approach will allow one to push to smaller scales with a wellunderstood, physically developed model that has only a handful of parameters.
An alternative to HOD models that maintains the requisite differentiability is to use differentiable emulators (Kwan et al. 2015;Wibking et al. 2020) or fitting functions of the observables like the one proposed in Hearin et al. (2021) for galaxy assembly bias.However these are efficient only for the particular summary statistics and cosmological parameters on which they are trained.Hence, they require a new training set once these are varied.Depending on the parameter space of interest this could be of prohibitive computationally cost.In addition, separate emulators must be trained for each summary statistic of interest as such methods do not match the galaxy observations at the field level.
Motivated by this, here we adopt a different approach and aim to make the HOD sampling itself differentiable.Our aim is to be able to compute gradients of any observable with respect to HOD parameters through a particular realisation of a galaxy catalog.Common wisdom states that differentiating through stochastically sampled discrete random variables, such as the number of satellites in a given halo, is not possible.However, modern Reinforcement Learning have spurred the development of techniques to deal with these types of categorical variables in the context of deep neural network training via back propagation.In particular, we use the Gumbel-Softmax or CONCRETE method (Jang et al. 2016;Maddison et al. 2016) that utilises continuous distributions to approximate the sampling process of discrete stochastic variables, such as galaxies, in a differentiable fashion.It relies on two insights: 1) a re-parameterization for a discrete (or categorical) distribution in terms of the Gumbel distribution (referred to as the "Gumbel trick"; Maddison et al. 2014) and 2) making the corresponding function continuous by using a continuous approximation that depends on a temperature parameter, which in the zero-temperature case degenerates to the discontinuous, original expression.
In this paper, we will implement the Gumbel-Softmax method in the context of HOD models and apply it to mock datasets.In Sec. 2 we will describe our HOD model and the methods used to allow differentiability of its categorical outputs.In Sec. 3, we apply this technique to a Monte Carlo analysis of a mock galaxy catalog constructed from the Planck-Bolshoi simulation.In Sec. 4 we compare the differentiable HOD model to that from a standard approach and discuss its applications.

METHOD
In this section, we provide some background on the various components of our HOD model, and detail our strategy to make this model differentiable.We implement our model using Tensorflow Probability (Tran et al. 2016;Morgan 2018), particularly the Tensorflow Distribution package (Dillon et al. 2017).

HOD Model
To describe the population of galaxies in our halos we use the standard Zheng et al. (2007) HOD model.In the Zheng et al. (2007) model, the probability of a given halo hosting  galaxies is dictated solely by its mass -( |).The model separately populates central and satellite galaxies, motivated by theoretical studies (Kravtsov et al. 2004;Zheng et al. 2005), and has five free parameters with some physical significance that can be related back to well studied massluminosity relationships.

Central occupation
For central galaxies, the mean occupation function is step-like with a soft cutoff to account for natural scatter between galaxy luminosity the halo host mass.There are two free parameters controlling this function, the characteristic minimum mass of halos hosting central galaxies above some luminosity threshold,  min , and the width of the cutoff profile,  log  : where erf is the standard error function and  is the halo mass.Given the mean occupation for a halo of a given mass, central galaxies are assigned to halos by sampling a Bernoulli distribution: (2)

Satellite occupation
Simulations suggest that satellites follow an approximately powerlaw distribution with a slope close to unity at the high mass end.At lower masses, the shape of the distribution changes and the overall distribution can be parameterized as where  is the power law slope at high masses,  0 is the characteristic mass of the change-over and  1 is the characteristic amplitude.This mean number of satellites for a given mass is then used to define the intensity  of a Poisson distribution, from which a particular number of satellites are drawn for each halo.

Spatial satellite distribution
In the Zheng et al. (2007) HOD, central galaxies are located at the center of its host halo and satellite galaxies are distributed according to a Navarro et al. (1997) profile (hereafter NFW).To sample the satellite galaxy positions, we utilize the Robotham & Howlett (2018) implementation, which constructs an efficient mapping from a random sample and the full NFW profile via the calculation of the quantile function, i.e. the inverse Cumulative Distribution Function (CDF).This can be written analytically as where  0 is the Lambert-W function,  vir is the virial mass and  is the concentration parameter.Using this inverse CDF, we can now randomly draw radial distances for satellites by sampling  from  [0, 1] and mapping it to radii as / vir = ( , ).The angular position of the satellite is sampled uniformly in this isotropic model.

Differentiable stochastic sampling
In this section, we review the key ideas behind differentiable stochastic sampling, which will form the building blocks of DiffHOD.

Stochastic backpropagation by reparametrisation
One of the most common approaches for backpropagation through stochastic sampling is the so-called reparametrisation trick, extensively used for instance in Variational Auto-Encoders (Kingma & Welling 2013;Jimenez Rezende et al. 2014).
The key idea of this approach is to rewrite samples  from a given paramteric distribution P  as a deterministic and differentiable transformation  applied to a fixed distribution P  : This reparametrisation of the samples allows to side-step having to take derivatives of the stochastic variable  when computing derivatives of some downstream function ℎ with respect to the distribution parameters : In the right-hand side of this expression, the derivative now only involves taking gradients of a deterministic function of .
To provide a simple concrete example of such reparameterisation, let us consider a Gaussian distribution of mean  and standard deviation .One can express a sample  ∼ N (, 2 ) as  =  +  with  ∼ N (0, ), making it trivial to take derivatives of the samples with respect to the parameters of the distribution ( and ).

Gumbel-Softmax trick for categorical variables
The reparameterisation trick as presented above requires the samples to be expressible as a deterministic and differentiable function of a random variable.While this can often be achieved for continuous distributions, it is typically not directly possible for discrete categorical variables.To overcome this limitation, the Gumbel-Softmax trick (Jang et al. 2016;Maddison et al. 2016) introduces a relaxation of a categorical distribution to a continuous distribution, which can then be handled with the reparameterisation trick.
Let  be a categorical variable with class probabilities  1 ,  2 , ...  that we wish to sample.We assume that categorical samples are encoded as -dimensional one-hot vectors, i.e. they are 1× vectors with all elements 0 except the the element corresponding to the sampled class which is 1.The simplest way to sample z is by A first step towards making these samples differentiable is to use the Gumbel-Max trick (Gumbel 1954;Maddison et al. 2014), which reparametrizes categorical sampling as where   are i.i.d.random variables drawn from the Gumbel distribution between 0 and 1, Gumbel(0, 1)1 .This reparameterization trick refactors the sampling of  into a deterministic function of the parameters () and some independent noise with a fixed distribution.However the reparametrized function is still non-differentiable due to the argmax function.A continuous, differentiable approximation to this is given by a softmax function, where  is a free parameter sometimes referred to as the "temperature".Using this approximation relaxes the discreteness of the Gumbel-Max trick and generates a -dimensional vector z We recover the true discrete function in the limit of  → 0. As this function is analytical in class probabilities  for  > 0, we can estimate the gradients of the observed samples z with respect to the parameters parameterizing .
In the remaining of this work, we will make use of the special case when the number of classes is 2, i.e. when the categorical distribution reduces to a Bernoulli distribution.In this special binary case Eq. 11 can be simplified.Maddison et al. (2016) refer to the resulting distribution as BinConcrete; we will refer to it in this work as a Relaxed Bernoulli distribution.Using the fact that the difference of two Gumbel variables follows a Logistic distribution 2 , (Maddison et al. 2016) shows the Relaxed Bernoulli can be reparameterized as: where in this expression  is the odds-ratio  = /(1 − ) if  is the probability of corresponding Bernoulli distribution.

D HOD Implementation
We now have all the elements needed to build a differentiable HOD (hereafter D HOD) model.We describe in this section our strategy for sampling central and satellites occupation, and satellites positions.

Differentiable central occupation sampling
As described in Sec.2.1, the central occupation is defined by a Bernoulli distribution, with a parameter  =  cen ( | min ,  log  ) defined as a deterministic function.
Here, we can directly apply the Gumbel-Softmax trick introduced above in the specical case of a binary variable.We therefore sample the central occupation of each halo using the Relaxed Bernoulli distribution: with  ∼ Logistic(0, 1) .
(13) at given temperature .Fig. 1 illustrates the high agreement between the halo occupancy obtained by sampling centrals with this Relaxed Distribution compared to the analytical expectation at our fiducial choice of  = 0.1.

Differentiable satellite occupation sampling
For satellites, we aim to define a differentiable approach to sampling from a Poisson distribution with intensity  =  sat ( | 0 ,  1 , ) , also a deterministic and differentiable function.To build on the Gumbel-Softmax trick, we propose to replace conventional Poisson sampling of the total number of satellites by sampling each satellite individually from a Bernoulli distribution.
Let us consider a halo with a Poisson rate  for its satellite occupancy.We assume the halo can have a maximum of  satellites, then for each potential satellite we sample from a Bernoulli distribution with probability  = / whether this satellite will be included in the halo.The resulting statistics of the number of satellites with this procedure will be Binomial (as N draws from i.i.d.Bernoulli).
More formally, we propose to approximate the Poisson distribution with intensity  of a standard HOD by a Binomial distribution with  trials and probability  = /: By construction, this Binomial distribution will yield the same mean number of satellites as the Poisson distribution, however the variance of both distributions is different: From this expression, one can foresee that the Binomial distribution will be a close approximation to a Poisson distribution when the ratio  sat  is small, i.e. when the number of trials is large compared to the expected number of satellites.This is actually known as the law of rare events, and at fixed  *  the Binomial distribution Binomial(, ) converges to a Poisson distribution when the number of trials  → ∞.
In practice, we will need to limit the number of trials  to some finite value and Fig. 2 compares the shapes of satellites distributions with a Poisson model versus a Binomial model for two different choices of , and for different halo masses.A higher value of  can improve accuracy but will also increases memory costs, which scales linearly with the maximum number of satellite galaxies encoded in our one-hot embedding.Meanwhile, too low of a  can bias results by artificially reducing the variance of the satellite occupation distribution, or worse, truncating the satellite galaxies of the most massive halos.
Our fiducial choice in this work is  = 48.In Fig. 1 (bottom), we can see that the satellite population reaches a maximum of ∼ 40 galaxies for the most massive ∼ 10 15  halos.For the mass range considered here, we find  > 40 does not significantly improve the statistical match in summary statistics (correlation function, power spectra, etc.) of the resulting galaxy fields and larger  further will increase memory requirements and computational time.If the end statistic of interest is particularly sensitive to galaxy populations in the most massive clusters, a higher  might be needed.If one is limited by memory or implementing HOD for halos with a broader mass range, then it may be more efficient to have multiple halo mass bins with different maximum number of allowed satellites .
This Binomial assumption for the sampling of satellites brings two concrete advantages: (i) Having restated satellite sampling as draws from Bernoulli distributions, we can make the procedure differentiable by using the Relaxed Bernoulli, similarly to centrals.
(ii) Using a fixed number  of potential satellites gives us a practical way to handle varying number of satellites per halos.
Concretely, for each candidate satellite  ∈ 1,  of a halo, we sample whether the satellite will be included in the halo using: with  ∼ Logistic(0, 1) .( 17) where  =    and  is the temperature.As a result, for each halo we obtain a vector z of size  which encodes active satellites for the halo.
In all downstream computations, this vector z can be interpreted as a weight between 0 and 1 to apply to each of the  satellites of m halo = 10 14.0 m halo = 10 14.4 m halo = 10 14.8 Poisson Binomial (N=100) Binomial (N=30) Figure 2. Comparison of satellite occupation distributions for different halo masses, under various assumed distributions: Poisson (solid line), Binomial with 100 trials (dashed line), Binomial with 30 trials (dotted line).By construction, the Binomial approximation recovers the mean number of satellites, but for massive halos limiting the number of samples N will lead to departure in the spread of the distribution compared to a Poisson distribution.Note that for the  halo = 10 14.8 halo the mean number of satellites is above 30, so the Binomial distribution is not well defined.ƭ each halo, for instance in the computation of two-point correlation functions.

Differentiable sampling from NFW distribution
The last step to complete our HOD implementation is to sample the position of satellites based on an NFW profile centered at the halo position.There are two difficulties here: 1. sampling positions for a varying number of satellites, 2. making the sampled positions differentiable with respect to the NFW parameters.
The first question of dealing with varying number of satellites is solved by our Binomial model with fixed number of trials .For each halo, we will sample the same number of  sets of coordinates, one for each potential satellite.Whether these coordinates will actually contribute in downstream computations will depend on the per halo satellite occupation vector z introduced above.
The second question, of the differentiability of stochastic coordinates, is again solved by applying the re-parameterisation trick to the NFW profile.From Eq. ( 5) we know the CDF of the NFW profile, meaning that for each satellite  ∈ 1,  of a given halo we can sample the halo-centric satellite radial distance as: where  is a differentiable function of parameters .
In our adaptation of the Zheng et al. (2007) model, we assume an isotropic NFW distribution for satellite, so to retrieve halo-centric cartesian coordinates   ,   ,   of a given satellite, we first sample   ,   ,   on the unit sphere and then multiply these coordinates by the   value sampled above.This is nothing more than an another reparameterisation step and the resulting cartesian coordinates remain fully differentiable with respect to the NFW parameters.
We note that contrary to the sampling of central of satellites which are differentiable approximation to a standard HOD (due to the discrete variables involved), this differentiable implementation of NFW sampling is exact.

Impact of temperature parameter 𝜏
In the D HOD model, we introduce temperature, , as a free parameter (Eq.13).Depending on the context/implementation, it may be beneficial to anneal (i.e.reduce)  over the course of the optimization such that  → 0. For instance, in the original papers (Jang et al. 2016;Maddison et al. 2016), the Gumbel-Softmax trick was used in the context of training a neural network where only the optimal network weights were of interest and hence  was reduced to nearly zero over the course of the training.However in our case, we are interested in stochastic sampling, and not optimization, so we will pick a single temperature that well approximates the target distribution (i.e.unbiased) while maintaining reasonable derivative properties (i.e. less noisy).
In Fig. 1, we present the central (top) and satellite (bottom) occupancy distributions of our D HOD model for different temperatures:  = 0.02 (red) 0.1 (green), 0.5 (orange), and 1 (blue).We include the occupancy distributions for the standard HOD model for comparison (star).For this work, we take an experimental approach for determining , as advocated in Maddison et al. (2016): the temperature should set as high as possible while maintaining the desired accuracy of the target distribution.We find that a fixed  = 0.1 provides high accuracy while maintaining stable gradients for both sampling the number of galaxies in each halo as well as the positions of the satellites from the NFW profile (see later Fig. 3).

EXPERIMENTATION
To test our implementation, we construct a fiducial mock galaxy catalog from the Planck Bolshoi simulation halo catalog (Klypin et al. 2011) at  = 0. We treat this catalog as our mock observation.This simulation has a side length of 250 ℎ −1 Mpc, and contains 1,367,493 unique halos ranging in mass from 1.1 × 10 15  down to 2.7 × 10 8  .We use halotools (Hearin et al. 2016) to generate our fiducial catalog with HOD parameter values: log  min = 12.02,  log  = 0.26, log  0 = 11.38,log  1 = 13.31, = 1.06.These parameters correspond to the best fit values from Zheng et al. (2007) using an SDSS-based galaxy catalog (Zehavi et al. 2005) with -band absolute magnitude threshold of   < −20.

Summary Statistics and Derivatives with D HOD
Next, we compare the mock observations to parallel galaxy catalogs constructed using D HOD (Section 2).The galaxy catalogs are then painted onto a grid using a differentiable Cloud-In-Cell (CIC) painting method (Modi et al. 2020) and its real-space power-spectra calculated via a differentiable Tensorflow power spectrum implementation (Horowitz et al. 2021).For D HOD, all steps are differentiable so the overall mapping from original halo catalog to end power spectrum is also differentiable via the chain rule.In Fig. 3, we present the D HOD derivatives of the central and satellite occupancy distribution functions (top) and resulting power spectra (bottom) with respect to the HOD parameters for different temperatures:  = 0.02 (left), 0.1 (middle), and 1 (right).The derivatives are evaluated at the fiducial HOD parameter values.In the center panels, for comparison, we include derivatives of the central and satellite occupancy distribution functions derived analytically and derivatives of the power spectrum derived using the standard HOD with finite differences (star).
With  = 0.02, the power spectrum derivatives have significant 10 2 10 1 10 0 10 1 10 2 10 3 0 10 3 10 2 10 1 10 0 10 1 10 2 . Top: Derivatives of the central and satellite occupancy functions with respect to the HOD parameters for various temperatures:  = 0.02 (left) 0.1 (center), and 1 (right).The derivatives are calculated at the fiducial HOD parameter values.We include derivatives analytically derived from Eqs. 1 and 3 for comparison (star).Bottom: Derivatives of the galaxy power spectrum with respect to the HOD parameters, evaluated at the fiducial HOD parameter values.We include derivatives calculated using the standard HOD with finite difference for comparison (star).Our D HOD model with  = 0.1 provides sufficiently smooth derivatives that are in good agreement with analytical derivatives for the occupancy function and with standard HOD derivatives for the power spectrum.ƭ numerical noise.The  = 1.0 derivatives are smoother but we find inaccurate occupancy distributions (Fig. 1) and a significantly biased power spectrum.Meanwhile, with  = 0.1 there is still some noticeable numerical noise in the derivatives, however, this is sufficiently smooth for our application and for our optimization to be well behaved.We also find that the  = 0.1 D HOD derivatives are in good agreement with analytical derivatives for the occupancy function and with standard HOD derivatives estimated using finite differences for the power spectrum.
We compare the differentiable power spectrum from D HOD to the power spectrum from the standard HOD model in Fig. 4. We use the fiducial HOD parameter values and estimate the error bars from 500 realizations of the D HOD model.We find good agreement (better than 1%) across all scales, with particularly good agreement for central galaxies, which are not sampled from the NFW profile.Low  modes are most sensitive to the most massive halos whose satellite galaxy populations are truncated by our one-hot distribution ( = 48; Section 2.3); those who are interested in this regime can further increase .However, even at  < 0.15ℎ/Mpc, we find good agreement between the power spectra.

Monte Carlo Analysis with D HOD
Lastly, we demonstrate that we can derive unbiased inference using D HOD, by comparing the posteriors on HOD parameters derived using D HOD to posteriors derived using standard methods with the same mock observations and likelihood.We use the power spectrum measured from the fiducial galaxy catalog as our mock observations, and construct a covariance matrix using 100 galaxy catalog realizations at the fiducial HOD catalogs.To this covariance, we added a small constant diagonal term (8.0 × 10 −5 ) to improve numerical stability.We limit our comparison to  < 1.0.In both D HOD and standard cases, we use the same halo catalog used to construct the mock observations, so that the only source of error is variation caused by the HOD model.For this analysis we impose wide Gaussian priors, N (,  2 ), where  is the mean, and  2 is the variance, on our parameter values as follows: log  min ∼ N (12.0, 0.5),  log  ∼ N (0.25, 0.2), log  0 ∼ N (11.25, 0.5), log  1 ∼ N (13.20,0.5),  ∼ N (1.0, 0.2).We plot the ratios for central and satellite galaxies, as well as for all galaxies.Error bars represent the standard deviation of the D HOD model estimated from 500 realizations of the galaxy sampling.Grey band represents the standard deviation of  HT for all galaxies.We find good agreement between the differentiable power spectrum and  HT across all scales ( < 3ℎ/Mpc), as well as similar distributional properties.ƭ To derive the posteriors using D HOD, we sample over the HOD parameter using Hamiltonian Monte Carlo (HMC) (Duane et al. 1987b;Neal et al. 2011).We use the NoUTurn HMC implementation (Hoffman et al. 2014) in Tensorflow Probability.We use three chains initialized around our fiducial HOD parameters with over 1,000 steps (300 steps of burn-in).
For the standard approach, we run the same analysis using the standard HOD.However, since this implementation does not allow easy differentiation, we cannot use HMC instead use a Markov-Chain Monte Carlo analysis.We use the emcee (Foreman-Mackey et al. 2013) implementation with 10 walkers and 6,000 steps.We present the posteriors on the HOD parameters for the D HOD (black) and standard (blue) analyses in Fig. 5 and list the median posterior values in Table 1.We mark the fiducial ("true") HOD values in green.As we are using a single galaxy realization for our mock observations, we expect some variation between the best fit parameters and the true values.The posteriors derived using D HOD and HMC is in excellent agreement with the posteriors derived using the standard HOD and MCMC.
Our HMC analysis takes approximately 10 hours on a single Tesla V100-PCIE-32Gb GPU.Meanwhile, the standard approach takes substantially more time to get comparable results: ∼200 hours on 1 CPU -∼20× slower than our D HOD analysis.Some of this improvement is due to the fact that our D HOD implementation is faster per iteration than the standard HOD implementation (∼1 and ∼4 seconds per iteration, respectively).Most of the improvement, however, comes from the fact that D HOD allows us to exploit a more efficient gradient-based method to derive the posterior.
We compare the D HOD and standard approaches in more detail by comparing the effective sample size of each chain per function evaluation (Gelman et al. 2013).The effective sample size incorporates information about the auto-correlations within a chain; i.e. it accounts for the dependent relationships between the samples.We  2007) HOD parameters derived using D HOD with Hamiltonian Monte Carlo (black) and the standard HOD with Markov Chain Monte Carlo (blue).We use a single mock galaxy catalog realization at the fiducial HOD parameter values (green) as our observations.ƭ calculate it from the output of the Markov chain: where  is the number of samples in the chain and   is the autocorrelation of length .Averaging over all HOD parameters, we find an mean effective sample size of 524.2 for our D HOD HMC evaluation and 403.9 for the standard MCMC evaluation.This corresponds to an effective sample of 0.05 per step for the HMC and 0.006 for the MCMC evaluation.

CONCLUSIONS
In this work we have constructed a differentiable stochastic HOD model going from a halo catalog to an observed galaxy power spectrum.This allows us to use derivative-based optimization methods to quickly optimize for the underlying model parameters.This is the first time differentiable stochastic models have been used in the astrophysics literature.We find the D HOD model provides a 4× increase of speed versus the same analysis performed via MCMC with the standard HOD implementation.D HOD is an alternative to a number of recent works focusing on emulating galaxy clustering statistics (Kwan et al. 2015;Wibking et al. 2019;Kobayashi et al. 2020;Wibking et al. 2020;Hearin et al. 2022).Unlike emulator based methods, D HOD works at the level of the halo catalog and allows fast, differentiable, generation of any summary statistic with respect to the HOD parameters.
In of models.While standard HODs are based only on halo mass, in general various properties of the halos environment and formation history could effect the galaxy properties (Zhu et al. 2006;Croton et al. 2007).Galaxy assembly bias has been argued (Feldmann & Mayer 2015;Hadzhiyska et al. 2020) to cause significant deviations between predictions of standard HOD models and those from hydrodynamical simulations.Decorated HOD models have been introduced to account for assembly bias (Hearin et al. 2016) and has been extended to include other possible effects (Yuan et al. 2018).These models still rely on stochastic discrete sampling for assigning centrals and satellites so they can be modelled in a differentiable way using the techniques described in this work.As the dimensionality of our problem increases, either with extended HOD models or joint analysis with cosmological parameters, we expect the relative performance of derivative based methods, like HMC, over pure sampling based methods to further improve (Neal et al. 2011).Differentiable HOD models have even more apparent applications in the case of dynamical forward model large scale reconstructions (Seljak et al. 2017) when paired with efficient differentiable halo finding methods (Modi et al. 2018(Modi et al. , 2020;;Kodi Ramanah et al. 2019).While it is possible to perform these reconstructions by interpreting the galaxy field as a biased version of the dark matter field (i.e. in Horowitz et al. (2021)), inaccuracies in this prescription will result in biases that would be difficult to account for in cosmological constraints.Through joint inference of the HOD parameters with the initial density field, these astrophysical uncertainties can be rigorously marginalized out.Differentiable models are critical for this application as the optimization is highly multidimensional (approximately number of particles in the simulation) and would be computationally infeasible without gradient-based methods.
While in this work we have highlighted using our D HOD model inside an HMC framework, one can exploit its automatic differentiation for a variety of first order optimization and parameter inference methods.For example standard Variational Inference relies on having well defined derivatives for the optimization of latent space parameters describing the likelihood surface (Peterson 1987;Beal 2003;Blei et al. 2016).Variational inference could further accelerate parameter inference when compared to Hamiltonian Monte Carlo or nested sampling methods (Gunapati et al. 2018).

Figure 1 .
Figure 1.Halo occupancy distribution for central (top) and satellite galaxies (bottom) as a function of halo virial mass of our differentiable HOD model (D HOD solid).Different colors indicate different temperature values used in the Gumbel-Softmax approximation (see Eq. 11).We include the occupancy distribution from the standard Zheng et al. (2007) HOD model for reference (star).In this work, we use D HOD with  = 0.1, which is in good agreement with the standard HOD model throughout the full halo mass range.ƭ

Figure 4 .
Figure 4.Comparison of power spectrum derived using a standard HOD model ( HT ) versus our D HOD model.We plot the ratios for central and satellite galaxies, as well as for all galaxies.Error bars represent the standard deviation of the D HOD model estimated from 500 realizations of the galaxy sampling.Grey band represents the standard deviation of  HT for all galaxies.We find good agreement between the differentiable power spectrum and  HT across all scales ( < 3ℎ/Mpc), as well as similar distributional properties.ƭ

Figure 5 .
Figure 5. Posterior distributions of the Zheng et al. (2007) HOD parameters derived using D HOD with Hamiltonian Monte Carlo (black) and the standard HOD with Markov Chain Monte Carlo (blue).We use a single mock galaxy catalog realization at the fiducial HOD parameter values (green) as our observations.ƭ

Table 1 .
this work, we have focused on the Zheng et al. (2007) HOD model, but our methods can be easily extended to a broad class Posterior values from the HOD analyses using the D HOD and standard HOD model.Uncertainties are estimate from the 16% and 84% quantiles.