SIDE-real: Supernova Ia Dust Extinction with truncated marginal neural ratio estimation applied to real data

We present the first fully simulation-based hierarchical analysis of the light curves of a population of low-redshift type Ia supernovae (SNae Ia). Our hardware-accelerated forward model, released in the Python package slicsim, includes stochastic variations of each SN's spectral flux distribution (based on the pre-trained BayeSN model), extinction from dust in the host and in the Milky Way, redshift, and realistic instrumental noise. By utilising truncated marginal neural ratio estimation (TMNRE), a neural network-enabled simulation-based inference technique, we implicitly marginalise over 4000 latent variables (for a set of $\approx 100$ SNae Ia) to efficiently infer SN Ia absolute magnitudes and host-galaxy dust properties at the population level while also constraining the parameters of individual objects. Amortisation of the inference procedure allows us to obtain coverage guarantees for our results through Bayesian validation and frequentist calibration. Furthermore, we show a detailed comparison to full likelihood-based inference, implemented through Hamiltonian Monte Carlo, on simulated data and then apply TMNRE to the light curves of 86 SNae Ia from the Carnegie Supernova Project, deriving marginal posteriors in excellent agreement with previous work. Given its ability to accommodate arbitrarily complex extensions to the forward model -- e.g. different populations based on host properties, redshift evolution, complicated photometric redshift estimates, selection effects, and non-Ia contamination -- without significant modifications to the inference procedure, TMNRE has the potential to become the tool of choice for cosmological parameter inference from future, large SN Ia samples.

On the other hand, detailed dust modelling is a principal component of BayeSN (Mandel et al. 2009 ;Mandel, Narayan & Kirshner 2011 ;Thorp et al. 2021 ;Mandel et al. 2022 ;Ward et al. 2023a ), a probabilistic SN Ia model that furthermore captures the broader-brighter correlation observed in SNae Ia into a trainable spectro-temporal flux distribution, thus obviating the need for Tripp corrections involving the magnitude, stretch, and dust parameters (Phillips 1993 ;Tripp 1997Tripp , 1998 ) ). BayeSN also exploits observations in the near-infrared (IR), where extinction is reduced and only weakly dependent on the dust properties, as a convenient 'anchor' for measuring the SNae's brightness (Avelino et al. 2019 ).
Astrophysical properties of galaxies are also thought to influence the population of SNae Ia they host.Correlations between (standardized) SN Ia absolute magnitudes and colours on one side and the host o v erall stellar population age, star-formation rate, metallicity, and stellar mass (Kelly et al. 2010 ;Sullivan et al. 2010 ;Childress et al. 2013 ;Chung et al. 2023 ), as well as the SN's location within the host and the local host properties (e.g.Rigault et al. 2013Rigault et al. , 2015Rigault et al. , 2020 ; ;Jones, Riess & Scolnic 2015 ;Moreno-Raya et al. 2016b , a ;Hill et al. 2018 ;Jones et al. 2018 ;Kim et al. 2018 ;Roman et al. 2018 ;Kim, Kang & Lee 2019 ;Rose, Garnavich & Berg 2019 ;Kelsey et al. 2021 ) on the other side have been empirically observed, although a definitiv e e xplanation of the causal channels is still outstanding.
Studying populations of SNae Ia has often involved a twostep process whereby first, the properties of individual objects are inferred, and then their distributions are examined.Modern analyses methodologies like BAHAMAS (March et al. 2011 ;Shariff et al. 2016 ), UNITY (Rubin et al. 2015(Rubin et al. , 2023 ) ), Ma, Corasaniti & Bassett ( 2016 ), Steve (Hinton et al. 2019 ), and BayeSN are instead hierarchical : they include parameters describing the SN Ia population and infer them simultaneously with those of the individual SNae.While the former three use a few highly compressed summary statistics instead of the full light curves, BayeSN models probabilistically the full spectrotemporal flux distribution of SNae Ia.A middle ground is BIRD-SNACK (Ward et al. 2023b ), which models hierarchically the light curves around peak with a restricted set of colour-related parameters.The impact of Bayesian hierarchical modelling (BHM) of dust extinction on inferred SN Ia distance estimates (used subsequently for cosmology) was demonstrated by Mandel et al. ( 2017 ), Thorp & Mandel ( 2022 , hereafter TM22 ).Recently, Grayling et al. ( 2024 ) extended the analysis to inferring SN Ia models separately for high-and low-mass hosts, while Thorp et al. ( 2024 ) considered an evolution of dust properties by comparing a low-to a higher-redshift sample.
Future surv e ys, like those performed by the upcoming Roman Space Telescope (WFIRST; Hounsell et al. 2018 ) and Vera Rubin Observatory (LSST;LSST Science Collaboration 2009 ;Ivezi ć et al. 2019 ), are expected to detect hundreds of thousands of supernovae (Ia and non-Ia), greatly reducing purely statistical uncertainties and highlighting the need for principled treatment of systematic effects that a v oids approximations and ad hoc assumptions.Apart from requiring scalable computational methods for their analysis, the orders-of-magnitude increase in the number of SN Ia candidates will also introduce new modelling challenges, mainly related to the inability to have spectroscopic follow-up for any but a small fraction of the detected transients.Among those challenges are complicated (non-Gaussian, multimodal) posterior estimates for the objects' redshifts, contamination of the sample due to mis-classification (Kunz, Bassett & Hlozek 2007 ), covariate shift (Moreno-Torres et al. 2012, see also Revsbech, Trotta & van Dyk 2018 ;Autenrieth et al. 2023 ) between the full sample and a high-quality/spectroscopic sample used to train SN Ia models, and selection effects like Malmquist bias (Malmquist 1922(Malmquist , 1925 ) ), whereby the preference for detecting brighter objects skews and shifts the distribution of properties of detected SNae Ia away from that of the whole population.
In principle, these effects can be included in likelihood-based analyses, with two approaches pre v ailing.Some studies -see e.g. the recent photometry-only cosmological analysis of Popovic et al. ( 2024 ) -rely on simulations to derive various de-biasing and correction factors (Kessler & Scolnic 2017 ), which, ho we ver, are at best correct and de-biasing on avera g e .Furthermore, the focus of this procedure on observed SN properties -inherited from the tradition of standardization -makes it difficult to extend to properties of the hosts (see e.g.Popovic et al. 2021Popovic et al. , 2023 ) ).On the other hand, hierarchical likelihood-based analyses (Rubin et al. 2015 ;Hinton et al. 2019 ;Rubin et al. 2023 , see also March et al. 2018 ) can only handle selection effects by transferring them to the level of the unobserved latent parameters, which necessitates a plethora of 'counter-intuitive' (Rubin et al. 2015 ) assumptions and approximations to the selection likelihood, which is intractable for realistic selection criteria applied to light curves.
The framework of simulation-based inference (SBI) is a powerful alternative to likelihood-based methods which allows simulations to be used to directly derive quantities of interest: for example, marginal posteriors for the cosmological parameters or those describing the dust population.Stemming from approximate Bayesian computation (ABC; for a re vie w, see e.g.Sisson, Fan & Beaumont 2018 ), SBI comprises a rapidly expanding collection of techniques centred around the idea of representing the data-generating process not through numerical e v aluation of the likelihood but rather through example (mock) data stochastically simulated with known parameters.
Because the pairs of parameters and simulated data implicitly encode the full likelihood, it is possible to define arbitrary parameter sub-spaces in which to perform marginal simulation-based inference.This is especially beneficial when analysing large collections of observed objects, for which likelihood-based methods require exploring a large latent parameter space even if one is only interested in a few global parameters: in the case of SN Ia cosmology, one needs to infer, e.g. the individual redshifts, stretches, colours, etc. of all SNae, only to eventually marginalize them out and obtain a twodimensional marginal posterior for the cosmological parameters of interest.While all latent parameters still need to be stochastically sampled in the context of SBI, this only adds a linear complexity of simulating all objects, instead of the at least quadratic -but often worse: see e.g.Handley, Hobson & Lasenby ( 2015 , fig. 5) -scaling required to map out high-dimensional spaces.
SBI also allows seamless inclusion of many aspects of the model for which a numerical likelihood is intractable or computationally impractical: for example, contamination and selection bias can be represented through contaminated and biased mock catalogues, created by performing on simulated data the same classification procedure and enforcing the same selection criteria as on real observations.Importantly, because the true input parameters to the simulations are known, these catalogues can be used to derive accurate (de-biased and de-contaminated) posteriors.
Different SBI fla v ours use different procedures to convert parameter-simulated data pairs into posteriors conditioned on the observed data.ABC, for example, accepts or rejects input parameters based on the similarity of simulated and observed data, but this is computationally wasteful and requires bespoke distance mea-Downloaded from https://academic.oup.com/mnras/article/530/4/3881/7643658 by SISSA -Scuola Internazionale Superiore di Studi Avanzati user on 14 May 2024 sures for all but the simplest types of data.More recent methods employ neural networks (NNs) to interpolate in data space and have thus proven much more efficient in terms of simulation budget.The inference network can be trained to either emulate the lik elihood: neural lik elihood estimation (NLE), approximate the posterior of interest: neural posterior estimation (NPE), or estimate the likelihood-to-evidence ratio: neural ratio estimation (NRE). 1 Whereas NLE and NPE require density estimation, NRE transforms Bayesian inference into a classification task, thus allowing the greatest freedom to the neural network architecture.We use a sequential modification known as truncated marginal neural ratio estimation (TMNRE; Miller et al. 2021 ), which further optimizes simulation use and network expressivity by focusing on regions of parameter space consistent with the observed data, while trivially composing with marginalization and preserving local amortization (see below).
Neural SBI techniques are also amortized: i.e. rather than the particular posterior given the observed data, they learn how to derive the posterior from any data.This, in combination with a nearly instantaneous e v aluation speed once trained, allows neural SBI to be validated on large sets of simulations, either with random parameters from the priors (Hermans et al. 2022 ) or at fixed parameter values, i.e. in a frequentist context.Amortization can also be used during training to ensure proper Bayesian co v erage properties (Delaunoy et al. 2022 ).
The analysis in SICRET (and the majority of other works mentioned abo v e 2 ) is based on pre-deriv ed summary statistics (e.g. from SALT) and focused on scaling it to the size of near-future surv e ys ( ∼10 5 SNae Ia).The present work, in complement, extends the methodology in terms of data and modelling complexity by training a neural network to summarize raw light-curve data in a way that is optimal for the particular inference task, thus circumventing the expensive fitting stage present in all current studies.While our focus is on marginal population-level inference (e.g. of global dust properties and, eventually, cosmology), TMNRE also allows simultaneous inference of all object-level (local) parameters in the BHM: the very ones that serve as summary statistics in downstream tasks.Finally, explicitly modelling the SN light curves 1 See Cranmer, Brehmer & Louppe ( 2020 ), Lueckmann et al. ( 2021 ) for o v erviews of the methods and references for each and https://simulationbased-inference.org/ and https://github.com/smsharma/awesome-neural-sbifor references to applications.We list rele v ant ones below. 2 except Jennings et al. ( 2016 ), who derived a distance measure between sets of light curves that a v oids the curse of dimensionality, and Villar ( 2022 ), who resorted to Gaussian process (GP) interpolation to analyse light curves using neural networks is indispensable if one wants to account for selection effects (introduced by criteria defined in terms of raw observations) and non-Ia contamination.In a forthcoming work, we plan to address these two issues, the final hurdle to achieving fully simulation-based SN Ia cosmology.
We describe our forward simulator of SN Ia light curves in Section 2 and briefly introduce TMNRE in Section 3 , detailing the network we use in Section 3.1 .In Section 4 , we validate the results from our method against those obtained via Hamiltonian Monte Carlo (HMC) with simulated data that mimics the Carnegie Supernova Project (CSP) surv e y before presenting results on the real SN Ia light curves in Section 5 and concluding in Section 6 .

slicsim : R E A L I S T I C S N I A L I G H T C U RV E S I M U L AT I O N S F O R M AC H I N E L E A R N I N G
SBI analyses are empowered by the realism of the simulations they employ.This work impro v es in this respect compared to SICRET by utilizing a simulator that generates realistic SN Ia light curves while incorporating uncertainties from three different levels: observational noise, the hierarchy of stochastic parameters, and the residual stochasticity of the SN's spectro-temporal flux surface.
In this work, we present a new light-curve simulation framework, slicsim ,3 built from the ground up for close interoperability with modern machine-learning frameworks.Based on PyTorch (Paszke et al. 2019 ), our simulator is trivially deployable on hardware accelerators like graphics processing units (GPUs), parallelizable, and automatically differentiable. 4 SN Ia light-curve simulation involves three main components: a source model, a number of propagation effects, and an instrument model.Its input is a 'surv e y specification': the time, passband, and a description of the instrument and observing conditions (see Section 2.5 ) of each pointing comprising the analysed data.The surv e y specification is kept constant, so that the simulator is implicitly conditioned on it, producing an ordered list of simulated flux . This conditioning allows us to use the efficient inference network described in Section 3.1 .

Intrinsic SN Ia spectral time-series: BayeSN
The simulation starts with the spectral flux distribution (total emitted energy per unit time and unit wavelength interval), t , λ , of a supernova s ∈ { 1, . . ., N SN } , i.e. its brightness at each point in time and at each wavelength in the SN rest frame.We use the MNRAS 530, 3881-3896 (2024) BayeSN model (Mandel et al. 2009 ;Mandel et al. 2011 ;Thorp et al. 2021 ;Mandel et al. 2022 ;Ward et al. 2023a ), which decomposes the magnitude difference from the Hsiao et al. ( 2007 ) SN Ia template, Hsiao ( t , λ), as: with t , λ in the SN frame.Note that while W 0 and W 1 are shared among all SNae Ia, the residual perturbations s are not, and this sets BayeSN apart from other models based on functional principal component analysis (PCA) like SALT and SNEMO.
The principal components W 0 ( t , λ) and W 1 ( t , λ) and the residual perturbation surface s ( t , λ) are defined via two-dimensional spline interpolation o v er a fix ed grid [ t g , λ g ] in time and wavelength.The spline knots W k and e s can be learnt from data after setting suitable priors.Here, we use the pre-trained BayeSN model from Mandel et al. ( 2022 , hereafter M20 ), based on 79 nearby SNae Ia with highquality optical and near-IR observations.It defines a 6 × 9 grid spanning the ranges t ∈ [ −10; 40] d and λ ∈ [0 .3; 1 .85] μm.We fix the W k and the common covariance matrix e of the e s to the M20 posterior means: Thus, the 44 free parameters controlling the intrinsic brightness of each SN Ia are δM s , θ s 1 , and the 42-component array e s (the perturbations are fixed to zero at the extreme wavelengths of the grid, reducing the trainable parameters to 6 × 7).

Propagation effects: dust extinction
In the context of BayeSN, dust extinction from the SN host is modelled with the Fitzpatrick ( 1999 , hereafter F99 ) law: where t , λ are still in the SN frame, R s V is the F99 dust-law parameter, and A s V is the optical depth of host-galaxy dust for SN s .

Propagation effects: redshift and distance
The wavelength of light from a supernova is affected by the relative motion of the source and observer, which is made up of: the motion of the supernova within its galaxy (which is often neglected), the peculiar velocity of the host galaxy and of the Milky Way with respect to the CMB, and the Sun and the Earth's own peculiar motions.Additionally, distant objects are affected by the expansion of the Universe and exhibit a cosmological redshift.
The effect of the total redshift5 z s is simple: it shifts the spectral flux distribution in both wavelength ( λ → λ o = (1 + z s ) × λ) and phase ( t → t o = (1 + z s ) × t ) and suppresses its intensity (threefold: once because of the redshift of photons to lower energies, once because is a rate (and time is dilated), and once because spectral intervals also get dilated): In this work, we will assume to have perfect estimates ˆ z s = z s of the total redshifts, as appropriate for spectroscopically observed supernovae (uncertainties in this case are on the order of σ z ≈ 10 −5 ).Most supernovae from future surveys, ho we ver, will only have redshift estimates from photometric observations of their host galaxy (thus, not including the SN's peculiar motion), which are highly uncertain and expected to deviate significantly from the Gaussian approximations usually employed and to exhibit multimodality (Leistedt, Mortlock & Peiris 2016 ; see also Autenrieth et al. 2024 , fig. 1 ).While this puts strain on likelihood-based analyses, SBI is completely transparent to the distributions used and so can easily handle realistic photometric redshifts.
On the other hand, the intensity of a supernova's light is affected by its distance D from the observer: since the total flux is spread o v er a sphere with area 4 π D 2 , the spectral flux density (SFD) is simply F or superno vae located at cosmological distances, the appropriate distance to use (see e.g.Hogg 2000) is the transverse comoving distance6 D M ( z c , C), which is related to the object's cosmological redshift z s c through the cosmological model parametrized by C. Using SNae Ia for cosmological inference therefore requires disentangling the effect of peculiar velocities7 from the cosmological redshift, especially when using low-redshift ( z ࣠ 0.1) SNae Ia (Davis et al. 2011 ).While peculiar velocities can be included in the BHM and inferred simultaneously with other SN and host parameters (Rahman et al. 2022 ), cosmological analyses usually employ a simpler procedure of correcting the redshifts and propagating the associated uncertainty to the observed magnitudes.In Section 5 , where we analyse real light curves, we will also use this simpler approach for comparison with previous work, while for the examination of simulated data in Section 4, we will assume no peculiar velocity, leading to an equality of all considered redshifts: z s c = z s = ˆ z s .In future work, we will extend our SBI framework to perform joint inference of SNae and their hosts, thus fully accounting for peculiar velocities.

Propagation effects: Milky Way extinction
Dust extinction in the Milky Way (MW) is similar to that in host galaxies, but the F99 law is e v aluated at the observer-frame wavelengths: Following Schlafly & Finkbeiner ( 2011 , hereafter SF11 ), we assume an isotropic MW dust law with R V , MW = 3.1 and perfectly measured MW optical depths A s V , MW at the sky locations of the SNae, extracted from the SF11 maps.These assumptions can be easily relaxed and where the prefactor 1/(1 + z c ) is cancelled when integrating to obtain bolometric fluxes.This formulation, ho we ver, makes dealing with peculiar velocities, which introduce redshift but not distance, cumbersome (Davis et al. 2011 ), and so we forward simulate the two effects separately.the MW dust properties inferred or marginalized with SBI similarly to those of the hosts.

Instrument model
A telescope measures the photon rate density in a given band b (in the observer frame): where T b is the filter transmission (including the wavelength dependence of atmospheric absorption), and hc / λ o is an individual photon's (observer-frame) energy.In slicsim , the integral is performed numerically o v er a dense grid of wavelengths.Photometric light curves consist of a collection of noisy measurements of R identified, in addition to the SN label s , by i ∈ { 1 , . . ., N s obs } , for each of which a time t s , i and band b s , i are provided.The translation invariance of the time axis necessitates the introduction of a time-offset parameter, t s , for each SN, so that t s,i = t s,i o + t s .The instrumental characteristics and amount of atmospheric absorption affecting the observation are summarized into a zero-point magnitude , which is the magnitude of a source which produces a read-out of 1 ADU on expectation.An ADU (analogue-to-digital unit), in turn, is related to photoelectrons, which are the actual Poisson-distributed quantity, through the instrumental gain .Finally, the magnitude system is defined through a zero-point count (rate density) R ZP ,b s,i : this is the 'brightness', in the particular band, of an object of nought magnitude.
The expected signal (number of photoelectrons) from SN s at time t s , i in band b s , i is, then to which we must add background noise from: the electronics, the sky, and the host galaxy; cumulatively, d bg , before Poisson-sampling the final instrument readout: While R ZP ,b s,i is measured in laboratory conditions, and gain s , i is a controllable setting, the zero-point magnitude includes contributions from the weather and other uncertain effects, so for each pointing s , i , the surv e y data release contains a noisy measurement of ZP s , i , often represented through a mean zero point and a (usually small) zero-point uncertainty obtained from simultaneous observations of photometric standards.
A common simplification of the instrument description, which we will also adopt, is to consider a simple Gaussian model for the observ ed calibr ated flux (' FLUXCAL ') with mean d src and standard deviation FLUXCALERR : In this case, linear rescaling does not modify inference, so ZP and gain can be set arbitrarily: to 27. combines in quadrature the (Gaussianized) uncertainties from the source, the electronics, the zero-point estimate, and from independent measurements of the sky and the host, which have been substracted8 from the data to produce FLUXCAL .If only FLUXCAL s and FLUXCALERR s are released by a surv e y, instead of the more detailed information required to use equation ( 10 ), one could in principle simulate the noise sources, analyse them to produce 'mock' FLUXCALERR s, and include them as part of the data in an SBI framework.In this study, however, FLUXCALERR s are regarded as part of the instrument and are therefore kept fixed for simplicity, ef fecti vely assuming that Poisson noise from the source is a subdominant component, which justifies disregarding its dependence on the true source flux.

Hierarchical Bayesian modelling of SNae Ia
The parameters that describe each individual supernova ( θ s 1 , e s , δM s ), its environment ( R s V , A s V ), and other related quantities ( A s V , MW , z s , t s ) are, in the context of a Bayesian hierarchical model (BHM), assigned priors which may themselves be parametrized by global hyperparameters to be jointly inferred with the SN-specific properties.This allows for so-called 'borrowing of strength': constraints from individual objects are partially pooled (with weights given by the relative uncertainty in each object), and the final posteriors -for both global and local parameters -benefit from the whole set of observations.
The hierarchical structure of our model is depicted in Fig. 1 , and the variables and their (hyper)priors are listed in Table 1 .We allow each SN Ia to be affected by dust with host-specific properties, adopting a hierarchical prior distribution of R s V go v erned by hyperparameters as in Thorp et al. ( 2021 ), Thorp & Mandel ( 2022 ), Grayling et al. ( 2024 ).This hyperprior has the shape of a Gaussian, but its support is restricted to [1.2; ∞ ) to maintain physicality (the lower limit is set by Rayleigh scattering; see Draine 2003 ).Thus, the hyperparameters μ R , σ R no longer represent the mean and standard de viation, respecti vely, of the distribution of R s V .Correspondingly, this truncation9 modifies the likelihood of μ R , σ R by fa v ouring broad R s V distributions (i.e. a high σ R ), which would otherwise predict a number of objects with very low -or even negative -R s V in contradiction with the data.Further discussion and explanation of the two effects can be found in Grayling et al. ( 2024 ).
For this proof-of-concept study, we adopt the following simplifications, which can easily be relaxed in future analyses.First, we assume perfect (spectroscopic) redshift estimates and no peculiar velocities, fixing z s = z s c = ˆ z s (except when analysing real data: see Section 5 ).We also fix the cosmological model10 (and hence distances D s M ) to a flat CDM with matter density m0 = 0.28 and Hubble parameter H 0 = 73 .24 km s −1 Mpc −1 .We also assume perfectly measured MW optical depth, A s V , MW = ˆ A s V , MW , and an isotropic fixed MW dust law with R V , MW = 3.1.Finally, we use a simplified Gaussian description of calibrated fluxes and a toy model for the uncertain time of maximum -a parameter that shifts the whole light curve rigidly in time -by allowing it to fall within a 15-day time window around an initial estimate (which we set as the time origin)., d ) is then trained to minimize the binary cross-entropy (BCE) loss:

T RU N C AT E
which, as Hermans, Begy & Louppe ( 2020 ) show, leads to the neural network approximating the ratio r( , An approximate posterior for the parameters of interest can then be obtained either by multiplying the prior density p( ), if it is readily available, by ˆ r ( , d 0 ), evaluated at the observed (as opposed to simulated) data d 0 , or more generally, by weighting prior samples, e.g.those used for training/validation, by ˆ r ( , d 0 ).NRE is an amortized technique, which means that the NN tries to learn r( , d ) even for data that does not resemble the observed one.While this allows its perfromance to be tested and verified, it is usually advisable to re-train using simulations focused on the particular (target) observation d 0 .We use the truncation scheme of Miller et al. ( 2021 ), whereby global-parameter priors are restricted to regions with significant posterior mass (approximately determined via an initial NRE run), but their shapes are not modified.This preserves the amortization of the ratio estimator within the confines of the truncated priors.We found that truncation was only necessary for the most complicated inference tasks (global dust population parameters), while other posteriors were optimally reco v ered from their initial priors.

Inference network
An SBI analysis is only as powerful as the inference network employed in it is e xpressiv e.A collection of light curves is a peculiar data set since the data d s ≡ [ d s,i ] N s obs i= 1 related to each object have different sizes because of the irregular cadence of observations.While NN architectures that accept varying-length inputs exist 11 and are presently ubiquitous in areas like natural language processing, in the present work, we use a simpler architecture that we found is fast to train (both in terms of number of steps and time per step) and with present-day data sets achieves the best performance while requiring reasonable resources.
As a key first step, we use a collection of N SN distinct learnable embeddings of the unequal-length d s into a space of fixed dimension: We found that even a single linear layer (per SN) works well in our particular set-up.The embeddings are stacked along a batch dimension and processed in parallel by a single shared component to derive final featurized representations of each supernova: A data set summarizer then combines information from all objects: We use a simple summarizer that concatenates the (ordered) tuple of inputs [    s ] N SN s= 1 , as we previously did in SICRET , where we showed that, while memory-and compute-intensive, this approach does scale to ∼10 5 SNae with current hardware.We prevent overfitting in this layer via stochastic dropout (Hinton et al. 2012 ).
Finally, a number of ratio-estimator networks estimate ratios.For a group of global parameters : and similarly for the local parameters θ s of object s : where a s ≡ [ ˆ z s , ˆ A s V , MW ] are auxiliary (constant) inputs that completely identify the object whose parameters are being inferred. 12s we previously noted in SICRET , the presence of    accounts for a posteriori correlations between the parameters: i.e. we infer the posterior p( θ s |{ d s } ) instead of p( θ s | d s ).In a hierarchical model in which { θ s } are a priori conditionally independent given global parameters , i.e. p( { θ s }| ) = s p( θ s | ), this corresponds to the marginalization p( θ s , | d) d instead of simply p( θ s | d , ) as was done in previous hierarchical SBI analyses of permutation-invariant data (Rodrigues et al. 2021 ;Heinrich et al. 2024 ).
To enhance their expressivity, ratio estimators first 'featurize' the raw parameters by passing them through a ParamHead , whose output is concatenated to the data set summary.For latentvariable estimators, θ s ≡ ParamHead θ ( θ s ) is concatenated with a processed    θ ≡ SummaryHead θ (    ), which extracts the rele v ant summaries from    , with the pre-processed data d d d s , and with the auxiliary inputs a s .Lastly, to enhance constraining power when the posterior is significantly more concentrated than the prior, we use the leaky parity-odd power (POP) activation layer (Jeffrey & Wandelt 2024 ) on the output of global-parameter ratio estimators.All network components (SN embedders, summarizer, and all ratio estimators) are trained simultaneously using the loss from equation ( 13 ) summed o v er all marginal parameter groups (i.e.representing, in turn, each of the global parameters and each of the local parameters, summing o v er the SNae).
We implement all the components using multilayer perceptrons as a more principled approach we will adopt in the future.
(MLPs) with batch normalization and rectified linear unit (ReLU) non-linearities.Details about layer sizes are given in Table 2 , and the network is depicted in Fig. 3 .

Mock data and sur v ey specification
We generate and analyse mock data designed to mimic the third data release of CSP, as presented in Krisciunas et al. ( 2017 ) and included in SNANA (Kessler et al. 2009 ).We extract the list of observation times ( t s , i ) and bands ( b s , i ) for each of the SNae Ia in the data release, their spectroscopic redshift ( ˆ z s ) and the Milky Way colour excess since we assume isotropic R V = 3.1 for the Milky Way).This constitutes the 'surv e y specification' part of the input to the graphical model from Fig. 1 .
Since the M20 model, which we use both to generate and analyse the mock data, was not trained on u -band observations and outside the rest-frame time range [ −10 d; 40 d], we exclude the corresponding entries from our CSP-like set-up.As described in Section 2.1 , we keep the spectro-temporal templates W k and the covariance e of intrinsic residual variations fixed in this work (they are inputs to the graphical model of Fig. 1 ).For the remaining parameters, we set ground-truth values as listed in Table 1 , informed by the posterior means reported in M20 .Individual light curves are simulated as described in Section 2 after sampling SN-local parameters from their respective priors (also listed in Table 1 and depicted in Fig. 1 ).We also add a random time shift of up to 5 d (rest-frame) around the time of the brightest observation ( SEARCH PEAKMJD ) while, in order to a v oid boundary effects in the analysis, we widen the prior to U( ±7 .5 d).
In this work, we only simulate and analyse FLUXCAL data since detailed descriptions of the CSP observations (zero-points, gains, and background fluxes) are not available in SNANA .To facilitate the likelihood-based comparison, we increase very small reported FLUXCALERR s to be at least 0.01 mag, as has also been done for the real data.
The mock data set contains N SN = 134 simulated SNae Ia with a total of N SN s= 1 N s obs = 13 202 flux measurements. 13Fig. 4 depicts it four times, coloured according to the values of the different SN-local parameters.The impact of A V and θ 1 are clearly evident as gradients Figure 3. Architecture of the neural network.Solid lines represent linear connections (followed inside the MLPs by batch normalization and ReLU nonlinearity).Dashed lines, on the other hand, connect layers that are duplicated for presentation (identity operation).When multiple parameter groups are being inferred (at the initial stage before truncation), there are multiple parameter heads, whereas here only one is shown for clarity.Notice the similarity with SICRET , fig. 2 , the main addition being the SN-embedding layers: N SN distinct components (thus coloured diversely) with different input sizes but the same output dimension, which allow the {    s } to be stacked into a single tensor along a batch dimension and processed in parallel before being flattened out for input into the summarizer.
Table 2. Details about the components of the inference network: their input and output dimensions and particular implementation in this work.For all components, we use MLPs, indicating the number and size of the hidden layers and the output size as MLP( n hidden × d hidden , d out ).Each hidden layer consists of a fully connected layer, batch normalization, and a ReLU non-linearity.Inputs are also whitened (shifted and scaled by the mean and standard deviation of the training set).The size of global-parameter groups is denoted with m = 1, or 2 for the group [ μ R , σ R ].The inference network is also depicted in Fig. 3 .
(shifts) of the light curves, while those of δM and R V less so, which has an impact on the inference of the parameters in question.Notice also the reduced spread of SN Ia light curves in the infrared bands, where measurements allow disentangling pure-magnitude from colour-andmagnitude variations (respectively described by δM and R V ).

Training
We implement and train the neural network described in Section 3.  11)), which ef fecti vely augments the training set but avoids the e xpensiv e part of the simulator, and stochastically 'drop out' 50 per cent of the summarizer input (Hinton et al. 2012 ).We optimize using Adam (Kingma & Ba 2017 ) with the default PyTorch momentum settings, a decaying learning rate schedule ( γ = 1/1.5 every 10 epochs) o v er 100 epochs, and with mini-batch size of 128 examples.The results we present below use the checkpoint that performed best on the validation set.Training on one NVIDIA A100 GPU took ≈5 h to converge, in addition to ≈30 min needed to generate the training set.Evaluating a single set of marginal posteriors then takes on the order of milliseconds.

Validation with HMC
To validate our NRE results, we run a likelihood-based analysis (using the hierarchical likelihood that corresponds to the forward model in Section 2 ) with Hamiltonian Monte Carlo (HMC) and consider the resulting posterior the ground truth.We use the code outlined in Grayling et al. ( 2024 ), which is based on the implementation of the No-U-Turn Sampler (NUTS; Hoffman & Gelman 2014 ) in NumPyro (Bingham et al. 2019 ;Phan, Pradhan & Jankowiak 2019 ).We run four chains and draw 500 samples each after 500 burn-in steps, which takes ≈30 min when run in parallel on four NVIDIA A100 GPUs.We v erify conv ergence using the Gelmann-Rubin R number, the ef fecti ve sample size, and other standard diagnostics as described in Grayling et al. ( 2024 ).We remind the reader that, as any likelihood-based method, this HMC analysis requires sampling the joint posterior of all model parameters, including in this case four global parameters and N SN × (5 + 42) = 6298 object-specific ones, most of which describe the residual light-curve variations through e s .In contrast, our SBI methodology implicitly marginalizes e s and estimates three global (since we group [ μ R , σ R ]) and N SN × 5 SNspecific marginal posteriors.

Comparison of marginal posteriors
We plot marginal NRE posteriors, e v aluated by weighting prior samples from the validation set by ˆ r ( , d ) in Fig. 5, compared with the ground-truth marginalized HMC posteriors and the true parameter values used to generate the mock data.
We observ e e xcellent agreement between NRE and HMC posteriors for the global parameters: τ , σ 0 , [ μ R , σ R ], with similar uncertainties from the two methods and relative shifts of at most about 1 σ .Since the ratio estimator for [ μ R , σ R ] is the most challenging, we re-trained it after truncating the global-parameter priors, as described by Miller et al. ( 2021 ) and illustrated in the figures.
Similarly, SN-local parameters, with the exception of R s V , are very well reco v ered and in agreement with HMC.A detailed comparison of the first two moments of the one-dimensional marginal posteriors is shown in the top row of Fig. 5 .In general, NRE exhibits slightly larger uncertainties for most parameters, as was previously observed in SICRET .It is important to note that R s V inference is almost entirely population-driven.Since constraints from individual objects are weak, the hierarchical structure induces 'shrinkage': a statistical effect whereby all marginal R s V posteriors concentrate toward the population mean.This is not an artefact of the inference procedure used but rather a feature of the hierarchical model itself, and is observed for both NRE and HMC.Thus, small changes in the μ R -σ R posterior shift all the R s V marginals coherently, leading to similar offsets from the HMC results for individual objects.We note that, while the posterior can be studied using HMC, with NRE, we only derive marginal posteriors.

A P P L I C AT I O N TO R E A L DATA
We apply the methodology described and validated abo v e to the real light curves from CSP: specifically, the subset of 86 non-peculiar ones identified and analysed by Thorp & Mandel ( 2022 , hereafter, TM22 ).Since this is a low-redshift sample, we use redshifts corrected for peculiar velocity (using the flow model of Carrick et al. ( 2015 ), as described in TM22 ) and thus have separate (fixed) z s and z s c , the former acting to redshift the light curves, while the latter is only used to calculate distances under a fixed cosmological model (flat CDM with m0 = 0.28 and H 0 = 73 .24 km s −1 Mpc −1 ).As standard in SN Ia analyses, we account for a ±150 km s −1 uncertainty in the peculiar velocity correction by propagating it linearly to magnitudes 14 : where c is the speed of light, and σ s z is the measurement uncertainty (which is small for spectroscopic redshift estimates).This is then included in the BHM as an additional (in quadrature) spread of absolute magnitudes: δM s ∼ N (0 , σ 2 0 + ( σ s μ ) 2 ).To further match TM22 's set-up, we also fix the standard deviation of residual scatter σ 0 = 0.088 and the time offsets t s = 0 instead of inferring them.
We present the posterior for μ R and σ R in Fig. 6 in comparison with the one from TM22 .A full comparison is shown in Fig. 7 .The NREand HMC-derived posteriors are in good agreement with about 1 σ offset and similar sizes, as was the case when analysing the simulated data set.Moreo v er, for completeness in Appendix A , we validate the global-parameter inference on simulated data with parameters randomly drawn from the priors, observing good Bayesian co v erage properties, and calibrate the approximate posteriors to construct confidence regions with exact frequentist coverage.Since the NRE is already nearly optimal, this procedure does not lead to results significantly different from those presented in Figs 7 and 6 but serves as reassurance of their correctness.
MNRAS 530, 3881-3896 (2024) curves that incorporates realistic intrinsic (to each SN) and extrinsic (due to dust properties of the host galaxy) variability.By training a neural network to approximate the likelihood-to-evidence ratio with a training set of simulated light curves based on the Carnegie Supernova Project (CSP), we have derived marginal posteriors for the parameters of the populations of SNae Ia and their hosts: the mean and standard deviation of dust-law parameters R s V , the average optical depth τ , and the residual scatter of SN Ia absolute magnitudes σ 0 , and simultaneously inferred marginally the parameters of all ≈100 SNae Ia.After validating the approach on simulated data, we have analysed the light curves of 86 real SNae Ia observed by the CSP (Krisciunas et al. 2017 ) and selected by TM22 .In both cases, we observ e e xcellent agreement between our SBI results and a baseline likelihood-based analysis as in Thorp & Mandel ( 2022 ), Grayling et al. ( 2024 ).Concretely, posteriors for τ and σ 0 are in perfect agreement from the two methods, as well as the marginals Downloaded from https://academic.oup.com/mnras/article/530/4/3881/7643658 by SISSA -Scuola Internazionale Superiore di Studi Avanzati user on 14 May 2024  ).The small offset is comparable to the one observed with mock data (Fig. 5 ) and is due to the minuscule effect that varying R s V has on the data (see Figs 4 and 8 ) and the hierarchical nature of μ R -σ R inference, which makes the problem particularly challenging.
for most local parameters ( t s , θ s 1 , A s V , δM s ).For the latter, SBI exhibits a slightly bigger uncertainty (by ≈10 per cent).Results for the dust-law parameters R s V and their population are also in good agreement, with only a small offset of about 1 σ between NRE and HMC observed.As we illustrate in Figs 4 and 8 , R s V have a minuscule impact on the data in comparison with the remaining variability, which makes them the hardest to infer and leads to hierarchydominated results: i.e. inference of one R s V depends on observations of all SNae, regardless of the analysis methodology (likelihood-or simulation-based).In light of this extremely challenging learning task, the neural network e xhibits e xcellent performance, having learned to extract and route the rele v ant information without access to the full high-dimensional likelihood but solely from training examples.
The precision and accuracy we achieve are largely due to the network architecture we adopt, designed to address the issue that each supernova in a survey has a different number of observations at different times and in different bands (thus, a surv e y is a tuple : an ordered collection of different-sized objects).Our NN consists of: a single bespoke linear layer embedding each SN individually  into a common-dimensional space; a shared fully connected SN post-processing subnetwork applied in parallel to the embeddings of all SNae; and a fully connected summarizer combining the results.It is as e xpressiv e and fast to evaluate and train as conventional fully connected networks (taking a few hours to converge with training data generated in ∼30 mins, in the same ballpark as highly optimized likelihood codes) but manages to fully extract the rele v ant information before o v erfitting.
In the present work, we have made a number of simplifying assumptions that do not affect significantly the results in the lowredshift, fairly small-size, high-signal-to-noise case we consider.F or e xample, we employ the simplified instrumental model that summarizes observational uncertainties in a FLUXCALERR ; to properly investigate the impact of measurement-related systematics, one must introduce calibration parameters (e.g.ZP s , i ) for each data point, which would significantly impact runtimes of likelihood-based methods. 15The same applies to the hierarchical parameters we have kept fixed in this study: namely, the MW dust law (and its variation along different lines of sight) and extinction amount and the total and cosmological redshifts of the supernovae.In contrast, with SBI all extra parameters can be marginalized implicitly by stochastically sampling them in the simulator, or marginally inferred in the same way (and at the same time) as the parameters we have already considered, provided suitable models: e.g.Zhang, Yuan & Chen ( 2023 ) for MW dust and Rahman et al. ( 2022 ) for redshifts induced by peculiar velocities consistent with models of the galactic bulk flow.
Along the same line of thought, one can also consider training the SN Ia template used to simulate light curves -represented in BayeSN through W k and e -together with the properties of individual SNae Ia (notably, distances used for cosmology).Even though this is the main reason behind using a hierarchical model, for BayeSN this is yet to be attempted because of subtleties related to selection effects.Likelihood-based analyses are thus typically split in two stages, with the second one (cosmological inference) using a fixed mean/median SN Ia spectral energy distribution (SED) model.To SBI, on the other hand, W k and e are just another set of global parameters implicitly marginalized in the training data.Sampling them from the very general priors used by M22 (and successors) will introduce tremendous and difficult-to-handle variability in the training data, which, ho we ver, can be remedied through highdimensional truncated SBI (see, e.g.Anau Montel, Alv e y & Weniger 2023 ; List, Montel & Weniger 2023 ).Furthermore, SBI opens up the possibility of using implicit (data-driven) priors and/or SN templates implemented as neural generative models trained (or fine-tuned) simultaneously with the inference, as recently done by Alsing et al. ( 2024 ) for galaxy photometry.
Beyond individual objects, numerous qualitative enhancements of the population modelling of SNae Ia and their hosts have been explored in the literature.These follow two main threads: considering correlations of SN properties with those of their hosts, and allowing for their evolution with time (with redshift used as a proxy).Due to the high-dimensional nature of the required analysis, confronting the different models using likelihood-based pipelines has been cumbersome and reliant mainly on visual inspection of the results of parameter inference, e.g.comparing the dust-law distributions of high-and low-mass galaxies or examining trends of inferred properties with redshift.SBI, ho we v er, pro vides an avenue towards principled Bayesian model comparison by giving direct access to marginalized model probabilities (and Bayes factors) even for models with thousands of dimensions like the one in this study, as recently demonstrated by Karchev et al. ( 2023a ).Furthermore, amortization allows for exploring the results as a function of the values of the underlying parameters on mock data -unthinkable with likelihoodbased methods.
Selection effects and non-Ia contamination thus remain the two major hurdles left before a fully-fledged application of SBI to SN Ia cosmology becomes viable.Accounting for them requires simulating the way transients are identified and classified into a surv e y release, leading to different-sized surv e ys in the training set and transients observed in different time/band configurations.One might imagine tw o w ays to circumvent these problems.One is to take a step back in the data processing pipeline, simulating and subsequently feeding in the neural ratio estimator raw telescope images, which have a set number , order , and characteristics after the surv e y has been performed (see, e.g. S ánchez et al. ( 2022) for a similar approach but using the conventional SN cosmology pipeline).This is equi v alent to padding the light curves, which would be the standard approach to unequal-length sequences in machine learning, and including light curves for undetected objects.The downside is clear: the network must learn selection effects and contamination from a prohibitively large amount of mostly uninformative data.Alternatively, one might still try to condition the simulator on the observed light curves of detected objects by smartly modifying the hierarchical priors.Ho we ver, this might cause issues when many objects are considered, as we discuss in Karchev et al. (in preparation).
In light of this, our current approach -conditioning on the number and order of SNae in a surv e y and on the number and order of observations of each SN -has limited prospects of solving selection effects.In upcoming work, we will demonstrate cosmological inference in the presence of selection effects and variable-length data by using tools that have already been exploited in the SN literature: e.g.Gaussian process regression for regularizing the light curves (e.g.Revsbech et al. 2018 ;Boone 2019 ), in combination with cutting-edge techniques for permutation-invariant SBI (Rodrigues et al. 2021 ;Campeau-Poirier et al. 2023 ;Heinrich et al. 2024 ;Makinen, Alsing & Wandelt 2023 ).
We make use of Clipppy ,16 a Python package based on pyro (Bingham et al. 2019 )   Everywhere, the colour axis represents the difference between the required credibility and the confidence level (empirical coverage).In the top row, using black lines, we depict nominal credibility from the NRE posterior e v aluated on the observed data, which is used to derive calibrated 1-and 2 σ confidence regions, filled in purple.The difference with the approximate posterior is insignificant, but the confidence region thus constructed has guaranteed co v erage.

Figure 1 .
Figure 1.Graphical depiction of the model, in which parameters to be inferred are represented by shaded squares, the data by a shaded circle, and deterministic components and fixed variables by unshaded nodes.The distributions of SN-local variables (those inside the 'SN' plate) are in dashed boxes, while those for the global parameters are omitted for clarity.See Fig. 2 for an elaboration of the 'instrument' node andTable 1 for a tabular representation of the model.
model that includes parameters of interest and 'nuisance' parameters η in which we are generally not interested and would like to marginalize when analysing data d .That is, we w ould lik e to obtain the marginal posteriorp( | d ) = p( , η) d) η = p( d | , η)p( , η) d η p( d ) .(12)Marginal neural ratio estimation uses forward simulations to build a training set with two classes of ( , d ) pairs: either sampled from the joint distribution p( , d ), or from the product of marginals p( ) p( d ).The former are obtained by first running the full joint model p( , η, d ) and simply distregarding η, while for the latter, respectively, d or are further disregarded to have ∼ p( ) and d ∼ p( d ).A neural network ˆ r (

Figure 2 .
Figure 2. Two kinds of instrument, as discussed in Section 2.5 .In this work, we only use the ' FLUXCAL ' simplification from Fig. 2 (b) but include Fig. 2 (a)as a more principled approach we will adopt in the future.

Figure 4 .
Figure 4.The mock light curves that we generate and analyse, corrected for cosmological distance (but not for redshift).While we work entirely in linear (flux) scale, for presentation purposes, this figure is in magnitudes.Each column shows the same light curves but coloured according to a different SN-local variable, as indicated on the top.Each row is a different CSP band: from bluest (top) to (infra)reddest (bottom).Different SNae might have observations in different sets of the bands.All plots have the same scale and limits: notice that the diversity in redder bands is smaller, owing partly to the weaker effect of dust extinction.

Figure 5 .
Figure 5. Inference results from the mock data set.Top : moments of the marginal posteriors of the local parameters of the N SN supernovae, as indicated in the top-left corner of each plot.Means (standard deviations) are shown in teal (ochre), with scale indicated below (abo v e) the plot.The abscissa (ordinate) coordinate comes from the HMC (NRE) posterior, so that the diagonal indicates matching moments from the two methods.Middle : the same per-object marginal posteriors (mean ±1 standard deviation) plotted against the true values in the simulation.Only every third error bar is plotted for clarity.Bottom : posterior densities (in the two-dimensional plot, 1-and 2 σ credible regions) for the global parameters, as inferred by HMC and NRE, compared with the prior density and the true value used to simulate the mock data.Shaded regions indicate the truncation used for re-training the μ R -σ R NRE posterior depicted in the inset (the estimators for τ and σ 0 were not re-trained).

Figure 6 .
Figure6.Comparison of marginal dust-population posteriors (1-and 2 σ credible regions) from the subset of 86 SNae Ia analysed in TM22 (assuming no split according to host mass: upper left in fig.8 of TM22).The small offset is comparable to the one observed with mock data (Fig.5) and is due to the minuscule effect that varying R s V has on the data (see Figs4 and 8) and the hierarchical nature of μ R -σ R inference, which makes the problem particularly challenging.

Figure 8 .
Figure8.Variations in rest-frame B and H absolute magnitudes at phase 0 (around maximum), as simulated with the BayeSN trained by M20 , induced by varying each of the free local parameters according to its fiducial hierarchical prior, with respect to a reference value with δM = 0, θ 1 = 0, A V = 0.1, R V = 3, = 0. Numbers along the bottom specify the standard deviation for the two bands.
and PyTorch(Paszke et al. 2019 ), for the probabilistic part of our forward simulator and PyTorch Lightning(Falcon & The PyTorch Lightning team 2023 )  for training the inference network.AC K N OW L E D G E M E N T SRT acknowledges co-funding from Next Generation EU, in the context of the National Recovery and Resilience Plan, Investment PE1 -'Project FAIR Future Artificial Intelligence Research'.This resource was co-financed by the Next Generation EU [DM 1555 del 11.10.22].RT is partially supported by the Fondazione ICSC, Spoke 3 'Astrophysics and Cosmos Observations', Piano Nazionale di Ripresa e Resilienza Project ID CN00000013 'Italian Research Center on High-Performance Computing, Big Data and Quantum Computing' funded by MUR Missione 4 Componente 2 Investimento 1.4: Potenziamento strutture di ricerca e creazione di 'campioni nazionali di R&S (M4C2-19)' -Next Generation EU (NGEU).MG and KSM are supported by the European Union's Horizon 2020 research and innovation programme under ERC Grant Agreement No. 101002652 and Marie Sklodowska-Curie Grant Agreement No. 873089.BMB is supported by the Cambridge Centre for Doctoral Training in Data-Intensive Science funded by the UK Science and Technology Facilities Council (STFC).CW has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (Grant agreement No. 864035).Part of this work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service ( www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/T022159/1), and DiRAC funding from the Science and Technology Facilities Council ( www.dirac.ac.uk).MNRAS 530, 3881-3896 (2024)

Figure A2 .
Figure A2.Calibrated frequentist global-parameter inference from the real CSP data set.As purple lines and coloured surfaces, we show the required credibility ˆ γ that achieves 1-or 2 σ coverage (corresponding to 68.3 and 95.4 per cent in one dimension and ≈39 and ≈86 per cent in two dimensions), as a function of the parameter value.Everywhere, the colour axis represents the difference between the required credibility and the confidence level (empirical coverage).In the top row, using black lines, we depict nominal credibility from the NRE posterior e v aluated on the observed data, which is used to derive calibrated 1-and 2 σ confidence regions, filled in purple.The difference with the approximate posterior is insignificant, but the confidence region thus constructed has guaranteed co v erage.

Table 1 .
SN Ia parameters, (hierarchical)priors and values used to generate mock data.For local parameters, the support and size of the sampled 'vector' are listed.See also Fig.1for a graphical representation of the model.