Albatross: A scalable simulation-based inference pipeline for analysing stellar streams in the Milky Way

Stellar streams are potentially a very sensitive observational probe of galactic astrophysics, as well as the dark matter population in the Milky Way. On the other hand, performing a detailed, high-fidelity statistical analysis of these objects is challenging for a number of key reasons. Firstly, the modelling of streams across their (potentially billions of years old) dynamical age is complex and computationally costly. Secondly, their detection and classification in large surveys such as Gaia renders a robust statistical description regarding e.g., the stellar membership probabilities, challenging. As a result, the majority of current analyses must resort to simplified models that use only subsets or summaries of the high quality data. In this work, we develop a new analysis framework that takes advantage of advances in simulation-based inference techniques to perform complete analysis on complex stream models. To facilitate this, we develop a new, modular dynamical modelling code sstrax for stellar streams that is highly accelerated using jax. We test our analysis pipeline on a mock observation that resembles the GD1 stream, and demonstrate that we can perform robust inference on all relevant parts of the stream model simultaneously. Finally, we present some outlook as to how this approach can be developed further to perform more complete and accurate statistical analyses of current and future data.

Observational Status.In the early 2000s, the first observations of cold (and hot) stellar streams in the Milky Way were obtained by the SDSS survey [34], the most well-known of which being the GD1 stream [3][4][5][6][7].Since then, many more streams have been discovered in surveys such as SDSS and Gaia [31,32,[34][35][36][37], but perhaps more importantly, the resolution of the observations has improved dramatically.Current observations have revealed, for example, interesting substructure and features in cold stellar streams such as GD1 [6,7,[38][39][40].This is the context in which we want to consistently analyse both the large scale structure of the streams (such as its location on the sky and track), and the small scale structure that is sensitive to e.g. the dynamics and details of the tidal stripping process, or baryonic/dark matter interactions.
Past Analyses.There are a number of relevant aspects to analysing stellar streams -stream modelling, inference, and observations.Since the main focus of this work is the statistical analysis of streams, we will briefly review its current status and the corresponding claims, although we will return to computational models for streams when we describe our dynamics code below.Previous analyses have typically focused on either the global structure of the stream, see e.g.Refs.[1, 16-19, 21, 29, 41-43] (where it is on the sky and the fits to the general stream track) or construct some sort of summary statistics to study perturbations in the stellar density along the stream object, see e.g.[5, 7-12, 14, 20, 22, 23, 28, 44]. 1 The former of these analysis methodologies is well suited for studying properties and phenomena that are specific to the orbit and evolution of a given stream.For example, one can constrain quantities such as the Milky Way potential [1, 17-19, 21, 23, 41, 42, 46-48], the age of the stream [14,22], or possibly even gain information about close encounters with large perturbers which can leave large gaps or features in the stream track [7,8,11,12,20,22].The classic examples that are often quoted in the literature along 1 In this regard, the case of GD1 is interesting since there is some evidence that the observed density variations exhibit periodicity along the stream track consistent with the well-known epicyclic variations.See e.g.Fig. 14 in [45] which constructs the power spectrum as a function of wavenumber along the stream track and highlights a clear peak at k −1 s ≃ 2.64 kpc.
FIG. 1.A schematic illustration of the data analysis pipeline developed in this work.We use the TMNRE algorithm (see Sec. III) to carry out parameter inference on Milky Way stellar streams (see Sec. IV), using our new modelling code sstrax (see Sec. II).We also publicly release the albatross analysis code.
these latter lines are the so-called "spur" and "gaps" in the GD1 stream [5,6,28,44].On the other hand, the sub-structure of the stream is better suited to asking questions about e.g. the physics of tidal stripping mechanisms in the Milky Way, see e.g.[49][50][51][52][53], the internal dynamics and nature of the progenitor, and population level information about smaller (or more distant) perturbers [20,26,28,29,43].From the perspective of the dark matter community, both the large and small perturbing objects are of huge significance in the context of the distribution of dark matter subhalos in the Milky Way (and other galaxies).Indeed, one of the key goals of stellar stream analyses is to constrain possible subhalo populations [9-12, 14, 16, 54], or provide a detection of some larger mass (say, 10 7 M ⊙ ) subhalo [8,44].The main motivation behind our work is to provide a path towards a robust analysis pipeline to consistently (and simultaneously) analyse all of the above scenarios.

Statistical Challenge.
Making statistically robust statements about quantities of interest -the gravitational potential of the host, the disruption history, internal dynamics of the progenitor etc. -can be extremely challenging [14,25,55].To do so requires us to have good control over the dynamical history and initial conditions of the stream [42,[56][57][58][59][60][61][62][63], its stochastic interactions with dark matter or baryonic substructures [22,54,64], as well as a reasonable model for foreground and selection effects in the observations, see e.g.[55].As a result of the large number of free parameters this can introduce, together with relatively costly simulations, classical statistical methods scale quite poorly.Currently, this means that one must instead rely on constructing bespoke summary statistics such as the power spectrum of density perturbations along the stream, significantly reducing the dimensionality of the data via e.g.only considering the stream track, or ignoring a subset of effects in the modelling to lower the simulation overhead.This approach has been used to obtain relevant results regarding e.g. the properties of the Milky Way potential [1, 17-19, 23-25, 65], or the evolution history of progenitors [26][27][28][29][30].In this paper, we propose using the modern tools and techniques of simulation based inference [66,67] to analyse stellar streams and overcome some of these challenges.
Simulation Based Inference.Given the context described above, we briefly argued that the analysis of stellar streams was a problem that is well-suited for the application of simulation-based inference (SBI) [66,67].Currently, there are a wide range of available approaches and implementations that have been shown to be successful in a number of settings such as CMB data analysis [68], point source searches [69], gravitational wave inference [70], and others, see e.g.[14,[71][72][73][74].In general, the advantages of simulation-based inference techniques fall into three categories: (i) a consistent inference methodology for any forward simulator, irrespective of the complexity, stochasticity, or data dimensionality of the model, (ii) the possibility of extremely simulation efficient inference compared to traditional methods 2 , and (iii) the methods do not require an explicit likelihood to be written down, allowing for arbitrarily detailed physics simulations, and observational/detection models.The last point has interesting outlook for stellar streams as it allows for the possibility to significantly improve the modelling and to investigate the implications of e.g.selection effects, observation strategies, and instrument errors.This could have important implications for inference results based on e.g.small-scale structure in the observed streams or concrete features such as the GD1 spur and gaps [5,6,40,44].
Key contributions.This work contributes in a number of ways to the problems and analysis challenges identified above.Firstly, and most importantly, we develop and test a brand new analysis pipeline that leverages recent advances in simulation-based inference.We argue that the use of simulation-based inference to study stellar streams is motivated for a number of reasons.In particular, it allows one to make use of the highest fidelity modelling and observational models via the fact that it is an implicit-likelihood framework.It has also been shown in numerous settings to be highly simulation-efficient [68] compared to more traditional methods such as Markov Chain Monte Carlo (MCMC) [75,76]. 3This is of high relevance to the analysis of streams, since modelling of the complex and varied physics can be computationally costly, making sampling the posterior for large dimensional models typically infeasible.One way we overcome this in the current work is to use a specific targeted (in the sense of analysing a particular observation) simulationbased inference algorithm known as Truncated Marginal Neural Ratio Estimation (TMNRE) [78], implemented within the framework of swyft [78,79].Secondly, we also developed and will release a public code called sstrax for the modelling of stellar streams in the Milky Way.The current version of the code is designed to be highly modular and extendable for any aspect of streams modelling (e.g. the gravitational dynamics or tidal stripping).It is written in python but is highly accelerated through the use of jax [80], allowing for fast (O(1) s) sampling of realistic forward models.This speed is crucial for doing sampling on large dimensional models.Our implementation of the TMNRE algorithm, coupled to the sstrax modelling code will also be made publicly available in the package albatross.
Structure of the work.The rest of this work is structured as follows: In Sec.II we describe the physics behind the forward modelling of stellar streams, and highlight our numerical implementation in sstrax.Then, in Sec.III, we describe the use of simulation based inference for studying and analysing stellar streams, including a detailed explanation of the TMNRE algorithm.In Sec.IV, we demonstrate that our analysis pipeline can reliably perform parameter inference on all of the parameters in our forward model and discuss the sort of validation tests we can perform on the resulting posteriors.Finally, in Sec.V, we present the key conclusions to the study as well as some outlook as to the relevant use cases and data analysis challenges.

II. MODELLING STELLAR STREAMS
Arguably one of the most challenging aspects for analysing stellar streams is balancing the complexity of the modelling with the ability to do full parameter inference without resorting to e.g.fixing a number of parameters.One of the key arguments we will make later in this work is that simulation-based inference can be a path towards a highly sample efficient analysis framework [68].This opens up the possibility for using higher fidelity forward models for the dynamics and observation of stellar streams.It is for this reason that we decided to simultaneously develop and test a new modelling code for stellar streams, sstrax, that is modular and designed to be extendable in all aspects with the aim to move towards highly realistic stream modelling for sampling tasks.For the purposes of this work, we have developed what we believe is a simulator that contains all the key elements for for a robust proof-of-principle inference analysis.It will highlight the fact that the analysis and inference pipeline that we develop in later sections is not reliant on particularly symmetric or statistically simple (e.g. at the level of the data likelihood) models.We do note, however, that as far as the analysis methodology is concerned, any forward model could be used (introducing its own set of modelling assumptions, of course), including e.g. the current state-of-the-art models developed in galpy [56,63] or other works [22,42,54,61,64].
In this section, we describe the key components to our modelling code, and discuss in each case some relevant improvements that could be made.The generation of a single stream is split broadly into five steps: 1. Cluster Trajectory.Given some current position x c and velocity v c for the disrupted cluster, we trace the trajectory back for some time t age in the relevant gravitational potential to find the initial conditions.
2. Cluster Mass Loss.We then solve an equation for the evolution of the mass of the cluster M c (t), due to e.g.tidal disruption events, given its trajectory from Step 1, the gravitational potential, and choices for the parameters in the mass loss model.

3.
Star Stripping Times.Given this mass loss history, we can then generate a set of stripping times {t i } i=1..Nstars for stars released from the cluster.These are chosen to be a random sample from a probability distribution that is a normalised version of dM c /dt.

4.
Stream Stars Evolution.For each stripping time t i , we generate initial conditions for a star released from the cluster and evolve the star forward in the gravitational potential for a time (t age − t i ) before noting its final position and velocity.

5.
Observation.Given the full set of stream stars, we construct an observation by projecting to a co-ordinate frame relevant for the stream and accounting for errors in the measurements of e.g. the positions and proper motions of the stream stars.We also account for possible background contamination and misidentification that may occur when applying selection cuts.
We will discuss each step in detail below.A concrete example of each step of the analysis process is shown in Fig. 2 along with the mock observation used later in the case study.

II.1. Cluster Trajectory
The first step in the modelling is to take the cluster position 4 x c = (x c , y c , z c ) and velocity v c = (v x,c , v y,c , v z,c ) at time t = 0 (today) and construct the trajectory for all times t ∈ [−t age , 0].In other words, we project the current position and velocity backwards to find the initial conditions of the cluster a time t age ago.To do so, we need to know the gravitational potential Φ(x, t) and solve the equation ẍc (t) = −∇Φ(x c , t).In terms of implementation, we use the publicly available diffrax differential equation solver library [81], written in jax [80].
In principle, the gravitational potential Φ(x, t) can include all contributions from, e.g., the Milky Way dark matter halo, baryonic structures, dark matter subhalos, dynamical clusters or dwarf galaxies etc.In this work, we restrict our attention to a fixed, time-independent Milky Way potential Φ = Φ MW (x) which consists of a dark matter halo, and baryonic disc and bulge components.Specifically, we choose the MWPotential2014 implementation in galpy [63], whose parameters are given in Tab. 1 of Ref. [63].We note, however, that it is trivial to include arbitrarily complex potentials in our modelling framework.One should also check the level of impact mild to strong mismodelling has in this regards if e.g. the true potential is not exactly the one with which the simulations are generated.One reason for this is that we do not need to analytically construct actionangle co-ordinates (although in principle, this could be possible numerically, see e.g.[82]).Instead, we take advantage of jax-accelerated differential equation solvers to efficiently evolve the cluster and stars.With this choice, it is also simple to include time-dependent potentials that would arise from either the evolution of the Milky Way itself [57,59,83], or through interactions with dynamical objects such as dark matter subhalos or dwarf galaxies [5,7,8,28,44].These can be modelled without any additional approximations, and are represented simply by an additional term in the gravitational force.It would also be straight-forward to let the parameters in 4 All of our dynamical modelling is performed in a Cartesian coordinate frame (x, y, z) with its origin at the galactic centre.
the Milky Way potential vary and constrain them at the same time as the other model parameters.

II.2. Cluster Mass Loss
Once we have the trajectory of the cluster x c (t), we want to solve for the evolution of its mass as a function of time M c (t).This mass loss typically occurs for a number of reasons, due to, e.g., disruption as a result of tidal forces, stellar evolution, or dissolution, see e.g.[49][50][51][52][53].
Nonetheless, the vast majority of semi-analytic mass loss models take the form [84][85][86], where r t is the instantaneous tidal radius, f is some model-dependent function of the cluster mass and tidal radius, and τ orb is some characteristic timescale.In this initial implementation of sstrax, we choose to work with a semi-analytic model given by [49], where ξ 0 and α are dimensionless parameters that in initial works were fitted to N-body simulations, r h is the half-mass radius of the cluster, and t rh is the relaxation time given by [49], In this expression, m is the average mass of a star in the cluster, and N (t) = M c (t)/ m is the total number of stars in the cluster.In addition, r t is the tidal radius, and is computed using [42], where Ω is the instantaneous angular frequency of the cluster around the galactic centre, r = |x|, and we compute the second derivative of the potential using the autodifferentiation capabilities of jax.This model quantitatively reproduces interesting features in the mass loss such as the fact that more stars should be stripped near the pericentre of the orbit, which can introduce density variations that are totally separate from e.g.epicycles in the stream evolution [45,58,60].Again, as in the case of the gravitational potential, this mass loss model can be improved, either by generalising the form, or through a modern calibration to high-resolution N-body simulations of cluster evolution [30,50,[87][88][89][90].Of course, there is a "gold standard" approach which would be to perform N-body evolution in every simulation.However, we do not expect simulation efficiencies for this type of computation to drop significantly enough for this to become viable in parameter inference.As such, in any inference analysis, one will almost certainly have to resort to a semi-analytic form.
At the level of implementation, we solve this mass loss differential equation numerically using diffrax [81], taking the (densely interpolated) cluster trajectory solution x c (t) and initial cluster mass M sat as input.As mentioned above, since we directly forward model the mass loss, the code can be modified to use any form of dM c /dt, including e.g.contributions due to impacts with subhalos or other transient interactions [5,7,8,28,44].

II.3. Stripping Times
Once we have obtained the cluster mass M c (t) as a function of time t, we want to stochastically generate a set of stripping times {t i }.These times define the moment the stars which will ultimately form the final stream are released.To go from cluster mass to stripping times, we identify (−dM c (t)/dt) as the instantaneous stripping rate.We could model this faithfully as an inhomogeneous Poisson process, however a simple approximate scheme, which we outline below, is sufficient for our purposes.
First, we introduce the average mass of a star m as a new parameter, although we note that this is a somewhat toy simulation parameter since real systems are known to not have monochromatic mass functions.We then compute the total number of stars that should be in the final stream as N stars = ∆M/ m, where ∆M = (M sat −M c (t = 0)) is the total mass loss of the cluster. 5Now, each of the N stars stripping time can be sampled individually according to the distribution (−dM c (t)/dt)/∆M .This can be done in a number of ways, but in sstrax we choose to construct the cumulative distribution function and sample uniformly from U [0, 1] before projecting back to the t-space.
Note that this scheme does not explicitly use a distribution over star masses.Nevertheless, if one were to compute the differences of the heuristic cluster mass function between stripping times, one would obtain some mass distribution clustered around m.The important point to realise is that we already make an approximation in using a continuous cluster mass M c (t).If the particular distribution of star masses becomes relevant, both the stripping and the cluster mass modeling would have to be replaced by a more realistic framework.This could be achieved, for example, by sampling the next star mass m s ∼ p(m s ) [91], treating the cluster mass loss in Eq. ( 2) as the instantaneous event frequency of an inhomogeneous Poisson process, and only reducing the cluster mass by discrete steps m s whenever a star is released.It would be interesting to explore the implications of these two effects (i.e.star mass distributions and modelling the cluster mass via an instantaneous Poisson process instead of a continuous function) in the context of limits derived from density variations along the stream tracks, see e.g.[9,12,14].We have also assumed that the cluster system is collisionless, whereas in some systems populations of e.g.dark masses (see e.g.[92]) towards their centre or accretion of halo clusters (see [93,94]) could break this assumption and should be modelled properly before targeting real data.
Finally, note that our stripping process is stochastic and can therefore lead to different realisations of the density profile along the stream track if the same stream is generated multiple times.

II.4. Stream Star Evolution
The final dynamical step for generating the stream is quite simple -we just need to release stars from nearby the cluster at the times t i and evolve them forward in the same gravitational potential Φ(x, t)6 as the cluster until today t = 0.The only choice left to be made is one regarding the initial conditions for the stars, which we choose in accordance with observations made in N-body simulations of tidally disrupted clusters [50,[87][88][89][90].It has been shown that the majority of stars escape from near one of the two Lagrange points x 1,2 = (1 ± (r t /r))x c of the cluster [42,77] (one on either side of the radial line joining the galactic centre and the cluster centre), where r = |x c |.
In the sstrax implementation, we generalise this slightly and introduce three additional parameters: λ rel , λ match , and p near .Respectively, these describe how far away from the cluster the star is released, i.e. x rel = (1 ± λ rel (r t /r))x c (t i ), at what distance the velocity matching is done (specifically, the velocity is matched so that the angular velocity of the star and the cluster agree at a distance x match = (1 ± λ match (r t /r))x c (t i )), and finally the probability p near of being released from the closer Lagrange point.Finally, to model the velocity dispersion of the cluster itself, we choose the initial velocity of the star to be this matching velocity plus an additional random vector ∆v sampled on the unit sphere and rescaled by a factor √ 3σ v , where σ v is the velocity dispersion.In much the same way as the mass loss model, the most realistic way to actually model this process would be to account for the full dynamics inside the cluster via some N-body approach.For the same reason, this is still too costly for parameter inference tasks, so a semi-analytic approach like the one above needs to be used.Again, and in line with the prescription we chose for the mass loss, since we directly forward model the evolution of the stars, the generation of these initial conditions can be tuned arbitrarily to either analytic expectations, or some new high-resolution simulations.In any case, the analysis pipeline will remain the same.

II.5. Observational Model
An important aspect of simulation-based inference approaches is that the forward model must also include the detector response, observational model, or noise generation.This is in contrast perhaps to traditional approaches where typically some clean signal output of the forward model is input into an explicit data likelihood.In practice, the statistics results should be identical in either formulation.In the context of stellar streams, given some final stream configuration {x i ⋆ , v i ⋆ } i=1...Nstars , we need to model, (a) the observational measurement errors, (b) the detection of the stream in the sky, and (c) the contaminating background of other stars.
In this work, we develop a simple initial observational model, meant mostly as a proof of principle.Specifically, we assume that the stream has been "detected" through some form of selection cuts and vetoes in survey data [55,[95][96][97][98].We use this to define an observational window which we choose to focus on (i.e.we do not model the full sky).In the rest of the analysis, we will be focusing on a mock stream that is supposed to resemble the GD1 stream [3,4,6].There are a standard set of co-ordinates used in the literature [17] to describe the phase-space structure of this stream.Specifically, there are two angle co-ordinates (ϕ 1 , ϕ 2 ) which are approximately aligned with the stream track at ϕ 2 ≃ 0 deg, the corresponding proper motions (µ ϕ1 , µ ϕ2 ), and radial distances and velocities (d, v rad ).The definitions for these can be found in Appendix A.
Given these definitions, to construct the observation from the list of stream stars, we first define the region of interest in the sky/velocity phase space, i.e. we ignore all stars with (ϕ . Then, we add random observational errors to the values generated by sstrax via sampling e.g.ϕ obs 1 ∼ N (ϕ 1 , δϕ 1 ).Finally, we model two aspects of stream detection and selection effects.In particular, we assume that we have some selection efficiency ϵ sel that measures how often we accidentally miss a star in a given detection algorithm that should have been correctly classified as part of the stream [55,[95][96][97][98].We also model the fact that there can be contamination from the background stars that are not part of the stream, but are nonetheless not removed by the detection algorithm and are in the observing window [55].This is quantified by assuming there is some number N background stars, of which we are able to successfully re-move (1−ϵ background )% via the selection process.We then distribute N background ϵ background stars uniformly across the observational windows to model the background contamination.Finally, we bin the remaining data into three channels of size (N x bins , N y bins ) each: (ϕ 1 , ϕ 2 ), (µ ϕ1 , µ ϕ2 ), and (d, v rad ). 7All the choices for the particular values of the observational model described here are given in Tab.II.
As in the other components, there is significant room for more detailed modelling.For example, we know just from looking at Gaia data that the background stars will not be uniformly distributed across the sky [31,32,99], with higher concentrations near the galactic centre.Similarly, the efficacy of the sort of selection criteria or cuts that are applied based on e.g.metallicity or proper motions are likely at least stream-and sky locationdependent.The extent to which this impacts the inference is a different question, and something that we can actually test in our framework by modifying the observational model.

II.6. Acceleration with jax
Making the decision to directly forward model the evolution of the stream, rather than construct either some effective description [54], or accelerate the dynamical solutions through action-angle co-ordinate constructions opens up the possibility for far more general simulation frameworks.On the other hand, it is also potentially much more computationally intensive, e.g. if we include the effect of a large population of subhalos in the future.This is compounded by the additional simulation budget that is potentially required to perform inference on the large number of parameters any augmentation of the model can introduce.
As such, an important component of our implementation is its computational efficiency.We have achieved this by using the jax framework [80], which allows for justin-time compilation caching and highly optimised custom vectorisation.

III. SIMULATION BASED INFERENCE FOR STELLAR STREAMS
In this section, we will give a brief review of general simulation-based inference (SBI) methods before describ- ing the specific implementation we will use in this work.We will end the section by presenting some of the algorithm design choices that are relevant to stellar stream analysis.

III.1. Overview of Simulation-based Inference
Recently, there have been significant advances in high fidelity physics simulations, and machine learning techniques for processing complex data structures, alongside the emergence of increasingly challenging data analysis problems.This has led to the rapid development of "simulation-based inference" as a competitive alternative to traditional techniques as far as scalability, model realism, and unbiased analysis pipelines are concerned [66,67].At its heart, the field of simulation-based inference asks: given some forward model or simulator, can we perform efficient and correct Bayesian inference?Ultimately, the goal of simulation-based inference is to develop a robust statistical pipeline that can make use of the most realistic and state-of-the-art modelling tools.To be more concrete, suppose we have some forward model p(x, θ) that takes the model parameters θ -which could be a range of physical parameters, effective model components, nuisance parameters etc. -to some data x that resembles the real observed data x 0 .In a Bayesian context, we sample θ from some chosen8 prior distribution p(θ) so that the forward model takes the form p(x, θ) = p(x|θ)p(θ).This expression is at the heart of simulation-based methods, since it formally represents the notion that "running your simulator" is the same as sampling from the (simulated-)data likelihood p(x|θ).Indeed, this is the origin of the terms "likelihood-free" or "implicit likelihood" inference to describe SBI [66,67].These descriptions are supposed to convey the distinction between analytically evaluating some expression to compute the likelihood p(x|θ) and sampling from it.
To understand the different ways in which SBI methods approach the Bayesian inference problem, it is useful to briefly review how the forward model fits into Bayes' theorem.As far as scientific conclusions are concerned, we are typically 9 interested in computing the posterior p(θ|x) of the parameters given some data x, Here, as above, p(x|θ) is the data likelihood, p(θ) is the prior over our parameters θ = (θ 1 , • • • ), and p(x) is the evidence.Given this setup, there are various ways that SBI algorithms tackle posterior estimation given the ability to sample from the forward model (x, θ) ∼ p(x, θ) = p(x|θ)p(θ).Specifically, these can be categorised as follows (with the method we use highlighted in bold): • Neural Posterior Estimation (NPE).In neural posterior estimation [101,102], the goal is to directly estimate the posterior distribution p(θ|x) by representing 9 Of course, there are use cases e.g. in model comparison or goodness-of-fit tests [100], where computing other quantities such as the data evidence or maximum likelihood is more relevant.
it as some flexible parameterised probability density.This has been applied successfully in a number of contexts, e.g.gravitational wave analysis [74] and open source implementations are available [103].
• Neural Likelihood Estimation (NLE).In contrast, neural likelihood estimation [104,105] attempts to construct an estimator for the (simulated-)likelihood function itself p(x|θ).This can then be used to carry out standard inference techniques such as MCMC [75,76] or nested sampling [106][107][108] and generate samples from the posterior.

III.2. The TMNRE Algorithm
We will now focus on the specific implementation of simulation-based inference used in this work.This is known as Truncated Marginal Neural Ratio Estimation (TMNRE) [78], and is implemented in the swyft software [79].We have summarised the method in Fig. 1 for reference, however, there are a number of features we wish to emphasise in terms of its applicability to stellar streams.
• Targeted Inference.TMNRE is both a "targeted" and "sequential" algorithm in the sense that it performs inference on a specific target observation x 0 (as opposed to amortising over all possible model outputs) over a number of discrete rounds.In each round, the prior is truncated based on inference in the current round (see description below) to avoid simulating in parameter regions which do not contribute significantly to the likelihood ratio, given the fixed target observation.
• Marginal Posteriors.There are a number of quantities of interest in Bayesian inference, including the full joint posterior p(θ|x 0 ) given some observation x 0 , the evidence of a particular observation p(x 0 ), or marginalised posteriors10 p(θ i |x 0 ) for some individual parameter θ i or subset of parameters in θ.In TMNRE, a significant portion of the achieved simulation efficiency arises due to the fact that we directly estimate the marginal posterior, rather than marginalising over samples from the full joint distribution.
In combination, these two properties are the key to achieving a highly simulation efficient inference strategy.
For more discussions along these lines, see e.g.works discussing the application of TMNRE to CMB [68] and gravitational wave analyses [70].
Although the details of the method can be found in the original literature [78], it is useful to give a brief overview of the setup of the ratio estimation problem.This will highlight the features described above and how they will be beneficial for the analysis of stellar streams.The goal of TMNRE is to estimate the following ratio, where the last two equalities follow from an application of Bayes' theorem in Eq. ( 5) and the definition of the joint distribution p(x, θ) ≡ p(x|θ)p(θ).In other words, access to the ratio r(x; θ) is equivalent to estimating (a) the likelihood-to-evidence ratio p(x|θ)/p(x), (b) the posterior-to-prior ratio p(θ|x)/p(θ) which will be used for parameter estimation, and (c) the joint-to-marginal ratio p(x, θ)/p(x)p(θ) which will be the technically important form to perform ratio estimation in practice.
If we focus on the last form, r(x; θ) = p(x, θ)/p(x)p(θ), we can make the observation that given a set of simulations {(x, θ)} from our forward model p(x, θ) = p(x|θ)p(θ), we can construct two distinct classes of sample.The first is simply a sample from the full joint distribution p(x, θ), which amounts to picking an individual simulation pair (x, θ).The second is a sample from the combined marginal distribution p(x)p(θ) which can be obtained by picking two random samples, then taking x from one, and θ from the other.Having these two distinct distributions is the origin of ratio estimation as a binary classification task 11 -it asks the question given a pair (x, θ), did θ generate x? Intuitively, the relative precision of the posterior distribution in this case reflects how difficult it is to discriminate between joint and marginal samples.For instance, the larger the observational error is, the more overlap there will be between these two classes and the posterior will be wider.
More formally, we can frame this binary classification task and ratio estimation as an attempt to optimise (specifically minimise) the following loss function12 [79], Here σ(x) = [1 + exp(−x)] −1 is the sigmoid function, and f ϕ (x, θ) is the classifier with some set of free parameters ϕ that should be optimised.One of the key justifications for the correctness of TMNRE as an inference algorithm is to realise that this loss can actually be minimised analytically.In particular, one can show that the optimal classifier is given by f ⋆ ϕ (x, θ) = ln r(x; θ) [79].In other words, if one can successfully minimise the loss in Eq. ( 7), then one directly obtains the posterior-to-prior ratio r(x; θ).
Practically, this is where the "N(eural)" part of TM-NRE is relevant, especially for very high dimensional data/parameter spaces.Modern machine learning methods, architectures, and hardware allow for very flexible parameterisations of the classifier f ϕ , and there is a well established methodology to optimise their parameters ϕ (also more commonly called their weights).As far as the analysis of stellar streams is concerned, this gives us access to data representations that are as close as possible to real data from e.g.Gaia, without any need for compression into hand-crafted summary statistics, spline fits, or similar data reductions.In this work, the task of optimising is achieved through the use of the software swyft, which is built on top of pytorch.
With (neural) ratio estimation set up this way, we can now see how to directly estimate marginal posteriors in this framework.Suppose we wish to estimate the fully marginalised posterior for a single parameter 13θ i in θ, then we can start by taking our full suite of simulations (x, θ) ∼ p(x, θ) which (crucially) vary all parameters θ.Then, however, instead of constructing the loss based on all parameters, we can only "show" the single parameter θ i .This is equivalent to replacing θ → θ i in Eq. ( 7) above.Importantly, however, the analytic arguments will still hold and allow us to obtain directly the marginal posterior-to-prior ratio p(θ i |x)/p(θ i ).In contrast to e.g.MCMC where this marginalisation is performed after obtaining samples from the posterior, we implicitly marginalise by varying all parameters in the simulations simultaneously, but constructing the marginal posterior directly rather than via the joint.As far as analysing stellar streams is concerned, this is not just a useful trick for quickly obtaining the marginal posteriors, but is crucial in making the algorithm simulation efficient.Looking forwards, if the goal is to perform inference with extremely high fidelity stream simulations in order to extract the maximum possible information from the data, analysis methods that break the traditional scaling of sampling algorithms such as MCMC or nested sampling will be vital.
The final aspect to discuss before we summarise the algorithm and the design choices relevant to stellar streams is the truncation process that allows us to target a particular observation x 0 .As far as simulation efficiency is concerned, the idea behind truncation is to minimise the number of simulations performed in regions where there is extremely low posterior density, since, by definition, they provide almost no information about the parameter estimation problem.Formally we achieve this by performing the inference sequentially in several rounds.In each round, we generate a set of simulations (x, θ) ∼ p(x, θ) from the full model.Then we train and optimise our classifiers f i ϕ (x, θ i ) for the parameters of interest from which we can obtain marginal posteriors on each parameter p(θ i |x 0 ) for some specific target observation x 0 .This will highlight regions of parameter space where the posterior density for θ i is both very high, and of course other regions where the density is low, indicating that given the observation x 0 , this particular set of parameters is unlikely.We use these latter regions to truncate our prior region by imposing the condition r i (x 0 , θ i ) < ϵ on the estimated ratios 14 .Then, we resimulate by sampling from this truncated prior, repeat the inference and then truncate again.Eventually, once the posteriors converge to the level of statistical uncertainty, the truncation will just return the restricted prior and the algorithm will terminate.This truncation process is highlighted below in Fig. 5.
In summary, the TMNRE algorithm splits into four steps that are highlighted in the schematic shown in Fig. 1: 1.

2.
Step 2: Train a set of classifiers f i ϕ (x, θ i ) to obtain an estimate of the ratio r i (x; θ i ) = p(θ i |x)/p(θ i ).

3.
Step 3: Use this trained ratio to obtain estimates of the marginal posteriors p(θ i |x 0 ) for a specific target observation x 0 .

4.
Step 4: Take these marginal posterior distributions and derive bounds on the prior region to truncate for the next round of inference by imposing the condition p i (θ i |x 0 )/max θi p i (θ i |x 0 ) < ϵ.

Repeat from
Step 1 until the truncation procedure stabilises, then take the final round of inference as the set of posteriors p(θ i |x 0 ) and terminate the algorithm.

III.3. Design Choices for Stellar Streams
In order to use the TMNRE algorithm in practice, we must make a number of design choices.These include (a) building or using a pre-implemented forward model that generates the data x (here a representation of the stellar stream) given parameters θ, (b) designing a neural network architecture that is able to efficiently process the data format of x and θ, (c) making choices for the prior distributions over the parameters θ, and (d) choosing the hyperparameters relevant to the TMNRE algorithm.
Forward Simulator.To generate stellar stream simulations, we use the implementation of our modelling approach described in detail above (see Sec. II).To very briefly recap, we solve for the full evolution history of the stream including e.g. the orbit-dependent tidal stripping in a framework that can accommodate for any timedependent or time-independent gravitational potential.
In addition, we develop a simple observational model that is supposed to represent experimental and statistical uncertainties at the level of a current survey.This is implemented in the jax-accelerated modelling code sstrax, which we couple directly to the swyft software [79] in our analysis code albatross.The parameters that we vary in this analysis θ = (t age , M sat , . ..) are described in Tabs.I (stream modelling) and II (observation model).
An example output of our simulator (and the case study that we investigate below) is shown in Fig. 2.
Inference Network.As discussed above, the main aim of (neural) ratio estimation is to design a procedure that can reliably train a classifier, or set of classifiers f i ϕ (x, θ i ) to distinguish between joint and marginal samples [79,[109][110][111][112].To do this in practice, we need a flexible way to parameterise f ϕ , and although there are arguments from e.g. the loss function in Eq. ( 7) that any flexible enough parameterisation (i.e.just having enough trainable parameters in ϕ) will be able to optimise the loss, in real- ity this is only in some infinite training data limit.As such, we should try to make sensible design choices regarding the network to take full advantage of the known structure and physics associated to our signal and data format.Empirically, making physics-informed choices at this stage leads to huge increases in performance, robustness, simulation efficiency, and the general applicability of the method.
In our case of stellar streams, the signal is a collection of stars and their properties (positions, velocities, perhaps even metallicity etc.).As described in Sec.II, we focus on the phase-space information in this work, including e.g. the selection procedure in our observational model.As part of our forward model, we chose to bin the data into a 3-channel image to preserve the spatial structure and morphology of the signal.This data format is then well suited for the application of standard image processing network structures.
More precisely, we know that a stream binned at a given resolution can have structure on a range of scales.For instance, the large scale orbit of the stream is typically governed by e.g. the ambient gravitational potential, as well as the initial conditions of the cluster.On the other hand, the smaller-scale features (gaps, spurs etc.) are more likely to be impacted by the dynamical evolution history, tidal stripping, or interactions with perturbers.The aim is to analyse both of these classes of signal simultaneously, and as such we should choose a network architecture accordingly.In particular, with this observation, it is simple to see that applying e.g. a standard convolutional network which applies the same kernel to each part of the image identically would be a poor choice and unlikely to be able to simultaneously extract the small scale and large scale information.With this in mind, we choose to use the well-known unet archi-tecture [114], which is well suited for image analysis and segmentation.It is designed to simultaneously analyse the image at a larger scale, before performing follow up analysis on each identified segment and then combining the results.
There is another part of the inference network (a full description and network diagram can be found in Appendix B) which performs the ratio estimation.Schematically, one can understand the overall structure as first performing some data compression through the unet and a small linear network to extract an optimal set of summary statistics.Then, these summary statistics (which are automatically learned and optimised during the training), are passed to the default ratio estimator implemented in swyft along with the model parameters θ.All the details regarding the implementation can be found in the albatross library.In terms of specificity, we expect the network to be broadly applicable to the analysis of any stream model or observation, since it only assumes that the signal has structure on various scales.
Prior Choices.The prior choices for all the parameters of interest are shown in Tab.I.They are chosen to either represent our knowledge about the physics from current astrophysical observations or simulation results (e.g. the mass loss parameter α), or to be maximally uninformative.An example of the latter case are the cluster position and velocity priors which are chosen in the first instance to span the full observational window.TMNRE Hyperparameters.There are a number of hyperparameters that need to be set when one runs the TMNRE algorithm.Broadly these can be categorised as either parameters that control the network training process, or parameters specific to the TMNRE algorithm.For the inference and analysis detailed in this work, the particular choices can be found in Tab.III, as well as in the example configuration files supplied with albatross.Briefly, the training parameters describe how long to train the network for (min./max.training epochs), how many epochs to wait before the validation loss should decrease again (early stopping patience) 16 , the split between training and validation data (Train : Validation ratio), and the batch sizes shown to the network during training (training/validation batch size).The TMNRE settings consist of the minimum number of rounds (number of rounds), the schedule for the number of simula- 16 During the training, we track both the current loss on the training data set, as well as the loss evaluated on some separate validation set.Looking for good performance on the validation data set is typically a good strategy to avoid overfitting, and therefore we use it as a metric to indicate whether we are starting to overfit to the training data.The early stopping criterion waits for a specified number of passes through the training data (or epochs), over which the validation loss has not decreased before terminating the training.It then re-initialises the network parameters to the state where the minimum validation loss was observed.tions per round (simulation schedule) 17 , and the threshold for truncation (ϵ).Finally, we have the "noise shuffling" setting, which breaks down the data x into the stream and background components.In a given batch it then randomly permutes the background elements, essentially showing the network a brand new example (with the same signal component) every epoch at zero simulation cost.We found this to be an extremely effective way of reducing the possibility of overfitting, especially in the early rounds where we have small simulation batches 18 .Indeed, this strategy should be applicable to any additive noise model, see for example its application to gravitational wave data analysis [70].
In this section we have discussed the broad field of simulation-based inference and a specific algorithm, known as Truncated Marginal Neural Ratio Estimation (TMNRE) that we have used to build our data analysis pipeline.We argued that the targeted and marginalfocused approach could be a key advantage for stellar stream analysis, including the resulting simulation efficiency, statistical robustness, and the opportunities for increased model complexity.Finally, we discussed some of the design choices that need to be made in order to successfully apply TMNRE to a given problem.In the next section, we will present a case study for a mock stream to illustrate the application of our modelling and analysis strategy.

IV. RESULTS: GD1-LIKE CASE STUDY
Now that we have set up the framework of simulationbased inference, and specifically described the application of the algorithm to the analysis of stellar streams, we can present a case study to highlight its functionality.In this section, we will illustrate the full analysis and validation of a mock observation that is generated using our stellar streams modelling code sstrax.This is in order to have full control over the reconstruction of the parameters, as well as the physics input to the model.Of course, the longer term goal is to analyse current and future state-of-the-art spectroscopic and photometric surveys, such as Gaia and the Vera Rubin observatory [13,[31][32][33].

IV.1. Case Study Description
Perhaps the most well-studied and well-observed Milky Way stellar stream is the GD1 stream [3,4,6].It was first identified in the SDSS catalog in the early 2000s [3,4], but recently it has been observed by e.g.Gaia in significantly more detail [6,31,32].Indeed, observations are currently at the level where individual substructures (e.g. the so-called "gaps" and "spur") are reasonably well resolved [6,40].In some sense, the purpose of developing our analysis method is to take full advantage of these improvements and perform inference on streams like GD1 in as realistic as possible simulation framework.
To illustrate and test our method, we construct a case study to closely resemble the sky location and structure of the GD1 stream.To do so, we choose a stream closely aligned with the ϕ 2 = 0 deg plane in the GD1 specific coordinate system defined in Sec.II.We centre the location of the cluster (remnant) at ϕ 1 ∼ −25 deg, and choose the age of the stream such that it extends across a significant portion of the sky, as in the case of the real GD1 observation [3,4,6].Similarly, the dominant part of the stream is located around 6 − 10 kpc away from the galactic centre.The full set of parameter values that we choose for the mock observation are shown in Tab.I, along with the priors for the subsequent Bayesian inference.The mock observation that these parameters generate, and the focus of the analysis below is shown in Fig. 2. We do note, however, that whilst the mock observation we present here has general features that represent a GD1like stream, it is important to acknowledge that some aspects of the modelling could be improved in this regard.One important example is the presence of a clear remnant at the cluster centre, which is not present in the real GD1 data.Concretely, this has the effect that the reconstruction of the cluster position in our analysis may be overly optimistic compared to a more realistic case.

IV.2. Parameter Estimation with TMNRE
We carry out parameter estimation using the priors for the model parameters indicated in Tab.I, the observational model described in Tab.II, and the TMNRE algorithm settings given in Tab.III.The key results for this section are given in Figs. 3, 4, and 5.
Overview.There are a few levels at which to discuss our results from applying the TMNRE algorithm described in Sec.III to the mock GD1-like observation in Fig. 2. The first is simply in the context of robust and faithful inference -in Fig. 3, we show the full set of converged 1dposteriors for all parameters in the model.We see that we reconstruct the true value in all cases either via a direct measurement (e.g. the final position or velocity of the cluster) or as some clear upper or lower bound (e.g. the mass loss parameter α).Importantly, we can reconstruct with very high precision the age of the stream (t age ), the velocity dispersion of the cluster (σ v ), the current cluster position and velocity ([x c , y c , z c ], [v x,c , v y,c , v z,c ]), and the relative asymmetry in the tidal stripping (p near ).This sort of constraint is easy to motivate physically via e.g. the length and width of the stream, which is strongly affected by the age and velocity dispersion, as well as the stream's spatial location and orientation which is controlled by the relative cluster position and velocity 19 .
Degeneracies.Of course, not every parameter is measured with high precision, such as the parameters (ξ 0 , α, r h , m) that control the mass loss rate of the cluster as it orbits the Milky Way.Whilst we can set relevant upper or lower bounds on these parameters 20 , it is interesting to explore the degeneracies between these parameters also.This is where we can use the flexibility of the TMNRE algorithm to efficiently estimate posteriors of choice.Specifically, we only need to estimate the relevant 2d posteriors for exploring the degeneracy structure of the mass loss model.To do so, we train additional ratio estimators r(x; {θ i , θ j }) with θ i , θ j ∈ (ξ 0 , α, r h , m).The results for these 2d-posteriors, along with the corresponding 1d-posteriors are shown in Fig. 4. We see some clear degeneracies highlighted in the parameter inference such as those between ξ 0 and r h .Again, whilst we do not investigate these degeneracies in detail, this could be expected from e.g. the scaling of dM c /dt in Eq. ( 2).
Precision.As discussed in Sec.III, simulation-based inference has now been used extensively in other fields, and has been shown to qualitatively and quantitatively reproduce known results and results obtained using traditional methods.In the present case, a full comparison to e.g. a traditional method such as MCMC is challenging because we only have a forward simulator for the observational and noise models.This is another way to say that we do not have an explicit form of the data likelihood.Of course, this is a key strength of the class of simulationbased methods, since it allows for arbitrarily complex data simulators, which can account for complicated aspects of detection and selection in a statistically meaningful way.On the other hand, this means we should consider additional ways to test our results.A simple qualitative test we can perform is to compare the precision (and accuracy) with which we reconstruct the cluster position to the intrinsic observational errors on the stellar positions.To test this, we construct the 3d-joint posterior p(x c , y c , z c |x 0 ) by training a 3d ratio estimator r(x; {x c , y c , z c }) on the final round of simulations.From this joint posterior, we can generate posterior samples in the (ϕ 1 , ϕ 2 ) parameter space 21 for the current position of the cluster that are distributed as ϕ 1 , ϕ 2 ∼ p(ϕ 1 , ϕ 2 |x 0 ).These are shown in the top right panel (and inset) of Fig. 4 along with the observational model errors on the positions of the stars δϕ 1 , δϕ 2 .We see that we are able to reconstruct the cluster position to a good degree of accuracy and precision.
Simulation Efficiency.One of the key arguments we made for using TMNRE was the fact that it gave us the ability to use high fidelity simulators.This is both from a statistical perspective in the sense that we can perform Bayesian inference without explicit likelihoods, but also from the scalability point of view.Indeed, one of the main obstacles for a full analysis of stellar streams is that fact that performing enough simulations to do inference on a large number of parameters is typically infeasible.This is where the marginal and targeted aspects of TMNRE are relevant, as well as the acceleration of the simulator.To be more specific, in the case study described above, we required a total of only 350k simulations to perform inference on all 16 parameters simultaneously.Crucially, this simulation budget was split across a total of 7 rounds, as illustrated in Tab.III.In between the rounds, the truncation procedure described in Sec.III was applied, which ensures that we are targeting the specific observation of interest, and that the variance in the training data is significantly reduced compared to the previous round.This is very important for simulation efficiency, and results in much higher quality inference results on targeted observations compared to e.g. the case where a fixed simulation budget is used in a single round 22 .This truncation process is highlighted in Fig. 5, where we see how the different classes of parameter respond to the truncation process.For example, the first three columns of parameters (one component of the position and velocity, and the velocity dispersion of the stream) are extremely well constrained once the algorithm converges.On the other hand, the last two panels show parameters that are only broadly reconstructed (r h and λ rel ).For this second class of parameter, however, we see that in the initial rounds, the marginal posterior estimates of e.g.p(r h |x 0 ) are quite poor 23 .As the rounds evolve and the well-measured parameters are better constrained, subsequently reducing the training data variance, the posterior estimates on the poorly reconstructed parameters significantly improve.This is a general feature of TMNRE, where convergence and truncation in one set of parameters leads to marked improvements in the inference of other model parameters, even if they themselves are not well measured.
In terms of actual run time, we performed this analysis on a 72 CPU core cluster node, with a single NVIDIA A100 GPU to train the ratio estimators.The total run time for the analysis was around 19 hours, of which approximately 90% was simulation time.Note that this can therefore be improved immediately by either (a) further speeding up the simulator, or (b) having access to more CPU cores where the simulations can be further parallelised.

IV.3. Consistency and Validation Tests
The posterior sanity checks and explicit evidence for excellent reconstruction of the true values for the parameters in our case study are an important step towards developing and testing our analysis pipeline.On the other hand, given that our goal is to target data analysis challenges where there are no traditional methods available -either because they scale too poorly with the number of parameters, or because they have an analytically intractable data likelihood -we need to develop addi-tional consistency checks to validate our results.This is very much an active field of research in simulationbased inference, and a set of established methods now exist [115,116].
The most common, and the one that we will present here, are known as coverage tests [115].We will focus on expected coverage tests of our inference pipeline, which make precise the idea of variations in posterior estimates over various observational or statistical fluctuations.In particular, expected coverage tests ask the following question: how often does the x% credible interval contain the true value, averaged over observations generated from the joint distribution x, θ ∼ p(x, θ)?By definition, a well calibrated posterior distribution will contain the true value inside the x% credible internal x% of the time.As such, to carry out this test, we can generate a set of simulations 24 from the truncated prior in the final round of inference (so that we test the most relevant region of parameter space), and then perform inference on each simulation using our trained final round ratio estimators.For each confidence level x% ∈ [0, 1], we can then count how many simulations have inference results that contain the corresponding true value within this confidence interval.A posterior will pass this test if this results in an approximately diagonal line in the expected versus empirical coverage plane.Importantly, since the inference must be done individually for each mock observation, it is typically infeasible to perform this sort of coverage test with fully sequential methods (including e.g.MCMC or nested sampling), especially in scenarios with high simulation cost such as stellar streams.Finally, one should note that this coverage test is diagnostic in the sense that a failure indicates a poorly calibrated posterior estimate, but success does not guarantee that the correct posterior has been found.
We provide the coverage test results for all 16 parameters in the Appendix (see Figs. 8 and 9), but also give a specific example for the age of the stream t age in Fig. 6 opposite.We see that in all cases we achieve good coverage results, which can easily be improved further by allocating a slightly larger simulation budget.This coverage test diagnostic will remain applicable irrespective of the forward model or parameter choices, and is one of the key metrics for being able to validate simulationbased inference methods.

V. CONCLUSIONS AND OUTLOOK
In this work, we have presented the development and application of a brand new simulation-based inference data analysis pipeline for modelling (see Sec. II) and analysing stellar streams (see Secs.III and IV).In this last section, we present our key conclusions and provide some outlook as to the classes of analysis challenge we can now attempt to tackle, as well as the steps that would be required to achieve them.The key contributions in the work are as follows: • Scalable simulation-based inference pipeline.We have developed a brand new simulation-based inference algorithm to analyse stellar streams in the Milky Way (see Sec. III for a discussion on the application of simulation-based methods to stellar streams).In particular, we have implemented the TMNRE algorithm [78] with the aim to develop a scalable inference method.The motivation for choosing this algorithm for the analysis of stellar streams is mainly due to simulation efficiency that results from targeting individual observations and focusing on marginals.We showed in Sec.IV that we were able to perform inference on all 16 parameters of our model with only 350k simulations.This sort of performance is the key argument for the ability of our approach to analyse streams with far more realistic forward models.
• Robust and flexible method.Another important aspect of the analysis methodology developed here is its flexibility and robustness to changes in observation strategy or simulation model.By definition, our approach is simulation-based and therefore has the advantages that it does not (i) assume any explicit likelihood for e.g. the observational model, only the existence of a forward simulator, and (ii) assume any particular form of the data output.On the latter point, whilst we have developed the algorithm alongside our modelling code sstrax, the analysis pipeline would remain identical for any simulator.This is the crucial aspect that will allow our method to be used for making direct comparisons between different stream simulation strategies and observational models.
• Public analysis code.
We have built our analysis method on top of the swyft software which is a pytorch-based implementation of TMNRE [78].Specifically, we have publicly released the albatross code that is currently coupled to the sstrax modelling code by default.The albatross code is highly modular and can in principle be coupled to any forward model, for example galpy [63], without any change in the analysis methodology.This will eventually allow for direct comparisons in the inference between different modelling strategies.
• Public modelling code.As mentioned above, one of the key motivations for developing the albatross implementation of the TMNRE algorithm was to create a framework that allows for robust, scalable inference on complex models.In the same vein, we developed a new modelling code sstrax that is accelerated through the jax programming paradigm [80].This allows for fast (around a second per simulation) and realistic forward modelling of streams.We have designed the code to be readily extendable to include any physical effects such as subhalo impacts, varying gravitational potentials, higher fidelity tidal disruption models etc.As above, regardless of the modelling choices, the inference pipeline will crucially remain identical.

V.1. Outlook
We argued in the introduction that stellar streams are an exciting probe of galactic and dark matter physics.This is particularly true as the quality of observations continues to significantly improve in the eras of Gaia and the Vera Rubin observatory [13,[31][32][33].Taking full advantage of this data is challenging, however, both in terms of robust statistical analysis and the complexity of simulations required.Ultimately, if we are interested in using stellar streams to analyse scenarios such as the origin and statistics of substructure in the stream [9][10][11][12], or the impact of a large population of low mass subhalos on streams in the Milky Way [8,44,64], we will have to overcome these hurdles.This is the context we had in mind when developing albatross and sstrax.The aim was to develop a scalable, simulation-efficient framework that did not make any assumption about the complexity of the forward simulator.This is exactly the sort of task that simulation-based inference methods were developed to address.In terms of specific outlook, we believe there are a number of interesting avenues to pursue given the capabilities developed here.On the analysis side, there are a number of interesting claims in the literature about the origin and characterisation of the gaps and features in streams such as GD1 [3,4,6,40].More specifically, it would be an extremely important result to classify e.g. the gap in GD1 as being due to a compact object or subhalo collision [5,6,40].To obtain a definitive answer, however, one needs to show that the features cannot (at least to some degree of statistical certainty) arise by chance as a result of some stochasticity in the tidal stripping process, selection effects at the level of stream detection, or as a result of a more complex model of the Milky Way potential including known substructure such as dwarf galaxies or globular clusters [20,28,43].Similarly, it would be interesting to provide a conclusive answer as to the relative shape and size of the Milky Way gravitational potential from an analysis of individual or multiple streams [41].
The key advantage of the framework we have put forward here is that one can (and should) ask all of these ques-tions simultaneously.This is a more precise version of the statement in the introduction where we argued that we would ideally like to analyse the large-and small-scale structures present in stellar streams at the same time.On a more cautionary note, in order to move towards analysing real data in its full complexity with this class of simulation-based inference methods, it will be important to characterise and quantify the sensitivity of the inference method to perturbations or misspecifications in the forward model.This is particularly relevant in the case of stellar streams where the physics is highly complex, and it is unlikely to be possible for a simulation model to be developed that is simultaneously fully self-consistent and fast enough for parameter inference.One step towards this goal could include analysing mock streams generated from N-body simulations (see e.g.[77]) with identifiable parameters that match those in the model (such as the cluster velocity dispersion, analytic potential, or the age of the stream).One should bear in mind, however, that this is not necessarily an issue directly with simulationbased inference, but also affects traditional approaches such as MCMC if e.g. the modelling or data likelihood is miscalibrated.
Of course, to achieve these analysis goals, we must also make progress on modelling.Having a flexible analysis and simulation pipeline that does not assume e.g.symmetry in the Milky Way potential, or uniform stripping times in the evolution of the cluster motivates us to focus on improving the realism of each aspect.In particular, there are a number of key developments that would place the analysis questions above on a much more solid footing and allow us to analyse real data with confidence.Firstly, we should focus attention on the observational model -in this work we constructed a very simple framework to describe the detection and measurement of Milky Way streams.In reality, however, data such as that from Gaia is significantly more complicated [31,32], accounting for e.g.position dependent errors, selection effects based upon proper motions and metallicities, and spatially varying background densities [31,32].Realistic modelling of this will be particularly relevant for robustly studying e.g.small-scale features in streams.Secondly, the dynamics of tidal disruption and the release of stars from the cluster is vital for generating realistic density perturbations along the stream track.Again, since this could be an interesting observable for studying e.g. the collective implications of a population of small perturbers [9,10,44,54], or the internal properties of globular clusters [29], development of the model realism will inevitably lead to more informative inference results.Thirdly, we know that on the sort of timescales relevant to stellar streams, the Milky Way and its potential are dynamical, both in terms of its global structure, as well as the large amount of substructure in the form of dwarf galaxies, other clusters, or gas clouds [20,28,43].It would be interesting to take input from e.g.N-body simulations of Milky Way formation and trace the evolution of streams in such a dynamical potential.As we argued above, this could be done without any change in the analysis pipeline.
In summary, the development of a scalable and flexible simulation-based inference approach to analysing stellar streams can allow us to answer important questions about the evolution of, and substructure in our own galaxy.Aided by high quality observations by the latest surveys [13,[31][32][33], we can use this to start asking concrete questions regarding the nature of dark matter, the evolution and structure of the Milky Way, or the dynamics of dwarf galaxies and globular clusters.To achieve this will require development from the perspective of modelling stream dynamics and survey observations.However, having a robust simulation efficient inference strategy is strong motivation for starting to move further towards this ambitious goal.

FIG. 2 .
FIG. 2.Upper panels: Illustration of the various modelling steps (finding the cluster trajectory, mass loss history, final stream formation etc.) described in Sec.II.Lower panels: Example mock observation generated using the sstrax modelling code with the parameters in Tab.I. Analysed as a case study in Sec.IV.

FIG. 3 .
FIG.3.Full set of 1d marginal posteriors (orange curves) for all parameters in the sstrax stream model described in Sec.II applied to the mock observation in Fig.2.The 1σ, 2σ, and 3σ contours are overlaid behind.Finally, the black vertical lines indicate the true injected parameters from Tab. I.

FIG. 4 .
FIG. 4. Corner plot: Follow up analysis on the mass loss model given the 1d marginals in Fig. 3.The orange contours show the trained 2d posteriors on the parameters relevant to the mass loss model (ξ0, α, r h , m).Upper right panel and inset: Derived marginal posteriors (orange contours) in the (ϕ1, ϕ2)-plane on the final cluster position overlaid on top of the mock target observation.In addition, we highlight the observational errors (black error bars) and the true value (yellow star) to be reconstructed.

FIG. 5 .
FIG. 5. Examples of the truncation procedure in TMNRE applied to five of the model parameters.From left to right, we illustrate the evolution of the posterior estimates for zc, vx,c, σv, r h , and λ rel .From top to bottom we show the development over the number of rounds of the TMNRE algorithm.The insets zoom in on the bounded region (blue vertical lines) to highlight the coverage of the true value (vertical black dotted line).

FIG. 6 .
FIG. 6. Example of the coverage tests applied in this work for the age of the stream tage.Top panel: Empirical (observed) against the nominal or expected coverage.Bottom panel: The same information as the top panel but plotted in terms of the corresponding p-values.The red lines indicate the actual coverage results, whilst the blue contours represent the 1σ confidence interval on this estimate.

FIG. 8 .
FIG.8.Coverage results for the case study given in Sec.IV for all parameters in the sstrax stream model.This is the same information as Fig.9, but with more emphasis placed on the tail regions via the definition p = zp −zp dz (1/ √ 2π) exp(−z 2 /2).The pink curves indicate the average coverage, whilst the blue contours represent the 1σ uncertainty of this estimate.

TABLE I .
Parameters, prior range choices and injection values for the stream model parameters described in Sec.II.

TABLE II .
Background contamination rate ϵ background 10 −3 % Choices for observational model parameters described in Sec.II.

TABLE III .
Choices for the hyperparameters and settings for the TMNRE algorithm in this work, as described in Sec.III.This is the minimum number of rounds, if the algorithm has not converged, we continue rounds of inference until the truncation procedure terminates. ⋆