GalaxyFlow: Upsampling Hydrodynamical Simulations for Realistic Mock Stellar Catalogs

Cosmological N-body simulations of galaxies operate at the level of"star particles"with a mass resolution on the scale of thousands of solar masses. Turning these simulations into stellar mock catalogs requires"upsampling"the star particles into individual stars following the same phase-space density. In this paper, we introduce two new upsampling methods. First, we describe GalaxyFlow, a sophisticated upsampling method that utilizes normalizing flows to both estimate the stellar phase space density and sample from it. Second, we improve on existing upsamplers based on adaptive kernel density estimation, using maximum likelihood estimation to fine-tune the bandwidth for such algorithms in a way that improves both the density estimation accuracy and upsampling results. We demonstrate our upsampling techniques on a neighborhood of the Solar location in two simulated galaxies: Auriga 6 and h277. Both yield smooth stellar distributions that closely resemble the stellar densities seen in the Gaia DR3 catalog. Furthermore, we introduce a novel multi-model classifier test to compare the accuracy of different upsampling methods quantitatively. This test confirms that GalaxyFlow estimates the density of the underlying star particles more accurately than methods based on kernel density estimation, at the cost of being more computationally intensive.


INTRODUCTION
Large, detailed astronomical surveys are revolutionizing our understanding of the kinematic, photometric, and spectroscopic properties of the Milky Way and its satellites.Surveys such as SDSS (York et al. 2000), DES (The Dark Energy Survey Collaboration 2005), and Gaia (Gaia Collaboration et al. 2016Collaboration et al. , 2018Collaboration et al. , 2021) ) have revealed aspects of the Milky Way's merger history, dark matter substructure, stellar streams, satellite dwarf galaxies, and more.As existing surveys continue and new observatories (e.g., JWST (Gardner 2009), Vera Rubin Telescope (LSST Dark Energy Science Collaboration 2012), Grace Roman Space Telescope (Spergel et al. 2015)) join them, our knowledge of the Milky Way's constituents will continue to grow tremendously.
These massive datasets require equally sophisticated simulations in order to fully understand and interpret the underlying cosmological and astrophysical parameters of the Milky Way.Additionally, accurate simulations provide a proving ground for new analysis methods.In particular, given Gaia's unique capability to measure proper motions, accurate and realistic simulations of Gaia observations must ★ E-mail: sunghak.lim@rutgers.edu† E-mail: kailash.raman@rutgers.edu‡ E-mail: mbuckley@physics.rutgers.edu§ E-mail: shih@physics.rutgers.edumatch the kinematics of the Galaxy's stars in addition to matching the stars' spatial distribution.Broadly speaking, two approaches exist to generate synthetic Milky Way-like galaxies with a level of detail similar to actual Gaia observations.In the first approach, synthetic stars are sampled from an analytic model for the Milky Way.For example, Rybizki et al. (2018) generated a Gaia-like catalog from the Besançon model of Robin et al. (2003) using the Galaxia code (Sharma et al. 2011).Such analytic models can be closely tuned to observed properties of the Galaxy.Although those models are beneficial when working with Gaia data, the underlying model of the Galaxy consists of completely smooth distributions with no substructure, streams, or merger history.For analyses that target these and other non-equilibrium properties of the Milky Way, this type of simulation may not serve all necessary purposes.
In the second approach, synthetic observations are taken of fully cosmological -body simulations of Milky Way-like galaxies.State-of-the-art simulations consider dark matter and baryons with smoothed particle hydrodynamics, generating simulated galaxies that appear very similar to our own in many respects -although, of course, the precise merger history and local environment of the simulated galaxies can differ from those of the Milky Way, which may have important effects on the detailed properties of the stellar kinematics.These simulations have shed light on various interesting anomalies, such as the too-big-to-fail problem (Boylan-Kolchin et al. 2011, 2012;Brooks & Zolotov 2014;Papastergis et al. 2015), the missing satellite problem (Klypin et al. 1999;Moore et al. 1999;Brooks et al. 2013;Sawala et al. 2013Sawala et al. , 2016;;Wetzel et al. 2016a;Despali & Vegetti 2017;Garrison-Kimmel et al. 2017), and others.Unlike analytic models, -body simulations naturally contain realistic non-equilibrium effects and kinematically-consistent substructure evolved from galaxy mergers.
However, the -body simulations all operate at the level of "star particles": objects with masses generally on the order of 10 3 − 10 4  ⊙ , each of which represents an entire population of stars.As a result, these simulations are not directly suitable for comparison with Milky Way stellar populations from large survey data.To create a realistic Gaia-like survey catalog from these -body simulations, star particles must be spread out in phase space via some "upsampling" technique.The goal of any such technique should be to produce a dataset of upsampled stars that follows the same kinematic phase space distribution as the original star particles, so as to preserve the self-consistent kinematics of the cosmological simulation.In particular, a good upsampling technique should not result in kinematic artifacts or any other artificial substructure at length scales below the simulation resolution (which is O (10 2 ) pc for current state-of-the-art -body simulations with baryons); instead it should be "smooth" on scales smaller than the simulation resolution.
Several catalogs which simulate the capabilities of the Gaia telescope have been constructed from -body simulations of dark matter and baryons.Sharma et al. (2011) and Lowing et al. (2015) are early examples, as are the Aurigaia (Grand et al. 2018) and the Ananke (Sanderson et al. 2020) catalogs, derived from the Auriga (Grand et al. 2017) and the Feedback In Realistic Environments (FIRE) Latte simulations (Wetzel et al. 2016b;Hopkins et al. 2018), respectively.
The Ananke and Aurigaia1 catalogs use the EnBiD algorithm (Sharma & Steinmetz 2006) to estimate the phase space density of the star particles and spread them out by adaptive kernel smoothing.The EnBiD algorithm first tessellates the position/velocity space into hypercubes, each containing a single star particle, using an entropybased criterion.A length scale of each hypercube (such as a side length), can be used as a bandwidth for the (Gaussian) kernel smoothing.The upsampling is accomplished by drawing samples from the kernel.
When using such kernel-based methods, it is important to choose the appropriate bandwidth for the kernel in order to reduce the bias from the smoothing.In particular, if the bandwidth is chosen too small, as was the case in both Ananke and Aurigaia, then the upsampled distributions will be too clumpy.This lack of smoothness can be seen in plots in proper motion and Galactocentric velocity (see Figure 1), which reveal an unphysical blotchiness to the stars in the simulated Gaia catalogs.
To improve on the poor upsampling performance of the existing Gaia mock catalogs, this paper introduces two new upsampling algorithms.Our major result is GalaxyFlow, a novel synthetic star upsampling technique that employs state-of-the-art density estimation methods, specifically normalizing flows (for recent reviews and original references, see e.g.Kobyzev et al. 2021;Papamakarios et al. 2021).Normalizing flows have previously been used to accurately estimate the phase space density of stars in a galaxy.Green & Ting (2020) was an early study demonstrating the idea in a Plummer sphere.An et al. (2021); Naik et al. (2022) use a mock dataset from analytic galaxy models to confirm the idea by using masked autoregressive flows (MAF; Papamakarios et al. 2017), and show the estimated density can be used for recovering the gravitational potential and acceleration.Buckley et al. (2023) applies MAF to estimate the density of stars in an -body simulated galaxy and uses the density estimate to estimate the local gravitational acceleration and mass density.MAFs are used in Shih et al. (2022) as part of a model-agnostic "anomaly detector" to identify stellar streams in Gaia data.These successes of flows in accurately estimating stellar phase space densities in both simulated and actual astronomical datasets give us confidence that they should be effective upsamplers of galaxy simulations.
In this work, instead of using a MAF, we will use another type of normalizing flow: continuous normalizing flows (CNF; Chen et al. 2018;Grathwohl et al. 2019), which have an excellent inductive bias for modeling smooth distributions.CNFs have been shown to be effective in recovering the densities of various analytic models (Green et al. 2023).This theoretical and empirical evidence indicates that a CNF can be a good flow model for reducing the clumpiness of the upsampled dataset.In Appendix A, we will explicitly show that a CNF performs better than a MAF by our upsampling evaluation metrics.
In addition to GalaxyFlow, we develop a fine-tuned EnBiDbased upsampler.To mitigate the inherent smoothing bias that limits the EnBiD algorithm's ability to replicate the distribution of original star particles, we introduce a post-optimization technique to refine the bandwidths given by the original EnBiD algorithm.This method treats the kernel density estimation as a Gaussian mixture model (as they have the same functional form), allowing us to adjust the bandwidths through maximum likelihood estimation of the overall bandwidth scale parameter.This optimized EnBiD algorithm yields a more accurate stellar distribution than both the default EnBiD parameters and the small-bandwidth EnBiD setup used in constructing the Aurigaia and Ananke catalogs.As we will describe, GalaxyFlow and fine-tuned EnBiD have complementary strengths and weaknesses: the latter is computationally lightweight but still suffers from some amount of oversmoothing, while the former is computationally expensive but more accurate to the underlying phase space distribution.
Both of our proposed upsampling methods can be applied to a wide range of -body simulations.As a concrete demonstration of our upsampling methods, we use the star particles in a simulated galaxy called Auriga 6 (Grand et al. 2017) -the underlying -body simulation used for building the Aurigaia catalog.Though we use the Auriga 6 simulation for specificity, nothing about GalaxyFlow or the kernel method depends on the details of this simulation, and they can be applied to a wide range of -body simulations.In Appendix C, we show the results applied to the h277 galaxy simulation generated by the -Body Shop (Zolotov et al. 2012;Loebman et al. 2012).
As this paper aims to be a proof-of-concept demonstration of GalaxyFlow, we will concentrate exclusively on the sampling of stellar distributions in position and velocity spaces.This allows a straightforward upsampling performance comparison between the normalizing flows and EnBiD algorithms.Furthermore, we will narrow our focus to the subset of stars in a spherical region within a 3.5 kpc radius of the Sun, as going out to farther radii increases the training time dramatically.Also, limiting ourselves to the Solar neighborhood allows us to make qualitative comparisons to the subset of Gaia DR3 stars that have full 6d positions and velocities.As we will see, considering stars within this window is sufficient for a detailed and definitive demonstration of the efficacy of the GalaxyFlow method.
In addition to showing that the GalaxyFlow-and kernel-based upsampled stellar distributions replicate the star particle distributions by eye, we develop a novel neural network-based multi-model classifier test to determine quantitatively which upsampling method generates a set of synthetic stars closer to the phase-space density of the original star particles.The classifier is trained to distinguish samples from multiple upsamplers, and its output can be utilized for assigning likelihoods to each upsampler given reference dataset, which is not used for training upsamplers.With this multi-model classifier test, we will quantitatively see that GalaxyFlow and EnBiD are separable, and the reference data is much closer to GalaxyFlow than to EnBiD.This holds true even when comparing GalaxyFlow to our optimized kernel-based upsampling method.Note that the effectiveness of the multi-model classifier test can be affected by the characteristics of the classifier used.For example, we will employ a multilayer perceptron (MLP) as our classifier model, which is known to have a slow training issue in learning high-frequency modes (Rahaman et al. 2019;Xu et al. 2019;Yang et al. 2022).The clumpy distribution generated by the smallbandwidth EnBiD used in the Ananke and Aurigaia catalogs has high-frequency modes.In Appendix B, we address this by setting up a small-bandwidth EnBiD upsampler that generates stellar distributions similar to the blotchy Aurigaia catalog, and we demonstrate the validity and precision of the test when the training efficiency issue is prominent.
In future work, we will apply GalaxyFlow to full hydrodynamical simulations, in order to produce complete and realistic Gaia mock catalogs.A full realistic synthetic catalog must also sample stars over a range of masses and metallicities (as well as realistic dust extinction).As we will discuss in our conclusions, these do not present insurmountable problems, but we defer a full algorithm including these effects to future studies.
The outline of our paper is as follows.In Section 2, we describe the mock Gaia catalogs and simulations we use in this paper and compare the mock catalogs to the Gaia Data Release 3 (DR3) catalog.In Section 3, we introduce GalaxyFlow and summarize the details of normalizing flows for the upsampling.In Section 4, the optimized EnBiD algorithm with additional bandwidth fine-tuning used in this paper is explained.In Section 5, we introduce our multi-model classifier test for comparing the performance of upsamplers, and we qualitatively and quantitatively compare the upsampled datasets from GalaxyFlow and the EnBiD algorithm for the Auriga 6 simulation.Section 6 shows the computational speed of the upsampling algorithms considered in this paper.We conclude in Section 7 by discussing the limitations of our current upsampling and outlining future steps toward a full upsampling algorithm.Additionally, in Appendix A, we compare our upsampling results using a CNF with that of a MAF.In Appendix B, we revisit the small-bandwidth EnBiD in order to discuss the potential limitations of the multi-model classifier test using MLP-based classifiers.In Appendix C, we demonstrate the generality of our method by applying it to a second hydrodynamical simulation, the h277 galaxy.

CURRENT STATUS OF MOCK Gaia CATALOGS
To present the current status of mock Gaia catalogs, in this section, we consider two different mock catalogs of Gaia-like observations of a Milky Way-analogue galaxy: Ananke (Sanderson et al. 2020), based on the FIRE Latte simulations (Wetzel et al. 2016b;Hopkins et al. 2018); and Aurigaia (Grand et al. 2018), derived from the Auriga -body simulations (Grand et al. 2017).
The Latte simulations (Wetzel et al. 2016b;Hopkins et al. 2018), used to generate the Ananke dataset, were produced using the GIZMO code (Hopkins 2015).The star particle mass in these simulations is initially 7070  ⊙ .The snapshot has an average particle mass of 5000  ⊙ , as the star particle masses can shrink over time due to the life cycle of individual stars in the assumed stellar population.The Auriga suite of simulations was produced using the Arepo magneto-hydrodynamic code (Springel 2010), from comoving 100 Mpc snapshots of the EAGLE project (Schaye et al. 2015).The characteristic dark matter particle mass is 4 × 10 4  ⊙ , and the baryonic particle mass is ∼ 5000  ⊙ .Auriga 6, chosen due to its similarity with the Milky Way, has a total mass of 1.01 × 10 12  ⊙ within the virial radius.
These -body simulations both contain O (10 7 ) star particles in their Milky Way equivalents.From these particles, the position and velocities of the O (10 10 ) individual stars needed for a realistic simulation of the Gaia catalog are drawn through an upsampling process.As discussed in the Introduction, both Aurigaia and Ananke used the EnBiD algorithm as the core of their upsampling method.This code estimates the phase-space volume taken up by each massive star particle; this information is then used to spread out the distribution of upsampled stars using a series of probability distributions (chosen to be Gaussian) centered on each star particle in the sampling step (see Section 4 for additional discussion of the EnBiD algorithm).In addition to kinematic information, the upsampling must also account for stellar masses, ages, metallicity, and observational effects such as dust extinction.In this work, we are primarily concerned with the kinematic sampling and will defer these other important stellar properties for later work.
The kinematic sampling approaches used in Aurigaia and Ananke reproduce many features of the real Gaia data, particularly in the spatial distribution of the upsampled stars.However, especially in velocity space, residual traces of the progenitor star particles remain and can be clearly seen in relatively simple plots of the distribution of stars.As an example, in Figure 1, we show -for all stars with measured position and velocity 2 in a 15 • observational window centered on the ICRS coordinate (, ) = (167.47• , −4.2 • ) -the distribution of proper motions on the sky as well as stellar velocities in Galactocentric coordinates.The center of the 15 • window was chosen to be off of the Galactic disk but is otherwise randomly selected.These distributions are shown for real Gaia DR3 data (Gaia Collaboration et al. 2021) as well as the Ananke and Aurigaia simulations.
The upsampled stars exhibit clustering and streaking in the velocity and proper motion that are not present in the actual Milky Way population as seen by Gaia. 3 We verify in Appendix B that the visually-apparent clusters of simulated stars in proper motion-and velocity-space correspond to individual parent star particles in the original -body simulation.
Note that the three datasets have very different numbers of stars in the same angular patch on the sky (139,164; 94,970; and 2,192,188 stars for Gaia, Ananke, and Aurigaia, respectively).The order-ofmagnitude difference in statistics between the mock catalogs is mainly because the Aurigaia catalog provides radial velocities of 2 We select stars with measured radial velocities and with observed parallax 3 away from the zero parallax to remove poorly measured stars.We further require positive parallax. 3Although we compare Gaia DR3 to Ananke and Aurigaia DR2 mock catalogs, the difference between DR2 and DR3 cannot explain the clustering and streaking shown in the latter panels of  ) and with measured radial velocities.We additionally require that the "observed" parallax is more than 3 away from  = 0, in order to remove stars with poorly measured distances.Distributions are shown for real Gaia DR3 data (top row), for Ananke (middle row), and for Aurigaia (bottom row).The Gaia DR3, Ananke, and Aurigaia datasets have 139,164; 94,970; and 2,192,188 stars satisfying the selection criterion, respectively.Note that the Aurigaia catalog presents radial velocities of all the stars so that coverage and sample statistics are quite different from the other datasets.
all the stars (Grand et al. 2018), while the Ananke catalog provides the information for the stars satisfying the magnitude and temperature limits of Gaia DR2 (Sanderson et al. 2020).A similar orderof-magnitude difference can be seen in the total number of sources in Gaia DR3 (Gaia Collaboration et al. 2023) and the number of sources with radial velocities in Gaia DR2 (Gaia Collaboration et al. 2018).As this work is concerned only with the distribution of the stars in phase space, which can be seen regardless of these factors, we do not correct for them.
To further illustrate the comparative smoothness of Gaia, we show the full phase space distribution in ICRS coordinates for the Gaia DR3 stars in Figure 2. The full sample (not requiring a measured radial velocity) contains 979,185 stars.Only a small fraction of stars have measured radial velocity, so plots of radial velocity are statistically suppressed with only 135,105 stars (all other plots use the full sample of stars).Observational effects are apparent in the distance histogram, which falls after 1 kpc instead of continuing to rise as expected for the underlying distribution of stars.Most notably, we see no significant "blotchiness" in the resulting distributions.
Such spurious blotchiness often appears in kernel density estimation when the chosen bandwidth is too small, which appears to be the case in the Aurigaia and Ananke synthetic observations.To address this failure mode, one could use a larger bandwidth for each kernel.When doing so, we must be careful about the smoothing bias that is inherent in kernel density estimations with finite sample sizes: the upsampled stars will respect the smeared phase-space density of star particles, not the actual phase-space density (García-Portugués 2023).To upsample stars without encountering both blotchiness and smoothing bias, an alternative approach is needed.In this work, we provide two such methods: GalaxyFlow using normalizing flows (Sec.3), and a kernel-based method with an algorithm to select the bandwidth to avoid clumping without significant oversmoothing (Sec.4).

Benchmark Datasets for Training Upsamplers
Our goal is to generate upsampled stars from -body simulations of Milky Way-like galaxies with stellar kinematics which display less clumping than found in the existing suites of mock Gaia datasets (most notably in velocity-space) and accurately reflect the true phasespace density of star particles.To achieve this, we train upsamplers on the kinematics of star particles from fully-cosmological -body simulations containing both baryons and dark matter, and compare upsampling qualities between algorithms.In the main body of this paper, we focus on the Auriga 6 galaxy4 (Grand et al. 2017), which is the simulated galaxy used for constructing the Aurigaia mock catalog.The details of the simulation which produced Auriga 6 have been described in the previous section.In Appendix C, we work with the h277 galaxy to cross-validate our general GalaxyFlow method and optimized EnBiD algorithm with a different mock dataset.
The Galactocentric coordinate system of Auriga 6 is oriented such that the galactic disk lies in the -plane, with the assumed Solar location lying in the − direction.The  coordinate points out of the disk, oriented so that the local motion of disk stars is in the + direction.
The star particles within 3.5 kpc of the default Solar location of the Aurigaia catalog, (−8.0, 0, 0.02) kpc, are selected for training our upsamplers.The number of selected star particles in Auriga 6 is 482,412, with a maximum speed of 858.15 km/s.Although we limit the star particles to be upsampled, this dataset is enough for comparing the upsampling performance of various algorithms.Training upsamplers over a full galaxy will be addressed in future work.
The Aurigaia mock catalog aims to model the real Gaia data accurately and so must sample stellar populations from isochrones, creating synthetic stars with a range of magnitudes and colors.After this sampling, realistic observational effects must be applied.In this work, we are interested only in the sampling of the kinematic phase space: comparing our new techniques using normalizing flows and the optimized EnBiD-based approach.For this comparison, the full simulation of synthetic stars (including isochrones and observational effects) is unnecessary, so we omit these steps.

Normalizing Flows
GalaxyFlow is based on a type of generative model called normalizing flows.Normalizing flows are a class of neural networks that learn a transformation from a simple base probability distribution to a function representing the unknown probability distribution from which the training data was drawn.A fully trained normalizing flow allows for both density estimation of a dataset as well as sample generation from the approximated density distribution.A summary of the idea is as follows.
Let ì  and ì  be -dimensional random variables with the base distribution and training data distribution, respectively.The standard normal distribution is conventionally used as the base distribution because of its tractability.Normalizing flows will learn a bĳective transformation  between ì  and ì : Once the network has learned this transformation, it is straightforward to upsample the dataset using normalizing flows: randomly select samples from a multidimensional standard normal distribution and then transform them to synthetic data in the training space using the above formula.By composing a chain of simple, nonlinear bĳective functions   : one can build a highly expressive family of bĳections  that are capable of modeling non-trivial distributions.For computational efficiency, we require that inverse and Jacobian determinants of those simple bĳections are easy to compute.Training normalizing flows are based on maximum likelihood estimation.If the inverse transformation of  and the Jacobian determinant |ì / ì | are easy to compute, the change of variable formula can be used to compute the probability density function of ì : By maximizing log ( ì ) (summed over the training data) with respect to the parameters of , the normalizing flow can be fit to the data and learn the underlying probability density.
In addition, normalizing flows are able to learn conditional probabilities if the transformation  is conditioned.This capability is especially useful in modeling the phase space density  (ì , ì ), as the full 6-dimensional density need not be learned all at once.Instead, we borrowed the well-performing setup from our previous work (Buckley et al. 2023) and used two separate flows learning position density (ì ) and velocity density conditioned on position, (ì |ì ). 5 The phase space density  (ì , ì ) is modeled by their product as follows.
(4) 5 In Buckley et al. (2023), this decomposition was necessary in order to efficiently evaluate velocity integrals at a given position appearing while solving the Boltzmann equation and Gauss's law.For the upsampling problem, we need only the phase space density, so this decomposition is not strictly necessary.

Continuous Normalizing Flows
Within the general class of normalizing flows, we have to choose an optimal implementation for smoothly upsampling star particles.Continuous normalizing flows (CNF; Chen et al. 2018;Grathwohl et al. 2019) are a good candidate with an inductive bias suitable for this problem because the transformation smoothly deforms the base distribution to the target distribution.
More specifically, CNF learns the following infinitesimal transformation, Here, the function  is a neural network representing the derivative  ì / of the trajectory of transformed variables at a latent time .The Jacobian determinant of the transformation can be obtained by solving the following form of the Fokker-Planck equation with zero diffusion (Chen et al. 2018), describing the time evolution of log probability log ( ì (); ) along the trajectory ì () at time : The trace computation of this equation is often a bottleneck during the training, so Hutchinson's trace estimator (Grathwohl et al. 2019) can be used to speed up this step.In our case, the cost of evaluating the trace is manageable since we train CNFs for 3D densities; we explicitly evaluate the trace during the training.
Our CNF is implemented using Torchdyn (Poli et al. 2022) with backend PyTorch (Paszke et al. 2019).The neural ODEs are solved by the Runge-Kutta method of order 4 for both forward and backward directions.For the backward direction, we could numerically invert the forward calculation for the exact invertibility of the flows (Behrmann et al. 2019;Chen et al. 2019) but use the numerical solver again for fast calculation.We will see that this approximate inverse transform is still good enough to model a good upsampler.We set the distribution at  = 0 to be the base distribution and that at  = 1 be the target distribution, and the latent time interval is divided into 20 steps.The derivative function  is modeled by a multilayer perceptron (MLP) consisting of 4 hidden layers with 32 nodes each.GELU activations (Hendrycks & Gimpel 2016) are used in order to model a smooth transformation.For our conditional CNF for modeling (ì |ì ), the conditioning variables ì  are provided as extra inputs to .

Training Normalizing Flows for Upsampling Star Particles
We preprocess the position and velocity of the star particles in order to standardize the inputs and separate the sharp boundaries at the edge of the training volume.The CNFs model a smooth transformation in the latent space, so we preprocess to avoid discontinuities.The position vectors ì  are preprocessed as follows in order to construct normalizing flows on an open ball in Euclidean space (Buckley et al. 2023): (i) Centering and scaling: We scale the positional coordinates to transform the 3.5 kpc radius sphere centered on the Sun into a unit ball by the following linear transformation: where  max is the radius of the selection window, ì  ⊙ is the location of the Sun in the galaxy, and  = 1.000001 is a small constant to avoid putting stars exactly on the boundary.
(ii) Radial transformation: We then expand the unit ball to cover Euclidean space, i.e., This transformation is designed to have well-defined derivatives at the origin.
(iii) Standardization: We finally rescale the position coordinates by the width of their distributions: where    and    are the mean and standard deviations of   in the training dataset, respectively.
The velocity distribution does not have a sharp boundary at the edge of the observational sphere, so preprocessing the velocity vectors consists only of standardizing the inputs (Step (iii) above).Two normalizing flows regressing (ì ) and (ì |ì ) in the preprocessed coordinate are trained as follows.The loss function is defined as the negative log-likelihood of the flow evaluated on the data.We use an ADAM optimizer (Kingma & Ba 2014) with a learning rate of 10 −3 to minimize the loss function.We randomly select 20% of the training samples as a validation dataset.The remaining training samples are randomly split into 10 mini-batches for each epoch of training.We stop the training when the validation loss has not improved over 50 epochs.The trained network is further refined by restarting training with a reduced learning rate of 10 −4 .We train 10 instances of normalizing flows initialized with different random number seeds, and we ensemble average to improve the performance further.
We generate position and velocity samples using the normalizing flows as follows.
(i) Sample Gaussian random numbers from the base distributions.(ii) Use the trained normalizing flows to transform the random numbers into position and velocity vectors in the preprocessed space.
(iii) Use the preprocessors to transform the vectors in the preprocessed space to the position and velocity vectors.
(iv) We remove outliers by discarding samples with distances from the Sun larger than 3.5 kpc, or with speeds exceeding the maximum speed of the training sample.

EnBiD DENSITY ESTIMATION
We compare the phase space distributions generated by the normalizing flows to the distribution generated using the EnBiD algorithm (Sharma & Steinmetz 2006) -the adaptive kernel density estimation used as the base of the upsampling method for constructing the existing state-of-art Gaia mock catalogs.In this paper, we use an EnBiD algorithm as implemented in EnLink 0.1.0(Sharma & Johnston 2009;Sharma 2020).Here, we briefly summarize how the EnBiD algorithm determines the bandwidth of the kernels and upsamples stars afterward.We then describe a method to select an optimal bandwidth for the kernel, preventing the clumpiness that characterises existing synthetic stellar populations that were constructed using kernel-based methods.
Given a set of star particles in phase space, EnBiD first partitions the phase space into disjoint boxes, each containing exactly one particle.Each box represents a volume with a unit probability of finding a particle, 1/, where  is the total number of particles.The boxes are constructed using the following entropy-based criterion: (i) Start with a box containing all the star particles.(ii) For a given box containing   star particles, calculate the Shannon entropy along each axis.The entropy is estimated by modeling the probability density as a histogram of   equal-sized bins.
(iii) Select the axis with minimum entropy.
(iv) Divide the box in two by dividing the selected axis at the midpoint of the nearest upper and lower neighbors of the mean value of the corresponding component.
(v) Repeat the above splitting until all the boxes contain only one particle.
Dividing the axis with minimum entropy prioritizes splitting along axes with clustered structures.This box splitting rule results in small box volumes in regions of high density.The inverse of a box's volume can be used as a proxy for the phase space density within that box.Each star particle  can then be upsampled by drawing from (Gaussian) kernels with smoothing bandwidths ì ℎ () determined by hypercube length scales.These per-star-particle smoothing bandwidths can be viewed as the ultimate output of the EnBiD algorithm.We note that Aurigaia (Lowing et al. 2014;Grand et al. 2018) and Ananke (Sanderson et al. 2020) appear to have incorrectly extracted the ì ℎ () parameters from EnBiD, leading to upsampling bandwidths that are much too small.This explains why their upsampling star distributions are so blotchy in phase space.

Fine-Tuning Bandwidths using a Mixture Model
The clumpiness visible in existing upsampled datasets is not a fundamental limitation of EnBiD, and can be alleviated if the bandwidth is appropriately set.Here we describe an algorithm which further refines the EnBiD bandwidths to enhance the upsampling quality.The results give kinematic distributions which are smooth, though we will show that some oversmoothing remains and the samples drawn from the CNF are quantatively more similar to the underlying data.
In order to fine-tune the bandwidths, we construct a likelihood model of bandwidth scaling parameter  ℎ by considering the kernel density estimation with bandwidths scaled by  ℎ as a Gaussian mixture model.The resulting likelihood function of  ℎ is as follows.
Here,  is the number of samples in the kernel density estimation,  ( ì | ì , ì ) denotes a Gaussian kernel with mean ì  and bandwidth ì , and ì  () corresponds to the position and velocity vector of -th training sample.The scale parameter  ℎ adjusts the overall scale of kernel bandwidths ì ℎ () given by the EnBiD algorithm using the number of nearest neighbors,  = 64.We note that this modeling can be used for fine-tuning bandwidths given by any adaptive kernel density estimations.
To find an optimal  ℎ , we minimize the negative log-likelihood of the validation dataset.Given that this optimization involves a single parameter, we employ the Newton-Raphson method for its efficiency and fast convergence.We present the bandwidth scale optimization results in Figure 3.For the upsampling using half of the star particles as the validation set, the optimal scale parameter  ℎ is 0.356 ± 0.001. 6We use this scaling parameter to correct the EnBiD bandwidth for each particle.This optimized choice allows us to avoid the clumpiness of the existing upsampled catalogs while also not greatly oversmoothing the phase space density.
To further improve the resulting dataset, we perform EnBiD upsampling only after applying the preprocessing outlined in Section 3.3.This preprocessing also helps mitigate the boundary bias inherent in kernel density estimation, which occurs when a kernel is placed near the data boundary and extends beyond it.This causes the density to be underestimated near the edges and generates outliers.The preprocessing steps push the boundary to infinity and so greatly reduce this form of bias.
Compared to the GalaxyFlow upsampler, our EnBiD upsampler using a fine-tuned kernel is much more computationally inexpensive (see Sec. 6).However, as we will discuss in the next section, though the resulting upsampled stellar populations appear smooth by eye, GalaxyFlow outperforms our kernel-based method by our full classifier metrics.

Qualitative Comparisons
As Gaia observations cover the full sky, and studies of Galactic kinematics are concerned with the correlations of stellar velocities with their location in position-space, we compare the results of upsampling the parent star particles in the same observational window as in Figure 1.For Auriga 6, this patch contains 3,015 star particles.We upsample the star particles by a factor of 500 to have statistics comparable to that of the Aurigaia catalog: 2,192,188 stars.
In Figure 4, we plot the proper motion for stars upsampled in the patch for the Gaia DR3 catalog and from the Auriga 6 simulations using both EnBiD and the CNF.The bottom right panel of Figure 4 shows the proper motion of our EnBiD-upsampled dataset.The optimized bandwidth is sufficiently larger than the inter-particle spacing so that the clumpy substructures visible in Figure 1 are no longer there.The proper motion distribution is as smooth as Gaia DR3 plot in the center.
The bottom left panel of Figure 4 shows the proper motion of the CNF-upsampled dataset.The CNF upsampling also generates a very smooth distribution as CNF models a smooth transformation from the standard Gaussian to the data distribution.Within this patch, both optimized EnBiD and CNF apparently generate high-quality proper motion distributions.
In Figure 5 and Figure 6, we provide the full six-dimensional phase space (in the ICRS coordinates) of stars upsampled in the patch by our bandwidth-optimized EnBiD algorithm and the CNFbased GalaxyFlow, respectively. 7Overall, both sets of upsampled distributions look similar and sufficiently smooth in all the position and velocity components, but the two upsampling methods noticeably differ in the distance histogram.The main reason for the sudden rise of distance distribution in EnBiD-upsampling is due to over-smoothing in kernel density estimation.
The deviation in the EnBiD upsampled stellar distribution from the star particle distribution is also visible in the whole dataset, as shown in the histograms of galactocentric cartesian coordinate components in Figure 7.The bias is particularly prominent in the areas where the density changes rapidly and at the boundary, such as at the edges of -histogram and the peak of   histogram.Again, this deviation is due to over-smoothing.The smearing effect of the Gaussian kernel introduces a bias proportional to the second derivative of the density, leading to sizable deviations around density peaks.
In contrast, CNF aims to estimate the true density directly through maximum likelihood estimation.The histograms for coordinate components upsampled by CNF, depicted in Figure 7, show smooth and less biased stellar distributions.This qualitative comparison highlights the effectiveness of CNF in replicating the star particle density without the bias introduced by kernel smearing, offering a more accurate model for upsampling. 7We note that these plots lack observational effects, so they differ from the Gaia plots, especially in the distance distribution.The corresponding Gaia distribution is a convolution of the true distances with observational effects such as observed magnitude limits.The far-away stars are dimmer and more difficult to observe, and so we see fewer distant stars compared to the upsampled star distributions.Nevertheless, those observational effects do not wash out the upsampler dependence, so comparing the simulation truth distributions is sufficient for our purpose.

Two-Sample Classifier Metric
While by-eye comparisons of the outputs of the upsampling methods are useful, they are necessarily qualitative, as well as limited to one or two dimensions.For a more quantitative, full-phase-space test of the quality of the generative models, we turn to a novel neural classifier test.
The ideal metric to judge the quality of a generative model would be the Neyman-Pearson (NP) optimal binary classifier between reference and generated samples.The classifier output would be monotonic with the likelihood ratio between the reference and generated distributions.If the two follow the same probability distribution (i.e. ref =  gen ), the NP-optimal classifier would be maximally confused, i.e. the area under the receiver operating characteristic curve (AUC) of the classifier would be 0.5.Conversely, the higher the AUC, the worse the generative model is at modeling the underlying probability distribution of the data.
This "ultimate classifier metric" was recently used by Krause & Shih (2021) in a Large Hadron Collider (LHC) application to demonstrate the fidelity of a fast calorimeter simulation based on masked autoregressive flows trained on GEANT4 reference showers.Using this metric, the authors successfully demonstrate that normalizing flows are better at generating synthetic calorimeter responses than generative adversarial networks.Hence, we may also try this classifier test to determine a better star particle upsampler.
However, this "ultimate classifier metric" has many limitations and drawbacks.Chief among these is that it is only strictly reliable when the classifier is close to optimal.We find this is definitely not the case here because there are simply not enough reference star particles to train a good classifier.In the LHC case, we can always run more GEANT4 simulations to generate more reference data, but here we are limited to the fixed number of star particles in the Auriga simulation.As a result, the classifier training starts to overfit in the very early phase of training, and we end up with AUCs close to 0.5 (0.555 and 0.504 for EnBiD and CNF, respectively).This small AUC difference does not agree with what we found with our qualitative, by-eye comparisons of EnBiD and the reference Auriga simulation data, and indicates the classifier is not optimal.
Another issue with the "ultimate classifier metric" is that the AUC can be misleading, in the sense that a perfectly suitable generative model could be nevertheless 100% separable from the reference data due to some physically irrelevant "tells" (leading to an AUC of 1).Also, the AUC has rather limited expressiveness as a measure of quality in that it is bounded between 0 and 1.For example, two generative models could both have an AUC of 1, but one could be much better than the other, and it would be impossible to tell using the "ultimate classifier metric." For these reasons, in the next subsection, we introduce a new classifier-based evaluation score to judge the relative merits of different upsampling generative models, which circumvents the problems with the "ultimate classifier metric."

Multi-Model Classifier Metric
We dub our new evaluation score the "multi-model classifier metric".This score gives up on judging the absolute quality of a generative model with respect to the reference data, and instead focuses on judging the quality of a generative model relative to a collection of other generative models.In other words, we seek to rank a collection of generative models in terms of which one describes the reference data best.
Given a collection of  generative models, we will first train an For Gaia DR3, we additionally require that the "observed" parallax is more than 3 away from  = 0, in order to remove poorly measured stars.After the selection, the Gaia DR3 dataset has 979,185 stars.

𝑁-class classifier to distinguish between upsamplings of the same
reference data by all the different generative models.Since the training is over upsampled data, we are not limited by training statistics, we are able to learn the difference between these two generative models and verify that they are not equivalent.After this training, we utilize the classifier to assign upsampling performance scores to each upsampler and identify the best upsampler as the one whose probability distribution is closest to the reference dataset.Again, this test cannot tell us the absolute quality of a generative model, but it can tell us which generative model is most similar to the reference data.
We identify the better upsampler by using the average of logposteriors of each upsampler evaluated on the reference dataset.The log-posterior of a generative model  is defined as follows, where is the classifier output for model , evaluated on -th sample (ì  () , ì  () ) in the reference dataset.
One advantage of using the log-posterior as our metric is that the difference of posteriors of two upsamplers has a clear probabilistic interpretation.The log-posterior LP() of an upsampler  differs only from the log-likelihood LL() (the log-likelihood that the dataset is drawn from the probability distribution of upsampler ) by a constant which is the same when evaluated over both upsamplers: Note that we train classifiers using a uniform prior of () = 1/, and the prior dependence of Eq. ( 15) is canceled out.Therefore, the difference between the log-posteriors of two different models  1 and  2 can be considered as the log-likelihood ratio from one upsampler from the other, making these numbers suitable for systematically comparing the performance of upsamplers.
Another advantage of the log-posterior score is that it is not bounded and remains informative even when comparing generative models which may be separable from the reference data.In that case, the AUC becomes uninformative (AUC ≈ 1 for both upsamplers), whereas the better model should still be apparent using the log-posterior because our method is based on likelihood ratios for determining the model closer to the reference dataset.
In the rest of this subsection, we will describe our implementation of the multi-model classifier test for  = 2 models: EnBiD and CNF.In Appendix A, we will show an example of the multi-model classifier test with  = 3, by including another high-performing upsampler based on the MAF-family of normalizing flows.
We will use a classifier modeled by MLP taking the position and velocity vectors (ì , ì ) as inputs.The classifier is trained by minimizing the categorial cross-entropy loss function so that the network will return the posterior probability ( |ì , ì ) of the generative model .
The MLP consists of four hidden layers with [2048,1024,128,128] nodes each.We use LeakyReLU activations (Maas et al. 2013) in order to avoid dying ReLU problems that may appear when the datasets are too similar.As the classifier is for testing the generalization performance of the upsamplers, we perform the classifier training and hypothesis testing as follows: First, let's compare the datasets upsampled by EnBiD and CNF.As shown in Figure 7, the EnBiD-estimated stellar distribution exhibits biases near edges and peaks.In contrast, the CNF-upsampled dataset does not have such biases.The classifier can notice this difference, leading to distinct classifier output distributions for the two upsampled datasets, as illustrated in Figure 8.The AUC of this classifier is 0.577.The two upsampled datasets are distinguishable not only visually but also statistically through the classifier.
Next, we utilize the classifier to determine which upsampler yields a stellar distribution closest to the original star particles.Given the Density ((km/s)  biases observed in EnBiD-upsampled stars, we expect that the classifier would more likely identify test star particles as CNF-upsampled stars.Indeed, the classifier output for the test star particles aligns more closely with the CNF-upsampled dataset, as demonstrated in Figure 8.This suggests that the classifier perceives the test star particles as more similar to those upsampled by CNF, which were trained on a separate set of star particles.
The results of the multi-model classifier test are detailed in Table 1, with the log-posterior for CNF being larger than that for EnBiD.Specifically, the test gives a log-posterior of −0.683 ± 0.003 for CNF, while EnBiD's log-posterior is −0.724 ± 0.003.The uncertainties associated with the log-posterior scores are determined by conducting the test ten times using different random seeds, covering both the statistical uncertainty of repeated upsampling and the systematic uncertainty of network initialization.
These test results quantitatively confirm that our GalaxyFlow algorithm based on CNF produces stars that are more kinematically consistent with the underlying star particle distribution.To further validate our findings, we also cross-validate the comparison of EnBiD and CNF using an additional mock dataset from the simulated galaxy h277.The results align with our results using Auriga 6, summarized in Appendix C.

COMPUTATIONAL SPEED COMPARISON
In this section, we compare the computational speed of the upsamplers.Table 2 shows  calculation is finished, further optimization does not take much time.We may use either the rule-of-thumb bandwidth scaling to bypass the training or optimize the bandwidth scale parameter  ℎ , whose training only takes a few epochs.EnBiD sampling is also fast since we only need to select one of the components and draw a Gaussian random number from there.
For CNF, the training takes more time because of solving a multiparameter optimization problem, which requires multiple epochs to find the best-fit solution.Sampling is also slow because Gaussian random numbers have to be transformed to the data distribution using neural networks.
Although CNF shows the best upsampling performance, its training and upsampling speed are the slowest among the upsamplers.This might not be an issue since the upsampling needs to be performed only once per galaxy simulation, so the computational expense is a one-time cost.Nevertheless, it could be interesting to explore recent innovations in speeding up the CNF training from the machine learning community, for example, the probabilistic flow of score-based diffusion models (Song et al. 2021) and conditional flow matching (Lipman et al. 2023;Tong et al. 2023).We will leave the upsampling using these new machine learning architectures for future studies.

CONCLUSIONS
We introduced two improved star particle upsamplers.Our primary result is GalaxyFlow, based on continuous normalizing flows, which can be used for generating realistic Gaia mock catalogs from -body simulated galaxies.GalaxyFlow more accurately captures phase space density of star particles (Buckley et al. 2023).We only show its computational performance here.Although MAF is faster than CNF, the upsampling quality of MAF is slightly worse than CNF.Details of the test results are presented in Appendix A.
Table 2. Training and sampling times of EnBiD, MAF, and CNF.Training for EnBiD bandwidth scaling, MAF, CNF was performed using Tesla P100; Quadro M6000; GeForce 2080 Ti, which have FP32 computing performances of 9.3 TFLOPS, 6.8 TFLOPS, and 13.5 TFLOPS, respectively.For EnBiD, the training time is divided into two components: the first number is the bandwidth calculation time on a CPU (Intel Xeon E5-2630), and the second number is the bandwidth optimization time on the Tesla P100.To measure the upsampling time, we generate 241,206,000 samples on the Tesla P100.The bold values are the training and upsampling time of the fastest upsampler.the phase space distribution of the original star particles than the existing state-of-the-art EnBiD algorithm, which is an adaptive kernel density estimation specifically designed for upsampling star particles.
We also showed a bandwidth-optimization for kernel-based upsampling using the EnBiD algorithm.Our optimization avoids clumping in phase space, while mitigating oversmoothing.
In order to quantitatively compare the accuracy of GalaxyFlow and our kernel-based method, we developed a novel multi-model classifier test that compares the accuracy of generative models.This test statistically confirms GalaxyFlow's enhanced upsampling accuracy compared to the optimized EnBiD in two examples of upsampling stars in a neighborhood of the Sun in two simulated galaxies: Auriga 6 and h277.The improved accuracy of GalaxyFlow upsampling makes the algorithm suitable for extending the kinematic features of star particles to an upsampled stellar distribution.The bandwidth-optimized EnBiD upsampler also provides visually smooth distributions at a lower computational cost.
Although GalaxyFlow shows excellent upsampling performance, there remains room for improvement.Our upsampling only assumes the smoothness of the flow model and does not impose any physical constraints, such as the equations of motion or any specific physics-based analytic density models.The multi-model classifier test guarantees the kinematic consistency of upsampled data only up to the scale of the star particles: regression artifacts may appear on smaller scales.
To make the learned distribution at a smaller resolution scale more reliable, we may impose physical constraints.For example, we may require that the flows satisfy the Boltzmann equation by considering additional loss functions penalizing the solution incompatible with the equation of motion (Raissi et al. 2019).As our normalizing flows are estimating well the phase space density, snapshots of star particles at different times can be used for calculating the time derivative of the phase space density.As the simulations provide a truth-level gravitational field, the Boltzmann equation can be utilized for refining the quality of the density estimation for upsampling.
In this first paper, we applied our upsampler only on a small neighborhood around the Sun, and did not simulate the range of stellar masses or dust extinction that a realistic synthetic star population as seen by Gaia or other observatories must display.
While the second issue of star properties can be straightforwardly solved using existing methods, the generalized performance for upsampling over all the star particles in the simulated galaxy may differ from the results of this paper because of the soft topological constraints of normalizing flows.When the training dataset contains vastly different topological structures (for example, multimodality) compared to the base distribution, normalizing flows sometimes ex-perience a mild difficulty in modeling the distribution as it is modeling a continuous bĳective transformation preserving the topological structure of the data.Therefore, simple normalizing flows with a Gaussian base distribution may experience difficulties in learning galactic substructures, such as galactic streams and satellite galaxies, on top of the smooth distribution of stars.Nevertheless, those difficulties can be mitigated by introducing more expressive flows or base distribution with multimodality.As one potential failure mode, if the flow-based upsamplers are connecting two separated clusters, then due to a lack of expressive power, the upsampler may generate a fake unphysical stream that does not exist in the simulation.Alternatively, if the upsampler fails to model a real smooth stream and leaves a gap, this may give us false positive signals on a stream colliding with unknown objects.In future studies, explicit checks on the output of the normalizing flows must be carefully done to ensure that they preserve these structures and improvements developed to deal with these or other failure modes.We show the classifier output distributions of MAF and CNFupsampled datasets in Figure A2.Both distributions peak at 0.5, so the two datasets are consistent at some level.However, the distributions are different in both lower and upper tails, and the AUC of this classifier is 0.5366, which is slightly higher than 0.5.This indicates that there are non-negligible differences in the details of the distributions.The classifier output distribution of the test star particles is close to that of the CNF-upsampled stars, indicating that CNF can be a better upsampler than MAF.
The log-posterior of the classifier test for MAF vs. CNF is shown in the second column of Table A1.We note that the base classifier and shorter training setups give us compatible results.MAF and CNF are modeling smooth distribution, so the regularization issue due to the spectral bias is not happening here.Therefore, we will regard the log-posteriors as estimated log-posteriors, not the log-posterior bounds.The two scores are much closer than EnBiD vs. CNF since the classifier output distributions are similar.Nevertheless, there is a statistically significant difference between them, and CNF shows a better log-posterior.
We conclude that the density estimate learned by the CNF is more accurate than that learned by the MAF.Note that MAF is essentially a chain of conditional linear transformations that expand and shrink axes, and small wrinkles may appear during the transformation (Huang et al. 2018).Data preprocessing and ensemble averaging may mitigate these effects, but residual artifacts may continue to impact the quality of the density estimation.

A2 Multi-Model Classifier Test: EnBiD vs. MAF vs. CNF
Next, we consider all the upsamplers discussed in this paper -EnBiD, MAF, and CNF -to demonstrate the comparison of multiple upsamplers all at once by our multi-model classifier test.
The classifier output distribution is shown in Figure A3, and we can see the same results we have found from EnBiD vs. CNF and MAF The MAF and CNF distributions closely align with the reference dataset, indicating their high accuracy as both models are free from the smoothing bias inherent in kernel density estimations.Among them, CNF distribution is closest to the reference dataset, suggesting CNF is the most accurate among the upsamplers.
The log-posteriors are listed in the third column of Table A1.The log-posterior of EnBiD, MAF, and CNF is −1.133 ± 0.004, −1.104 ± 0.003, and −1.090 ± 0.003, respectively.Since the logposterior of CNF is highest, this multi-model classifier test quantitatively concludes that CNF is the best upsampler among those three.

APPENDIX B: LIMITATIONS OF MLP-BASED MULTI-MODEL CLASSIFIER TEST
In this appendix, we revisit an overly small bandwidth EnBiD setup seen in the Ananke and Aurigaia catalogs, which led to blotchiness in the upsampled star distribution, in order to discuss the limitations of using MLP-based classifiers in our multi-model classifier test.The main concern is that the inductive biases of MLPs may influence the test outcome if the classifier is underfitted due to the biases.Therefore, it's crucial to understand the methods' limitations carefully.
The inductive bias mainly considered here is the spectral bias of MLPs, where the training process tends to favor lower-frequency modes initially (Rahaman et al. 2019;Xu et al. 2019;Yang et al. 2022).This bias becomes particularly relevant when dealing with the high-frequency modes of the suboptimal EnBiD-upsampled star distribution, namely the blotchiness of the dataset at the bandwidth scale.We will first discuss the EnBiD setup that replicates upsampling results similar to those in Aurigaia and examine the effect of the spectral bias on multi-level classifier tests involving such data.

B1 EnBiD Upsampling with Small Bandwidth
In order to generate blotchy stellar distributions similar to Aurigaia, we adopt the following procedure for upsampling each star particle, similar to the bandwidth determination step described in Grand et al. (2018).(i) Use EnBiD 2.0 package (Sharma & Steinmetz 2006) to determine the phase space hyperrectangle enclosing the star particle.
(ii) For coordinate axis , denote   as the corresponding hyperrectangle side length.
(iii) Draw phase space coordinates from a 6D Gaussian centered on the original star particle, with each phase space direction's dispersion given by We select upsampled star particles within a 3.5 kpc radius from the Solar location for consistency with samples generated using normalizing flows.
Figure B1 shows the true Galactocentric velocity, without smearing from measurement errors, of stars in the Aurigaia catalog and small-bandwidth EnBiD-upsampled Auriga 6 within the same sky patch as Figure 1.To ensure fairness in our comparison, stars from the Aurigaia catalog originating from progenitor star particles beyond 3.5 kpc from the Sun are excluded.We also overlay the progenitor star particles within this analysis volume for reference.In each plot, we see the same increased concentration of stars around the star particles, while the stars in the Aurigaia catalog have a larger spread.The blotches not connected to identified star particles originate from star particles just outside the patch.Since this appendix focuses on the impact of clumpy substructures in multi-model classifier tests, this small-bandwidth EnBiD upsampling method is sufficient for our purpose.

B2 Multi-Model Classifier Test: Small-Bandwidth EnBiD vs. CNF
Here, we present the multi-model classifier test results comparing small-bandwidth EnBiD and CNF upsampling method.
Figure B2 shows the classifier output distributions for both upsampling methods, and the two distributions are well separated.Stars exhibiting high CNF posterior probability (CNF|ì , ì ) are mostly located in the gaps between EnBiD's narrow Gaussian kernels as the likelihood of being EnBiD-upsampled is low.Meanwhile, stars located atop the kernels have a high likelihood of being EnBiDupsampled, so that the posterior probability will be smaller.As a result, the two posterior probability distributions are clearly separated.The AUC of this classifier is 0.952, confirming that the classifier effectively distinguishes the two types of upsampled stars.Next, we use the classifier to determine which upsampler generates stars whose distribution most closely resembles that of the original star particles.Considering that the test star particles are mostly positioned away from the EnBiD's Gaussian kernels, we expect that the classifier will recognize the test star particles as CNF-upsampled stars.Indeed, the classifier output distribution of the test star particles is aligned to that of the CNF-upsampled dataset, as shown in Fig- ure B2.This alignment suggests that CNF more accurately replicates the original star particle distribution.
Although the results of this multi-model classifier test align with our qualitative understanding, we cannot use this classifier for precise log-posterior estimation, primarily due to the signs of underfitting: the training continues if we increase the patience of early stopping.We may further continue training to mitigate this issue; however, the training is too slow.The base classifier training setup finishes in 6 days using a Tesla P100 GPU already, and further pushing training is not feasible with our available computational resources.
Furthermore, this slow training speed also complicates the uncertainty estimation, Ideally, the training should be repeated several times with different random number seeds in order to assess the variance of the log posterior.However, the extensive computational time required for each iteration makes this uncertainty estimation impractical in this case.
This slow training pace is due to the spectral bias of MLPs (Rahaman et al. 2019;Xu et al. 2019;Yang et al. 2022): training of neural networks tends to prioritize learning lower frequency modes first.In the above classifier training, the early stopping effectively acts as a regularizer since the spectral bias significantly slows down the learning of features on high frequencies, such as the clumpy substructures of EnBiD upsampled dataset.The early stopping prematurely terminates the training before the neural network learns all the relevant high-frequency features.Thus, rather than accurately regressing the EnBiD-upsampled distribution, the classifier is constrained to learning a partially smoothed version of the distribution due to this implicit regularization.
Instead of continuing the training forever to get more accurate log posteriors, we admit the underfitting and perform the multi-model classifier test with a shorter classifier training setup by making the mini-batch size 10 times larger.The effective patience for network parameter updates is then one-tenth of the base setup, so the early stopping will terminate the training earlier than the base setup.This regularized classifier can still distinguish the small-bandwidth EnBiD and CNF (AUC: 0.831), and the training finishes within 10 hours.The decrease in AUC is as expected since the regularized classifier is no longer a linear function of the log-likelihood ratio and optimal classifier.We show the log-posteriors of the multi-model classifier test using this regularized classifier in Table B1.Specifically, the log-posterior scores for EnBiD and CNF are −2.48 ± 0.19 and −0.54 ± 0.01, respectively.We interpret these log-posteriors from the regularized classifier as indicative bounds -upper for EnBiD and lower for CNFrelative to their asymptotic values because the base classifier gives us smaller and larger log-posterior, respectively.The log-posterior from the base classifier −6.22 for EnBiD, and −0.34 for CNF.The quoted uncertainties on the log-posterior scores are derived by repeating the test ten times with different random number seeds.This uncertainty covers the statistical uncertainty of repeated upsampling and the systematic uncertainty of network initialization.
Although a regularized classifier was employed in this analysis, the difference between the two log-posteriors is nonzero and statistically significant.This means the classifier, when asked whether the dataset of test star particles looked more like the CNF-upsampled data or the EnBiD-upsampled data, determined that it had a higher likelihood of being drawn from the CNF-upsampled dataset than EnBiD-upsampled dataset.Therefore, this test result still quantitatively demonstrates that our GalaxyFlow algorithm based on a CNF generates stars more kinematically consistent with the underlying star particle distribution.
Note that this limitation of the MLP-based multi-level classifier test can be avoided if we use a classifier that does not exhibit spectral bias.While classifiers free from spectral bias could potentially offer more accurate log-posterior estimation, introducing such classifiers here falls beyond the scope of the paper.

APPENDIX C: RESULTS ON GALAXY H277
In this appendix, we apply our upsampling method to the galaxy h2779 (Zolotov et al. 2012;Loebman et al. 2012) from the N-Body Shop Collaboration.The galaxy h277 was produced using the Gasoline smoothed particle hydrodynamics code (Wadsley et al. 2004), starting with a dark matter-only simulation within a comoving box with side-length 50 Mpc, and then re-simulating with identical initial conditions at higher resolution including baryons.This simulation has a force resolution of 173 pc, dark matter particle mass of 1.3 × 10 5  ⊙ , initial gas particle mass 2.7 × 10 4  ⊙ , and average star particle mass of 5800  ⊙ .The entire galaxy has a mass of 7.95 × 1011  ⊙ within the virial radius, and was selected as the most Milky Way-like of the galaxies available from this set of simulations.This galaxy currently has no associated mock Gaia catalog, so we here present our upsampler as a first step towards such a catalog.The star particles within 3.5 kpc from the assumed Solar location, (−8.122, 0.0, 0.0208) kpc,10 are selected for training our upsampler.
The number of selected star particles is 153,724, and the maximum speed of the selected star particles is 433.79 km/s.We center and rescale our particles and train a CNF following the procedures outlined in Section 3.
In Figure C1, we show the phase space density distributions in ICRS coordinates for stars upsampled in the angular patch as Figure 1.We additionally overlay the progenitor star particles on each 2D distribution.The patch contains 440 star particles, upsampled to 2,429,709 stars -approximately 5,800 stars per particle or 1 star per  ⊙ .The distributions have no star particle substructure, and the CNF learns the bimodal distance distribution.These plots indicate that upsampling with normalizing flows provides consistent results across multiple simulations.
We then performed multi-model classifier tests for comparing EnBiD and CNF-generated datasets.For EnBiD, we find that the optimal bandwidth multiplier is 0.3348 ± 0.0016, which is compatible with the theoretical estimation from the rule-of-thumb bandwidth ratio, 0.332.The log-posteriors of EnBiD and CNF are shown in Table C1.Again, the EnBiD-upsampled dataset contains kernel smoothing bias, and the log-posterior of EnBiD is small.The logposterior for CNF is larger than that, and these results again confirm that our GalaxyFlow using CNF is more accurate than EnBiD upsampling.
Figure 1.If anything, given the considerably smaller measurement uncertainties on proper motions in DR3, comparing DR3 versions of Ananke and Aurigaia would presumably only exacerbate the discrepancies.

Figure 2 .
Figure 2. Distributions of a subset of Gaia DR3 stars in ICRS coordinates.We select stars within 15 • of ( ,  ) = (167.47• , −4.2 • ) and within 3.5 kpc of the Sun.We discard stars with poorly measured distances, i.e., the observed parallax must be greater than zero parallax by 3× the parallax measurement error.Plots on the diagonals are the marginal distribution of each component.The other plots are the joint distribution of two components.The Gaia DR3 dataset has 979,185 stars satisfying these criteria.Only a small fraction of stars have measured radial velocity, and so the plots including radial velocity have lower statistics, with 135,105 stars.

Figure 3 .
Figure3.The negative log-likelihood function of the scale parameter  ℎ .The kernel density is constructed using half of the star particles in the training dataset, and the log-likelihood function is evaluated using the other half.Each half of this training dataset contains 120,603 star particles.The bestfit value of  ℎ is 0.356.Green and yellow bands represent 1 and 2 confidence intervals, respectively.The Black dashed vertical line is the scale parameter estimated from the ratio between the rule-of-thumb bandwidths of the spherical Epanechnikov kernel and the kernel used in the plot.

DensityFigure 4 .
Figure 4. Distributions of proper motion for stars in Gaia DR3 (center) and upsampled Auriga 6 stars by EnBiD using an optimized bandwidth (lower left) and CNF (lower right) (lower right) within 15 • of the ICRS coordinate ( ,  ) = (167.47• , −4.2 • ) with distance from the assumed Solar location within 3.5 kpc.For Gaia DR3, we additionally require that the "observed" parallax is more than 3 away from  = 0, in order to remove poorly measured stars.After the selection, the Gaia DR3 dataset has 979,185 stars.

Figure 7 .
Figure 7. Normalized histograms of (top) position components and (bottom) velocity components for selected stars in Auriga 6.The red lines are the histograms for synthetic stars sampled from CNF, and the blue lines are for the stars from EnBiD using an optimized bandwidth.The error bars are the 1 statistical uncertainty.Below the main plots, we show the pull distributions, i.e., the difference between the histograms of star particles and upsampled stars, normalized by the 1 statistical uncertainty.The red and blue markers are for CNF and EnBiD, respectively.

Figure 8 .
Figure 8. EnBiD (green) vs. CNF (orange) classifier output distributions.Black markers are the classifier output histogram of star particles from the test set, which is not used in training upsamplers.Their vertical error bars are the 1 statistical uncertainties.
Figure A2.MAF (blue) vs. CNF (orange) classifier output distributions.Black markers are the classifier output histogram of test star particles, which is not used in training upsamplers.Their vertical error bar is 1 statistical uncertainty.

Figure A3 .
Figure A3.EnBiD (green) vs. MAF (blue) vs. CNF (orange) classifier output distributions.Black markers are the classifier output histogram of test star particles, which is not used in training upsamplers.Their vertical error bar is 1 statistical uncertainty.

Figure B2 .
Figure B2.Small-bandwidth EnBiD (green) vs. CNF (orange) classifier output distributions.Black markers are the classifier output histogram of this specific feature star particles from the test set, which is not used in training upsamplers.Their vertical error bars are the 1 statistical uncertainties.
Haber 2020), and the parameter  takes the role of the flow index in the chain.

Table 1 .
Log-posteriors from the multi-model classifier test for EnBiD vs. CNF.Large (i.e. less negative) log-posterior scores indicate better upsampling performance.We also show the 1 standard deviations of the scores obtained by repeating ten times the upsampling and classifier training.The bold values are the scores of the best model determined by the test.

Table A1 .
Log-posteriors from the classifier tests for EnBiD vs. MAF vs. CNF upsampling Auriga 6 star particles.The first two columns are the two-model classifier test results for comparing CNF to the other upsampler.The last column is the results of the multi-model classifier test comparing EnBiD, MAF, and CNF all at once.Large (i.e. less negative) log-posterior scores indicate better upsampling performance.We also show the 1 standard deviations of the scores obtained by repeating ten times the upsampling and classifier training.The bold values are the scores of the best model determined by the test.First of all, the EnBiD distribution is clearly different from those of MAF, CNF, and reference dataset, which indicates that EnBiD is less accurate than the other upsamplers.

Table B1 .
Log-posteriors from the multi-model classifier test for smallbandwidth EnBiD vs. CNF.Large (i.e. less negative) log-posterior scores indicate better upsampling performance.We also show the 1 standard deviations of the scores obtained by repeating ten times the upsampling and classifier training.The bold values are the scores of the best model determined by the test

Table C1 .
Log-posteriors from the classifier tests for EnBiD vs. CNF upsampling h277 star particles.Large (i.e. less negative) log-posterior scores indicate better upsampling performance.We also show the 1 standard deviations of the scores obtained by repeating ten times the upsampling and classifier training.The bold values are the scores of the best model determined by the test.