Simple lessons from complex learning: what a neural network model learns about cosmic structure formation

Abstract We train a neural network model to predict the full phase space evolution of cosmological N-body simulations. Its success implies that the neural network model is accurately approximating the Green’s function expansion that relates the initial conditions of the simulations to its outcome at later times in the deeply nonlinear regime. We test the accuracy of this approximation by assessing its performance on well-understood simple cases that have either known exact solutions or well-understood expansions. These scenarios include spherical configurations, isolated plane waves, and two interacting plane waves: initial conditions that are very different from the Gaussian random fields used for training. We find our model generalizes well to these well-understood scenarios, demonstrating that the networks have inferred general physical principles and learned the nonlinear mode couplings from the complex, random Gaussian training data. These tests also provide a useful diagnostic for finding the model’s strengths and weaknesses, and identifying strategies for model improvement. We also test the model on initial conditions that contain only transverse modes, a family of modes that differ not only in their phases but also in their evolution from the longitudinal growing modes used in the training set. When the network encounters these initial conditions that are orthogonal to the training set, the model fails completely. In addition to these simple configurations, we evaluate the model’s predictions for the density, displacement, and momentum power spectra with standard initial conditions for N-body simulations. We compare these summary statistics against N-body results and an approximate, fast simulation method called COLA (COmoving Lagrangian Acceleration). Our model achieves percent level accuracy at nonlinear scales of k∼1Mpc−1h, representing a significant improvement over COLA.


Introduction
Physicists tend to gain understanding of various systems by building toy models based on simplifying assumptions, which make them analytically tractable or numerically feasible.The simplicity of such models comes with the added benefit of making them easy to interpret, since they typically contain only a small number of meaningful parameters.As alleged by Philip Anderson, "a simplified model throws more light on the real workings of nature than any number of 'ab initio' calculations of individual situations, which even where correct often contain so much detail as to conceal rather than reveal reality" [2].However, the cost of these simplified models is their limited range of applicability, and the difficulty of generalizing them to describe a wider range of phenomena.The advent of high performance computing has made it possible to model physical systems without such simplifying assumptions and conservative parameterizations.The deep learning [15] approach is to dramatically overparameterize a system and force the model to learn complex patterns and features of a rich data set in an efficient, automated way.The great advantage of this new modeling paradigm is the wide range of applicability and the diversity of phenomena that can be described by such models.The cost is the loss of interpretability of the model and its parameters.In this way, the principles underlying deep learning and artificial neural networks (NNs) are both antithetical and complementary to those underlying the traditional approaches of physicists.Instead of building simplified and abstract models based on physical insight, NNs learn by finding patterns in experimental, observational, and simulated data, and do so most efficiently when the patterns in the data are rich and complex rather than simple.
Recently NNs have also proved useful for accelerating computationally expensive tasks by directly learning to approximate the mapping between an algorithm's input and output [18,11,10,8,6,20,1,14].These applications require robust methods of testing a model's accuracy, to be confident in its performance on new data.In this work we use a set of idealized physical scenarios where exact or perturbative calculations can be performed explicitly in order to validate a NN model trained on general, complex data sets.
The application we present here is in the context of structure formation in the Universe, which is a nonlinear problem involving the complex pattern formation of what is known as the cosmic web.Cosmological N-body simulations provide the most accurate model of this nonlinear process.Deriving robust and optimal parameter constraints from observations of cosmic structure requires running large numbers of these expensive simulations.To alleviate this computational cost, and to gain insight into the complicated mapping between the initial conditions and final outcomes of structure formation, simplified models such as spherical collapse [3], excursion set [4,23,27], and halo models [26,5] are often employed.These models provide intuition and qualitative predictions for cosmic structure formation that lack accuracy, whereas optimal extraction of cosmological information from observations calls for precise modeling.In recent years, NN techniques have been developed to approximate these simulations both accurately and efficiently [10,1].
In this work we develop the first NN model to predict the full N-body phase space of dark matter structure formation, including displacements and velocities in the deeply nonlinear regime.We demonstrate the accuracy of our NN model by comparing its predictions against both the full N-body evolution that it was trained to reproduce, and a fast, approximate algorithm called COLA.We then valid our NN model within the context of three well understood systems: spherically symmetric configurations, isolated plane waves, and two coupled plane waves.The training data for the NN model is a set of Gaussian random fields, so these much simpler testing scenarios, which are far from Gaussian, represent a dramatic departure from the training data.We demonstrate that our NN model is capable of generalizing beyond its training data to reproduce the essential features of these much simpler scenarios.This enables us to identify strengths and weaknesses of the NN model's performance, providing us with a powerful diagnostic tool that can be used for future model optimization.These results also confirm that the NN has inferred general physical principles from its complex training data.

Cosmic structure formation
The evolution of cold dark matter under the influence of cosmic expansion and Newtonian gravity is typically modeled with N-body simulations.Such simulations evolve a generic, random Gaussian configuration of the initial matter field into complex structures of collapsed objects and voids that form the cosmic web at late times.Such simulations involve integrating the equations of motion of ∼ 10 6 -10 12 N-body particles for thousands of time steps.Simpler models and idealized scenarios are often employed to interpret the outcomes of these numerical simulations.Here we will review both the generic nonlinear dynamics and some of these idealized scenarios in the context of Lagrangian displacement field theory (see [3] for further details).Later we will assess the performance of our CNN model and compare with calculations done in the context of these simple scenarios.
At early times, when density perturbations are small, we can describe the matter as a set of equal-mass fluid cells with time dependent positions, v(a, q) = Ψ(a, q) . ( Here we use the scale factor of the Universe a as the time variable. The coordinates q are known as Lagrangian coordinates, and Ψ(a, q) is the Lagrangian displacement field.Under a Helmholtz decomposition, the displacement field is expressed as the sum of the gradient of a scalar potential Φ(a, q) and the curl of a vector potential B(a, q), The vector potential's divergence vanishes, ∇ • B(a, q) = 0.The spatial derivatives are with respect to q.Consider some early time, a i , when the linearized theory accurately captures the full nonlinear dynamics.The linearized equations, known as the Ze'ldovich approximation (ZA), for each component have two solutions.The scalar potential has a growing solution Φ + (a i , q), and a decaying solution Φ − (a i , q).The vector potential has a constant solution B 0 (q) and a decaying solution B − (a i , q).The linearized phase space then has the general form, Under the ZA there is no coupling between different modes.If Φ − (a i , q), B 0 (q), and B − (a i , q) vanish, then the Lagrangian coordinates q represent the initial positions of the fluid elements in the asymptotic past.Otherwise, q are reference coordinates that impose some labelling scheme on the fluid elements.This description contains some redundancy, or gauge invariance.We are free to define new labels q = q + ∇ × B 0 (q), absorbing the constant curl solution.Thus B 0 (q) represents a gauge mode, or coordinate choice, rather than a dynamical degree of freedom, and we will drop it.This is consistent with the underlying fluid description of the system, which has four degrees of freedom: density, velocity divergence, and the two independent components of the velocity curl, or vorticity.The time derivative of the density is constrained by mass conservation to be related to the velocity divergence.
In cosmology, the decaying solutions are neglected since they quickly become negligible.Our model's input training data is thus defined with Φ − (a i , q) = B − (a i , q) = 0. We will restrict to this class of initial conditions for now.It is convenient to consider the Fourier modes of the displacement potentials, Here we use the same symbol of the potentials and their Fourier transforms.The Fourier modes will always appear with arguments k rather than q to indicate which one is meant.Nonlinear gravitational clustering couples the modes together, so that the late time solutions to the field equations have the form, The integration kernels G n and G n can be computed as Green's functions under perturbation theory.These Green's functions also depend on cosmological parameters, but since we restrict to a single cosmology here we neglect this dependence.In the fully nonlinear regime all terms in the sums become important and the perturbative approach breaks down.Simulations are then required to model the solution.
Note that the nonlinearities in the above equations are convolutions, similar to the convolutional operations carried out by the CNN, so we expect the model to be able to encode the form of these Green's functions.The potential modes for N-body simulations are initialized as a Gaussian random field, where N1 and N2 are unit Gaussian random variables, and the standard deviation for the mode amplitudes is given in terms of the matter power spectrum P mm (a i , k) and the box volume V , The Gaussian random fields used as training data contain a large sample of the mode couplings, allowing the model to thoroughly learn the forms of the nonlinear integration kernels.We demonstrate this by testing our model on initial data containing only one or two modes.More generally, we could include decaying modes in the initial conditions, which would lead to additional terms in the evolution.For example, the scalar potential would contain mode couplings of the form, Here, i j are the component indices of the vector potential, and these are implicitly summed over along with the corresponding indices of the integration kernel.Similar couplings between all initial modes contribute to the vector potential as well.Since the decaying modes are excluded from our model's input data, the neural networks never have the opportunity to infer the forms of these kernels during training.We demonstrate this by testing our model on initial data containing only transversal modes.

Spherical evolution
For spherically symmetric density distributions centered around q = 0, we can use mass conservation to derive the evolution.The initial radius Lagrangian R and later Eulerian radius r are related, where the density as a function on radius is ρ m (a, r) = ρm (a) 1+ δ m (a, r) , and the mean overdensity within Eulerian radius r is, δm (a, r) = 3 In this case, Newtonian gravity determines the evolution of the mean density, Here primes denote derivatives with respect to log(a), H(a) = d log(a)/dt is the Hubble expansion rate of the universe, and Ω m (a) gives the fraction of the energy density of the Universe in matter.This equation can be numerically integrated to solve for the time evolution of spherically symmetric density profiles.
In the case of Einstein-de Sitter (EdS) space, where Ω m = 1, this equation has a well known analytic, parametric solution [3].These dynamics are fully nonperturbative, and we will demonstrate that our neural network reproduces them well when tested on initial conditions containing spherical density configurations.

Isolated plane wave
Another simple scenario that admits an analytic solution is an isolated plane wave, For this configuration, the time dependence of the wave's amplitude satisfies the linearized displacement field equation of motion.Decomposing the amplitude into the longitudinal component A = k • A and transverse components A ⊥ = k × A, these satisfy These equations can be solved analytically in ΛCDM in terms of hypergeometric functions.As mentioned earlier, the longitudinal component has a growing and a decaying solution, while the transverse components have a constant and a decaying solution.
After dropping the constant solutions and decaying solutions, this evolution corresponds to the G 1 term from Eq. (8).The above linearized displacement field equations have the explicit form of the ZA, so for isolated plane waves the ZA is exact.Since there is no mode coupling under the ZA, isolated modes do not couple nonlinearly with themselves.However, the discretized N-body dynamics does lead to the generation of higher modes when the initial conditions contain only a single plane wave.We will see the the neural network has learned this discretized dynamics from its training data.

Plane wave pairs
The last simple scenario that we consider is a system of two plane waves interacting.In this case there is no analytic solution except in the trivial situation where the plane waves are coplanar, in which case the modes do not couple, so they behave as isolated plane wave solutions.Due to the nonlinearity of the equation of motion, two modes that are not coplanar couple together and dynamically generate additional modes.If the displacement mode amplitudes are initially small, these nonlinearities can be computed under Lagrangian perturbation theory (LPT).As the system of modes continues to grow, eventually perturbation theory will fail to accurately capture the full nonlinear dynamics, and we must rely on N-body simulations to model the evolution.
For two initial plane waves, the lowest order modes generated by their coupling have wave vectors For simplicity, we will set both initial phases to zero and neglect any initially decaying linear solutions.
Then the two waves are purely longitudinal, and the waves they generate under second order LPT (2LPT) are, This lowest order mode coupling is equivalent to the G 2 kernel from Eq. ( 8).The 2LPT growth equation for the amplitudes of these generated modes is Given the linear solution for the evolution of A 1 (a) and A 2 (a), this equation can be numerically integrated to determine the amplitudes of the 2LPT modes in the quasilinear regime.
The nonlinear displacement field can be further expanded to higher order in LPT, where modes with wave number k mn = mk 1 + nk 2 are generated for integer values of m and n.At higher order both longitudinal and transversal modes are produced with amplitudes that satisfy equations similar in form to Eq. ( 21), sourced by products of lower order amplitudes.The Lagrangian description of the system breaks down at shell crossing, when the Jacobian becomes singular, so the mapping between Eulerian and Lagrangian coordinates is no longer well defined.We will evaluate our model with input modes that have not yet undergone shell crossing, so the perturbative description is still valid and the mode coupling is well understood.

Power spectra
One of the most important cosmological observables is the matter density power spectrum.For a density field δ m (a, x), the Fourier modes of this field are given by Then the density power spectrum, P mm (a, k) is defined in terms of the equal-time, two-point correlation function of these Fourier modes, On large scales (small k), where the evolution is well described by the ZA, the shape of the power spectrum encodes information about the physics of the early universe and the statistics of the primordial density perturbations.On small scales, nonlinear evolution, along with baryonic effects, make it difficult to predict the shape of the power spectrum and extract cosmological information.For this reason it is important to have robust and accurate modeling of the full nonlinear evolution of the matter fluctuations on small scales.Here, we use the small-scale power spectrum as a basis of comparison against both the full N-body evolution, and the COLA approximation.
In addition to the matter power spectrum, we also consider the Lagrangian displacement power spectrum.This is more directly related to the output of the NN, although less relevant for deriving constraints from observational surveys.The Fourier modes of the displacement field are given by and their power spectrum is defined To test the accuracy of the velocity model we construct the Eulerian momentum field p(a, x).From the Fourier modes of this field, the momentum power spectrum is given by In practice, since the N-body particles represent a discrete point distribution of mass and momentum, the Eulerian density and momentum fields must be constructed by smoothing the particles onto a mesh.We used the cloud-in-cell (CIC) mesh assignment scheme for this purpose on a mesh with 1024 3 grid sites.The mesh is then Fourier transformed using FFTW3 [7], and the resulting density modes are binned according to wave number in bins the width of the fundamental mode in the simulation box.The Eulerian power spectra are computed as the average square amplitude in each wave number bin.The Lagrangian displacement power spectrum is Fourier transformed directly, since the displacement field is already defined with respect to a regular grid.In this case, the mesh size is 512 3 , which is the same as the number of simulation particles.The displacement power spectrum is then computed using the same wave number binning as the Eulerian power spectra.
For each quantity δ m , Ψ, and p, we compute their modes from the CNN model output, the COLA simulations, and the Lagrangian displacement field (middle), and the momentum field (right).The momentum field depends on both particle displacements and velocities.The displacements and momenta have been rescaled to coincide with the matter power spectrum on large scales.The middle row shows the stochasticities with respect to the N-body simulations, and the bottom row shows the relative error in the transfer functions.
N-body simulations.We characterize the errors in our CNN predictions by computing the stochasticity, where P NN×SIM is the cross power spectrum between the model prediction and the simulation modes, P NN is the auto power spectrum from the CNN prediction, and P SIM is the auto power spectrum from the simulation.The stochasticity quantifies the excess correlation in the predicted data that does not match the correlations in the simulation data.We also compute the error in the transfer functions of the CNN output relative to the simulations.For the CNN, the relative error in the transfer functions are, We similarly compute the stochasticity and transfer function errors from the COLA simulations relative to the N-body results.In Fig. 1, we plot the power spectra along with their stochasticities and transfer function errors.We have rescaled the displacement and momentum power spectra to coincide with the density power spectrum on large scales.We computed these results for ten realizations of the linear density field that were not included in the CNN training or validation data sets.The shaded regions in Fig. 1 show the 2σ intervals estimated by bootstrap averaging over these ten realizations.
Our CNN model achieves an accuracy of a few percent at deeply nonlinear scales of k ∼ 1 Mpc −1 h.Additionally, the CNN model significantly outperforms COLA on these nonlinear scales.In terms of computational speed, our model is comparable to COLA, although for optimal performance it must be run on GPU.In terms of accuracy, our models are comparable to the full N-body evolution on Mpc/h scales.
The momentum CNN errors are larger than the displacement errors.This is because the distribution of the particles onto the Eulerian mesh depends on the particle positions.We use the output of the displacement CNN for this purpose, so the displacement errors propagate to the momentum Fiels.The CNN transfer function errors have small oscillations on scales where k ∼ 0.1 Mpc −1 h, which can most easily be seen in the momentum power spectrum.These correspond to slight errors in reproducing the baryonic acoustic oscillations (BAO) of the power spectrum.

Spherical evolution tests
Analytically solvable models offer rich physical insights without handling the full complexity of a problem.They typically impose symmetries on the system configuration, as in the spherical evolution model in cosmology.In a matter dominated universe (EdS), the collapse of overdense (as compared to the mean density of the universe) spheres have analytic solutions before shell crossing, as does the expansion of spherical underdense regions.Having trained our neural network models with complex systems from simulations, we can use the exact solutions as test cases to inspect their performance and understand how they work.
For this purpose, we set up spherically symmetric ZA inputs of radius 50 Mpc h −1 in 250 Mpc h −1 boxes.The boxes are divided on 128 3 grids so that their cell size is the same as in the N-body simulations.Note that by design our models can apply to different volumes of the same grid resolution.We choose a list of spherical top hats with uniform density contrasts  hats expand and overdense ones collapse.Depending on their linear overdensities, the top hats evolve into different nonlinear densities at redshift 0, which can be solved exactly before shell crossing.We compare the neural network predictions to the exact solutions and find good agreement, which is further improved if we avoid the fuzzy edges by only using the inner part (within 0.8 times the radius) of the sphere to estimate the nonlinear relative density.The fuzzy edges can be seen in the right panel of Fig. 2.
For each test case we apply the displacement CNN model to the ZA inputs.Fig. 2 shows the y displacements of particles in a 2D slice through the center of the sphere with δ sphere = 1.55 as an example.The linear input displacements, shown on the left of Fig. 2, vary linearly in the y direction, showing that the particles move towards the center.On the right, the CNN predicted displacements are significantly larger, resulting in a nonlinear overdensity of ≈ 20.Displacements with other values of δ sphere are qualitatively similar.
Quantitatively, we compare the nonlinear densities inside the spherical top hats to the numerically integrated solution of Eq. ( 15) for the same ΛCDM cosmology as of the simulations.We measure the inner densities of the model outputs using the initial linear positions and final nonlinear positions of the particles.Let rlin be the average radial distance of the particles from the center of the grid in the linear input, and let rnl be the average radius of their final positions.The relative nonlinear density relative to the background density is given by (r lin /r nl ) 3 .
The nonlinear density is a monotonic function of the linear one, and should stay uniform if the linear input is uniform.Fig. 3 shows that the CNN model accurately predicts this relation for linear overdensities between −1 and 1, while having a small deviation beyond that range.This is due to the fact that the network is not fully resolving the sharp edges of the spherical top hats, as can be seen in Fig. 2.This causes some leakage to outer regions in the density estimation.To verify this, we also measure the nonlinear density using only particles within 0.8 of the sphere radius.The inner radius results match the exact solutions more accurately, as shown in Fig. 3.Note that the edge of the top hat is both where the distribution of matter has the most rapid spatial variation and where the neural network errors dominate.We will return to this observation in the next subsection.

Isolated longitudinal modes
Another instructive example of a solvable system is an isolated plane wave.As described in subsection 3.2, for purely longitudinal (∇ × Ψ = 0) and purely transversal (∇ • Ψ = 0) waves of particle displacement, the ZA (Eqs.(17)(18)) is the exact solution for their time evolution.
In our simulation boxes, wave vectors have the form k = k F (n x , n y , n z ), where k F = 2π/L Box is the fundamental wave number of the box and the n i are integers.We ran N-body simulations with ZA initial conditions containing single isolated longitudinal displacement modes with initial wave vectors k = k F (n, 0, 0) where n ∈ {1, 2, 4, 8, 16, 32, 64, 12, 256}.The first mode is the fundamental mode, and the last is the Nyquist  4. Displacement divergence amplitudes from isolated longitudinal modes.Top panels show the N-body simulation results, middle panels show the neural network predictions, and bottom panels show the ratio of neural network to N-body mode amplitudes.The left two columns are for longitudinal modes with displacements only along the x direction.The right two columns are for modes with equal displacement amplitudes along the x and y directions and double these displacements in the z direction.The dashed lines in the top two rows show the expected amplitudes from theory.The circled data points correspond to the original input modes in the initial conditions.The first and third columns show the input mode and even harmonic modes, while the second and fourth columns show the odd harmonics.The higher harmonics are absent in the continuum limit and arise from the discreteness of the N-body systems, which the neural network learns from the N-body training data.mode for the 3D cubic mesh.We also ran a set of simulations with initial longitudinal modes k = k F (n, n, 2n).For this set of diagonal waves, we omit the n = 256 case, since it is beyond the Nyquist mode in the last dimension.Each mode was given a linear amplitude corresponding to A = 1/2, which ensures that shell crossing does not occur.
The amplitudes of displacement divergence modes from the N-body simulation and the CNN are plotted in Fig. 4. The circled data points correspond to the input, ZA modes.The panels in the first and third columns show the input modes along with even order harmonics, and panels in the second and fourth columns show the odd order harmonics.These higher harmonics are present in both the N-body simulation and in the CNN output due the discreteness of the N-body evolution, which the CNN learns.
In the let column of Fig. 4, the final amplitude of the input modes are predicted to better that 1% accuracy for n < 16 and to within 3% accuracy including n = 16, corresponding to k = 5.0 × 10 −2 Mpc −1 h).This demonstrates that the CNN preserves the ZA evolution on the largest scales, as it should.The CNN overpredicts the amplitudes of smaller scale modes, approaching the deeply nonlinear regime.In the continuum limit, the growth of these displacement modes should be completely independent of scale.However, we see that the N-body simulations exhibit a decreasing amplitude on small scales where the waves are poorly sampled by a small number of discrete particles per wavelength.In this sense, the neural network predictions are actually closer to the continuum evolution.Although, for the x-directed wave with the smallest wavelength, the CNN predicts an enhanced amplitude.Such a large error indicates that this mode is past the limit of the CNN's learned resolution.For the diagonal waves (the two rightmost columns in Fig. 4), the CNN amplitudes decrease with increasing k, better reproducing the N-body results.
The CNN reproduces the higher harmonics best on scales between 0.2 Mpc −1 h < k < 0.8 Mpc −1 h.Outside of this range of scales, the CNN overpredicts the amplitudes.The odd harmonics have larger amplitudes than their nearest neighbor even harmonics.However, the neural network predicts them to have similar amplitudes, and so the CNN generally reproduces the N-body odd harmonics better than the even harmonics.To understand the pattern of CNN errors for higher harmonics, we must consider the waves in configuration space.As a function of initial Lagrangian position, the waves have nodes at every half wavelength where the displacement field vanishes.The distribution of matter has the highest spatial variation at these nodes.Just as in the previous subsection, the CNN errors are largest in regions of rapid density variation, so these nodes are where the CNN errors dominate.It is the even harmonics that have nodes that overlap with the input wave nodes, so these have worse errors than the odd harmonics.
We also ran a set of simulations in which the initial mode was fixed as k lin = k F (16, 0, 0) and the amplitude of the mode was varied from A = 0.1 to A = 8 (see Eq. ( 17)).Note that A = 1 is the minimum amplitude for which an isolated mode has undergone shell crossing at redshift zero, since the determinant of the Jacobian between the Eularian and Lagrangian coordinates vanishes wherever q • k = π/2 for these modes.
The results for the final amplitudes are shown in Fig. 5.The dashed line is the expected final amplitude from theory.Both the CNN and the N-body evolution have enhanced final amplitudes when the initial amplitudes are A = 0.1 and A = 0.2, which are the smallest amplitudes we considered.The neural network is about 6% higher than the N-body evolution for these cases.For input amplitudes from A = 0.5 to A = 2.0, both the CNN and N-body results agree well with theory, meaning the neural network can accurately predict the amplitudes of isolated modes even after shell crossing.This indicates that the CNN has learned nonperturbative aspects of the dynamics from the training data, which is well past shell crossing.For higher initial amplitudes, the N-body results are lower and diverging away from the linear growth calculation and the CNN overpredicts these N-body amplitudes.
Input data containing a single plane wave are a significant departure from the training data that the neural network has encountered.Errors in the CNN predicted amplitudes are thus expected for these extremely simple configurations, especially on small scales where nonlinear mode coupling is nonperturbative so the behavior of isolated modes may not be easily inferred in this regime.However, the CNN predicts a set of mode amplitudes that matches the full N-body evolution in most cases, even after shell crossing.This demonstrates that the CNN has inferred some general physical principles from its training data, enabling it to accurately generalize to the evolution of isolated plane waves across a wide range of scales and amplitudes.

Isolated transversal modes
The two previous examples focused on simple physical systems that can be solved exactly, and for which the CNN should be able to generalize its training and make reasonable predictions.This is because these systems evolve according to the same underlying physics that the CNN, to some degree, is inferring Fig. 6.Mode amplitudes from N-body simulations and neural network predictions with initial data corresponding to isolated transversal modes.The top panel shows the displacement curl component that is consistent with the initial data.The dashed line is the expected amplitudes from linear theory, and the dash-dotted line is the expected amplitude for a longitudinal mode.The second panel shows spurious curl modes predicted by the neural network that are perpendicular to the initial wave vector but parallel to the initial displacements.The bottom panel shows spurious longitudinal modes predicted by the neural network.
as it trains.An example of a simple physical system that we expect the neural network to perform poorly on is a transversal displacement mode, which is totally absent from the training data.
The transversal ZA, given by Eq. ( 18), is the exact solution for isolated transversal displacement modes.In this case, time dependent amplitudes evolve according to a constant and a decaying solution, but no growing solution.We include both a constant and a decaying component in the transversal waves we consider, in order to give them analogous initial conditions to the longitudinal modes.Specifically, our choice of initial condition for these modes is such that the displacement and velocity have the same coefficients at the initial time as the longitudinal modes discussed earlier, with initial velocities in the same direction as their displacements.This is achieved by giving the decaying component a coefficient with the opposite sign to the time independent component.The mode decelerates as it grows in amplitude, asymptotically approaching the amplitude of the purely the time independent part.However, this is really just a physical mode decaying away and leaving behind a residual, unphysical gauge mode.
Typically N-body simulations omit transversal modes, since they represent only decaying or pure gauge solutions.Thus, the CNN has never seen a transversal mode in its training data, and we do not expect it to accurately predict the evolution.We test this by running several N-body simulations for isolated waves with initial displacement amplitudes A ⊥ = (0, 0, 1) and wave vectors k = k F (n, 0, 0), and compare these with CNN predictions.The results are plotted in Fig. 6.The top panel shows amplitudes of the displacement field curl modes along the direction corresponding to the curl of the input mode.The N-body simulation produces mildly growing modes that have already converged to a nearly time independent behavior.The dashed line in the top panel of Fig. 6 shows the expected final amplitudes from linear theory.The neural network predicts that this mode will grow exactly as a longitudinal mode (the dash-dotted line), since this is the only behavior it has encountered.
In the middle panel of the same figure we show modes that are perpendicular to the expected curl of the displacement field.These are perpendicular to the initial wave vector but parallel to the initial displacement vectors.On the bottom panel we show the divergence modes.These modes vanish in the N-body results, as they should, but the CNN predicts nonvanishing amplitudes for them.Interestingly, we find that the CNN correctly predicts vanishing amplitudes for the curl component perpendicular to the initial displacements but parallel to the initial wave vector.Clearly, the CNN is unable to predict the correct evolution for transversal modes because it has never been exposed to them.In general, the model has not inferred any of the forms of integration kernels, such as those in Eq. ( 12), containing initially decaying modes.

Coupled mode pairs
As a final example, we consider a system initially containing two noncoplanar, longitudinal displacement modes.As discussed in subsection 3.3, the evolution of this system cannot be solved analytically, but can be accurately described by LPT as long as the mode amplitudes remain sufficiently small.
We ran sets of N-body simulations for systems containing only two longitudinal waves in their initial conditions.For both sets, one mode was fixed as k 1 = k F (8, 0, 0), while the second mode differed between simulations.One set had k 2 perpendicular to k 1 , with wave vectors k 2 = k F (0, n, 0) for n ranging from 1 to 256 in powers of 2. The other set had wave vectors k 2 = k F (n, n, 0), with both a parallel and perpendicular component to the first wave vector.
In Fig. 7, we show the amplitudes of the displacement divergence modes with wave vectors given by k mn = mk 1 + nk 2 up to fifth order in m and n.Qualitatively we see that the largest amplitude modes of the neural network match the largest amplitude modes from the simulation, and the overall pattern of the generated modes is well reproduced by the CNN.The CNN overpredicts the amplitudes of higher order modes, although these are two or more orders of magnitude smaller than the amplitude of the input mode.The CNN only has significant error for the smallest modes, which again demonstrates the resolution limit of the model.8. Displacement divergence amplitudes for input containing two longitudinal modes.The blue circles correspond to input modes with the wave vector k1, which is held fixed for all the initial conditions.The red squares correspond to the input modes with wave vectors k2.The black and purple triangles show the amplitudes of modes that are generated in 2LPT.The green horizontal dash-dotted line shows the predicted 2LPT amplitude for these modes.The vertical dashed line shows the scale approximately corresponding to the CNNs field of view.The top row shows the CNN predictions, and the bottom row shows their ratio to the N-body simulation results.
In Fig. 8, we show the amplitudes of the input mode along with the modes generated in 2LPT as a function of k 2 .The green dash-dotted line shows the expected 2LPT amplitudes.The vertical dashed line shows the scale corresponding to the physical size of the CNN field of view.On scales larger than the field of view, the CNN underpredicts the N-body simulation amplitudes, which is not surprising since the CNN cannot fully model the interaction between modes on these scales.For scales where k 2 > 0.2 Mpc −1 h the CNN overpredicts the amplitudes of the 2LPT modes.This is similar to the results for isolated waves, where the discreteness of the N-body evolution causes the smallscale modes to have lower amplitudes.Here again we see that the CNN predicts amplitudes closer to the continuum theory, since it assigns amplitudes close to the expected 2LPT values down to scales where k 2 ∼ 1 Mpc −1 h.Notice that the presence of the second mode does not affect the predicted amplitude of the first mode, which is the same for all simulations.
We find that the CNN is able to generalize well to these examples of coupled mode pairs, which are extremely different from the generic configurations of model's training data.The CNN model is able to accurately predict the 2LPT amplitudes over the range of scales where we expect it to perform well.This demonstrates that the network parameters encode information about the shapes of the mode coupling kernels in Eq. (8)(9).The errors of the CNN model reveal its limitations in terms of field of view and resolution, which can be used as diagnostics for future model improvement.

Conclusion
We have trained CNNs to model the full phase space evolution of N-body simulations.The model has been designed to preserve the correct evolution on large, linear scales, and effectively nonlinearize the evolution on smaller scales.Our models are accurate to a few percent deep into the nonlinear regime, at small scales of ∼ 1 Mpc h −1 .This is a significant improvement in accuracy over other fast approximate N-body solvers such as COLA.We tested the robustness of our model in three idealized scenarios where either exact or perturbative theoretical calculations can be carried out.These include spherically symmetric configurations, isolated plane waves, and pairs of coupled plane waves.Each of these represents a dramatic departure from the random Gaussian initial conditions of the full N-body simulations, as in Eq. ( 10), which were used in model training.In all of these scenarios the CNN model was able to accurately reproduce the correct evolution across a wide range of scales from the quasilinear down to the deeply nonlinear regime.Therefore, the model parameters have successfully encoded the dynamics of mode coupling represented in Eqs.(8)(9).The regimes where the CNN model made significant errors can be understood in terms of its limited field of view on large scales and its finite resolution on small scales.In particular, the model errors were most significant in regions where the matter distribution has strong gradients, such as at the edge of a spherical overdensity or at the nodes of an isolated plane wave.This demonstrates how these idealized scenarios can reveal both the robustness and the limitations of such complex CNN models, and indicates which features of the data should be targeted for more precise modeling.These tests help build confidence in our model, paving the way to applications in precision cosmological inference, where accelerating predictions of the nonlinear clustering is crucial for obtaining optimal parameter constraints.

Modeling Structure Formation by Simulation or Perturbation Theory
Cosmological structure formation can only be accurately modeled by perturbation theories in the large-scale, perturbative regime, but requires N-body simulations in the deeply nonlinear, small-scale regime.Since our Universe begins with a nearly uniform, Gaussian initial condition, N-body simulations start with particles slightly perturbed from a uniform configuration, often a grid.The initial displacement of each particle Ψ from its grid location q can be accurately computed with LPT, as described in the previous section.The ZA predicts linear particle motion Ψ ZA (a, q) = D(a)/D(1)Ψ ZA (1, q), where D(a) is the linear growth factor.Since only the growing mode is kept, the ZA velocity field is linearly related to the displacement field through v ZA (a, q) = aH(a)f (a)Ψ ZA (a, q), where f (a) = d log D(a)/d log a is the linear growth rate.
Nonlinear structure formation occurs at late times on small scales, where the ZA and higher order LPT fails to predict the particle motion.Modeling these trajectories accurately requires expensive N-body simulations, which typically integrate thousands of time steps.On the other hand, the Universe remains relatively uniform on the large scales where perturbation theories still apply.This motivates us to build our model to reproduce the small-scale N-body structure formation, while preserving the ZA evolution on large scales.Convolutional neural networks are well suited for this purpose, since they excel at local textures without a global view beyond their finite receptive fields.In other words, our goal is to train CNN models to nonlinearize the linear ZA inputs to reproduce the N-body simulation outputs.
For N-body data, we use 210 simulations from the publicly available Quijote suite [25]: 180 for training, 20 for validation, and 10 for testing.Each simulation has 512 3 particles in a 1 (Gpc h −1 ) 3 box, and has cosmological parameters: Ω m = 0.3175, Ω b = 0.049, Ω Λ = 0.6825, h = 0.6711, σ 8 = 0.834, and n s = 0.9624.We compare our model to the popular method COLA [COmoving Lagrangian Acceleration, 24], as implemented in L-PICOLA [12].COLA evolves particles relative to their second order LPT (2LPT) trajectories, using only tens of time steps, making it much faster than a full N-body simulation.We run L-PICOLA with the same configurations as the full N-body simulations.We train and test our models at redshift z = 0, corresponding to today.

Designing Networks with Physical Considerations
As explained above, we want the CNN to maintain the ZA predictions (Ψ ZA and v ZA at the output time of the simulation snapshot) on large scales, which are still perturbative, to account for the nonlocal gravitational interactions beyond the local views of the CNN.The CNN takes the linear inputs and forms accurate small-scale patterns.This is easily achieved by a global residual operation that adds the linear input to the network output, i.e., the CNN effectively predicts the difference between the N-body and ZA displacements.
The global residual connection has other advantages too.One is that the Galilean symmetry can be approximately preserved by effective data augmentations.This is because, even though ZA and N-body evolution may be affected by large-scale shifts in displacements or velocities, i.e., by adding a global constant vector, the residuals are not.The other advantage is that the large-scale dependence on cosmological parameters is automatically accounted for by the ZA inputs, so the CNNs only need to model their small-scale modulations, as explained below.Therefore we separately train two networks, one for displacements and one for velocities, with linearly related inputs.
By construction, CNNs preserve the translational symmetry if they are fully convolutional.However, most deep learning applications break this by the nonperiodic paddings commonly adopted in computer vision tasks.We fully preserve the translational symmetry using the N-body periodic boundary condition.In addition, our models approximately preserves the rotational symmetry of the cubic geometry by data augmentations that rotate and reflect the input and output fields according to all 48 variants as in [10].

Network Architecture and Training
Before feeding the data through the CNN models it must be prepared in an image-like format.This is straightforward due to the nearly uniform initial condition of the Universe.Both the ZA particle displacements and nonlinear N-body displacements are functions of the Lagrangian or initial positions q, which form a uniform grid.Therefore both of them are a 3D image, with 3 channels being the Cartesian components of the displacement fields.We format the velocity fields likewise.
We adopt a simple U-Net / V-Net [21,17] architecture similar to that in [1].It uses 3 levels of resolution connected in a "V" shape, comprising first 2 downsampling layers and then 2 upsampling layers by stride-2 2 3 convolutions and stride-1 /2 2 3 transposed convolutions, respectively.The input, resampling layers, and output are connected by 2 3 3 convolutional blocks.A residual connection [ResNet,9], here a 1 3 convolution instead of identity pass, is added over each block, as in V-Net.A batch normalization layer follows every convolution layer except the first one and the last two, and a leaky ReLU activation (with negative slope 0.01) follows each batch normalization, as well as the first and the second to last convolutions.As the original ResNet, the last activation in each residual block applies after the addition.The inputs to each downsampling layer (on the left side of "V") are concatenated to the outputs of the upsampling layers (on the right side of "V") of the same resolution, at the top 3 resolution levels.All layers have 64 channels, except those after concatenations (128), and the input and the output (3).Unlike the original U-Net architecture, we add the input displacement or velocity fields directly to the output, so that the network only needs to learn the nonlinear residuals, as we have mentioned above.
In the Lagrangian description, the simplest loss functions for training our networks minimizes the residuals in the displacements.[1] and [16] have shown that combining that with a loss on density can greatly improve the model performance in the Eulerian space.We can compute the particle positions using their displacements, and then the Eulerian density distribution described by the overdensity δ(x) ≡ n(x)/n − 1. n(x) is the particle number in grid cell at x, and n is its mean value.We use the CIC scheme to interpolate the particles to the grid.For our displacement model, the total loss function is log L δ + λ log L Ψ , where L δ and L Ψ are the mean squared error (MSE) losses on n(x) and Ψ(q), respectively.Adding the two losses in logarithm allows the optimizer to ignore their absolute magnitudes and trade between their relative changes.The hyperparameter λ controls this trade-off, and we set λ = 3 in the displacement network for faster training.
For the velocity network, we similarly use the particle momenta, Ψ along with a CIC mesh construction of the Eulerian momentum field, p(a, x).We use the loss log L Ψ + log L p .We refer to the former term as the momentum because all particles carry the same amount of mass, so their velocities represent mass weighted velocities, rather than the volume weighted velocities of the Eulerian fluid description.We do not multiply by the particle mass in practice though, since this only adds a constant to the loss.
Training neural networks on 3D data can easily be GPUmemory-bound tasks.The entire Quijote simulation field (512 3 ) cannot be fed at once to the network, so smaller chunks of size 128 3 are used.To preserve the translational equivariance, we periodically pad the input cubes by 20 voxels per side, but disable paddings in the convolutions.We train with a batch size of 16 distributed on 16 Nvidia V100 GPUs, and use the Adam optimizer [13] with learning rate 0.0004, and hyperparameters β 1 = 0.9, β 2 = 0.999.We reduce the learning by half when the loss does not improve for 3 epochs.

Fig. 1 .
Fig. 1.The top row shows the matter density power spectra (left) from the CNN displacement model (blue, solid) and COLA (red, dashed), the

Fig. 2 .Fig. 3 .
Fig.2.Spherical evolution test displacement fields.Our neural network models take linear displacement fields as input and predict nonlinear outputs to emulate N-body simulations.We consider test cases with spherical symmetry that admit exact solutions.The left panel shows the linearly varying y displacement of particles in a 2D slice through the center of a 50 Mpc h −1 sphere in the Lagrangian space.The inner particles are set up in a linear overdensity of 1.55 so they collapse into a smaller region at redshift z = 0 without shell crossing.The right panel shows the nonlinear network prediction of the same field.The nonlinear displacements also vary linearly, and are much larger in magnitude to create a nonlinear density of ≈ 20.

Fig. 5 .
Fig. 5. Displacement divergence mode amplitudes from N-body simulations and neural network predictions for a fixed wave vector and varied initial amplitude.The dashed line shows the expected amplitude from linear theory.

Fig. 7 .
Fig.7.Amplitudes of displacement divergence for initial data with two longitudinal waves.The left set of plots show cases where k2 is perpendicular to k1, and the right set shows cases where k2 has both perpendicular and parallel components to k1.Each set of plots shows the amplitudes from the simulations on the left, the neural network in the middle, and their difference relative to the input amplitude on the right.Each color dot corresponds to the amplitude of a mode with wave vector kmn = mk1 + nk2, where values of m and n are along the horizontal and vertical axes respectively.