Deep learning dark matter map reconstructions from DES SV weak lensing data

We present the first reconstruction of dark matter maps from weak lensing observational data using deep learning. We train a convolution neural network with a U-Net-based architecture on over 3.6 × 105 simulated data realizations with non-Gaussian shape noise and with cosmological parameters varying over a broad prior distribution. We interpret our newly created dark energy survey science verification (DES SV) map as an approximation of the posterior mean P(κ|γ ) of the convergence given observed shear. Our DeepMass1 method is substantially more accurate than existing mass-mapping methods. With a validation set of 8000 simulated DES SV data realizations, compared to Wiener filtering with a fixed power spectrum, the DeepMass method improved the mean square error (MSE) by 11 per cent. With N-body simulated MICE mock data, we show that Wiener filtering, with the optimal known power spectrum, still gives a worse MSE than our generalized method with no input cosmological parameters; we show that the improvement is driven by the non-linear structures in the convergence. With higher galaxy density in future weak lensing data unveiling more non-linear scales, it is likely that deep learning will be a leading approach for mass mapping with Euclid and LSST.


I N T RO D U C T I O N
The evolving cosmological density field is rich in information about the cosmological model of the Universe, its unknown parameters, and cosmic web-dependent astrophysics. Though the largest fraction of the density is invisible dark matter, the gravitational lensing effect of galaxies can be used to infer fluctuations in the total foreground matter distribution. Accurate mass maps will be essential for the science goals of the upcoming LSST survey and the ESA Euclid mission.
The maps considered in this paper are of the two-dimensional convergence, κ, a weighted projection of the matter density field in the foreground of the observed galaxies. Recovering the convergence from the measured galaxy shapes, known as observed shear γ obs in the weak lensing regime, is an ill-posed inverse problem, troubled by survey masks (missing data) and galaxy 'shape noise'.
A typical principled approach to reconstructing more accurate mass maps in the presence of noisy, masked shear data is to use E-mail: niall.jeffrey.15@ucl.ac.uk 1 github.com/NiallJeffrey/DeepMass physically motivated priors. In Jeffrey et al. (2018b), it was shown that using either Gaussian priors or 'halo model' sparsity priors for κ improved the accuracy of the reconstructions with the dark energy survey science verification (DES SV) data. Implemented methods include using lognormal (Böhm et al. 2017) priors or E-mode priors (Mawdsley et al. 2019).
However, all of these priors take functional forms that only approximate the true object of interest, the prior on the convergence field P (κ|M) (with model assumptions M). These approximations are necessary because we cannot represent the probability distribution of the non-linear density field in a closed form. For example, we cannot characterize it uniquely in terms of its moments (Carron & Szapudi 2017). Even if the true, unapproximated prior were available, evaluation via direct calculation would likely be intractable.
Fortunately, we can still draw realizations of convergence maps from the prior distribution P(κ) in the form of simulations, which provides an opportunity to a new generation of methods based on deep learning. Such an approach has been simultaneously proposed by Shirasaki, Yoshida & Ikeda (2018), where a conditional adversarial network was used to learn a mapping from noisy convergence maps to an estimate of the noise-free convergence.
In this work, we propose a deep learning method to estimate the posterior mean of the convergence map from observed weak lensing shear measurements. In Section 4, we demonstrate our method on simulations and the DES SV data.

Shear and convergence
Given a distribution of source galaxies n in radial comoving distance ω, the convergence at position φ → on the sky is given by a weighted integral of the density where H 0 is the present value of the Hubble parameter, a is the cosmological scale factor, m is the matter density parameter, and δ is the overdensity. We express the linear data model in matrix notation, where n is a vector of noise per pixel. The matrix operator acting on the convergence Aκ is the shear contribution due to lensing (Bartelmann & Schneider 2001). In this formulation, the elements γ are the complex shear measurements binned into angular pixels in a two-dimensional image format. We do not take into account the second-order effects of reduced shear (Schneider & Seitz 1995), flexion (Bacon et al. 2006), or intrinsic alignments (Kirk et al. 2015). However, the deep learning approach taken in this paper is extremely flexible; as long as an effect can be modelled and included in the training data, it will be taken into account in the mass map reconstruction. This is generally not true of other methods. For example, flexion requires reformulations of methods (e.g. Lanusse et al. 2016). Additionally, noise per pixel is invariably approximated as Gaussian, which we do not assume in our deep learning approach.

Previous mapping approaches
The original mass-mapping approach by Kaiser & Squires (1993) was a direct deconvolution. In practice, Kaiser-Squires (KS) inverts the matrix A in Fourier space, where the matrix is diagonal. As this deconvolution is across a finite space, the edges of the data and internal masks introduce artefacts. KS is further troubled by the noise term in equation (2), which it does not take into account.
In a Bayesian framework, we may wish to consider the posterior distribution of the convergence κ conditional on the observed shear γ The denominator P (γ ) is a Bayesian evidence term conditional on model M. The first factor of the numerator is the likelihood P (γ |κ, M), which encodes our noise model. The second term is the prior P (κ|M), a possible selection of which was discussed in Section 1. If we believe that a realization of the convergence κ is a realization of Gaussian random field, then the form of P (κ) would be Gaussian. If the noise per pixel is Gaussian, then the likelihood is also Gaussian, which results in a posterior distribution with both the mean and maximum given by the Wiener filter: where S κ = κκ † and N = nn † are the signal and noise covariance matrices, respectively (Wiener 1949, Zaroubi et al. 1995, Jeffrey, Heavens & Fortio 2018a. The signal covariance in harmonic space is diagonal for isotropic fields. On the sphere, its elements are given by the κ power spectrum, C κ ( ). This Gaussian distribution is only approximately true for large scales where Gaussianity persists from the early Universe. On smaller scales, non-Gaussianity grows due to non-linear structure formation, which results in the cosmic web of the late Universe.

Convolution neural networks
We take a standard deep learning approach. We seek an approximation F to the function that maps the pixelized shear to the convergence map where the parameters of the function are to be learned (Goodfellow, Bengio & Courville 2016). We learn these parameters by minimizing a mean square error (MSE) cost function evaluated on a set of training data, which consists of pairs of realistic shear and 'truth' (noise-free) convergence maps. If the training data 'truth' maps are drawn from a prior distribution P(κ), and the corresponding noisy shear map is drawn from the likelihood P(γ |κ), this MSE cost function corresponds to F (γ ) being a mean 2 posterior estimate (Jaynes 2003), such thatκ is approximatinĝ We use a deep convolution neural network (CNN) to approximate the function F , where the parameters are primarily elements of learned filters in convolutional layers. CNNs are particularly suited for two-dimensional image or one-dimensional time-series data with translation invariant features in the underlying signal.
The CNN is a series of iteratively computed layers. At a given layer j, the signal x j is computed from the previous layer with linear operator (e.g. convolution) M j and non-linear activation function ρ (LeCun et al. 1990, Mallat 2016. The output of a layer is sometimes called a feature map. Due to their additional layers, deep architectures are often able to learn features with greater complexity than shallow architectures and therefore can better approximate the target function. For a general overview of deep learning and neural networks, we recommend Goodfellow et al. (2016).

DeepMass architecture
Our DeepMass architecture is based on the U-Net (Ronneberger, Fischer & Brox 2015), which has a so-called expanding path and contracting path. The DeepMass contracting path differs from the original U-Net: Usually convolutions and activation are followed by a max pooling operation to downsample the images, whereas we use average pooling (Géron 2017). With each downsampling operation, the images decrease in resolution, but the 3 × 3 filters cover more angular size of the image. The convolution after a pooling operation therefore has a receptive field that covers larger physical features in the convergence κ map.
There are similarities between U-Net architectures and sparse recovery methods. These consider representations where the solution is sparse and employ transforms that are fixed (e.g. Fourier, wavelets) or learned from data, and optimization is solved using proximal theory (Starck, Murtagh & Fadili 2015). The U-Net expanding and contracting paths are very similar to synthesis and analysis concepts in sparse representations. This has motivated the use of wavelets to implement the U-Net average pooling and the expanding path Ye, Han & Cha 2018). There are nevertheless significant differences: U-Nets can learn rich sets of features (corresponding to sparse dictionaries) from large training data sets, and the CNN implementation of non-linearity.
We differ from the original U-Net by not using padding in the convolutional layers, as the edge of our data mask is already many pixels away from the edge of the square image. This choice means that output of a convolution has the same image dimensions as the input.
The full architecture and code can be seen online: DeepMass † . We have added batch normalization layers (Ioffe & Szegedy 2015) after each convolutional layer; without this, training often became stuck in local minima of the cost function with respect to the parameters . For all layers, except for the final, we use the rectified linear unit activation. In the final layer, we use a sigmoid function, which forces the output to be between 0 and 1 (inputs and outputs are correspondingly rescaled).
For simplicity and memory efficiency, we aimed to work with real (32-bit) numbers, thus necessitating an initial operation acting on the complex shear γ . The best results came from using a fixed Wiener filter operation before the first convolution (rather than KS, as might be expected). This is equivalent to the first layer having M j =0 = W and ρ = 1, with no free parameters. We could also interpret the U-Net after the initial Wiener operation as G , where F (γ ) = G (W (γ )). The Wiener filter used a power spectrum with cosmological parameters σ 8 and m , fixed at the mean of the marginal posterior distributions from DES Y1 analysis (Abbott et al. 2018). The flat-sky power spectrum was an average of 102 power spectra of projected patches.

L-PICOLA simulations
The training data are derived from 74 independent dark matter simulations, with each simulation covering an octant of the sky. The simulations used a standard flat lambda cold dark matter cosmological model with H 0 = 70 km Mpc −1 s −1 . The scalar spectral index and baryon density were fixed at n s = 0.95 and b = 0.044, respectively. The values of m and the amplitude parameter σ 8 are distributed on a non-Euclidean grid with distances between points giving a density according to our prior P(σ 8 , m ) as shown in Fig. 1. Weak lensing constraints are most sensitive to combinations of this pair of parameters, so we avoid overfitting to a single cosmology by varying them in the training data.
To generate a convergence map from a simulation, the matter particles were binned using the HEALPIX (Górski et al. 2005) pixelization of the sphere with NSIDE = 2048 in comoving radial shells of 50 Mpc h −1 . The density ρ map in a given redshift was converted into an overdensity δ = ρ/ρ − 1 using the average density in the shellρ. The convergence was calculated per pixel using equation (1). We wish to have the n(z) in the lensing kernel match the DES SV data (Section 4.1), which we approximate by summing the individual posterior redshift distributions per galaxy from the BPZ photometric redshift code (Coe et al. 2006). The convergence maps were downgraded to NSIDE = 1024. The dark matter simulations are generated using the L-PICOLA code (Howlett, Manera & Percival 2015), which is based on the COLA (Tassev, Zaldarriaga & Eisenstein 2013) algorithm. This uses a combination of second-order Lagrangian perturbation theory and a particle mesh that requires fewer time-steps than 'full' N-body (e.g. Gadget Springel 2005) and therefore can generate simulations more quickly. This allows more training data to be generated in a given amount of compute time.
We used a 1250 Mpc h −1 comoving simulation box, 768 3 particles, and a 1536 3 grid. A z < 1.6 light-cone was generated with up to four box replicates, using 30 time-steps from z = 20. The initial conditions used Eisenstein & Hu (1999) for the linear matter power spectrum.
The drawback of this approach is the accuracy of the dark matter distribution. The finite spatial resolution and fewer time-steps used by the COLA method particularly affects small distance scales. Our experiments have shown a suppression of the L-PICOLA power spectrum at scales of > 700 of the order of 10 per cent [relative to NICAEA 3 (Kilbinger et al. 2009) theory], as is expected with COLA methods. We correct the power of the L-PICOLA convergence by estimating the smooth part of the C κ ( ) using a polynomial order 1 Savgol filter with window size 91 for each convergence map and reweighting spherical harmonics. Using the ratio of NICAEA and only the smooth part of the measured simulation, power spectrum ensures that the natural fluctuations inherent in C κ ( ) for a given realization are preserved.

Training images
From the 74 independent HEALPIX convergence maps over an octant of the sky, we generate 376 684 DES SV mock data realizations. A given realization is generated from the HEALPIX convergence map by randomly choosing a position on the sphere, applying a uniform random rotation between 0 and 360 deg, and extracting a square patch using a gnomonic projection with 256 2 pixels of size 4.5 2 arcmin 2 . If the generated image has pixels outside the octant, it is rejected. The rotation step is not to make the reconstruction rotation invariant, which happens naturally as P (κ) is isotropic by the cosmological principle, but it is to augment the training data and learn F better.
From the projected square κ convergence map, the complex noise-free shear map is generated using the A matrix from equation (2). The mask is applied and a random shape noise map is added. The noise map is generated by randomly shuffling the positions of galaxies in the original catalogue; this keeps the density of galaxies the same, but destroys the coherent lensing signal. This way, we forward model the non-Gaussian noise inherent in the data (something that other methods do not do).

Dark energy survey SV data
DES is a ground-based photometric galaxy survey, observing in the southern sky from the 4 m Blanco telescope in Chile with five photometric filters (Flaugher et al. 2015). The SV (A1) data 4 come from an initial run of 139 deg 2 , but with depth approximately that of the full 6 yr survey (Chang et al. 2015). We make a redshift cut of 0.6 < z mean < 1.2, where z mean is the mean of the z posterior for each galaxy. Data selection choices match Jeffrey et al. (2018b), although some maps appear different due to changes in pixel size and flat-sky projection.
In Fig. 2, we apply KS, Wiener filtering, and the trained DeepMass CNN. KS uses a 10 arcmin Gaussian smoothing as in Jeffrey et al. (2018b). The Wiener filtering uses a power spectrum with m and σ 8 at the mean of their respective marginal posterior distributions from the year 1 DES cosmology result (Abbott et al. 2018). The DeepMass CNN was trained using the Adam optimizer (Kingma & Ba 2015) with a learning rate = 1 × 10 −5 for 4 http://des.ncsa.illinois.edu 20 epochs (retraining over the full training set). The final Wiener and DeepMass maps were smoothed with a Gaussian kernel of σ = 2.25 arcmin (half pixel size) to remove very small-scale artefacts arising from the HEALPIX projection.
The DeepMass reconstruction clearly shows more non-linear structure than the Wiener filter. Individual peaks, which are suppressed by Wiener filtering, are resolved by DeepMass. The accurate recovery of non-linear and peak structures using DeepMass is studied quantitatively in Section 4.2.

Validation on simulations
Of the originally generated training images (Section 3.3), 8000 were reserved for validation and not used for training. One such example can be seen in Fig. 3, with the corresponding Wiener filter and DeepMass reconstructions. As with the reconstruction from observational data, DeepMass can be seen to recover the non-linear (cosmic-web) structure better than Wiener filtering. Compared to Wiener filtering, the MSE over all 8000 maps is improved using DeepMass by 11 per cent.
As DeepMass is estimating the mean of a posterior probability distribution (equation 7), there are inevitably structures that appear in the map (Fig. 3) but do not appear in the truth, and structures that appear in the truth but not in the reconstruction. Exploring the full posterior distribution, and thereby quantifying uncertainty, is a rich topic for future work.
To understand how DeepMass performs over different regimes, we can measure the MSE for a subset of pixels in the map with true value greater than some threshold, κ > X. Fig. 4 shows this for three different examples (labelled A, B, and C), chosen from the validation sample for their different power spectra.
The bottom row of Fig. 4 shows the MSE difference between DeepMass and Wiener filtering (MSE DeepMass − MSE Wiener ) for these three examples. In each case, the DeepMass improvement over Wiener filtering is shown to be driven by pixels with large true convergence κ truth . That is, DeepMass improves over Wiener filtering more for pixels with larger κ values (with increased improvement for those with κ truth > 0). Compared to Wiener  filtering, DeepMass is able to better reconstruct high κ value, nonlinear structures in the convergence κ fields.
The error bars in the bottom row of Fig. 4 are given by σ 2 MSE, DeepMass + σ 2 MSE, Wiener )/N κ>X , where N κ > X is the number of pixels in the sample above the threshold, and σ 2 MSE is the variance of the measured MSE over the sample for a given method.
The three map examples in Fig. 4 were chosen to demonstrate performance when the underlying true maps have different power spectra (top row). The map with the smallest power (example C) has the least improvement of DeepMass over Wiener filtering (though the improvement is still larger for high-valued pixels). Example A, with a larger power spectrum, shows a much more significant improvement in MSE. Fig. 4 demonstrates that (1) DeepMass improves over Wiener more for pixels with larger κ value, and (2) DeepMass improves over Wiener more when the underlying map has more structure (higher power).
In Jeffrey et al. (2018b), use of a 'halo-model' sparsity prior did not outperform Wiener filtering in terms of MSE, and so we expect DeepMass MSE to outperform GLIMPSE. However, MSE . Power spectrum of the residuals (κ =κ − κ truth ) normalized by the power spectrum of the truth P truth . The ratio P /P truth is averaged over 8000 maps in the validation set with cosmological parameters drawn from the prior (Fig. 1).
minimization relates just to the posterior mean, so alternative metrics (e.g. constraints from peak statistics) remain to be explored. Using 18 non-overlapping mock DES SV data from the MICE (Fosalba et al. 2015) simulations, we apply a Wiener filter with an optimal power spectrum calculated using the known cosmological parameters (not available in real data applications). Nevertheless, without using the known cosmological parameters as input, DeepMass still recovers maps with an average of 2 per cent better MSE.
The smaller improvement of DeepMass over Wiener with the MICE simulated data can be explained both by the fact that we have used the known cosmological parameters in the Wiener filter (whereas DeepMass uses no specified input cosmological parameters) and by the intrinsically low power of the MICE simulations. The MICE cosmological parameters (inc. m = 0.25, σ 8 = 0.8) lead to a relatively low-amplitude power spectrum, so DeepMass is in the regime demonstrated by example C in Fig. 4. The low power effectively means that there are fewer non-Gaussian structures above a detectable signal-to-noise level. The largest improvement over Wiener filtering comes when there are more non-linear (non-Gaussian) structures.
With the same MICE simulations, and restricting ourselves to pixels where the truth is greater than two standard deviations from the mean κ > 2σ (where σ is the standard deviation of pixels in the true map), compared to Wiener filtering, DeepMass improves the MSE by 8 per cent. As is to be expected, therefore, DeepMass improves over Wiener filtering due to its ability to reconstruct the non-linear structures in the cosmological signal. Fig. 5 shows the power spectrum P of the residual mapŝ κ − κ truth , normalized by dividing by the true power P truth , averaged across all the maps in the validation sample. The residual power shows no particular scale at which DeepMass performs better or worse than the Wiener filter; DeepMass outperforms Wiener filtering at all length-scales.
At the smallest scales, the average P /P truth for both methods tend towards one. This is evidence that both reconstruction methods are suppressing structure on the smallest scales, and the residual power is tending towards the power of the true map. On these smallest scales the signal to noise is so low that the minimum variance reconstructions damp fluctuations. The Wiener filter uses the same fiducial power spectrum for each map included in the averaged result shown in Fig. 5. Although DeepMass has no input cosmology, it outperforms Wiener filtering on large scales, which we may expect to be approximately more Gaussian. We can interpret this as DeepMass inferring power spectrum or cosmological parameter information from the data, information which is being used in the reconstruction.
In this work, we drew the cosmological parameters of the training data from broad priors (see Fig. 1) to generalize the DeepMass method for the realistic situation where the true underlying parameters are unknown. However, in future work it would be interesting to compare Wiener filtering with DeepMass trained at a single fiducial cosmology, to find whether DeepMass would outperform Wiener filtering only at small (presumably more non-Gaussian) scales.

C O N C L U S I O N
With DeepMass, we have presented a deep learning method to reconstruct convergence κ maps from shear measurements. With DES SV, we have shown the mass map reconstruction with deep learning from observational data.
By training with simulations over a broad prior distribution of cosmological parameters, we have a generalized method that needs no input cosmological parameters. This method has shown substantial improvement over Wiener filtering both qualitatively (by eye) and quantitatively (11 per cent MSE reduction on the validation data). The flexible approach also takes into account non-Gaussian noise in the weak lensing data. As our simulated training data are samples drawn from the prior P(κ), the approach has a principled Bayesian interpretation, without the need for evaluation of closedform priors.
The quality of the reconstruction with these initial experiments, and its flexibility, makes the deep learning approach a pre-eminent candidate for mass mapping with future weak lensing surveys.