The Three Hundred Project: Mapping The Matter Distribution in Galaxy Clusters Via Deep Learning from Multiview Simulated Observations

A galaxy cluster as the most massive gravitationally-bound object in the Universe, is dominated by Dark Matter, which unfortunately can only be investigated through its interaction with the luminous baryons with some simplified assumptions that introduce an un-preferred bias. In this work, we, {\it for the first time}, propose a deep learning method based on the U-Net architecture, to directly infer the projected total mass density map from idealised observations of simulated galaxy clusters at multi-wavelengths. The model is trained with a large dataset of simulated images from clusters of {\sc The Three Hundred Project}. Although Machine Learning (ML) models do not depend on the assumptions of the dynamics of the intra-cluster medium, our whole method relies on the choice of the physics implemented in the hydrodynamic simulations, which is a limitation of the method. Through different metrics to assess the fidelity of the inferred density map, we show that the predicted total mass distribution is in very good agreement with the true simulated cluster. Therefore, it is not surprising to see the integrated halo mass is almost unbiased, around 1 per cent for the best result from multiview, and the scatter is also very small, basically within 3 per cent. This result suggests that this ML method provides an alternative and more accessible approach to reconstructing the overall matter distribution in galaxy clusters, which can complement the lensing method.


INTRODUCTION
Estimating the matter content in galaxy clusters (see Kravtsov & Borgani 2012, for a review) is crucial for cosmological studies due to the fact that they are the biggest gravitationally bound objects originating from small density fluctuations in the early universe.Thus cosmological parameters can be constrained by studying the abundance of galaxy clusters as a function of the mass and redshift (e.g.Allen et al. 2011;Planck Collaboration et al. 2016b;Pratt et al. 2019;Salvati et al. 2022).Galaxy clusters are mainly composed of Dark Matter (DM), which is about 80% of their total mass, diffused hot gas (about 12%), i.e., intra-cluster medium (ICM), and stars, mainly in galaxies (the remaining 8%).
Galaxies within clusters are normally observed at different optical bands.Both ground-based and space-based instruments have been used to measure these galaxy properties through their spectrum energy distributions.For example, the Sloan Digital Sky telescope ★ daniel.deandres@uam.es† Talento-CM fellow (SDSS 1 ) and the Hubble Space Telescope (HST2 ) have been crucial for the studies of galaxy clusters as well as cosmology.The recent photometric surveys, e.g., the Dark Energy Survey (DES3 ) using the Dark Energy Camera and the Javalambre Physics of the Accelerating Universe Astrophysical Survey J-PAS 4 from the Javalambre Survey Telescope, provide an unprecedented amount of data for understanding our Universe.Not to mention the recently launched space telescopes James Webb Space Telescope (JWST 5 ) and Euclid 6 .While optical surveys explore galaxy members, the ICM is targeted through X-ray emission and the inverse-Compton scattering of the cosmic microwave background (CMB) photons, i.e. the Sunyaev-Zel'dovich (SZ) effect (Sunyaev & Zeldovich 1972).The SZ effect can be observed at millimetre wavelengths and different instruments have managed to detect more than a thousand clusters through it, such as the Planck satellite (PSZ2; Planck Collaboration et al. 2016c), the Atacama Cosmology Telescope (ACT; Hilton et al. 2018) and the South Pole Telescope (SPT; Bleem et al. 2020).Nevertheless, Planck is the only all-sky survey for SZ clusters.On the contrary, clusters X-ray observations are more recently explored by eROSITA (Liu et al. 2022) and targeted XMM-Newton observations, such as the CHEX-MATE project (CHEX-MATE Collaboration et al. 2021).
Although the total mass of a cluster of galaxies is not observable, it can be inferred by observing its baryonic components.It can be estimated from the dynamics (e.g.Biviano et al. 2006) or the abundance/richness (e.g.Rozo et al. 2009) of the member galaxies, or from the ICM radial profiles in X-ray and SZ observations with the assumption of the diffused gas is distributed following the hydrostatic equilibrium (HE) hypothesis (e.g.Gianfagna et al. 2021Gianfagna et al. , 2023)).Alternatively, their physical quantities, which are integrated inside a fixed aperture in the sky, strictly related to the mass of the object can be selected under the self-similarity assumption (Bryan & Norman 1998) as suitable observational proxies.However, these estimated masses rely upon theoretical assumptions, such as HE, and are therefore possibly affected by bias.Nevertheless, only weak gravitational lensing (WL) is sensitive to all matter along the line of sight and measures the total projected matter (Becker & Kravtsov 2011;Herbonnet et al. 2022).However, gravitational lensing observations are not numerous, with tens of images, in comparison to the hundreds or few thousands of observations of galaxy clusters available for SZ, Xray and stars.In addition, mass inference from WL also suffers from systematics due to the theoretical assumptions when reconstructing mass (kappa) maps from shear (Kaiser & Squires 1993), similar to the inference of the mass from SZ or X-ray.Recent data-driven approaches based on simulations and convolutional neural networks have proven to outperform conventional methods (Hong et al. 2021b) also for the task of WL Mass Reconstruction.
Machine Learning (ML) has been applied in astronomy since quite a long ago (e.g Odewahn et al. 1992), but only in past years we have witnessed an unprecedented increase of Deep Learning (DL) methods (for a review, see e.g.Huertas-Company & Lanusse 2023;Smith & Geach 2023), which implies a change in the paradigm from applying specific algorithms to fully data-driven science.We stress here that applying deep learning in real observations of galaxy clusters is inherently more challenging than in simulations and accordingly most of the literature is limited to only utilising simulated data.For example, DL has also been used for identifying galaxy cluster members from HST images (Angora et al. 2020), increasing the resolution (super-resolution) and de-noising XMM-Newton images (Sweere et al. 2022), and deprojecting and deconvolving galaxy cluster X-ray temperature profiles within the CHEX-MATE collaboration (Iqbal et al. 2023).In our interests, recent studies have shown that galaxy cluster masses can be estimated using deep learning methods from catalogues of simulated galaxy clusters (Ntampaka et al. 2019;Ho et al. 2019;Gupta & Reichardt 2020;Kodi Ramanah et al. 2020;Yan et al. 2020;Gupta & Reichardt 2021;Ho et al. 2021;de Andres et al. 2022;Ferragamo et al. 2023;Ho et al. 2023), which outperform these classical methods.Fortunately, only recently these methods have also been applied to infer the mass of galaxy clusters of real surveys at different wavelengths, Kodi Ramanah et al. (2021) applied deep learning to infer galaxy clusters masses from the SDSS Legacy Survey (Strauss et al. 2002), Ho et al. (2022) estimated the dynamical mass of the Coma cluster, de Andres et al. ( 2022) inferred the masses of the full-sky Planck satellite PSZ2 catalogue, Krippendorf et al.
(2023) followed a machine learning approach to infer galaxy cluster masses from eROSITA X-ray images.
Notwithstanding, previous works on mass inference focus on the mass inside a sphere of a certain radius (i.e. 200 7 or  500 ).Recently, Ferragamo et al. (2023) pushed these ML applications even further by showing that ML can predict the mass radial profile using SZ idealised simulated images and thus, being able to theoretically obtain valuable information on internal properties of galaxy clusters as the concentration.As a general remark, all these studies suggest that the estimated masses are not affected by observational biases and assumptions due to the fact that they do not rely on hypotheses about the dynamics nor the hydrostatic equilibrium with spherical symmetry of the ICM, but rather on the quality of dataset provided by cosmological simulations.
In this work, for the first time, we explore Deep Learning models to infer the projected matter density fields from simulated baryonic observations.To this end, we utilise the The Three Hundred (The300; Cui et al. 2018): a set of "zoom-in" hydrodynamical and cosmological simulations to generate the input idealised observations: SZ, X-ray and star maps, and the output maps, mass maps.The dataset created for this purpose corresponds to a set of idealised observations free from the observational impacts typical of real instruments, such as noise and point sources.Therefore, this work represents the needed theoretical proof-of-concept study where the limitations of deep learning are tested.For further details regarding our dataset, we refer the reader to section 2.
The manuscript is organised as follows: in section 2 the creation of the dataset is discussed, The Three Hundred set of hydrodynamic simulations and the particular selection of halo-sized objects that are optimal for training ML models and the characteristics of our input maps, Compton- parameters maps (SZ), X-ray surface brightness (X-ray) and star density maps, and output mass density maps.In section 3 we discuss the Deep learning model and architectures considered for this project as well as how the models are trained.In section 4, the results of the model applied to the test set are shown.To this end, we design a set of test metrics to assess the fidelity of the predicted mass density maps: pixel-wise statistics, cylindrical radial profiles, power spectra and Maximum Mean Discrepancy.Finally, in section 5 the main conclusions of the work are drawn.

The Three Hundred Simulations
In order to create the dataset used for training our DL model, we use The300 8 simulations (Cui et al. 2018).The300 simulation project focuses on a set of 324 "zoom-in" hydrodynamic simulations centred on the most massive halos of the MultiDark simulation (MDPL2; Klypin et al. 2016) utilising the cosmological parameters inferred by Planck (Planck Collaboration et al. 2016a).The full MDPL2 simulation contains 3840 3 particles whose mass is 1.5 × 10 9 ℎ −1 M ⊙ .Moreover, each cluster region covers 15ℎ −1 Mpc and they have been simulated with different baryonic models: Gadget-MUSIC (Sembolini et al. 2013), Gadget-X (Murante et al. 2010;Rasia et al. 2015), Gizmo-SIMBA, Davé et al. (2019); Cui et al. (2022)).However, in this work, we only make use of Gadget-X (Cui et al. 2018) and Gizmo-SIMBA Cui et al. (2022)  In The300 simulations halos are identified by the Amiga's Halo Finder package (AHF; Knollmann & Knebe 2009) and only halos with  200 > 10 13.5 ℎ −1  ⊙ , at redshift  ≃ 0 and free from contamination are selected.Note that  200 stands for the mass inside a sphere of density 200 times the critical density of the Universe at the corresponding redshift.Bear in mind that for The300 simulations, the scaling relations do not depend on redshift at least up to  ≃ 1 (de Andres et al. 2023).Free from contamination means that heavy (or low resolution) dark-matter particles entering from outside the re-simulated volume are not present inside our selected halos.The selection is first done for Gadget-X, the counterpart halos from Gizmo-SIMBA are selected to have the same number of objects and mass distribution as in Gadget-X.
Furthermore, to increase the statistics of underrepresented clusters we added objects from other snapshots ( = 0.022, 0.045, 0.069, 0.093, 0.117) to account for a flat mass distribution as in Ferragamo et al. (2023), which is presented in Figure 1.In that figure, the number of galaxy clusters as a function of mass is shown, and the whole sample is equally split into 3 bins with two vertical lines.Moreover, the The300 simulations contain a rich variety of massive galaxy clusters with different dynamical states (De Luca et al. 2021).Note that at the end of the selection procedure, 2518 halos are selected from Gadget-X and 2523 from Gizmo-SIMBA, with the same flat distribution in mass.We refer the reader to the Appendix A for more information.

Simulated multiview images
From the selected halos, we further compute the Compton- parameter -a dimensionless measure of the SZ effect, X-ray surface brightness, star mass density and total mass density maps.Per each halo in our dataset, 29 different line-of-sight projections are considered to create 146,189 maps in total, a large enough dataset for training our ML model.To test the effectiveness of machine learning models and avoid observational biases, the generated images correspond to a set of idealised maps in which the angular resolution is sufficiently high for all of them regardless of their masses.As such, we have created a dataset in which the size of each map is 2 ×  200 , all sampled with the same number of pixels  pix = 640 and therefore all maps are equally resolved.
Although all maps are initially created with a resolution of 640 × 640 pixels, they are smoothed with a Gaussian kernel of    ≃ 0.015 200 and they are re-gridded to a lower resolution of 80 × 80 pixels, which is needed for computational efficiency.Nevertheless, these "low-resolution" final maps are still very well resolved, and the angular resolution for all maps is below 1 arcmin.The idea that we desire to convey is that this dataset corresponds to an ideal theoretical dataset from hydro-simulations which can be easily applied to the ML models for testing their performance.In detail, the SZ maps, Xray surface brightness maps, star mass density maps and total mass density maps are computed as follows: Compton-y parameter maps () correspond to the integrated pressure along the observer's line of sight (l.o.s.) .
where  T is the Thomson cross section,  B is the Boltzmann constant,  the speed of light,  e the electron rest-mass,  e the electron number density and  e the electron temperature.These maps are computed using the publicly available package PYMSZ9 (Cui et al. 2018).X-ray surface brightness maps (X-ray) are estimated by computing the X-ray emission by thermal bremsstrahlung in the hot intracluster medium using a wrapper of PyAtomDB10 package to compute X-ray luminosity11 .First, using the atomic database, we estimated the bolometric X-ray luminosity of gas particles in the simulation.Then we projected the sum of the luminosity values along the observer's l.o.s.The projected luminosity    , is then divided by the surface area Σ of the pixel , : which is Stellar density maps are generated by projecting the sum of the masses of the star particles p  * p along the observer's line of sight, that represent the pixel (, ).This value is then divided by the surface area of the pixel Σ: We note that opting for stellar density as the stars simulated maps requires some clarifications: 1) Firstly, stellar observations in surveys are all presented as multi-band observations, e.g., SDSS12 .As a first proof-of-concept approximation, the density field of the stars reflects their spatial distribution and thus, for this purpose, as shown in Yan et al. ( 2020)  500 could be inferred from star density maps.2) We can, for sure, compute luminosity maps from stars in the simulation by using Stellar Population Synthesis models, such as SPSM (Devriendt et al. 1999;Bruzual & Charlot 2003).This approach is closer to the real observation images.However, we argue that (a) the mass density map from observations can be derived by fitting the different colour bands to a Stellar Energy Distribution, using the same SPSM, albeit a little bit higher uncertainties and complexities due to several hypotheses on the metallicity, stellar mass function, etc... (b) there

Short name Units
Compton-y parameter SZ Bolometric X-ray surface brightness X-ray erg s −1 kpc −2 Star density maps star should be not much difference in practice with either luminosity or stellar mass maps.This is because the galaxies' luminosity-to-mass ratio is approximately constant for The300 simulation, see figure 8 of Cui et al. (2018).(c) to apply our ML models to real observation images, there are still multiple steps to take care of, such as instruments, and background noise.Therefore, we adopt the stellar mass maps for simplicity in this concept-proofing paper and leave the proper reproduction of mock observation images in the following work.
Mass density maps are generated by projecting the sum of the masses of all particles, i.e. gas, star, dark matter and black hole particles in the observer's line of sight.Similarly, this value is divided by the surface area of a pixel.
In Figure 2, examples of our simulated maps are displayed for different cluster masses.With Compton-y maps, X-ray surface brightness maps and star density maps, we aim to infer total mass density maps shown at the bottom by using a deep learning model that is able to capture the non-linear relations between input (SZ, X-ray, star) and output (total mass) maps.This model can end-to-end translate one input map into mass density.The choice of the particular model used and the training of that model are described in the next section.

The Deep Learning Model
The chosen model and its architecture used for the generator of mass density maps is based on convolutional neural networks following a U-Net architecture.This model was originally developed for image segmentation in the field of biological imaging (Ronneberger et al. 2015).In astronomy, some applications followed the original purpose of segmentation such as removing radio frequency interference (Akeret et al. 2017), but also the U-Net has been successfully considered for various and different tasks (Milletari et al. 2016;Aragon-Calvo 2019;Berger & Stein 2019;Hausen & Robertson 2020;Lauritsen et al. 2021;Hong et al. 2021a).The U-Net is characterised by a set of two paths: the down-sampling path (or encoder) and the up-sampling path (or decoder).For our application, the considered U-Net is described below and represented in Figure 3.
• The down-sampling path consists of a succession of convolutional blocks, each of these applies two K × K convolutions with  filters (  K × K ), being K the size of the receptive field or the kernel size.After the convolutions, a rectified linear unit operation is applied (ReLU; Nair & Hinton 2010) to ensure nonlinearity, and a batch normalisation operation (BatchNorm; Ioffe & Szegedy 2015) is applied after each of the convolutions to improve the stability and performance of the network.Later, a 2x2 max-pooling (e.g., Scherer et al. 2010) operation down-samples the tensor shape so that this is reduced by a factor of 1/2.The subsequent convolutional blocks are built with the same architecture, maintaining the kernel size of the convolutions but increasing the number of filters by a factor of 2, i.e., for the  layer, the filters are  × 2 −1 , where  = 1, ...,  is the layer and  is the maximum number of layers.After down-sampling  Model name input maps number of encoders star star 1 SZ SZ 1 X-ray X-ray 1 multi-1 star, SZ and X-ray 1 multi-3 star, SZ and X-ray 3 Table 2. Models considered in this work.Each is a variation of the U-Net architecture presented in Figure 3 to account for different inputs.
times, a final convolution   ×2  −1 K ×K is applied.Therefore, the downsampling path can be written as a series of convolutions as follows. (5) • The up-sampling path consists of a succession of similar convolutional blocks to infer the output mass density map from the latent (central layer) representation given by the down-sampling path.From this encoded reduced representation, which is a tensor whose shape has been decreased by a factor of 1/2  times and has  × 2  −1 channels, up-sampling operations are applied until the shape of the output mass density map is recovered.This up-sampling operation consists of repeating the nearest points to increase the shape of the data by a factor of 2.Then, skip connections are used for the same row layers, and thus, information is not bottlenecked in the latent representation.At the final decoder step, 1 × 1 convolutions are applied to recover the filter dimensions of the output mass density F = 1.For preventing overfitting, random "dropout" (Hinton et al. 2012) is considered only in the convolutional blocks of the decoder.

Multiview approaches
This model has the advantage that it can be easily generalised to extract information simultaneously from multiple input views.To do that two approaches are studied: (i) The three different views are combined in a single input tensor whose shape is increased as if it were an RGB image.This approach is labelled as "multi-1"; and (ii) Three different encoders are used, one encoder for each input view.Subsequently, the three latent vectors are concatenated in the internal latent space.One decoder, the generator, is devoted to creating mass density maps from the information of this concatenated latent space.This approach is labelled as "multi-3" (see Figure 4).
A summary of all the models used in this work can be found in Table 2.

Training and validation
According to the description given in the previous section, we consider the following hyperparameters for our model: the number of channels or filters (), the size of the kernel in the convolutions (K), the number of layers in both the encoder and decoder architectures (), and the fraction of neurons that are randomly omitted, i.e. the "dropout".The considered possible values of these hyperparameters are shown in Table 3 and the optimal set of hyperparameters is found by Tree of Parzen Estimators algorithm (TPE; Bergstra et al. 2011).This algorithm is implemented and freely available at Hyperopt 13 : Distributed Asynchronous Hyper-parameter Optimisation (Bergstra et al. 2013).The targeted loss function is the mean absolute error  1 between the predicted pixel values and the ground-truth pixels.Furthermore, the data is randomly split into 3 subsets: the training dataset composed of 80% of the sample, the validation dataset with 10% of data and the remaining 10% belongs to the test dataset.The training dataset is used for tuning the U-Net parameters, the Tree of Parzen Estimators algorithm is fed with validation data, and the test set is only used for displaying the final results.
We have taken 100 evaluations when performing the Hyperopt optimisation, which means that the U-Net needs to be fitted 100 times for each of the 5 different U-Nets.The models are trained using the Adam optimiser (Kingma & Ba 2014) with a learning rate of 10 −4 , which is reduced to 10 −5 after 50 epochs.The training stops if the validation loss does not improve after 5 epochs, a process

RESULTS
In this section, our aim is to assess the quality of predicted density maps by our model.To do that, we compare the ground-truth density maps with their corresponding predicted maps in the test set with simple visualisation as the first metric for assessing the quality of  The input image (star, SZ or X-ray) is down-sampled 4 times, once per layer through a set of convolution of kernel size K × K.At the very bottom, the down-sampled representation is up-sampled using a similar convolutional block to generate the output mass map.Skip connections are used to ensure that the information is not totally lost during the down-sampling operations.We remind that dropout is applied to all convolutional layers only in the decoder.the predictions.Firstly, we show one example of our predicted maps in Figure 5 for our different U-Nets accounting for different input views.The first column on the left shows the three input views corresponding to the same galaxy cluster and the ground-truth density map is located at the top left.The second column shows the predictions for several input bands: SZ, X-ray, star and multiview.The last column corresponds to the residuals, i.e., the difference between the prediction and the ground truth.

CNN
As a general result, we observe that the predicted mass density maps from SZ and X-ray inputs are smoother and do not contain most of the substructures that can be appreciated in the ground-truth map.This is clear in the last column where the residuals mostly contain all the missing substructures and they are underestimated as shown in the figure in blue colour.Conversely, predictions from stars and from the multiview models contain most of the substructures.Their residuals are generally closer to zero than the others.We only show here the multi-3 approach, given the similarity among multi-1 and multi-3 predictions in a human eye test.
This visualisation of density-map residuals in Figure 5 apparently is not sufficient to numerical quantify the similarity between predict and true maps.Therefore, we are compelled to utilise a set of additional metrics to measure the discrepancies between the two.The considered metrics are the pixel-wise statistics, cylindrical radial mass profiles, power spectrum and maximum mean discrepancy which are studied in the following subsections.

Pixel-wise statistics
One interesting metric is considering the pixel-value differences between ground truth and predicted maps.We calculate the relative difference    between predicted maps Î  and true maps    as Therefore, the tensor   , computes the pixel-wise similarity between true images and predicted images.The index  here runs over maps, i.e., all the maps in the test set.Subsequently, the tensor   , is flattening to a vector of dimensions  ×  ×  and its histogram as a probability density is represented in Figure 6.We drop the values of the tensor   , where the ground-truth signal   , equals 0 unless Î , −   , = 0.
The particular values of the median, 16 th and 84 th percentiles of the distributions are also displayed in the legend of Figure 6  In this part, we also study the performance of each input map to infer the predicted mass map at different apertures around the cluster.In the left panel, pixel values inside the whole map are considered, in the middle only the values   , inside  500 are displayed and in the right panel only values inside  500 /2.Firstly, for the three panels in Figure 6, the same trend is observed: the multiview-3 is the most accurate, followed very closely by the multiview-1, using star density maps as input corresponds to the third most accurate model while considering X-ray and SZ as inputs results in a higher relative error, which is consistent to the expectation from Figure 5. Furthermore, this can be appreciated in the legend of Figure 6 that all the models are slightly biased mostly towards negative values, besides 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 D ij, k inside the whole map 10 5 10 4 10 3 10 2 10 1 10 0 10 1 probability density star, 0.01 +0.38 0.21 SZ, 0.01 +0.44 0.29 X-ray, 0.03 +0.36 0.28 multi-1, 0.01 +0.30 0.17 multi-3, 0.00 +0.23 0.18 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 D ij, k inside R 500 star, 0.02 +0.19 0.15 SZ, 0.00 +0.24 0.20 X-ray, 0.03 +0.22 0.19 multi-1, 0.00 +0.16 0.11 multi-3, 0.00 +0.14 0.11 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 D ij, k inside R 500 /2 star, 0.00 +0.17 0.12 SZ, 0.01 +0.16 0.16 X-ray, 0.03 +0.16 0.15 multi-1, 0.00 +0.11 0.09 multi-3, 0.00 +0.10 0.08 Figure 6.Pixel-wise relative difference   , (see Eq.( 7)) between the predicted mass density maps Î , and the ground-truth mass density maps   , .Different lines represent the use of different input views to predict the mass density map: star, SZ, X-ray, multi-1 and multi-3.The legend also shows the median value of the distributions with the 16 th and 84 th percentiles as median . From left to right,   , is computed for pixels inside various circular apertures: the whole map, inside  500 and inside  500 /2. the multiview-3 with median ≃ 0.00 in all three panels.While the model trained with only X-ray images yields the worst with median = −0.03for all three panels.Moreover, the scatter is smaller for the two multiview approaches, i.e., +0.23, −0.18 with only slightly larger in multiview-1 than in multiview-3.In contrast, that scatter from the model trained with only SZ images is the largest, +0.44, −0.29, followed by the X-ray model, +0.36, −0.28, which is also closer to the values from the star model, +0.38, −0.21.As it is depicted in Figure 5, models whose input is either SZ or X-ray fail at predicting small-scale substructures and the signal vanishes in the outskirts of the clusters.Therefore, all those not predicted substructures contribute to negative values in the   , distribution.Subsequently, these substructures in the density mass maps are, allegedly, wrongly estimated.This will be quantified in the subsequent sections.
By comparing with the middle and right panels of Figure 6, we aim to test whether the asymmetry in the distributions of the relative difference tensor   , is mainly affected by pixels outside the central region of the maps and whether the model behaves worse in the cluster centre or not.As depicted in that figure, the distributions progressively become less spread and more symmetrical when considering only pixels inside a smaller aperture,  500 and  500 /2.This means that the central regions, which we care the most, actually have the best result.Nevertheless, the multiview models provide the most accurate result in all cases, albeit a slight improvement with multiview-3 than multiview-1.

Cylindrical radial profiles
One can assess the quality of the predictions by comparing the mass radial profiles of the predicted and the ground-truth mass density maps.The mass profiles   are defined by integrating the mass density, or pixel values, of the maps in different circular regions Ω  with radius : Note that all circular regions include pixels from the very centre of the maps, i.e., radius ∈ [0, ].Here,    are the pixel values and Σ is the pixel size -Σ = ( 200 /40) 2 -.Note that Eq.( 9) provides the total 3D mass of a galaxy cluster integrated over a cylindrical volume of 2 ×  200 × Ω  .We then have computed   for ten equally-spaced circular regions of radii  =  200 /10,  200 /9, ...,  200 .Nevertheless, the difference between the ground-truth profile   and the predicted profile M can be estimated as the mass bias   as it is computed also in de Andres et al. ( 2022) and Ferragamo et al. (2023): In Figure 7, we show the bias   as a function of the normalised radius / 200 .From left to right, we show the bias that corresponds to three different mass intervals, which divide the dataset in a manner that each interval contains roughly the same number of galaxy clusters, see Figure 1.Therefore, Interval 1 corresponds to 13.5 ≤ log  200 / ℎ −1 M ⊙ < 14.02, the interval 2 range is 14.02 ≤ log  200 / ℎ −1 M ⊙ < 14.63 and the interval 3 is 14.63 ≤ log  200 / ℎ −1 M ⊙ ≤ 15.38.Moreover, in the top panels, we show the 3 single-view approaches (star, SZ, X-ray) and the 2 multiview models are shown in the bottom panels.From Figure 7, we see that all models have negative biases, apart from the low mass interval in which the biases for SZ and Xray models generally are negative, but positive for star and multi-1 models.For the single view result in the top panels of Figure 7, negative biases for the results from SZ and X-ray inputs are possibly caused by the fact that they can not provide any information of substructures, especially relatively low mass ones.The biases for the stellar inputs decrease from positive at the low halo mass bin to negative at the high mass bin.We suspect that is because the contribution of substructures is more significant in halos with higher mass.Since we train the model regardless of the sample's halo mass, there is also a possibility that ML has to balance the predictions due to different contributions of the sample's halo mass.Nevertheless, it is surprising to see the improvements from multiview inputs shown at the bottom panels of Figure 7: the median bias is reduced by roughly a factor of ∼ 1/3, while the scatter significantly shirked by almost ∼ 1/2 in all three mass bins.For the reader's interest, the particular values for the bias at  200 for different models and mass intervals are shown in Table 4.
The data in Figure 7 and Table 4 show a general trend that the scatter of the bias decreases with mass regardless of the input images.As can be seen in the Table, for interval 1 considering the multiview-3 model, the results are  200 = 0.001 +0.078  0.066 and for interval 3 is 0.024 +0.019 −0.017 , that is, the width is reduced by a factor of 4. This  10).From left to right, we show the bias corresponding to different mass intervals as indicated in Figure 1: Interval 1 corresponds to 13.5 ≤ log  200 / ℎ −1 M ⊙ < 14.02, the interval 2 range is 14.02 ≤ log  200 / ℎ −1 M ⊙ < 14.63 and the interval 3 is 14.63 ≤ log  200 / ℎ −1 M ⊙ ≤ 15.38.The top panels show the single-input view models (star, SZ, X-ray) and the bottom panels the multiview-1 and -3 models.
The lines represent the median values per bin and the shaded regions cover the 16 th and 84 th percentiles.Furthermore, the particular enclosure mass bias at  200 is presented in Table 4.
. translates into the effect that the predictions for massive clusters are more precise than those predictions for less massive clusters with smaller scatter.This problem could be caused by different factors: • The loss function utilised during training, which is the mean absolute error  1 , could be inducing the weights of the neural network to be updated to fit only massive clusters.This can be understood by noticing that the gradients of the loss function could be higher for massive clusters.According to this hypothesis, the presence of highmass clusters in our dataset hinders the predictions at low masses.We claim here that this is false as demonstrated in the Appendix section A, in which the deep learning model is trained with a different configuration where there is only data corresponding to low masses in the training set.
• ICM and stars follow gravity at higher masses, but at lower masses astrophysical effects are relevant and ICM and star components differ between Gadget-X and Gizmo-SIMBA simulations, increasing the irreducible scatter in the observable-mass relation.Since the training uses both simulation halos together, this larger difference at lower halo mass introduced such a large scatter.Furthermore, even with a single simulation, the scatter in halo baryon fractions increases as halo mass decreases which is caused by their intrinsic evolution history (see Cui et al. 2021, for example).This is the most relevant source that induces a mass-dependent scatter, as it is discussed in further detail in the Appendix A.

Power spectrum
Quantifying the information contained in our maps at different spatial frequencies is of great interest because the empirical evidence suggests that substructures tend to be underestimated as observed in Table 4. Values for the bias  200 defined in of Eq.( 10) at  200 for our different models.These values are also displayed in Figure 7 and the intervals 1, 2 and 3 are defined to separate the dataset in roughly equal number of clusters, as shown in Figure 1.Consequently, Interval 1 corresponds to 13.5 ≤ log  200 / ℎ −1 M ⊙ < 14.02, the interval 2 range is 14.02 ≤ log  200 / ℎ −1 M ⊙ < 14.63 and the interval 3 is 14.63 ≤ log  200 / ℎ −1 M ⊙ ≤ 15.38.
Figure 5, especially for the X-ray and SZ input images.This can be clearly viewed by investigating the power spectrum of the density mass map at small scales.For 2D images (  ) of pixel size  × , the Fourier Transform F     can be defined as: where  stands here for the imaginary unit.Note that for our images of 80 × 80 pixels  =  = 80.Moreover, the power spectrum is computed as: where  = √︃  2  +  2  .Note that the possible values of  in pixels go from 1 to the one corresponding to the Nyquist frequency, i.e., √ 2 × /2 ≃ 56 pixels, or in physical units with a factor of / 200 .From here we defined the spatial length  as Therefore, we further calculate the power spectra of the groundtruth maps and of the predicted maps for comparison, using the Pylians library (Python libraries for the analysis of numerical simulations, Villaescusa-Navarro 2018), and highlight the difference in Figure 8.In this figure, the power spectra of the predicted maps and the ground-truth maps are shown in the top panels as a function of the spatial length which is normalised to  200 .In the lower panels, we show the relative difference of the spectrum defined as ( pred −  true )/ true .In this figure, several things can be noted.Firstly, the power spectrum of the predicted mass density maps from star maps is consistent within the errors to the ground truth, where the median power spectrum starts deviating significantly at scales 0.1 ×  200 .For ICM tracers, the mass density maps predicted from SZ and X-ray observables indicate that their power spectrum differs on average from the ground-truth one (lower by about 25 per cent at  ∼ 0.5 200 ), being only similar at high spatial wavelengths.In addition, in Figure 8 the advantage of considering multiview models is appreciated, as shown in the fourth and fifth columns in the figure, where the multiview models effectively combine the information available from the different inputs.This means that the median power spectrum is similar to the power spectrum of the predicted mass density maps from stars, but with the clear advantage that the scatter is much smaller.To this end, SZ and X-ray views do not, as expected, improve the predictions of small structures but rather allow the model, when combined with data from stars, to better calibrate the overall signal, reducing the scatter.

Maximum Mean Discrepancy
Given two distributions, maximum mean discrepancy (MMD) is a test that assesses whether the two images are the same (Gretton et al. 2008).Although MMD can be used for training generative adversarial neural networks (Karolina Dziugaite et al. 2015), it is used here for simply assessing the quality of the predicted mass maps.
The MMD can be defined by choosing a kernel function  and a pair of random variables of inputs  and outputs  , so that one can compute the MMD as Here   's are the ground-truth data points and   's are the predictions of our U-Net model.For our case, the kernel is Gaussian, and therefore the estimation of K (, ) can be written as where  corresponds to the band-with range.Typically, one chooses a range of values of  to evaluate the MMD.In our case, we have computed the MMD as the maximum value estimated by Eq. ( 14) taking into consideration the following values of  = [0.01,0.1, 0.25, 0.5, 0.75, 1.0, 2.5, 5.0, 7.5, 10.0], defined in Eq. ( 15).Though the MMD might be very useful, it has little physical meaning and needs to be calibrated.To overcome this issue, we have computed the MMD between the ground-truth mass density map    and a perturbed map    that is created by adding a random Gaussian noise such that    =    + N (0, max(   ) • ) .( 16) Here  corresponds to the parameter that quantifies the standard deviation of the noise map.In Figure 9, we have used  = 0.05, 0.1, 0.2, 0.3 as shown in horizontal grey dashed lines, which represent the calibration of the MMD.This means that the standard deviation of the noise map is at most 30% of the peak value.Subsequently, we have calculated the MMD between the groundtruth maps  and the predicted maps Î.In Figure 9, we show the median values of the MMDs as a function of the mass of the cluster that corresponds to our different models.As a general result, the MMD metric is in agreement with the other previous investigations in the sense that the multi-3 model is the closest to the ground truth.We also find a mass dependence of the MMDs values with respect to the total mass of the galaxy clusters, which is coherent with the results in subsection 4.2 and it is explained in detail in the Appendix A. Moreover, the noise calibrations, which are represented in dashed grey lines, show that for small groups of clusters, the discrepancy is equivalent to having less than 30% of noise in the maps.However, in the massive end of our dataset around 10 15 ℎ −1 M ⊙ , MMDs are below 10% for all the models (SZ, X-ray, star, multi-1 and multi-3) and very close to 5% in case of the multiview models.This result suggests the use of multiview data to boost the statistical agreement of the pixel's distribution in the generated total mass density maps.

CONCLUSIONS
Deep Learning models have been used for generating images (see Rothschild et al. 2022, for example for the application in astronomy).Building on the success in the development of ML in this field, we propose, for the first time, to expand its application to predicting the total matter density distributions using multiview simulated observational data: star, Sunyaev-Zeldovich and bolometric X-ray.The deep learning architecture used on the project is based upon the U-Net, which was introduced in the context of biomedical imaging and it has been modified for accounting for multiview input as described in subsection 3.1.As the first step along this research line, we validated the applicability of this ML model using a dataset of simplified maps, described in section 2, from The Three Hundred set of hydrodynamic simulations.This dataset is free from instrumental and observational effects so that noise, point sources and telescopes' impact are not considered.In this work, we tested different convolutional U-Net architectures for both single-input and multi-input models and applied different matrices to quantify the fidelity of the predicted matter density maps from this DL model.From the results, we can conclude that, although the ML output is much more complex -2D vs 1D or a single data point ( 200 ) -compared to previous ML models, its outcomes, for example the 1D mass bias profile, are much better and very promising, especially for the multiview inputs.In detail, we can summarise our results in the following: • The scatter in the estimation of the mass maps from multiview is reduced by a factor of ≃ 1/2.As depicted in Figure 6, the pixel  14), between predicted mass density maps and the ground-truth mass density maps as a function of the cluster mass  200 .Different colours represent the predictions when the model is trained with only SZ data (blue), X-ray data (orange), stellar data (green), multi-1 (red) and multi-2 (purple).Horizontal grey dashed lines represent the calibration of the MMD values, Eq.( 16), and numbers written in black colour on the left of these lines correspond to the noise intensity  used for the calibration of the MMD values.The inset highlights the results at higher halo mass where the multiview results are compatible with  ∼ 0.05.
statistics show that the scatter can be reduced from, e.g. ± ∼ 16% when inferring the mass from SZ and ± ∼ 10% when using the combination of inputs maps using the multiview-3 model.Note that the scatter is defined as the deviation from the median value using the 16 th and 84 th percentiles as it is written in Eq.( 8).This fact is also acknowledged when examining the mass profiles in Figure 7 in which the scatter can be reduced from ± ∼ 4% to ± ∼ 2%, see Table 4 for further details.
• Each input view (see Figure 5) distinctly correlates with the output mass map and therefore, the capability of predicting different spatial frequencies is supreme for the multi-input models, as it is manifested in Figure 8.
• By computing the MMD values defined in Eq.( 14), we have examined the statistical similarity of the pixel distributions between the predicted maps and the true maps.The results suggest again the multiview models are foremost and they can be below 5% noise level as shown in Figure 9.
• Overall, the multiview models provide the best predictions in all the matrices: very low bias with a small scatter, especially in the massive regime.Multi-3 shows a little improvement compared with multi-1 because using 3 different encoders tends to better capture the important features of each input image.
It is interesting to also asses what is the improvement in the mass reconstruction when combining only gas inputs, i.e., SZ and X-ray.For that purpose we have trained a model with two encoders and the results are in accordance with single-input models, but with no observed improvement in the quality of the predicted mass.Quantitatively, this can be analysed by computing the mass bias in the mass profiles which has roughly the same values as considering SZ alone.The performance of the double-input gas model seems to be not comparable with the benefits of including information from both tracers, e.g., star and gas together.The star model trained with star density maps is more precise than the corresponding double-input one with SZ and X-ray.
The results of this work hint that observations of galaxy clusters at different spectral bands will translate into a better estimation of the underlying mass distribution.We have limited ourselves, as a first work on the topic, to 3 input tracers, but our deep learning approach can be generalised to account for many available inputs at different wavelengths, e.g., the several luminosity bands of the SDSS survey.The accessible computational resources are the only limitation, such as GPU memory.
Different Deep Learning models could also be considered as an additional job.Though not shown in this paper, we have considered modifying the loss function in the framework of generative adversarial networks (Goodfellow et al. 2014), and training the Wasserstein GAN (Arjovsky et al. 2017) with no improvement over using only the MAE loss function.Therefore, at the moment, the U-Net model presented in this work is, the most suitable for addressing the problem of the inference of the mass map from multiview simulated observations.Conversely, vision transformers (Dosovitskiy et al. 2020) might yield competitive results in comparison with convolutional networks.
These techniques that are based on image-to-image translation algorithms (Isola et al. 2016) can also be applied to painting baryons on to N-body simulations (Chadayammuri et al. 2023), which we will also consider them for generating baryon maps from DM-only simulations, increasing the number of maps already available in simulations.
The results of this project represent the required proof-of-concept step towards the estimation of mass density maps from real observational data.This constitutes a challenge in the sense that our data considered here for training the models could not completely match the characteristics of real data due to several regards.Firstly, our models can be used in real data training on mock observational data where the instrumental impacts are considered in the framework of Simulation-Based Inference.As achieved in our previous work, observational mock data was created by introducing directly the instrumental effects on our simulated clean maps (de Andres et al. 2022).Secondly, the physics implemented in our simulations could lead to biased results.To address this problem, deep learning models could be trained with data provided from simulations with various and plausible physical models (Villaescusa-Navarro et al. 2021).gas fraction for Gadget-X is more constant, while for Gizmo-SIMBA decays faster in small groups.Nevertheless, the star fractions follow the opposite trend, as shown in Figure A3.In summary, these two figures show that different astrophysical implementations converge in the massive end and therefore, one might expect that the ICM and star distributions follow gravity, and are more related to the total mass density of the galaxy clusters for massive galaxy clusters.
To show that the scatter does not depend on the machine learning model, but rather on the diversity of the data, we have run different experiments: (i) Low mass: We have retrained the multi-3 model with galaxy clusters belonging only to the first mass interval, i.e., 13.5 ≤ log  200 / ℎ −1 M ⊙ < 14.02.
(ii) High mass: We have also retrained the model with data belonging only to the most massive interval (14.63 ≤ log  200 / ℎ −1 M ⊙ ≤ 15.38).
(iii) Low mass Gizmo-SIMBA: We have retrained the model multi-3 with data belonging only to Gizmo-SIMBA only in the first mass interval.
(iv) Low mass Gadget-X: We have also retrained the multi-3 model with data belonging to Gadget-X only in the first mass interval.
The results of the experiments are shown in the Figure A4.From the first and second experiments, we conclude that the scatter does not depend on the ML model and training procedure.We acknowledge that although these could bias the prediction to be more accurate at higher masses, that is not the case as the experiments suggest.From the third and fourth experiments, the results on Gizmo-SIMBA and GADGET-X simulations are similar and therefore, we conclude that the performance should not vary significantly for different hydrodynamic simulations.10) for the mass profiles as a function of the radius / 200 for the experiments previously mentioned in this appendix.From left to right, we show the bias corresponding to the different experiments: Train with low-mass objects with data from both simulations, train only with high-mass objects with data from both simulations, train low-mass objects with data only from Gizmo-SIMBA and train with low-mass objects with data only from Gadget-X.The lines represent the medium values per bin and the shaded regions cover the 16 th and 84 th percentiles. .

Figure 2 .
Figure 2. From top to bottom, we show some examples of maps used for training the U-Net model: SZ, X-ray, star and mass.The examples' mass  200 is shown at the top of each column.The size of all maps is 2 ×  200 regardless of the mass.

Figure 3 .
Figure3.U-Net architecture for input dimensions = (80 pixels, 80 pixels, 1 channel) and F = 32.The input image (star, SZ or X-ray) is down-sampled 4 times, once per layer through a set of convolution of kernel size K × K.At the very bottom, the down-sampled representation is up-sampled using a similar convolutional block to generate the output mass map.Skip connections are used to ensure that the information is not totally lost during the down-sampling operations.We remind that dropout is applied to all convolutional layers only in the decoder.

Figure 4 .
Figure 4. Multiview 3 approach.3 different encoders are considered, with one down-sampling path per each input view.Then this information is concatenated in the internal layer.From this concatenated latent vector, one decoder is in charge of the inference of the output mass density map.

Figure 5 .
Figure5.The first column on the left corresponds to the input maps (SZ, X-ray and stars) and the ground-truth mass density map.In the second column from top to bottom, we show the mass map predictions of our U-Net when training with SZ, X-ray, star or multiview (multi-3).The residuals are defined as the difference between the prediction and the ground-truth maps.The size of all maps is 2 ×  200 .

Figure 7 .
Figure 7. Radial profiles of the mass bias, see Eq.(10).From left to right, we show the bias corresponding to different mass intervals as indicated in Figure1: Interval 1 corresponds to 13.5 ≤ log  200 / ℎ −1 M ⊙ < 14.02, the interval 2 range is 14.02 ≤ log  200 / ℎ −1 M ⊙ < 14.63 and the interval 3 is 14.63 ≤ log  200 / ℎ −1 M ⊙ ≤ 15.38.The top panels show the single-input view models (star, SZ, X-ray) and the bottom panels the multiview-1 and -3 models.The lines represent the median values per bin and the shaded regions cover the 16 th and 84 th percentiles.Furthermore, the particular enclosure mass bias at  200 is presented in Table4.

Figure 8 .Figure 9 .
Figure8.Top: Power spectrum corresponding to our ground-truth (red) and the predicted (blue) mass density maps as a function of the spatial length  = 2 / for our different inputs: star, SZ, X-ray, multi-1 and multi-3, are displayed in different columns.Bottom: We show the relative difference (  pred −  true )/ true of the predicted power spectrum  pred and the ground-truth power spectrum  true of our mass density maps.The dashed black line depicts the perfect prediction where the difference is zero.The solid lines correspond to the median values while the shaded regions represent the 16 th and 84 th percentiles.

Figure A3 .
Figure A3.Median star fraction as a function of the mass for the Gizmo-SIMBA and GADGET-X simulations.The shaded regions correspond to the 16 th and 84 th percentiles.

Figure A4 .
Figure A4.Mass bias see Eq.(10) for the mass profiles as a function of the radius / 200 for the experiments previously mentioned in this appendix.From left to right, we show the bias corresponding to the different experiments: Train with low-mass objects with data from both simulations, train only with high-mass objects with data from both simulations, train low-mass objects with data only from Gizmo-SIMBA and train with low-mass objects with data only from Gadget-X.The lines represent the medium values per bin and the shaded regions cover the 16 th and 84 th percentiles.

Table 1 .
Dataset of simulated maps from The Three Hundred simulations.

Table 3 .
The considered possible values of hyperparameters of our U-Net model.