Constraining primordial non-Gaussianity using Neural Networks

We present a novel approach to estimate the value of primordial non-Gaussianity ($f_{\rm NL}$) parameter directly from the Cosmic Microwave Background (CMB) maps using a convolutional neural network (CNN). While traditional methods rely on complex statistical techniques, this study proposes a simpler approach that employs a neural network to estimate $f_{\rm NL}$. The neural network model is trained on simulated CMB maps with known $f_{\rm NL}$ in range of $[-50,50]$, and its performance is evaluated using various metrics. The results indicate that the proposed approach can accurately estimate $f_{\rm NL}$ values from CMB maps with a significant reduction in complexity compared to traditional methods. With $500$ validation data, the $f^{\rm output}_{\rm NL}$ against $f^{\rm input}_{\rm NL}$ graph can be fitted as $y=ax+b$, where $a=0.980^{+0.098}_{-0.102}$ and $b=0.277^{+0.098}_{-0.101}$, indicating the unbiasedness of the primordial non-Gaussianity estimation. The results indicate that the CNN technique can be widely applied to other cosmological parameter estimation directly from CMB images.


INTRODUCTION
The Cosmic Microwave Background (CMB) radiation is the snapshot of the photon temperature variation across the whole sky at the time of recombination (around 380,000 years after the Big Bang), which characterises crucial information to understand the early Universe (Scott & Smoot 2010;Dodelson & Schmidt 2020).The CMB temperature and polarization anisotropies provide measurable quantities that can be compared with theories of the early Universe and constrain cosmological parameters (Komatsu et al. 2009;Hinshaw et al. 2013;Planck Collaboration et al. 2020a).One of the key parameters of interest is the level of primordial non-Gaussianity in the primordial fluctuations, which is captured by non-linear coupling parameter fNL (Komatsu & Spergel 2001;Planck Collaboration et al. 2016, 2020b) for the deviation to Bardeen curvature perturbation Φ (ΦH in Bardeen 1980), i.e.
where ΦL are the Gaussian linear perturbations with zero mean.
There have been several statistical methods developed to estimate the non-Gaussianity parameter fNL from the CMB maps since the era of the Wilkinson Microwave Anisotropy Probe (WMAP).One of the most commonly used methods is bispectrum analysis, which measures the correlations between three different Fourier modes (k1, k2, k3) in the CMB temperature and polarization maps and compares with the theoretical models.WMAP used its five-year data and provided an estimate of the local shape of non-Gaussianity f local NL 1 as −9 < f local NL < 111 at 95% confidence level (C.L.) (Ko-⋆ Corresponding author: Y.-Z.Ma, mayinzhe@sun.ac.za 1 The local non-Gaussianity corresponds to the k-space configuration as matsu et al. 2009).The ESA's Planck satellite measured the full-sky CMB fluctuations with higher resolution, and its nominal mission gives f local NL = 2.7 ± 5.8 (Planck Collaboration et al. 2014) and full-mission gives −0.9 ± 5.1 (1σ C.L., Planck Collaboration et al. 2020a,b).In addition, the South Pole Telescope (ground-based CMB experiment) has also measured the fNL using the bispectra method, and gives f local NL = 420 ± 350 (Fergusson et al. 2012).Future surveys such as Simons Observatory (Ade et al. 2019) and CMB Stage-4 (Abazajian et al. 2022) will constrain the f local NL to higher precision.In addition to the bispectrum analysis of the CMB, there are several complementary methods to constrain the primordial non-Gaussianity.For example, the Minkowski functionals analysis characterises the topology of the CMB maps using topological descriptors known as Minkowski functionals, and applied the topological estimators to the observed CMB maps (Hikage et al. 2006;Planck Collaboration et al. 2016, 2020b).The peak counting method involves counting the number of peaks in the CMB maps and uses this information to constrain fNL (Marian et al. 2011); and the wavelet bispectrum method decomposes the CMB maps into different wavelet scales and uses the bispectrum of each scale to estimate fNL (Curto et al. 2011;Planck Collaboration et al. 2014).Apart from using CMB maps, large-scale structure data has also been used to constrain primordial non-Gaussianity in different ways, which include using Baryon Oscillation Spectroscopic Survey's (BOSS) DR16 data (Barreira 2022;Cabass et al. 2022a,b), using peculiar velocity field scale-dependent bias (Ma et al. 2013) and using 21-cm intensity mapping large-scale correlated bias (Li & Ma 2017).
While the aforementioned methods have been successful in estimating fNL from CMB maps, they can be computationally expensive and time-consuming, and may require careful modelling of the instrumental effects and other sources of uncertainty in the data.In particular, we want to explore the possibility of using neural network method of machine learning to directly recognise the non-Gaussianity pattern in the CMB map.In recent years, machine learning techniques, particularly the neural network, have emerged as a promising technique to deal with large cosmological datasets (see, e.g.George & Huerta 2018, Hezaveh et al. 2017, Wang et al. 2022b, Wang et al. 2023).Neural networks are able to learn complex patterns and correlations in large datasets, and can provide accurate predictions with relatively low computational cost and time.For example, Casaponsa et al. (2011)  In this work, we adopt a different approach.We will explore the scenario of using neural networks to estimate the primordial non-Gaussianity level from the CMB images directly.We will develop a customized neural network architecture and training process that can handle CMB maps with different levels of primordial non-Gaussianities.We will also test the robustness to the instrumental effects and high-dimensional data representation.
This paper is organised as follows.In Sec. 2, we will review the method we use to generate CMB maps with different levels of local shape of primordial non-Gaussianities (fNL).In Sec. 3, we describe the neural networks, loss function and training process.In Sec. 4, we present the results of training and model evaluation.The conclusion is presented in the last section.

Training and Evaluation procedure
This paper follows a structured flowchart shown in Fig. 1.We first simulate the pure Gaussian CMB maps, and add non-Gaussian CMB maps with different f local NL values to the Gaussian ones to synthesize the simulated maps.We then convolve them with the Planck beam function and add Planck pixel noise.We name them as "Synthesized Maps".These synthesized maps form the training data sets used to train the Convolutional Neural Network (CNN) model.Once the network is fully trained, the CNN model is tested on synthetic data to assess its performance.Subsequently, the trained model is employed to estimate fNL values on another set of simulated Planck maps with different f local NL values.This comprehensive approach ensures the model's effectiveness in handling real CMB data and contributes to the understanding of CMB analysis and estimation of fNL.

Simulating CMB Gaussian and non-Gaussian Maps
We use CAMB to calculate the temperature angular power spectrum of the CMB, by using the Wilkinson Microwave Anisotropy Probe (WMAP) five-year data with the BAO and Type-Ia supernovae best-fitting parameter values: ΩΛ = 0.721, Ωch2 = 0.1143, Ω b h 2 = 0.02256, h = 0.701, ns = 0.96, τ = 0.084 and ∆ 2 R (k * = 0.002 Mpc −1 ) = 2.457 × 10 −9 (Komatsu et al. 2009;Dunkley et al. 2009).The reason we use this set of parameters is that we will utilise the non-Gaussian maps generated from the "Optimized Quadrature Scheme" by Elsner & Wandelt (2009), which takes the values in the above set.We then simulate 1000 pure CMB maps from the outputted theoretical angular power spectrum of CMB (C ℓ ) out to ℓmax = 1024.For the non-Gaussian component of the map (a nG ℓm ), we utilise the Elsner & Wandelt "Optimized Quadrature Scheme" non-Gaussian maps (Elsner & Wandelt 2009), and download 1000 of them from their website 2 .
We then combine them to compute the synthesized maps with different values of f local NL : In Fig. 2, we show the Gaussian and non-Gaussian maps with dif- But so far, a ℓm in Eq. ( 2) is a theoretical simulation which does not include observational effects.We further convolve the a ℓm in Eq. ( 2) with Gaussian beam function and add the Planck instrumental noise to mimic the observational map.

Noise
We now simulate the noise of the CMB map.To mimic the true Planck mission data, we download the Planck maps and use the "Half-Ring Half-Difference (HRHD)" method to estimate the noise power spectrum.Basically, we calculate where δ1 and δ2 are the splits of the data from Planck detectors into two halves.Therefore the difference in Eq. ( 4) makes the signal cancel out but keeps the noise fluctuations.We do this practice for both Planck SMICA and SEVEM maps, and the resultant HD maps are shown in the upper panels of Fig. 3.One can see that the noise across the whole sky is almost homogeneous except for the ecliptic pole region, which is due to more scannings on these two regions.
We then calculate the noise power spectrum (N ℓ ) from the HD maps, and the results are shown in the lower panel of Fig. 3.One can see that the noise power spectrum of SMICA is slightly higher than SEVEM map, which might be due to the difference in the component separation methods.We use the resultant noise power spectrum N ℓ to simulate the noise map (n ℓm ) and add to the convolved CMB map to resemble the total (simulated) CMB map which ensures the training set is as close to the true Planck map as possible.In Fig. 4, we show the comparison between theoretical CMB power spectrum C TT ℓ (blue solid line), convolved spectrum C TT ℓ B 2 ℓ (black solid line) and the estimated power spectrum from simulated maps (yellow dots).One can see that the power spectrum from the simulated map is very close to the convolved CMB power spectrum (black line), which is because on scales of ℓ ≤ 1000, the noise level is not very high compared to the intrinsic CMB level.
We nonetheless include the noise simulation for the accuracy of the study.

Single Neuron
A typical single neuron used in machine learning is a fundamental building block of artificial neural networks.It represents the basic computational unit that processes inputs and produces an output as shown in Fig. 5.The neuron receives inputs from various sources, each multiplied by corresponding weights.These inputs are then summed together, including a bias term.The weighted sum is passed through an activation function to introduce non-linearity and determine the neuron's output.
The mathematical representation of a single neuron in a neural network is given by, The neuron takes input signals (x1, x2, ..., xn), and computes a weighted sum of these inputs, denoted by the symbol z.The weights used in the summation are denoted by (w1, w2, ..., wn), and the bias term is denoted by b.
The input signals (x1, x2, ..., xn) can represent any number of features or attributes that describe the input data, such as pixel values in an image.The weights (w1, w2, ..., wn) determine the strength of the connection between each input signal and the neuron.Alternatively, one can think of the weights as representing the importance of each input feature for the task at hand.
The bias term, b, is an additional parameter that is added to the weighted sum.It can be regarded as the representation of the neuron's willingness to fire even when all input signals are zero.In this sense, the bias term provides a measure of the neuron's overall activity level.
Once the weighted sum p is computed, it is typically passed through an activation function, such as the sigmoid function or ReLU function, to introduce non-linearity into the model and allow the neuron to model complex mappings between the input and output.

Neural Network
A typical neural network is composed of multiple layers of interconnected neurons depicted in Fig. 5, where each neuron performs a specific computation.These neurons are organized into different layers, including input layer, hidden layers, and output layer.A typical neural network is shown in the Fig. 6.
The input layer of a neural network receives the raw input data (deep green dots on the left side of Fig. 6).Each neuron in the input layer represents a specific feature or attribute of the input data.The number of neurons in a neural network is directly related to the dimensionality of the input data.In neural networks, the dimensionality of input data represents the number of variables or attributes describing each data point.In the case of our data with a shape of (256,192,3), each data point comprises multiple values due to the presence of three channels.Consequently, the dimensionality is calculated as the total number of points in the 2D array (256 × 192) multiplied by the number of channels (3).This yields a threefold increase in dimensionality compared to traditional methods that utilize single-channel convolutional neural networks (CNNs).The increase in the number of channels results in a higher dimensional input, which means the model needs to process more data at once.This can lead to increased computational requirements and potentially longer training times.However, the benefits of having more information in the form of additional channels outweighs in our case, as the model can learn more complex patterns and representations from the additional data.
Hidden layers are intermediate layers between the input and output layers (shallow green dots in the middle of Fig. 6).They are responsible for learning and representing complex patterns and relationships within the data.Each neuron in the hidden layers receives inputs from the previous layer, applies a weighted sum of these inputs, and passes the result through an activation function.The number of hidden layers and the number of neurons in each hidden layer can vary depending on the complexity of the problem and the desired model capacity.
The output layer represents the final prediction or output of the neural network (yellow dots on the right side of Fig. 6).It consists of neurons that produce the desired output based on the learned representations from the hidden layers.The number of neurons in the output layer depends on the specific task.In our work, the number of output neurons corresponds to the single output neuron, because it is a regression task.

Convolutional Neural Network (CNN)
For our regression work, we use a CNN model to carry out the regression task.Our input data is a 2D array of CMB temperature maps.The model uses the Keras deep learning library's Sequential API, which is designed to make it easy to create neural networks consisting of multiple layers.This specific neural network model includes convolutional layers, pooling layers, and fully connected layers, which are typical for image-related problems.
A CNN is a deep learning architecture composed of multiple layers of interconnected neurons, including the typical single neurons described earlier.A CNN is a specialized type of neural network that is specifically designed for processing structured grid-like data, such as images or sequences and has become highly effective in various computer vision tasks.
CNNs utilize specialized layers, namely convolutional layers and pooling layers, which are not typically found in traditional neural networks.Convolutional layers apply convolutional operations to the input data, using learnable filters or kernels to extract local features.These filters slide over the input, computing element-wise multiplications and summations.Pooling layers then downsample the feature maps obtained from the convolutional layers, reducing their spatial dimensions while preserving important features.These layers enable CNNs to capture spatial hierarchies and translationinvariant representations.The earlier layers of a CNN learn lowlevel features, such as edges or corners, while deeper layers learn higher-level features, such as textures or object shapes.This feature learning capability allows CNNs to effectively extract and represent complex patterns in the CMB data, making them well-suited for our regression task.The following Fig. 7 shows a general CNN structure.
Our CNN model used for regression consists of 10 layers including the input layer, convolutional layers, dense layers, and the output layer.The breakdown of the layers are: • Input Layer: The model expects input images with a shape of (256,192,3).This means the input array must have a height of 256, width of 192, and three channels.The input layer with appropriate shape is represented in rectangular red box in Fig. 7.
• Convolutional Layers: (i) The first convolutional layer has 32 filters of size (3, 3) and uses the ReLU activation function.It takes the input image shape.The convolutional layer is part a of hidden layer which is represented in Fig. 7 in yelow rectangular boxes.
(ii) A max-pooling layer with a pool size of (2, 2) follows each convolutional layer.It is also a part of a hidden layer.This layer downsamples the feature maps obtained from the previous layer.
A max pool layer is represented in pink rectangular box immediately after convolutional layer represented in Fig. 7.It is possible to have a convolutional neural network (CNN) without max pooling layers.While max pooling is a common technique used in CNNs to reduce the spatial dimensions of feature maps and prevent overfitting, it is not an essential component of a CNN.
• Fully Connected Layers: (i) After the convolutional layers, a flatten layer is added to convert the 2D feature maps into a 1D vector.
(ii) Then, a series of dense also known as fully connected layer (green shaded region in Fig. 7) layers follow, with 256, 128, and 64 neurons, respectively.Each dense layer uses the ReLU acti-  vation function to introduce non-linearity and capture complex relationships in the data.(iii) The last dense layer has a single neuron with a linear activation function, which makes it suitable for regression tasks.
• Model Output: The final layer of the model (represented as red dot in the Fig. 7) consists of a single neuron with a linear activation function.This indicates that the model is designed for regression tasks, where it aims to predict a continuous numerical value.
The given CNN model has approximately 1, 479, 721 trainable parameters for the input of shape (256,192,3).The number 3 in (256,192,3) refers to the channel which is a feature related to CNNs.During the convolution operation, the filters (also known as kernels) are applied to the input data slide over the data, computing dot products between the filter and the corresponding input values.The output of this operation is called the feature map, which represents a set of learned features or patterns in the input data.Each channel applies different filters.Usually the input data will be of one-channel meaning (256, 192, 1) (also written as (256, 192)) but we stack the array to make it suitable for three-channel CNN so more filters are applied to extract features along the multiple channels.
When the model is compiled using the method, we specify the loss function, optimizer, and evaluation metrics we will use for the training.The optimization algorithm sends error signals, usually in the form of gradients, back through the network to fine-tune the weights of the nodes in the network, which is the essence of backpropagation.The compiling method sets up the backpropagation algorithm to be used during the training process.
Let L be the loss function, p be the output of a particular layer, x be the input to that layer, w be the weights of that layer, b be the bias of that layer, and δ be the error signal at that layer.The back propagation algorithm can be summarized as follows: (i) First, compute the error signal δ for the output layer, (ii) Then, compute the error signal δ for the previous layer using the current layer's error signal where T is the transpose operator, and g ′ (x) is the derivative of the activation function used in the previous layer.
(iii) The gradient of the loss function with respect to the weights and biases of the current layer are (iv) Then, update the weights and biases of the current layer using an optimization algorithm, such as stochastic gradient descent: where η is the learning rate, which is a hyperparameter that determines the step size of the optimizer while updating the weights of the neural network during training.Therefore, η is a critical parameter that needs to be tuned to achieve the best performance of the model.η is tuned by trial and error method, or grid search method, or by adopting decaying learning rate technique.In our case, the performance is achieved with a constant learning rate of 0.001, and we arrived at the value by trial and error method and further tuning during the runs was not necessary once the model was initialized.
The above steps are repeated for all previous layers in the neural network.The above equations are then used iteratively for multiple epochs until the loss function converges and the neural network is trained.The back propagation algorithm is used in our CNN model to adjust the weights of the neural network to minimize the specified loss function and improve the model's performance on the evaluation metrics.

Loss function
In machine learning, a loss function is used to measure the difference between the predicted outputs of a model and the actual values.The goal of the model is to minimize this loss function during training, which improves the accuracy of predictions.
Figure 6.The image depicts a neural network with an input layer, multiple hidden layers, and an output layer.In the above neural network, each circle in a hidden layer represents a neuron, typically displayed in light green.These neurons are interconnected by lines, which signify the connections between them.The output of one neuron can be passed to the inputs of neighbouring neurons, forming the network structure.The yellow circle denotes the output layer, which may consist of a single neuron or multiple neurons depending on the specific task the network is designed to perform.Finally, the output layer produces the final result, which could be a classification or prediction based on the input data.The image demonstrates the complexity of neural networks and how they can be used for tasks such as image recognition, natural language processing, and more.The choice of loss function depends on the type of problem being solved.For regression problems, where the output is a continuous value, the mean squared error (MSE), the mean absolute error (MAE), or the Huber loss is commonly used.
Below are the two commonly used loss function • Mean Absolute Error (MAE) • Huber Loss The Huber Loss function effectively combines two components for handling errors differently, with the transition point between these components determined by the threshold δ = (yi, yo).
These loss functions are used during training to optimize the parameters of the neural network in order to minimize the difference between the predicted and actual outputs.

z-scaling
After obtaining the simulated a ℓm maps through Eq. ( 5), we convert the a ℓm s into a full-sky map with n side = 64, resulting in a float array of size 49,152.The histogram of the array values is very close to Gaussian distribution with values in order of O(10 −5 ).Because of the small values in the array which might be insensitive to CNN recognition, we employ the z-scaling, which preserves the Gaussian shape of the array while amplifying the value of each data point.This ensures that the CNN can effectively recognize and learn from the changes in the values, even though they are initially very small.As a result, the z-scaling technique allows for better sensitivity and interpretation of the data by the CNN.
We apply the z-scaling as follows: zi is the standardized value (z-score) of the ith data point, xi is the original value of the data point, µ is the mean of the data distribution in the given array which is z-scaled, and σ is the standard deviation of the data distribution in the given array.The mean value of the array for one of the sample map before scaling was 4.822 × 10 −8 and the standard deviation was 3.91 × 10 −5 .After applying z-scaling, the mean value of the array shown in the Fig. 8 has been shifted very close to zero, µ = −2.02× 10 −17 and the standard deviation has been scaled to σ = 1.After z-scaling the values are scaled approximately between -4 to 4 for the illustrated map in Fig. 8, making it convenient for the CNN to apply filters to extract features.This transformation preserves the shape of the distribution, meaning that the relative positions and spread of the data points remain unchanged, but they are now measured in standard deviation units instead of their original units.This process helps to amplify the values and make them more discernible to the CNN, while maintaining the Gaussian distribution of the data.The z-scaling is applied to the each map in the training dataset separately.This approach ensures that the array values are more sensitive to the CNN model.By applying the scaling independently to each map, we prevent any data leakage between different maps, which could potentially compromise the performance of the model on the unseen data.By treating each map independently during the scaling process, we maintain the integrity of the training dataset and ensure that the CNN learns the patterns and variations specific to each map.This approach helps the model generalize better to the new and unseen dataset, as it has not been exposed to any information from other maps.
The histogram in the Fig. 8 illustrates the distribution of values in the array.This scaling process ensures that the CNN can effectively capture the subtle variations and patterns within the data, enabling better interpretation and learning.

2D Conversion
Because the power of the CNN lies in its ability to capture shapes, features, and depths in images, to leverage this capability, we utilize the HEALPIX module to project the one-dimensional array into a two-dimensional shape.
The array after z-scaling is still a 1D array, which is then converted into 2D map.For a map of n side = 64, the array has 49, 152 elements.Following Wang et al. (2022a), we transform the effective map into a two-dimensional array of size (256,192) using the code CMBNNCS3 , where the values from the z-scaled array are preserved without any clipping or truncation of data points.This approach ensures that no information is lost compared to conventional methods that convert the map into a Mollweide projection or a PNG image with a white background.Such conventional methods can introduce systematic errors and distortions to the data, compromising its accuracy.
By directly converting the map into a two-dimensional array as show in the Fig. 9 while preserving the z-scaled values, we maintain the integrity of the data, allowing the CNN to effectively learn and extract meaningful features for improved analysis and prediction.This process gives us the map of size (256,192).

Array stacking
The 2D conversion of the map results in an array of size (256,192), which is suitable for a CNN to learn features and shapes within the data.However, it does not capture the depth information.In Fig. 2, we can observe that different values of fNL uniformly change the values of the array, but the underlying features remain consistent.The features of the map are altered when different realizations of the simulated maps are combined linearly.This would have been good to estimate fNL within the same realization because features will not change for the same realization of the CMB map, but this does not suit our application because we have different realisations of CMB.
Through array stacking, we transform each (256, 192) map from a single-channel format, suitable only for a single-channel CNN,into a (256,192,3) format, enabling compatibility with a threechannel CNN.This multi-channel transformation empowers the CNN to extract intricate features from the CMB map, enhancing its ability to capture nuanced details and varying intensities across the data depth.By leveraging information from all three channels, the CNN's predictive capabilities are strengthened, resulting in improved prediction accuracy.
During training, a dataset comprising 9500 maps is utilized, shaping the input data into a data cube of shape (9500,256,192,3) to adhere to the conventions of numpy arrays and meet the requirements of CNNs.The 'three' designation signifies the compatibility with CNNs operating on three-channel inputs.

Training
The dataset of 10,000 maps was generated by combining Gaussian and non-Gaussian maps provided by Elsner & Wandelt (2009), which represent different realisations.We then multiply a range of f local NL values in between −50 to 50 to the a nG ℓm maps and add them to the Gaussian CMB maps (Eq.( 2)).
During the training process, the model's parameters are adjusted to minimize the difference between the predicted and actual values of the f local NL maps in the training dataset.This is done using a technique called backpropagation, which is a type of gradient descent algorithm.The validation dataset is used to monitor the model's performance during training and prevent overfitting, which occurs when  14)).The x-axis represents the bin range, while the y-axis shows the frequency of values in that bin.The sample plot reveals the spread and shape of the data distribution for each map array.After z-scaling, the plot shows a roughly symmetric distribution with a peak around zero, indicating that the mean has been shifted to zero.This type of preprocessing is often applied to input data in machine learning models to improve training performance and accuracy.
Figure 9.One of the sample map after 2D conversion from the Mollweide projection to the rectangular array of the shape (256,192).In HEALPIX, the celestial sphere is divided into a hierarchical, non-overlapping arrangement of equal-area pixels.These pixels are organized with multiples of 12, ensuring consistency across the dataset (Górski et al. 2005).To convert a Mollweide projection to a 2D rectangular array, such as a shape of (256,192) shown in this figure, the dimensions are calculated as 3 × n side in one direction, and 4 × n side in the other direction (i.e.3n side × 4n side in total).This 2D conversion is suitable for machine learning algorithms as long as the consistency in the map division is maintained throughout the dataset.
the model fits too closely to the training data and performs poorly on new, unseen data.Thus, we split the initial 10,000 maps into training, validation and the test dataset.
By doing the splitting, and test sets, we can ensure that the model is trained on a diverse set of inputs and can make accurate predic-tions for unseen data.This is an important step in developing a reliable and accurate model for predicting f local NL values from CMB maps.The dataset of 10,000 maps is split into 9,500 for training and 500 for testing.The maps separated for the tests are not "seen" by the algorithm during the training and validation phases, and can be safely used to evaluate the model's true performance.While training, 10% of the training maps (950 realizations) were assigned for the validation dataset, and the remaining 8,550 maps were used to train the algorithm.
During training, two loss functions were considered: MSE and MAE.After assessing the model's performance, mean absolute error was selected as the preferred loss function, yielding slightly better results.Extensive hyperparameter tuning was performed to optimize the model, including batch size (256) and learning rate (0.001).Early stopping and model checkpoints were employed to prevent overfitting and save the best-performing weights.The training process was initially set to run for 1,000 epochs, but concluded at 909 epochs when no significant improvement in validation loss was observed.The weights from the epoch with the best performance were saved for subsequent evaluation on the test dataset.
A batch size denotes the number of samples processed simultaneously during a single training iteration in a machine-learning model.Distinct from an epoch, which encompasses a complete pass through the entire dataset, the training dataset is partitioned into smaller batches, with each batch fed into the neural network during training.For instance, if we have 1000 training samples and a batch size of 100, then 100 images are presented to the network in each iteration until all 1000 images are processed, completing one epoch.Batch size is a critical hyperparameter influencing both performance and efficiency.Smaller batch sizes may lead to slower convergence and less stable training, while larger sizes can accelerate training but may demand more memory.The choice of batch size is often determined by available computing resources.
Learning Rate is a hyperparameter that determines the step size of the parameter updates during the optimization process.It determines how much the weights of the model should be adjusted in each iteration of the optimization algorithm.The learning rate is a crucial hyperparameter in optimization algorithms, determining the size of parameter updates during training.It influences the speed of convergence and the risk of overshooting.A high learning rate can accelerate convergence but may cause instability, while a low rate may slow convergence but improve stability.Selecting the optimal learning rate depends on factors like model complexity, dataset size, loss function shape, and optimization algorithm.
This meticulous training approach ensures that the model is optimized and capable of accurately estimating fNL on previously unseen data.Here, unseen data refers to the data that is not used in training the CNN like test dataset.

RESULTS
In this section, we present the results obtained from our trained model.

Training and Validation loss curve
One important factor that indicates a well-trained model is the behavior of the training and validation loss curves throughout the learning process.These curves provide valuable insights into the model's performance, including close scrutiny of either underfitting or overfitting behaviors that require further adjustments.
In our study, we utilized the MAE as the chosen loss function.Initially, both the training and validation loss started at values above 25.As the model began training, the loss function gradually decreased, following a smooth curve as illustrated in Fig. 10.Over the course of 909 epochs, the validation loss and training loss exhibited a steady decrease until the early stopping mechanism was triggered.The initial epoch was set to 1000, and the best validation loss of 3.2 was achieved at epoch 895.Importantly, the loss curves demonstrate a desirable pattern, showing that our model is not suffering from overfitting or underfitting.In machine learning, an epoch signifies a complete iteration through the entire dataset during the training phase.Put simply, it involves presenting all the training examples to the model once.For instance, if a training dataset consists of 1000 maps and a Convolutional Neural Network (CNN) is utilized, one epoch entails feeding all 1000 images into the model for training.Following each epoch, the model's parameters are adjusted based on its performance on the dataset.This iterative process continues until the model converges on a satisfactory set of parameters that yield desirable results on the validation set.Consequently, repeating this process one thousand times constitutes 1000 epochs.
The observed behavior of the loss curves provides evidence of the effectiveness of our training process.The gradual decrease in loss signifies that the model is learning and adapting to the data, while the absence of significant fluctuations suggests stability in the training procedure.The achieved validation loss of 3.2 demonstrates the model's ability to generalize well to the unseen data.These results highlight the successful training of our model, indicating its potential for accurate predictions and reliable performance on the unseen data.

Estimation on synthetic data
In the analysis, the performance of the trained model is evaluated by estimating the input fNL values on both the training dataset and the unseen test dataset.The model checkpoint saves the weights of the epoch with the least validation loss for further analysis.
To assess the model's performance, we plot the output f local NL values against the input values in In Fig. 11, we can see that the f local NL outputs against inputs graph exhibits a strong correlation, demonstrating a reasonable level of accuracy.

Tests of unbiasedness
We further test the unbiasedness of the test data for the n = 500 output f local NL against input values.We utilise the minimal χ 2 -fitting method, which is a straightforward way to find the best-fitting line through a series of scattering points.Considering a straight line with slope a and interception b, we model the linear regression as y = ax + b, then we formulate the χ 2 as follows Then we run it on a regular 2-D grid of parameter space (a, b), with flat prior ranges as a = (−10, 10), b = (−10, 10) and each side of sampling 10,000.We show the results of the constraint in

CONCLUSION
This paper developed a deep learning model that could accurately identify the non-Gaussianity parameter (fNL), directly from cosmic microwave background (CMB) maps.To achieve this, we trained During training, we set the value of batch size to 256 and a learning rate to 0.001, which were selected based on their impact on the model's performance.We also employed various early stopping, to prevent overfitting to the training data.
The plot of loss function versus epoch for both training data and validation data is a smooth curve which tells us the model is converging and not overfitting to the training data.In our case, the increase in validation loss is minimal and does not affect the overall performance of the model.This suggests that the model has generalized well to new data and is not overfitting to the training set.We can see that the loss for both the training and validation data decreases gradually during the initial epochs, indicating that the model is learning quickly and efficiently.The fact that the loss for both datasets converges to around a similar value also suggests that the model has learned to distinguish between maps with different values of fNL on both the training and validation sets.The smooth curve of the loss function versus epoch provides further evidence that our CNN model is well-designed and trained, To evaluate the performance of our CNN model, we used two metrics: prediction accuracy on both the input map used during training and on a separate test map that was not seen during training.We found that the model can achieve high accuracy on both the training and test datasets, indicating that it has learned to distinguish between maps with different values of fNL.
Overall, our results suggest that a well-designed and trained CNN model can be effective in identifying fNL from synthetic CMB map images directly.However, we acknowledge that further testing with more diverse and realistic data with varying cosmological parameters are necessary to fully assess the model's performance and generalizability.Nonetheless, our results are promising and demonstrate the potential of deep learning techniques for cosmological parameter estimation from the CMB map directly.
decomposed the WMAP seven-year temperature maps into HEALPIX wavelet and a spherical Mexican hat wavelet, computed the third-order moments of the wavelet coefficients and constrained f local NL to be |f local NL | ≲ 50.Novaes et al. (2015) used a combined estimator of Minkowski Functionals (MF) and neural networks to constrain f local NL to be f local NL = 33 ± 23 (1σ C.L.) by using Planck SMICA map.These techniques require the pre-processing the CMB maps into some mathematical basis (e.g.wavelets or MF) to feed the neural networks as the input.

Figure 1 .
Figure 1.Flowchart depicting the process of generating maps and feeding them into a neural network model, as well as training the model and estimating on test data.The flowchart shows the different stages of the process, including preprocessing, data augmentation, training, and testing, and the different inputs and outputs at each stage.The synthesized maps are the combination of simulated a G ℓm and their corresponding a nG ℓm with varying f NL values, as described by Eq. (2).

Figure 2 .
Figure 2. Comparison of CMB maps with different levels of primordial non-Gaussianities (calculated via Eq.(5)).The different panels show the CMB maps with increasing levels of non-Gaussianity.It becomes obvious that above the level of f local NL ≳ O(1000) the non-Gaussianity becomes visible in the CMB maps.

Figure 3 .
Figure 3. Top Left and Top Right panels are the noise maps of Planck SMICA and SEVEM CMB maps, which are obtained by using the "Half-Ring Half-Difference (HRHD)" method.Bottom panel compares the noise power spectra of SMICA and SEVEM maps.

Figure 4 .
Figure 4. Comparison between theoretical CMB power spectrum C TT ℓ , convolved spectrum C TT ℓ B 2 ℓ and the estimated power spectrum from simulated maps.

Figure 5 .
Figure5.A single neuron with an input and its corresponding output.The input is multiplied by a weight, which is then added to a bias term.The resulting value is passed through an activation function, which produces the output of the neuron.

Figure 7 .
Figure 7.The diagram of a convolutional neural network (CNN) structure.It includes input data, convolutional layers, activation functions, pooling layers, fully connected layers, and an output layer.The input data is passed through the convolutional layers, which extract features from the input data.Activation functions introduce non-linearity to the network, and pooling layers downsample the output of the convolutional layers.The fully connected layers take the output of the pooling layers and produce a final output.The output layer is responsible for producing the final prediction based on the input data.

Figure 8 .
Figure 8.The histogram of the array values before and after z-scaling are shown in blue and light pink, respectively.The histogram in light pink shows the distribution of array values after applying z-scaling with the mean and standard deviation corresponding to each map (Eq.(14)).The x-axis represents the bin range, while the y-axis shows the frequency of values in that bin.The sample plot reveals the spread and shape of the data distribution for each map array.After z-scaling, the plot shows a roughly symmetric distribution with a peak around zero, indicating that the mean has been shifted to zero.This type of preprocessing is often applied to input data in machine learning models to improve training performance and accuracy.

Figure 10 .
Figure 10.The image shows the training loss and validation loss for 9500 number of maps used for training.

Figure 11 .
Figure 11.Scatter plot showing the comparison of estimated f NL values by a CNN model on training and test dataset, with the solid line indicating the ideal case where estimated values would match the true values.The blue dots are training data (n = 8550) and yellow dots (n = 950) are validation data, the deviation from the ideal line in red indicates the level of error in the model's predictions.
Fig. 11.The blue dots represent the estimation of fNL on the training data consisting of 8, 550 f local NL values, while the yellow dots represent the estimation on the validation data containing 950 f local NL split during the training.The y-axis corresponds to the output fNL values predicted by the model, while the red line represents the ground truth where y = x.

Figure 12 .
Figure 12.This graph shows the relationship between the input f NL and the predicted f NL values from a machine learning model.The blue dots (n = 500) represent the input f NL values, while the red line represents the least square fit of the input f NL and predicted f NL .The intercept of the line is 0.274, and the slope is 0.975, indicating a strong positive correlation between the two variables.

Fig. 12 .
One can see that the best-fitting slope parameter a and interception b are a = 0.980 +0.098 −0.102 , b = 0.277 +0.098 −0.101 , (16) which recovers the complete unbiasedness (a, b) = (1, 0) in 2σ C.L. One can also see the 2-D joint constraint in the corner plot in Fig. 13, which shows that the unbiased linear regression (a, b) = (1, 0) locates within the 95% C.L. contour.The R 2 score for the test data is 0.96, indicating a high level of accuracy in the estimation provided by the CNN model.

Figure 13 .
Figure 13.Constraints on slope parameter a and interception b by using the test data.The two 2-D contours are the 68% and 95% confidence level of the parameters (a, b), and the diagonal plots are the marginalised distribution of the parameters.