Data Compression and Inference in Cosmology with Self-Supervised Machine Learning

The influx of massive amounts of data from current and upcoming cosmological surveys necessitates compression schemes that can efficiently summarize the data with minimal loss of information. We introduce a method that leverages the paradigm of self-supervised machine learning in a novel manner to construct representative summaries of massive datasets using simulation-based augmentations. Deploying the method on hydrodynamical cosmological simulations, we show that it can deliver highly informative summaries, which can be used for a variety of downstream tasks, including precise and accurate parameter inference. We demonstrate how this paradigm can be used to construct summary representations that are insensitive to prescribed systematic effects, such as the influence of baryonic physics. Our results indicate that self-supervised machine learning techniques offer a promising new approach for compression of cosmological data as well its analysis.


INTRODUCTION
Over the last few decades, cosmology has undergone a phenomenal transformation from a 'data-starved' field of research to a precision science.Current and upcoming cosmological surveys such as those conducted by the Dark Energy Spectroscopic Instrument (DESI) (Aghamousa et al. 2016), Euclid (Laureĳs et al. 2011), the Vera C. Rubin Observatory (LSST Dark Energy Science Collaboration 2012), and the Square Kilometer Array (SKA) (Weltman et al. 2020), among others, will provide massive amounts of data, and making full use of these complex datasets to probe cosmology is a challenging task.The necessity to create and manipulate simulations corresponding to these observations further exacerbates this challenge.
Analyzing the raw datasets directly is a computationally expensive procedure, so they are typically first described in terms of a set of informative lower-dimensional data vectors or summary statistics, which are then used for parameter inference and other downstream tasks.These summary statistics are often motivated by inductive biases drawn from the physics of the problem at hand.Some widely used classes of summary statistics include power spectra and higher-order correlation functions (Chen et al. 2021b;Gualdi et al. 2021;Philcox & Ivanov 2022;Chen et al. 2021a), wavelet scattering transforms (Cheng et al. 2020;Valogiannis & Dvorkin 2022a;Valogiannis & Dvorkin 2022b), overdensity probability distribution functions (Uhlemann et al. 2023), void statistics (Pisani et al. 2019;Hawken et al. 2020), and many others.While these statistics have been successful in placing tight constraints on cosmological models, pressed statistics for the linearized log-likelihoods not limited to be Gaussian.Zablocki & Dodelson (2016) and Alsing & Wandelt (2019) extended this approach further to design a compression scheme to obtain 'nuisance-hardened' summaries of the original dataset from the score statistics.This further reduces the size of the summaries from  points to  points where  is the number of parameters of interest in the model.These summaries, which can also be defined as the score function of the nuisance-marginalized log-likelihood, preserve the Fisher information corresponding to the parameters of interest and are optimal for Gaussian likelihoods.
There have also been considerable efforts to develop and apply novel machine learning methods in order to find optimal summary statistics.Charnock et al. (2018) introduced a method called Information Maximising Neural Networks (IMNNs), which trains a neural network to learn informative summaries from the data by using the (regularized) Fisher information as an objective function.This approach also requires a choice of fiducial parameter values, although later works (Makinen et al. 2021;Makinen et al. 2022) have found that IMNNs can be robust to shifts from fiducial parameter points.Another direction explored compression schemes that optimize an estimate of the mutual information between the summaries and the parameters of interest (Jeffrey et al. 2021).These methods can produce globally-sufficient summaries valid for a range of parameters within the of support of available simulations and not only for a set of fiducial values.
In this work, we extend and explore self-supervised machine learning techniques as a complementary approach to derive compressed summary statistics.Self-supervised learning leverages the structure and internal symmetries of the data to learn informative summaries without explicit need for labels.In astrophysics, self-supervised learning has been applied to a variety of downstream tasks, including galaxy morphology classification and photometric redshift estimation (e.g.Hayat et al. (2021); Cheng et al. (2020); Slĳepcevic et al. (2023)), with extensions to domain adaptation (Ćiprĳanović et al. 2023;Vega-Ferrero et al. 2023) and neural posterior estimation (Shen et al. 2021).Another set of recent works has focused on using compressed summary statistics more broadly for astrophysical data exploration, anomaly detection, and self-supervised similarity search (Stein et al. 2021;Sarmiento et al. 2021;Vega-Ferrero et al. 2023;Desmons et al. 2023).We refer the reader to Huertas-Company et al. (2023) for a more extensive review of the application of contrastive self-supervised learning methods in astrophysics.We extend the selfsupervision paradigm in a novel manner, using physically-motivated simulation-based augmentations to inform the construction of summary representations.We investigate the potential of our method for compressing cosmological datasets, such as density maps, into informative low-dimensional summaries and their downstream use for parameter inference.
This paper is organized as follows.In Sec. 2, we review the framework of self-supervised learning, contrasting this paradigm with traditional supervised learning.We then describe the particular self-supervised method used in this study, VICReg (Bardes et al. 2021).In Sec. 3, we showcase the performance of our method for data compression and downstream parameter inference through case studies using mock lognormal density fields, as well as more complicated simulations -matter density maps from the CAMELS suite (Villaescusa-Navarro et al. 2021c).We compare the method's performance to an equivalent supervised baseline and, where applicable, to theoretical expectation based on Fisher information.In Sec. 4, we explore our method's potential to construct summaries that are insensitive to nuisance parameters and systematic effects, e.g. the influence of baryonic physics, while preserving information pertaining to rel-evant aspects of the model, e.g.cosmological parameters.Section 5 presents an application of our compression scheme for sequential simulation-based inference via generative emulation.Finally, we conclude in Sec.6 with a summary discussion and an outlook for future work and improvements.

Self-supervised learning
Self-supervised learning has recently emerged as a powerful framework for learning meaningful representations across a wide variety of data modalities without the need for explicit labels (Shwartz-Ziv & LeCun 2023).Self-supervised learning methods have also been shown to achieve performance comparable to fully-supervised baselines on a variety of downstream tasks such as, in the context of computer vision, image classification and object recognition (e.g.He et al. (2020); Chen et al. (2020b)).
The pipeline of self-supervised learning usually involves two steps.In the first step, often referred to as pre-training, an encoder network is trained to learn representations, or summaries, of the data which are invariant under various transformations or augmentations.This training step does not require the input dataset to come with labels, which is particularly advantageous in cases when obtaining or scaling up labelled datasets is expensive.Another important aspect of selfsupervised learning is that, since training of the encoder network does not rely on labels, it can leverage the structure of the data itself to learn useful and informative summaries or representations of the data.
After pre-training, one can use the summaries obtained from the encoder network directly for downstream tasks, such image classification, object detection, and, as we will show in the following sections, parameter inference.The network used for the downstream task tends to have a simpler architecture than the encoder network, such as a multi-layer perceptron with a few dense layers.This offers another potential advantage of self-supervised methods: once the encoder model has been trained, training the network on the summaries for downstream tasks is usually faster and more efficient than training a supervised model directly on the input data.Furthermore, self-supervised learning methods have been empirically and theoretically shown to generalize better to out-of-distribution data, which could be partially attributed to the simpler structure of the neural network specialized for the downstream task (Bansal et al. 2020).
A key difficulty of implementing self-supervised methods is a phenomenon called collapse, in which the encoder network learns a trivial solution and produces the same summaries for different input vectors (Shwartz-Ziv & LeCun 2023;Balestriero et al. 2023).A variety of approaches have been introduced in order to deal with this problem and enable learning meaningful representations.In this work, we focus on a particular approach called VICReg (Variance-Invariance-Covariance Regularization) (Bardes et al. 2021), described in the next subsection.VICReg is designed to explicitly avoid the collapse problem through an easily interpretable triple objective function, which maximizes the similarity of the summaries corresponding to the same image, while minimizing the redundancy between different features of the summary vectors and maintaining variance between summaries within a training batch.
A complementary approach to the collapse problem involves contrastive learning methods, which separate the training samples into 'positive ' and 'negative' pairs (e.g. Chopra et al. (2005) These pairs then contribute adversarially to the overall objective function in the pre-training step: the loss encourages the encoder to create similar summaries for the 'positive' pairs, while pushing the summaries for the 'negative' pairs apart in representation space.For completeness, we also test the self-supervised learning approach with a canonical contrastive learning method called SimCLR (Chen et al. 2020b) and find comparable performance to the VICReg baseline.In Appendix C, we provide a brief summary of SimCLR and a more detailed overview of our implementation and results.We emphasize that our method is not dependent on a specific choice of self-supervision method, since it relies more generally on the paradigm of self-supervision.

Variance-Invariance-Covariance Regularization (VICReg)
The VICReg framework was introduced and described in Bardes et al. (2021).In this section, we briefly review its key aspects as well as extensions introduced to make it applicable to specific use cases in cosmology.
Similarly to other self-supervised methods, VICReg can be divided into a pre-training step and a downstream task.During the pre-training step, the encoder network is first provided with two different views  and  ′ of an input .In the image domain, so-called views are random transformations of the image  obtained by, for instance, cropping it at different locations, applying color jitters or blurring the image.In the context of cosmological data, different views might represent different realizations of an observable that corresponds to the same fundamental cosmological parameters, but, for instance, different initial conditions or evolution histories.
The encoder uses views  and  ′ to produce corresponding lowdimensional summaries  and  ′ .The summaries are then used as an input to a small expander network that maps them onto vectors  and  ′ , called embeddings.
Empirically, it has been found that computing self-supervised losses on embeddings ,  ′ results in more informative summaries than computing the loss directly on the summaries ,  ′ (e.g.Chen et al. (2020a,b); Zbontar et al. (2021); Bardes et al. (2021)).Although the expander network is discarded after the pre-training step, using the expander network generally results in substantial improvement of the performance of the summaries ,  ′ on the downstream tasks.This behaviour is most likely due to the fact that, by applying a non-linear transformation to the summaries ,  ′ , the encoder network can act as a 'filter', which removes features from the representations ,  ′ (Chen et al. 2020b) that could be useful later on in the downstream task (Appalaraju et al. 2020;Gupta et al. 2022).These features are not particularly useful or important for the selfsupervised loss functions, but they could be leveraged later on, when using the summaries for the downstream tasks.In this manner, the expander network allows for more flexible and informative summaries of the input images.Therefore the VICReg loss is computed on the level of embeddings ,  ′ , but the summaries ,  ′ are used for the downstream tasks in the subsequent steps of the method.We show a schematic overview of the method in Fig. 1.
The invariance component  measures the similarity between the outputs of the encoder and is computed as the mean-squared Eu-clidean distance between all pairs of embeddings in ,  ′ : The variance loss  is intended to prevent norm collapse which occurs when the encoder maps every input to the same (trivial) output.It measures the overall variance in a given batch across different dimensions in the embedding space and encourages the variance along each dimension  to be close to some constant .Let   be a vector that consists of the values of the embeddings   at -th dimension.Then the variance loss can be computed as: where (, ) = √︁ Var() +  is defined as the standard deviation which is regularized by a small scalar .Following Bardes et al. (2021), in our implementation  and  are fixed to 1 and 10 −4 respectively.
The covariance loss () is used to address the informational collapse whereby different dimensions of the summaries encode the same information and are therefore redundant.It drives the covariance matrix C() to be close to a diagonal matrix by minimizing the sum of the squares of the off-diagonal entries of the covariance matrix: The covariance matrix C() is defined as: The final loss function is a weighted sum of invariance , variance  and covariance  terms: where , ,  are hyperparameters controlling the weights assigned to each term in the loss function.

Downstream Task: Parameter Inference
After pre-training, the summaries can be used for downstream tasks by training a simple neural network, such as a multi-layer perceptron with a few layers, on the task.We use the summaries to infer cosmological parameters of interest and refer to the neural network used in this step as the inference network.Assuming a Gaussian likelihood, we use the inference network to predict the parameters' means   and covariances Σ  by minimizing the negative log-likelihood function: where   are the true values of the parameters,   and Σ  are the predicted means and covariance matrix of the parameters, and the sum runs over all input images in the training set.We emphasize that, even though we showcase the specific downstream task of parameter inference, representative summaries can be constructed for a wide variety of downstream tasks common in cosmology and astrophysics.For instance, sensitivity analyses of simulation-based inference typically focus on a particular summary statistics in order to examine the robustness of the inference to different components of cosmological forward models.Summary statistics

Cosmological parameters
< l a t e x i t s h a 1 _ b a s e 6 4 = " 8 E t m U g M i U t p W X 3 a 8 + P p x 0 L c I F g M = " > A A A B 8 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m k q M e i F 4 8 V 7 A e k o W y 2 m 3 b p J h t 2 J 0 o p / R l e P C j i 1 V / j z X / j t s 1 B W x 8 M P N 6 b Y W Z e m E p h 0 H W / n c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G Z V p x p t M S a U 7 I T V c i o Q 3 U a D k n V R z G o e S t 8 P R 7 c x v P 3 J t h E o e c J z y I K a D R E S C U b S S 3 9 V i M E S q t X r q l S t u 1 Z 2 D r B I v J x X I 0 e i V v 7 p 9  < l a t e x i t s h a 1 _ b a s e 6 4 = " constructed in this study could be considered as an alternative to the traditional summary statistics (such as power spectrum or bispectrum) used for such sensitivity analyses (Modi et al. 2023).Beyond parameter inference, summaries can be constructed for e.g.source identification and deblending (Liu et al. 2021;Hansen et al. 2022), allowing for massive compression of survey data while retaining the desired information.

Lognormal Fields
We first test our methodology on mock cosmological data: lognormal random fields generated from linear matter power spectra.Lognormal fields are commonly used as a rough approximate description of matter density fields (Percival et al. 2004;Beutler et al. 2011;Cole et al. 2005).While they cannot accurately capture small-scale features of the cosmological density fields (e.g.Kitaura et al. (2010)), they nevertheless serve as a useful model for large-scale structure due to a number of properties, including the ability to compute summaries and information content analytically (Coles & Jones 1991).Unlike Gaussian density fields, lognormal density fields take positive values by construction, and they have been shown to agree with the results of -body simulations even in mildly non-linear regimes (Kayo et al. 2001).

Data
We generate lognormal fields    () from 2D Gaussian overdensity fields   () with a specified power spectrum   ().We convert the Gaussian fields to obtain corresponding lognormal overdensity fields: where  2  is the variance of the field   ().The Gaussian fields are produced with the powerbox package (Murray 2018).
We take   () to be the linear matter power spectrum computed with the Eisenstein-Hu transfer function (Eisenstein & Hu 1999) and generate the power spectra using the pyccl package (Chisari et al. 2019b).For each   (), we vary two cosmological parameters: total matter density, Ω  , and the r.m.s. of the present day ( = 0) density perturbations at scales of 8 ℎ −1 Mpc,  8 .We fix the remaining cosmological parameters to the following values: Ω  = 0.05, ℎ = 0.7,   = 0.96,  eff = 3.046,   = 0 eV.We use a grid of  2 = 100 × 100 points and set the area of the slice to be  =  2 = (1000 Mpc) 2 .Figure 2 shows an example of a power spectrum   with Ω  = 0.3 and  8 = 0.8 as well as the corresponding realizations of Gaussian and lognormal overdensity fields.
We generate a set of 10,000 different combinations of cosmological parameters uniformly distributed in the range Ω  ∈ [0.15, 0.45] and  8 ∈ [0.65, 0.95].For each combination of Ω  and  8 , we simulate 10 different realizations of lognormal overdensity fields, constructed from different initial random seeds.These realizations, rotated and flipped at random, serve as augmentations ('different views') to train the VICReg encoder network.

VICReg Setup
We compress the 100 × 100 dimensional maps down to 16dimensional summaries using an encoder network with 9 convolutional layers and 2 fully-connected layers.The inference network used to infer parameters from the compressed summaries is a simple fully-connected neural network with 2 hidden layers.We provide the details about the architectures of the two networks in Appendix A.
We use 80% of the data for training, 10% for validation, and the remaining 10% for testing.When splitting the dataset into the training, validation, and testing sets, augmentations corresponding to the same set of parameters are not shared across the different data splits.
We train the encoder network on the training set for 200 epochs in the PyTorch (Paszke et al. 2019) framework using AdamW (Kingma & Ba 2014;Loshchilov & Hutter 2019) optimizer, which is used as a default optimizer in this work, with initial learning rate of 2 × 10 −4 and cosine annealing schedule.Throughout the work we also perform a manual hyperparameter search to find the optimal invariance , variance , and covariance  weights in the loss function.We work with  = 5,  = 5, and  = 1.The downstream network is trained for 200 epochs with initial learning rate 10 −3 , reduced by a factor of 5 when the validation loss plateaus for 10 epochs.
In this work, we use the same training, validation, and test datasets when training the encoder network and the downstream inference network.We evaluate the performance of neural networks on the validation set at the end of each epoch and save the models with best validation loss.Once both the networks are trained, the overall performance of the algorithm is evaluated on the test dataset.
As a baseline, we also construct a convolutional neural network with an architecture that is equivalent to a combination of the encoder and inference networks, corresponding to the fully-supervised case.We train this network to perform field-level inference without intermediate steps: given an overdensity map, it learns to infer the means and covariance matrix of the cosmological parameters Ω  and  8 .The network is trained for 200 epochs, with initial learning rate 2 × 10 −3 and cosine annealing schedule.This model is used to evaluate the performance of the self-supervised method compared to a fully-supervised benchmark.

Results
We now present the results of the parameter inference with VICReg method on a test dataset of 10,000 mock overdensity maps.In Fig. 3, we plot the predicted values of Ω  (left panel) and  8 (right panel) compared to the true values of these parameters for 100 test maps, with the error bars showing the 1 uncertainties predicted by the model.
We compare the performance of the VICReg method to the performance of the baseline supervised learning method on the test dataset in Table 1.We find that the inference network trained on VICReg summaries is able to recover the true values of cosmological parameters with both accuracy and precision, with relative errors on Ω  and  8 equal to 5.2% and 1.3%, respectively.For comparison, a neural network with an equivalent architecture, trained on the maps directly in a fully-supervised manner, predicts the cosmological parameters with similar accuracy (relative errors on Ω  and  8 are equal to 5.1% and 1.3%), which suggests that the encoder network has learned an effective compression scheme that reduces the maps to summaries without substantial loss of information.

Comparison to Theoretical Expectation
As a bĳective transformation from a Gaussian random field, the lognormal field preserves the information content of an equivalent Gaussian one and is therefore conducive to an analytic treatment of its information content.We use this fact to compare expected constraints given the Fisher information content of the underlying lognormal field and VICReg-extracted summaries.The Fisher information matrix of the Gaussian fields can be conveniently computed from their power spectra   .The elements of the Fisher matrix for parameters   ,   are given by: where the sum is over all independent -modes.The Fisher matrix elements for the overdensity maps are computed by evaluating the linear matter power spectrum   at the relevant -values and numerically computing the derivatives of the power spectrum with four-point finite differences formula.
Assuming the summaries  we obtain with VICReg can be described by a Gaussian likelihood, we can compute the Fisher information matrix for  as follows: We evaluate the Fisher matrix for the summaries numerically: the derivative of the summaries are computed using four-point finite differences formula, and the covariance matrix  is estimated with Ledoit-Wolf method implemented in the sklearn package.We use 10,000 realizations of lognormal maps at the fiducial cosmology to In Fig. 4, we show the Fisher-informed constraints on Ω  and  8 .The fiducial values of Ω  and  8 in this case are set to 0.3 and 0.8 respectively.The Fisher-informed contours from the lognormal fields and the VICReg summaries are in excellent agreement, demonstrating that the summaries preserve the Fisher information content of the maps almost entirely.We show the lower bounds on the errors on the two parameters in Table 2. Since the Fisher-informed constraints from the VICReg summaries of the lognormal overdensity maps were computed under a set of assumptions about Gaussianity, we further examine and validate our conclusions in Appendix D. We compare the Fisher-informed constraints to posterior distributions inferred by a normalizing flow trained on the VICReg summaries and find that similar conclusions hold.

CAMELS Total Matter Density Maps
Having demonstrated the potential of self-supervised learning for data compression and parameter inference on a simple lognormal model of overdensity fields, we now consider an application of our method to more realistic data: total matter density maps from the CAMELS project (Villaescusa-Navarro et al. 2021b,c).CAMELS is a collection of hydrodynamic and -body simulations, which span a wide range of cosmological (Ω  ,  8 ) and astrophysical parameters (stellar feedback parameters  SN1 ,  SN2 and AGN feedback parameters  AGN1 ,  AGN2 ).Stellar feedback parameters  SN1 ,  SN2 parametrize the galactic stellar-driven winds or outflows which eject the gas from the interstellar medium to large distances away from the star-forming galaxy.AGN parameters  AGN1 ,  AGN2 describe the feedback from the massive black holes, which affects the largescale matter distribution by heating up and expelling the gas from the galaxy (Somerville & Davé 2015).We refer the readers to Villaescusa-Navarro et al. (2021c) for further details on the CAMELS dataset.
In this work, we use two publicly available suites of hydrodynamic CAMELS simulations, which implement two distinct galaxy formation models: IllustrisTNG (Weinberger et al. 2017;Pillepich et al. 2018) and SIMBA (Davé et al. 2019).We use the latin hypercube (LH) sets of the two suites, which contain realizations from uniformly-drawn cosmological parameters (Ω  ∈ [0.1, 0.5],  8 ∈ [0.6, 1.0]) and astrophysical parameters.Each simulation in the LH sets has a different value of the random seed which defines the initial conditions for the simulation.For our study, we use the total matter density maps from the CAMELS multifield dataset, which represent spatial distribution of baryonic as well as dark matter at  = 0 within a 25 × 25 × 5 (ℎ −1 Mpc) 3 slice of a given simulation (Villaescusa-Navarro et al. 2021b).IllustrisTNG and SIMBA datasets contains 15,000 different maps each (1000 hydrodynamic simulations with 15 maps per simulation).We construct self-supervised summaries using these maps and demonstrate their efficacy for downstream parameter inference.

VICReg Setup
We modify the notion of two different views/augmentation to represent total mass density maps from two different slices of the same simulation, rotated or flipped at random during training.This should enable the encoder network to learn relevant cosmological information from the maps and become insensitive to spatial variations in the slices.
We also find it helpful to modify the VICReg loss such that each batch includes 5 pairs of different augmentations from each simulation, as opposed to a single pair per simulation (or per set of cosmological parameters).Since the CAMELS maps have more complexity than the lognormal maps, this allows the encoder network to learn from more variations.
Due to the high computational cost of running hydrodynamic simulations, IllustrisTNG and SIMBA have fewer data samples than the lognormal maps dataset used in Sec.3.1, so we reserve more data for validation and testing purposes: 70% of the simulations for training, 20% for validation, and the remaining 10% for testing.
We use ResNet-18 (He et al. 2016;Paszke et al. 2019) as the encoder, which compresses the 256 × 256 maps to summaries of length 128.The inference network used for parameter inference is a simple fully-connected 2-layer neural network, with 512 units in each layer.
We train the encoder for 150 epochs with initial learning rate 10 −3 , which is multiplied by a factor of 0.3 when the validation loss plateaus for 10 epochs.The weights , ,  in the loss function are set to 25, 25, and 1, respectively.The inference network is trained with initial learning rate 7 × 10 −4 .
As a baseline model to compare against, we train a ResNet-18 model in a fully supervised manner to infer Ω  ,  8 directly from the total mass density maps.We train the network for 200 epochs with initial learning rate of 2 × 10 −4 and cosine annealing schedule.

Results
We next present our results for the two simulations suites.We plot the predicted values of Ω  (left panel) and  8 (right panel) against the true values for a subset of maps from the test set for the SIMBA (Fig. 5a) and IllustrisTNG (Fig. 5b) simulation suites.The error bars on the plots correspond to the predicted 1 uncertainties.
We summarize the errors on the predicted parameters and compare the performance of the VICReg algorithm on the two simulation suites in Tables 3a and 3b.It can be seen that the inferred parameters provide a fairly accurate and unbiased estimate of the true parameters.Trained directly on the VICReg summaries, the inference model is able to infer the cosmological parameters with percent-level accuracy: the relative errors on Ω  and  8 are 3.8% and 2.5% respectively for the SIMBA suite, and 3.7% and 1.9% for the IllustrisTNG suite.
We find that performing field-level inference on the matter density maps with an equivalent (ResNet-18) supervised model results in similar constraints on the cosmological parameters: the relative errors on Ω  and  8 are 3.3% and 2.3% respectively for the SIMBA suite and 3.3% and 1.8% for the IllustrisTNG suite.These results suggest that, despite massive reduction in the size and dimensionality of the data, the VICReg encoder network learns a near-optimal compression scheme with only a slight reduction in sensitivity of downstream inference.

Performance on an Out-of-Distribution Dataset
When testing the models on the out-of-distribution data, we find that similar results hold.We summarize the performance of the VICReg method and compare it to the supervised baseline model in Tables 4a and 4b for two scenarios of out-of-distributions datasets: applying models trained on data from IllustrisTNG simulations to total matter  density maps from the SIMBA suite, and applying applying models trained on data from SIMBA simulations to total matter density maps from the IllustrisTNG suite.We find that, similarly to the case with both training and test data coming from the same distribution, the supervised baseline model shows slightly better performance than the inference network trained on the summaries of the maps.We plot the predicted values of Ω  and  8 for the two outof-distribution dataset scenarios for the two methods (VICReg and supervised) on Figures 6 and 7. We notice that while the supervised model predicts the cosmological parameters more accurately overall, the predictions from the inference network show some qualitative similarities with the predictions from the supervised model.For instance, on Fig. 6 we find that both models under-predict the value of Ω  when Ω  is low, and over-predict it when Ω  is low.Similarly, both models show the reverse trend when predicting Ω  on Fig. 7.This suggests that, with preserving the relevant cosmological information from the maps, the summaries also encode some of the biases present in the simulations they were derived from, since the training of the encoder was not set up in such a way as to create summaries insensitive to the systematic effects present in these simulations.

MARGINALIZATION OVER SYSTEMATICS AND NUISANCE PARAMETERS
A general task across many fields involves summarization of complex data in a manner that isolates the effects of interesting variations in a model (e.g., parameters of interest) while accounting for the space of variations of uninteresting effects (e.g., nuisance or latent parameters).In the context of parameter inference, this is often accomplished through marginalization (in the Bayesian setting) or profiling (in the frequentist setting).This can be especially challenging if the space of systematic variations is high-dimensional and/or not well-understood.
In cosmological inference, a specific issue involves summarization and inference while accounting for systematic effects, both physical and observational, e.g. the effect of baryonic physics.In recent years, a number of works have investigated robustness of supervised machine learning methods to uncertainties associated with modelling baryonic processes in numerical simulations.These processes, which include feedback from Active Galactic Nuclei (AGN) and stellar feedback, affect observables such as the matter power spectrum at nonlinear scales (Chisari et al. 2019a).Different suites of hydrodynamical simulations take different approaches to modelling baryonic physics effects, which vary in the numerical methods used and their implementation of baryonic effects (or 'sub-grid' physics) (Weinberger et al. 2017;Pillepich et al. 2018;Davé et al. 2019;Chisari et al. 2018).As a result, theoretical predictions of cosmological observables from these simulations do not necessarily agree well on small scales, which are most affected by baryonic physics.Some studies have found machine learning models that are robust to variations in 'sub-grid' physics across different simulations (e.  2023)).In order to further address the robustness question, new numerical simulations with distinct 'sub-grid' physics models are now being incorporated into the CAMELS simulation suite (Ni et al. 2023).
The self-supervised method we have introduced offers an avenue to build machine learning models that are insensitive to uncertainties, such as those due to baryonic effects.These methods are designed to compress data into a set of statistics which preserve relevant information and are insensitive to a given set of variations.If we are interested in isolating information due to cosmological parameters of interest, then different augmentations used to train an encoder network could be the simulation products from different sets of simulations (such as SIMBA and IllustrisTNG), which share the same cosmological parameters and initial conditions, but follow different 'sub-grid' physics prescriptions, or span a range of variations in sub-grid modeling.In such a setup, the encoder would learn to produce representations that are insensitive to variations in the implementation of baryonic physics.Since a large-scale cosmological dataset with the necessary augmentation structure is unavailable at the present time, we motivate this use case with a simple proof-of-principle example in the following subsection.

Data
Following Villaescusa-Navarro et al. ( 2020), for simplicity and ease of interpretation, we study a toy model of a power spectrum observable represented by a broken power law: where  and  are proxies for 'cosmological parameters' of interest that describe the amplitude and scale-dependence of the power spectrum on scales that are unaffected by the 'baryonic effects'.Parameter , on the other hand, is a proxy for the effects of baryons on small scales which, in this toy model, change the slope of the power spectrum.Finally,  is a constant that describes the amplitude of the power spectrum beyond the pivot scale  pivot .It is calculated by requiring that the power spectrum is continuous at the pivot scale:   pivot =    pivot , so the small scales ( >  pivot ) carry 'cosmological' information about parameters  and  via .
For this simple idealized case, we do not include noise into the model, but add cosmic variance effects.Cosmic variance accounts for the fluctuations or differences from the true values of the power spectrum one would get when measuring the power spectrum  obs () from a realization of a corresponding Gaussian field: The variance on the power spectrum is given by

𝐹
is the number of -modes in a given bin and   is the fundamental frequency for a simulation box or a survey.For our dataset, we set   to 7 × 10 −3 ℎ Mpc −1 ,  pivot to 0.5 ℎ Mpc −1 , and compute the power spectrum for modes in range  ∈ [3,142]   .In Fig. 8, we show one example of such power spectrum produced with  = 0.6,  = −0.4, = 0.25.On the same figure, for comparison we also show an example of a power spectrum described by a simple power law with the same ,  parameters ( = ).
For our dataset, we sample 1,000 different values of the parameters  and  from uniform distributions:  ∈ [0.1, 1.0],  ∈ [−1., 0.0].For each combination of  and , we sample 10 different values of parameter  uniformly at random ( ∈ [−0.5, 0.5]) and generate realizations of the observed power spectra.These power spectra, which share  and , but vary , are used as different views when training the VICReg encoder.This is the dataset we are most interested in, and we later refer to this case as 'broken power law with varying '.For comparison, we create a second dataset that uses the same 10,000 combinations of parameters  and .However, for each combination of , , we sample only one value of parameter  and generate 10 realizations of the corresponding power spectra.These realizations, used as different views for training the encoder, share both 'cosmological' and 'baryonic' parameters, but differ from one another only due to cosmic variance.We refer to this case as 'broken power law with constant '.epochs with learning rate of 10 −3 and cosine annealing schedule.The invariance , variance , and covariance  weights are set to 15, 15, and 1 respectively in the loss function.The inference network is trained on the summaries for 300 epochs with initial learning rate of 5 × 10 −4 , which is multiplied by a factor of 0.3 if the validation loss plateaus for 10 epochs.

Results
We evaluate the overall performance of the VICReg method on the test dataset corresponding to 'broken power law with varying ' case.
We begin by analyzing the results of cosmological parameter inference from the summaries.We plot the predicted values of parameters , , and  against the true values of these parameters in Fig. 9.We 10 -1 10 0 Simple Power Law Broken Power Law Figure 8.An example of a toy power spectra represented by a simple power law (no 'baryonic effects') and a broken power law ('baryonic effects' affect scales past the pivot scale  pivot ).The power spectra were generated for the following set of parameters:  = 0.6,  = −0.4, = 0.25 (see Eq. 11).
find that the regression network trained on the summaries is able to predict the 'cosmological parameters', with relative errors on  and  2.1% and 3.9% respectively.However, the network is not able to infer any information regarding the 'baryonic effects' parameter  from the summaries.This is promising, since the augmentations were specifically designed so that the representations produced by the VICReg encoder network would be insensitive to variations due to 'baryonic effects' while retaining information about the relevant 'cosmological parameters'.Since small scales ( >  pivot ) still contain information about the 'cosmological' parameters, a potential worry about our prescription is that it blindly ignores the small scales in the data and the information they contain, learning a trivial prescription to ignore 'baryonic' effects.Instead, we would like the method to learn to disentangle the 'cosmological' information from the 'baryonic' information on these scales.We study how much, if at all, the representations produced via self-supervision depend on the power spectrum at different scales  using two different metrics of statistical dependence: distance correlation and mutual information.We will see that the learned summaries retain correlations with the input power spectra at input scales beyond the pivot scale, showing the retention of 'cosmological' information despite having no sensitivity to 'baryonic' parameters.

Distance Correlation
Distance correlation is a statistical measure of dependence between two random variables that captures both linear and non-linear associations between the variables (Székely et al. 2007).For a pair of random variables  and  , the distance correlation R (,  ) can be computed as follows: where V 2 (,  ) is the (squared) distance covariance between  and  , and V 2 (), V 2 ( ) are the distance variances of ,  .Empirically, the distance covariance for a statistical sample of N pairs of random vectors (  ,   ),  = 1, 2, ..,  can be computed as an average of the so-called doubly-centered distances   ,   : where the doubly-centered distances   ,   are defined as: and  , ,  , are matrices containing pairwise Euclidean distances between the vectors: We refer the reader to Székely et al. (2007) for the full definitions and properties of these quantities.In general, the distance correlation R (,  ) can vary between 0 and 1, with R (,  ) = 0 corresponding to the case of two independent variables and R (,  ) = 1 indicating linear relationship between the variables.
Figure 10 shows the distance correlation between the VICReg summaries and the power spectra values in a given -bin.Here we consider summaries from the two cases outlined in Section 4.1: 'broken power law with varying ' and 'broken power law with constant '.The distance correlation between the summaries and different -modes follows similar behavior for the two datasets prior to the pivot scale.Past  pivot , the two curves start to differ: distance correlation for the summaries trained on the power spectra with varying  declines rapidly, whereas, for the summaries trained on the power spectra with constant , the distance correlation starts to increase again.
This behaviour is consistent with the expected results, given the two different VICReg setups used to obtain the summaries.On large scales  <  pivot both types of power spectra carry relevant information about 'cosmological parameters', since at these scales the power spectra depend only on  and .The small scales, however, contain information about both 'cosmological' and 'baryonic' parameters.In the first case, the 'baryonic' information is considered irrelevant, so the summaries should depend less on power spectra values past  pivot .For the second case, the same 'baryonic' information is considered relevant since it is present across different augmentations of the same object, hence the increase in the value of distance correlation between the summaries and the power spectra values on small scales.

Mutual Information
Mutual information (MI) is another measure of non-linear dependence between random variables.It measures the amount of information (in natural units or nats) one gains about one random variable by 'observing' the other random variable.Similar to the distance correlation, mutual information extends beyond linear correlations, captures the full dependence between the two variables, and is zero only if the two variables are independent.For a pair of variables  and  , mutual information  (,  ) is defined as: i.e., the Kullback-Leibler divergence  KL between the joint distribution   and the product of marginals   and   .The integral is taken over the joint support of  and  .For a comprehensive review of mutual information and its properties, we refer the reader to, for instance, Vergara & Estévez (2014).
We use the estimator MINE (Mutual Information Neural Estimation) (Ishmael Belghazi et al. 2018) to compute a mutual information estimate between the VICReg summaries and power spectra in different -bins.MINE trains a neural surrogate to distinguish between samples from the joint distribution and the independent marginal distributions to maximize a lower bound on the mutual information; see Ishmael Belghazi et al. (2018) for more information.
In Fig. 10, we plot the estimated mutual information between the VICReg summaries and the values of power spectra values in different -bins.Similarly to Sec. 4.3.1,we consider two cases: 'broken power law with varying ' and 'broken power law with constant '.We notice that the mutual information and distance correlation follow a similar pattern.Past the pivot scale  pivot , mutual information estimate decreases in magnitude in the first case, while in the second case there is also a slight enhancement of MI on small scales, which mirrors the enhancement in the distance correlation.The suppression of mutual information on small scales in the first case aligns with what one might expect: VICReg training was set up to generate summaries insensitive to 'baryonic physics', which only affects  >  pivot , so the summaries should show less dependence (lower MI) on small scales.We note that in the case with constant  (i.e., when summaries are expected to carry information about all parameters of interest), for both MI and distance correlation, the correlation measure peaks at an intermediate scale in the -range where it is expected to be sensitive to specific parameters ( >  pivot for the 'baryonic' parameter, and  <  pivot for the 'cosmological' parameter).
In summary, Figs. 9 and 10 show that the summaries learned through self-supervision preserve relevant 'cosmological' information, while being insensitive to the variations in 'baryonic effects' by disentangling the signal coming from 'cosmological' parameters from the information about 'baryonic' parameters, instead of entirely ignoring the small scales.While in this section, for the ease of interpretation, we considered a simple toy model as our data vector, it would be interesting to investigate further if the same holds for more complex and realistic data vectors, such as cosmological power spectra with various non-linear corrections due to baryonic effects or matter density maps from hydrodynamical simulations with different baryonic physics prescriptions.

Comparison to the Supervised Baseline Model
For completeness, we also compare the performance of the VICReg method and of supervised learning and find that, similarly to the results from Section 3, an equivalent supervised model slightly outperforms the self-supervised model.In particular, when marginalizing over the 'baryonic physics' parameter  and inferring only the 'cosmological' parameters, the relative errors on  and  from the supervised model are 1.30% and 2.98% respectively, while the relative errors from the self-supervised model are at the level of 1.90% and 3.39% respectively.Villaescusa-Navarro et al. (2020) found that for the toy () model studied in this Section, neural networks trained in a supervised manner are able to marginalize out the scales which are impacted by baryonic effects and to obtain relevant cosmological information even from the scales strongly affected by these effects.While we cannot directly compare our results to Villaescusa-Navarro et al. ( 2020) due to differences in the datasets and the neural network architectures used in the study, we find that supervised baseline models place tight and accurate constraints on the 'cosmological' parameters when marginalizing over 'baryonic' effects, which is consistent with their findings.

COSMOLOGICAL PARAMETER INFERENCE WITH SEQUENTIAL SIMULATION-BASED INFERENCE
We have so far focused on cases where compressed summaries are used to build amortized estimators for the parameter posterior distribution, i.e. ones that can be deployed for inference on any new data point.Many cosmological applications can benefit instead from targeted inference -learning an estimator specific to a given observation.This can be challenging when one only has access to a fixed set of simulations.We show here how our compression scheme can be used to build a generative model for use as an emulator of summaries, to be then used for targeted inference using sequential simulation-based inference.Simulation-based inference (SBI) refers to a broad set of methods that are designed to infer parameters of interest ì  when the likelihood describing the observed data ì   is unknown or intractable.SBI techniques rely on forward models or simulators which probabilistically model the data generation process and thus implicitly define the likelihood ( ì   | ì ) of the data given a set of input parameters.By aggregating and analyzing the samples generated from the simulator, SBI techniques approximate the posterior distribution of the input parameters ì .In the recent years, advances in machine learning have led to development of new neural SBI methods that address the shortcomings associated with traditional SBI methods, such as Approximate Bayesian Computation (ABC), and enable efficient and accurate posterior inference, even for complex high-dimensional distributions.We refer the reader to Cranmer et al. (2020) for a recent review of neural SBI and associated developments.
Here, we use the Sequential Neural Posterior Estimation (SNPE) method (Papamakarios & Murray 2016;Lueckmann et al. 2017;Greenberg et al. 2019)  active learning, allows the neural network to learn from the more informative samples, which are more likely to produce data vectors  similar to the observed data ì   , reducing the total number of simulations required to accurately approximate the true posterior.We use the default implementation of the SNPE algorithm provided by the sbi package (Tejero-Cantero et al. 2020).
While strategies such as 'active learning' can significantly speed up the inference process, one potential bottleneck for an SBI-based pipeline is the computational complexity of the simulator.Although the mock data, such as lognormal overdensity fields, can be generated fairly quickly, if one were to perform SBI on more realistic datasets, such as density fields from hydrodynamical or N-body simulations, obtaining a single sample from the simulator would take a non-trivial amount of time and computational resources.We address this problem by training an emulator using a normalizing flow (Papamakarios et al. 2021;Rezende & Mohamed 2015) to model the distribution of the summaries ì  given a set of parameters ì .The advantage of training the emulator on the summaries  as opposed to the input maps themselves is that the summaries contain most (or ideally all) of the information from the maps, while, due to their lower-dimensionality, their distribution should be easier to estimate accurately than that of full maps.Once trained, the emulator can produce the summaries on-the-fly, which addresses the computational bottleneck.
We construct the emulator from a stack of 8 masked autoregressive flows (Papamakarios et al. 2017).The emulator is trained on the VICReg summaries of the lognormal maps from the dataset described in Sec.3.1.1.We train the emulator for 300 epochs using AdamW optimizer, with learning rate of 2 × 10 −3 and cosine annealing schedule.We compare samples of summaries from the trained emulator to the summaries computed on lognormal maps in Fig. 11 for input parameters ì   = {Ω  = 0.3,  8 = 0.8}.By inspection, we find that the emulator can successfully reproduce key features of the distribution of the actual summaries.
We use the trained emulator as the simulator for the SBI pipeline.Our observed data vector ì   is a summary of a randomly generated lognormal overdensity map with ì   = {Ω  = 0.3,  8 = 0.8}.We run the SNPE algorithm for 10 rounds, drawing 1000 parameterdata pairs in each round and using these simulated pairs to estimate the posterior distribution of the input parameters ì  at the end of each round of inference.Figure 12 shows the approximated posterior inferred with the SNPE algorithm (teal) compared to that obtained using the inference network from Sec. 3.1.3 (orange).We check that both posteriors are well-calibrated using simulation-based coverage The results shown on the plot were obtained for a lognormal map generated with Ω  = 0.3 and  8 = 0.8.calibration (SBCC) procedure (Deistler et al. 2022) in Appendix E. The true input parameters ì   are seen to lie well within the posterior contours, and the constraints obtained using the inference network are consistent with the SNPE-informed constraints.While this is a strictly proof-of-concept application of the VICReg-constructed summaries within the SBI framework, it demonstrates potential utility of the summaries for the downstream task of targeted posterior density estimation.

DISCUSSION AND CONCLUSION
We have introduced a self-supervised machine learning approach for general-purpose representation learning from simulated data, with a focus on data compression as an application.The method uses simulation-based augmentations in order to learn representations based on inherent symmetries of the data (i.e., data variations that correspond to the same underlying physics of interest).Applied in the context of cosmology, we have shown the potential of the method for massive compression of cosmological simulations into a set of sufficient summary representations as well as their use for downstream parameter inference.In addition, we have shown how the method can be used to produce representations that are insensitive to a set of systematic variations or nuisance parameters.The learned summaries can be used for a variety of downstream tasks in cosmology and astrophysics, such as classification, source identification and characterization, anomaly or outlier detection, and parameter inference, as well as in conjunction with other machine learning methods such as simulation-based inference.In this work, however, we focused on using the compressed summaries specifically for the downstream parameter inference.
Using VICReg as the backbone method, we showed that summaries learned via self-supervision can be used to faithfully repro-duce parameter constraints expected via the Fisher information (in the case of lognormal fields data).Considering the total matter density maps from hydrodynamic simulations in the CAMELS project (Villaescusa-Navarro et al. 2021b,c), we found that, even when applied to this more complex dataset with fewer data samples, our method is able to construct informative summaries that achieve parameter inference performance on par with a fully-supervised baseline.
Recent works (Villaescusa-Navarro et al. 2021a;Wadekar et al. 2022;Shao et al. 2023;Delgado et al. 2023) have explored the robustness of supervised methods to variations in subgrid physics models by applying the neural networks trained on observables from one simulation suite to a dataset from a different simulation suite containing separate models of sub-grid physics.Here, we do not make such robustness comparisons, since the self-supervision setup we used was designed with the idea of learning summaries of maps that are invariant to random spatial fluctuations and projections rather than variations in subgrid physics models.However, it would be interesting to apply this method to create summaries that are robust to differences in the subgrid physics.In this case, the different augmentations used to train the encoder could correspond to the maps from simulations that share the same cosmological parameters but are run using different codes that differ, for instance, in their implementation of baryonic effects due to stellar and AGN feedback.
Through an illustrative proof-of-principle example, we finally explored the potential of our method to produce summaries that are robust to certain nuisance parameters or systematic uncertainties, finding that the method is able to produce robust representations that can effectively disentangle the effects of nuisance parameters from those corresponding to parameters of interest.These results suggest that self-supervision can be used to reduce the sensitivity of machine learning models to physical and observational systematic uncertainties, such as those due to baryonic physics.A more comprehensive study of this potential is warranted and is contingent on the availability of cosmological simulations that include the necessary diversity of variations.
While additional follow-up studies are necessary before deploying self-supervised learning methods on real cosmological data, with the influx of new survey data and simulations products, as well rapid advances in novel machine learning techniques, self-supervised learning methods have the potential to accelerate a wide variety of tasks in cosmology via learning useful representations.This research made extensive use of the Jupyter (Kluyver et al. 2016), Matplotlib (Hunter 2007), Numpy (Harris et al. 2020), scikit-learn (Pedregosa et al. 2011) .We choose the dimensionality of the summaries to be equal to 32 (() = 32).

B2 Inference Network
The inference network is a simple fully-connected network with 2 hidden layers with 4 × ().Similarly to the encoder network, we apply batch normalization and ReLU activation function after each layer, excluding the output layer.The final output is of dimensionality  inference which is given by Eq.A1.
where ℓ ,  is the loss computed for each for a positive pair of examples, 1 [ ≠ ] is an indicator function which is non-zero only if  = , and  is the so-called temperature parameter.In the following sections, we fix  to 0.1.
The final loss function is computed by summing over all positive pairs in the batch: While we outlined the motivation for using the VICReg method in Sec. 2, it is interesting to check whether a contrastive learning method such as SimCLR would provide comparable results when applied to the mock cosmological datasets considered in this work.In this appendix we test the method on lognormal overdensity maps and CAMELS total matter density maps.We find that the results of parameter inference with SimCLR are comparable to the VICReg results which suggests that SimCLR could also offer an promising approach for data compression and inference in cosmology, although a further, more detailed experimentation with the method should be conducted.

C1.1 SimCLR Data and Setup
We use the dataset described in Sec., multiplied by a factor of 0.3 when the validation loss plateaus for 10 epochs.The downstream inference network is also trained for 200 epochs with AdamW optimizer, with the initial learning rate 5 × 10 −4 , reduced by a factor of 5 when the validation loss plateaus for 10 epochs.

C1.2 Results
We present the results of the parameter inference on the test dataset in Fig. C1 and compare the performance of the SimCLR and VICReg methods in Table C1.We find that both methods are able to effectively compress the maps and accurately infer cosmological parameters from the summaries with similar errors.Once both the encoder and the inference networks are trained, we save the models with the lowest validation loss.
We also compare the Fisher information content of the lognormal fields and the SimCLR summaries, following the calculations in Sec.3.1.4.We show the Fisher-informed constraints on cosmological parameters in Fig. C2 with fiducial values Ω  = 0.3 and  8 = 0.8.We find the Fisher-informed contours from the lognormal fields and the SimCLR summaries to agree very well.

C2.1 SimCLR Data and Setup
We next apply the SimCLR method to total matter density maps from the CAMELS project.The neural network architecture and datasets are same as in Sec.3.2.1.
We compare the results of the parameter inference with SimCLR to the VICReg results.The VICReg setup is the same as in Sec.3.2.1,except that we use only one pair of views.We choose to use one pair of views for a one-to-one comparison, since the SimCLR is defined for a single positive pair of examples.
We train the SimCLR encoder network for 150 epochs with AdamW (Kingma & Ba 2014;Loshchilov & Hutter 2019) optimizer with initial learning rate of 1 × 10 −3 , multiplied by a factor of 0.3 when the validation loss plateaus for 10 epochs.The downstream inference network is trained for 200 epochs with AdamW optimizer, with the initial learning rate 1 × 10 −3 and the same learning rate schedule as

C2.2 Results
We present our results for parameter inference with the two methods on the SIMBA and IllustrisTNG simulations suites in Fig. C3 (Sim-CLR) and C4 (VICReg with 2 views).We summarize the losses and errors of SimCLR and VICReg methods on the test datasets in Tables C2a and C2b.The results show that the two methods are comparable in terms of their performance and constraints on cosmological parameters.

APPENDIX D: COMPARISON TO THEORETICAL EXPECTATION (FISHER-INFORMED CONSTRAINTS) WITH NORMALIZING FLOWS
The Fisher-informed constraints from the VICReg summaries of lognormal overdensity fields are computed under a set of assumptions about the Gaussianity of the likelihood of the summaries and of the posterior distribution of the parameters.In this Appendix, we confirm that our conclusions from Section 3.1.4hold when we estimate the posterior distribution of the parameters from the VICReg summaries without making such assumptions.
We train a normalizing flow to estimate the distribution of the parameters, given a VICReg summary of a corresponding lognormal field.The normalizing flow consists of a stack of 8 masked autoregressive flows.We train the flow on VICReg summaries of the lognormal maps from the dataset described in Section 3.1.1.We train the flow for 300 epochs using AdamW optimizer with learning rate of 5×10 −4 and cosine annealing schedule.
We then compare the posteriors estimated by the normalizing flow to the constraints from Fisher forecast on the cosmological  where H () = 1, if   ( * |x * ) ≥   ( |x * ) 0, otherwise.The expected coverage is then computed as an average coverage across for multiple pairs of across multiple parameter-data pairs { * , x * }.If the posterior is well-calibrated posterior, the expected coverage should match the confidence level for all confidence levels (1 − ) ∈ [0, 1].The algorithm to compute SBCC is described in detail in the Appendix of Deistler et al. (2022).
Since the posterior we obtain with simulation-based inference is not amortized, but targeted to a specific data point or observation (namely, a VICReg summary of a specific lognormal map), we do not expect the posterior to be accurate for all parameter-data pairs drawn from the prior.Rather, we would like the posterior to be well-calibrated for other observations that share the same fiducial cosmology as the target observation.In this case, fiducial cosmology corresponds to   = {Ω  = 0.3,  8 = 0.8}.Therefore, in order to compute the calibration plots, we use an emulator to sample 1000 different VICReg summaries at the fiducial parameter values and evaluate the coverage diagnostic on these samples.
We plot the expected coverage for a range of confidence levels (1 − ) for the two posteriors on Figure E1.As can be seen from the plot, the both posteriors lie close the dashed line which corresponds to a well-calibrated posterior.

Figure 1 .
Figure 1.A schematic overview of the self-supervised learning pipeline implemented in this work.The  ,  ′ are different transformations used to produce two views (e.g., lognormal density maps) ,  ′ of the same underlying cosmological parameters of interest (e.g., Ω  and  8 ).The inference network is trained on the summaries ,  ′ obtained from the pre-training step.

Figure 2 .
Figure 2.An example of a Gaussian overdensity field   and a corresponding lognormal overdensity field    generated from a linear matter power spectrum   with Eisenstein-Hu transfer function and Ω  = 0.3,  8 = 0.8.

Figure 4 .
Figure 4. Constraints from Fisher forecast on the cosmological parameters Ω  and  8 obtained from lognormal overdensity maps (black dash-dotted line) and from summaries constructed with VICReg (orange solid line).The results shown on the plot were obtained for a fiducial cosmology with Ω  = 0.3 and  8 = 0.8.

Figure 5 .
Figure 5. Predicted means and 1- uncertainties of cosmological parameters Ω  and  8 compared to the true values of the parameters for total matter density maps from SIMBA and IllustrisTNG simulations.Predictions for the means and variances of the parameters were obtained by training simple inference neural network on VICReg summaries.

Figure 6 .
Figure 6.Predicted means and 1- uncertainties of cosmological parameters Ω  and  8 compared to the true values of the parameters for total matter density maps from SIMBA simulations using the models trained on maps from IllustrisTNG simulations.Predictions for the means and variances of the parameters were obtained by training a convolutional neural network in a fully-supervised way (top) and by training simple inference neural network on VICReg summaries (bottom).

Figure 7 .
Figure 7. Predicted means and 1- uncertainties of cosmological parameters Ω  and  8 compared to the true values of the parameters for total matter density maps from IllustrisTNG simulations using the models trained on maps from SIMBA simulations.Predictions for the means and variances of the parameters were obtained by training a convolutional neural network in a fully-supervised way (top) and by training simple inference neural network on VICReg summaries (bottom).
15) and ā . and b . are the row means, ā• and b• are the column means, ā.. and b.. are the overall means of the matrix distances  , ,  , .

Figure 9 .Figure 10 .
Figure9.Predicted means and 1- uncertainties of 'cosmological' parameters ,  and 'baryonic physics' parameter  as a function of the true values of the parameters.Predictions for the means and variances of the parameters are plotted for the dataset which is referred to as the 'broken power law with varying ' in this Section.

Figure 12 .
Figure 12.Posterior probability distributions of the input parameters , inferred with the Sequential Neural Posterior Estimation algorithm and a trained emulator (teal) and with an inference network from Sec.3.1.3(orange).The results shown on the plot were obtained for a lognormal map generated with Ω  = 0.3 and  8 = 0.8.

)
APPENDIX C: ALTERNATIVE SELF-SUPERVISED LEARNING APPROACHES: SIMCLRIn this appendix we show results of data compression and parameter inference for the SimCLR method.Proposed byChen et al. (2020b), SimCLR is a contrastive learning method which stands for 'A Simple Framework for Contrastive Learning of Visual Representations'.Similarly to VICReg and other self-supervised learning methods, SimCLR method consists of a pre-training step and a downstream task.In the pre-training step, the encoder network samples a batch of  inputs and uses two different views   and   corresponding to each input image  in the batch.It treats the two views of the same image as a positive pair and consider all other examples in the batch as negative examples.encoder network then learns the representations of the inputs by maximizing the similarity between the positive pair via a contrastive objective function which computed on the embeddings   ,   : ℓ ,  = − log exp(sim(  ,   )/) 2 =1 1 [ ≠ ] exp(sim(  ,   )/) ,

Figure C3 .
Figure C3.Predicted means and 1- uncertainties of cosmological parameters Ω  and  8 compared to the true values of the parameters for total matter density maps from SIMBA and IllustrisTNG simulations.Predictions for the means and variances of the parameters were obtained by training simple inference neural network on SimCLR summaries.

Figure D1 .
Figure D1.Constraints from Fisher forecast on the cosmological parameters Ω  and  8 obtained from lognormal overdensity maps (black dash-dotted line) and inferred posterior distribution of parameters Ω  ,  8 from a normalizing flow trained on the VICReg summaries (orange solid line).The results shown on the plot were obtained for four different random realizations of the lognormal fields with fiducial parameter values of Ω  = 0.3 and  8 = 0.8 (marked with a dashed line).

Table 1 .
Summary of the performance of the VICReg method and of an equivalent supervised baseline model for inferring cosmological parameters Ω and  8 from lognormal overdensity maps.The performance of the two methods is evaluated on the test dataset described in Sec.3.1.1.

Table 2 .
Scatter plot of predicted means and 1- uncertainties of cosmological parameters Ω  (left) and  8 (right) plotted against the true values of the parameters for a subset of a 100 maps from the test set.Predictions for the means and variances of the parameters were obtained by training a simple inference neural network on VICReg summaries.Summary of the Fisher-informed constraints on Ω  and  8 obtained from lognormal overdensity maps and from VICReg summaries for the fiducial cosmology with Ω  = 0.3 and  8 = 0.8.evaluate the covariance matrix  and 1,000 realizations of lognormal maps around the fiducial parameters to compute the derivatives /  .One can then use the Cramer-Rao bound to estimate the minimum variance of the parameter of interest   as the inverse of the Fisher matrix:

Table 3 .
Summary of the performance of the VICReg method and of an equivalent supervised model for inferring cosmological parameters Ω  and  8 from SIMBA and IllustrisTNG total matter density maps, evaluated on the respective test datasets.

Table 4 .
Summary of the performance of the VICReg method and of an equivalent supervised model for inferring cosmological parameters Ω  and  8 trained on SIMBA and evaluated on IllustrisTNG and vice-versa.
).While the first batch of the parameter-data pairs are drawn from the prior, in subsequent inference rounds the SNPE algorithm draws new samples from a modified proposal distribution obtained by conditioning the approximate posterior estimator on the observation of interest ì   .This technique, which is an instance of Comparison of the summaries from the trained emulator (teal) and the actual summaries (orange) of lognormal maps (LN) with Ω  = 0.3 and  8 = 0.8.

Table C1 .
Summary of the performance of the VICReg and SimCLR methods for inferring cosmological parameters Ω  and  8 from lognormal overdensity maps.The performance of the two methods is evaluated on the test dataset described in Sec.3.1.1.

Table C2 .
Scatter plot of predicted means and 1- uncertainties of cosmological parameters Ω  (left) and  8 (right) plotted against the true values of the parameters for a subset of a 100 maps from the test set.Predictions for the means and variances of the parameters were obtained by training a simple inference neural network on SimCLR summaries.Constraints from Fisher forecast on the cosmological parameters Ω  and  8 obtained from lognormal overdensity maps (black dash-dotted line) and from summaries constructed with VICReg (orange solid line).The results shown on the plot were obtained for a fiducial cosmology with Ω  = 0.3 and  8 = 0.8. the encoder.We use the same setup to train the VICReg encoder and inference network, but decrease the initial learning to 5 × 10 −4 when training the inference network.Summary of the performance of the VICReg and SimCLR methods for inferring cosmological parameters Ω  and  8 from SIMBA and Illus-trisTNG total matter density maps, evaluated on the respective test datasets, as described in Sec.3.2.1.