Robust PCA and MIC statistics of baryons in early mini-haloes

We present a novel approach, based on robust principal components analysis (RPCA) and maximal information coefficient (MIC), to study the redshift dependence of halo baryonic properties. Our data are composed of a set of different physical quantities for primordial minihaloes: dark-matter mass ($M_{\mathrm{dm}}$), gas mass ($M_{\mathrm{gas}}$), stellar mass ($M_{\mathrm{star}}$), molecular fraction (${\mathrm{x_{mol}}}$), metallicity ($Z$), star formation rate (SFR) and temperature. We find that $M_{\mathrm{dm}}$ and $M_{\mathrm{gas}}$ are dominant factors for variance, particularly at high redshift. Nonetheless, with the emergence of the first stars and subsequent feedback mechanisms, ${\mathrm{x_{mol}}}$, SFR and $Z$ start to have a more dominant role. Standard PCA gives three principal components (PCs) capable to explain more than 97 per cent of the data variance at any redshift (two PCs usually accounting for no less than 92 per cent), whilst the first PC from the RPCA analysis explains no less than 84 per cent of the total variance in the entire redshift range (with two PCs explaining $ \gtrsim 95$ per cent anytime). Our analysis also suggests that all the gaseous properties have a stronger correlation with $M_{\mathrm{gas}}$ than with $M_{\mathrm{dm}}$, while $M_{\mathrm{gas}}$ has a deeper correlation with ${\mathrm{x_{mol}}}$ than with $Z$ or SFR. This indicates the crucial role of gas molecular content to initiate star formation and consequent metal pollution from Population III and Population II/I regimes in primordial galaxies. Finally, a comparison between MIC and Spearman correlation coefficient shows that the former is a more reliable indicator when halo properties are weakly correlated.


INTRODUCTION
The standard model of cosmology predicts a structure formation scenario driven by cold dark matter (e.g., Benson 2010), where galaxies form from molecular gas cooling within growing dark matter haloes. Hence, understanding the correlation between different properties of the dark matter haloes is imperative to build up a comprehensive picture of galaxy evolution. Many authors have explored the correlation between dark-halo properties, such as mass, spin and shape, both in low-(e.g., Bett et al. 2007;Hahn et al. 2007;Macciò et al. 2007;Wang et al. 2011) and high-redshift (e.g., Jang-Condell & Hernquist 2001;de Souza et al. 2013a) regimes. Estimating the strength of these correlations is critical to support semi-analytical and halo occupation models, which assume the mass as determinant factor of the halo properties (e.g., Mo & White 1996;Cooray & Sheth 2002;Berlind et al. 2003;Somerville et al. ⋆ e-mail: rafael.2706@gmail.com 2008). Nevertheless, alternative approaches, based on principal components analysis (PCA), found that concentration is a key parameter, contrary to what expected before (Jeeson-Daniel et al. 2011;Skibba & Macciò 2011), and stressed the need for further investigations. PCA belongs to a family of techniques ideal to explore high-dimensional data. The method consists in projecting the data into a low-dimensional form, retaining as much information as possible (e.g., Jollife 2002). Hence, PCA emerges as a natural technique to investigate correlation and temporal evolution of halo properties. Because of its versatility, PCA has been applied to a broad range of astronomical studies, such as stellar, galaxy and quasar spectra (e.g., Chen et al. 2009;McGurk et al. 2010), galaxy properties (Conselice 2006; Scarlata et al. 2007), Hubble parameter and cosmic star formation (SF) reconstruction (e.g., Ishida et al. 2011;Ishida & de Souza 2011), and supernova (SN) photometric classification (Ishida & de Souza 2013).
Despite its generality, PCA is not the only way to handle huge data sets, and the growth in complexity of scienc 2013 RAS tific experimental data makes the ability to extract newsworthy and meaningful information an endeavor per se. The yearning for novel methodologies of data-intensive science gave rise to the so-called fourth research paradigm (e.g., Bell et al. 2009). Data mining methods have been used in many areas of knowledge such as genetics (e.g., Venter et al. 2004) and financial marketing decisions (e.g., Shaw et al. 2001), and their importance for astronomy has been recently highlighted as well (e.g., Ball & Brunner 2010;Graham et al. 2013;Krone-Martins et al. 2013;Martínez-Gómez et al. 2014). Likewise observations, cosmological simulations are continuously increasing in complexity, lessening the distance between observed and synthetic data (e.g., Overzier et al. 2013;de Souza et al. 2013bde Souza et al. , 2014. None the less, the application of data-mining to cosmological simulations remains a terra incognita. In this work, we investigate the statistical properties of baryons inside high-redshift haloes, including detailed chemistry, gas physics and stellar feedback. We make use of Robust PCA (RPCA) and maximal information coefficient (MIC) to study a set of various halo parameters. RPCA represents a generalization of the standard PCA, whose advantage is its resilience to outliers and skewed data, while MIC is expected to be the correlation analysis of the 21st century (Speed 2011), in particular due to MIC ability in quantifying general associations between variables. Therefore, this project represents the first application of MIC to N -body/hydro simulations, and the first use of PCA to explore the low-mass end of the halo mass function and the birth of the first galaxies.
The outline of this paper is as follows. In Section 2, we describe the cosmological simulations and their outcomes. In Section 3, we describe the statistical methods. In Section 4, we present our analysis and main results. Finally, in Section 5, we present our conclusions.

SIMULATIONS
We analyzed the results of a cosmological N -body, hydrodynamical, chemistry simulation based on Biffi & Maio 2013 (see also Maio et al. 2010, that was run by means of a modified version of the smoothed-particle hydrodynamics code GADGET2 (Springel 2005). The modifications include relevant chemical network to self-consistently follow the evolution of e − , H, H + , H − , He, He + , He ++ , H2, H + 2 , D, D + , HD, HeH + (e.g., Yoshida et al. 2003;Maio et al. 2006Maio et al. , 2007Maio et al. , 2009, ultraviolet background radiation, metal pollution according to proper stellar yields (He, C, O, Si, Fe, Mg, S, etc.), lifetimes and stellar population for Population III (Pop III) and Population II/I (Pop II/I) regimes ), radiative gas cooling from molecular, resonant and finestructure transitions (e.g. Maio et al. 2007, and references therein) and stellar feedback (Springel & Hernquist 2003). The transition from the Pop III to the Pop II/I regime is determined by the value of the gas metallicity (Z) compared to the critical value Zcrit (e.g., Omukai 2000;Bromm et al. 2001), assumed to be 10 −4 Z 1 .
The cosmic field is sampled at redshift z = 100, adopting standard cosmological parameters: ΩΛ = 0.7, Ωm = 0.3, Ω b = 0.04, H0 = 70 km/s/Mpc and σ8 = 0.9. We considered snapshots in the range 9 z 19, within a cubic volume of comoving side 0.7 Mpc, and 2 × 320 3 particles per gas and dark-matter species corresponding to particle masses of 42 and 275 M h −1 , respectively. The identification of the simulated objects is done by applying a friends-of-friends (FoF) technique with linking length equal to 20 per cent the mean interparticle separation and substructures are identified by using a SUBFIND algorithm ), which discriminates among bound and non-bound particles. The halo characteristics, such as position, velocity, dark matter and baryonic properties, are computed and stored at each redshift.

Data set
The total dataset is composed by a few thousands haloes at very high redshift, z ≈ 19, and reaches about 25000 primordial objects at z ≈ 9. In order to avoid numerical artifacts, created by a poor number of gas particles (Bate & Burkert 1997), we selected only those structures in which the gas content is resolved with at least 300 gas particles. This usually corresponds to selecting only objects with a total number of particles of at least ∼ 10 3 . The remaining data are therefore composed of ≈1680 haloes in the whole redshift range, of which ≈ 200 are at z = 9. Fig. 1 shows the probability distribution function (PDF) for the seven halo parameters: M dm , Mgas, Mstar, SFR, T , x mol and Z at each redshift. They are portrayed by a violin plot. Each violin centre represents the median of the distribution, while the shape, its mirrored PDF. A visual inspection in Fig. 1 indicates the first stages of significant SF activity around z = 17, giving rise to a subsequent boost in metal enrichment at z 15, and a similar growth of Mstar in the same redshift range. Just after this episode, we can see the rapid spread in the x mol variance, peaking few orders of magnitude above average. The masses of the haloes range between 10 5 M M dm 10 8 M and 10 4 M Mgas 10 7 M . Typical temperatures range from 500 to 10 4 K, where H2 shapes the thermal conditions of early objects. Hotter temperatures are due to the thermal effects of SN explosions that heat and enrich the gas in nearby smaller haloes.

Robust Principal Components Analysis.
The ultimate goal of PCA is to reduce the dimensionality of a multivariate data 2 , while explaining the data variance with as few principal components (PCs) as possible. PCA belongs to a class of Projection-Pursuit (PP; e.g., Croux et al. 2007) methods, whose aim is to detect structures in multidimensional data by projecting them into a lower-dimensional subspace (LDS). The LDS is selected by maximizing a projection index (PI), where PI represents an interesting feature in the data (trends, clusters, hyper-surfaces, anomalies, etc.). The particular case where variance (S 2 ) is taken as a PI leads to the classical version of PCA 3 . Given n measurements x1, · · · , xn, all of them column vectors of dimension Γ, the first PC is obtained by finding a unit vector a which maximizes the variance of the data projected on it: where t is the transpose operation and a1 is the direction of the first 3 The PCs are computed by diagonalization of the data covariance matrix (Σ 2 ), with the resulting eigenvectors corresponding to PCs and the resulting eigenvalues to the variance explained by the PCs.
The eigenvector corresponding to the largest eigenvalue gives the direction of greatest variance (PC1), the second largest eigenvalue gives the direction of the next highest variance (PC2), and so on. Since covariance matrices are symmetric positive semidefinite, the eigenbasis is orthonormal (spectral theorem).
PC 4 . Once we have computed the (k − 1)th PC, the direction of the kth component, for 1 < k Γ, is given by where the condition of each PC to be orthogonal to all previous ones, ensures a new uncorrelated basis. In spite of these attractive properties, PCA has some critical drawbacks as the sensitivity to outliers (e.g., Hampel et al. 2005), and inability to deal with missing data (e.g., Xu et al. 2010). In order to overcome this limitation, several robust versions were created based on the PP principle. Instead of taking the variance as a PI in equation (1), a robust 5 measure of variance is taken. Hereafter, we will refer the standard variance as S 2 sd and robust variance as S 2 MAD . Two common measures of robust variance (Hoaglin et al. 2000) are the median absolute deviation (MAD; e.g., Howell 2005), and the first quartile of the pairwise differences between all data points (Q; e.g., Rousseeuw & Croux 1993), κn} is a given univariate dataset and the square of MAD or Q gives a robust variance 6 . Hereafter all calculations of the PCs are performed using the grid search base algorithm (Croux et al. 2007) with MAD, but using Q has no influence on our results. Also note that before applying the PCA, we standardize the halo properties by subtracting the mean and dividing by the standard deviation. Therefore we are formally using the correlation matrix that can be seen as the covariance matrix of standardized variables.

Maximal information coefficient.
The maximal information-based non-parametric exploration (MINE) statistics represent a novel family of techniques to identify and characterize general relationships in data sets (Reshef et al. 2011). MINE introduce MIC as a new measure of dependence between two variables, which possesses two desired properties for data exploration: (i) generality, the ability to capture a broad range of associations and functional relationships 7 ; (ii) equitability, the ability to give similar scores to equally noisy relationships of different types 8 .
MIC measures the strength of general associations, based on attains its largest value. 5 Robust statistics commonly use median and median absolute deviation, instead of mean and standard deviation, in order to be resistant against outliers. 6 When the PI is the standard variance, the first PC is the eigenvector of the data covariance matrix corresponding to the largest eigenvalue. But this does not hold for general choices of variance and approximative algorithms are necessary. 7 For comparison, Pearson coefficient measures the linear correlation between two variables, while Spearman coefficient (Rs) measures the strength of monotonicity between paired data. 8 In benchmark tests, MIC equitability behaves better than other methods such as e.g., mutual information estimation, distance correlation and Rs. A lack of equitability introduces a strong bias and entire classes of relationships may be missed (Reshef et al. 2013). the mutual information 9 (MI) between two random variables A and B: 10 where p(a) and p(b) are the marginal PDFs of A and B, and p(a, b) is the joint PDF. Consider D a finite set of ordered pairs, {(ai, bi), i = 1, . . . , n}, partitioned into a x-by-y grid of variable size, G, such that there are x-bins spanning a and y-bins covering b, respectively. The PDF of a particular grid cell is proportional to the number of data points inside that cell. We can define a characteristic matrix M (D) of a set D as representing the highest normalized mutual informations of D. The MIC of a set D is then defined as representing the maximum value of M subject to 0 < xy < B(n), where the function B(n) ≡ n 0.6 was empirically determined by Reshef et al. 2011 11 .

RESULTS
Hereafter we discuss the relations between halo properties and their relative importance. Our matrix is composed by 1680 haloes, spanning the redshift range 9 z 19, with ≈ 200 (30) haloes at z = 9 (19), each halo containing at least ∼ 10 3 particles. Each row of the matrix represents a halo and each column represents one of the halo properties. PCA probes the entire matrix at once. On the other hand, MIC is a pair-variable comparison, therefore requiring N (N −1)/2 operations, with N being the number of halo properties. It is worth to highlight here that each approach has its own advantages and disadvantages. PCA is suitable for high-dimensional data, when a pair comparison becomes unfeasible, however the method only searches for linear relationships. MIC, instead, finds general associations in data structures, but may be impractical to deal with a large amount of parameters.

PCA
In order to better understand the pros and cons of using RPCA, we first start the analysis with the standard PCA. Fig. 2 shows the contribution of the first three PCs to S 2 sd , as a function of redshift. Three PCs account for more than 97 per cent of S 2 sd at any redshift, while two PCs explain more than 92 per cent except at z ≃ 14, when the contribution drops to 85 per cent.
The sharp variation of the PCs around z ≃ 14 − 16 acts as a 9 Mutual information measures the general interdependence between two variables, while the correlation function measures the linear dependence between them (e.g., Li 1990). 10 MIC tends to 1 for all never-constant noiseless functional relationships and to 0 for statistically independent variables. 11 The 0.6 exponent value represents a compromise since high values of B(n) lead to non-zero scores even for random data, as each point gets its own cell, while low values only probe simple patterns.
smoking gun for a global cosmological event. Indeed, this is a direct consequence of first SF episodes and the interplay between chemical and mechanical feedback from the first stars, that takes place around z ≃ 15 − 20 (e.g., Maio et al. 2010. As molecules are produced over time, they lead to gas collapse, stellar formation and metal pollution, with consequent back reaction on the thermal behavior of the surrounding gas (see e.g., Biffi & Maio 2013). This redshift range represents an epoch of fast and turbulent growth of the metal filling factor, from ∼ 10 −18 at z ≃ 15 to ≈ 10 −12 at z ≃ 14 (see Fig. 1 from . At the beginning, only the gas at high densities is affected by metal enrichment, due to SF concentration in these regions. As SF and metal spreading proceed, the surrounding lower density environments are affected as well. SNe heat high-density gas within starforming sites and, consequently, hot low-density gas is ejected from star-forming regions by SN winds. The contribution of each PC dramatically changes if we use RPCA instead. The clearest advantage is the amount of variance explained by each component (Hereafter, when necessary to avoid ambiguity, the PCs from RPCA analysis will be referred as RPCs). RPC1 accounts for no less than ≈ 84 per cent of the S 2 MAD anytime, whilst two RPCs account for more than ≈ 95 per cent. Moreover, the RPC2 contribution mostly stands out between at 13 < z < 17 and z 10. Albeit contributing differently to the total variance, the general behavior of PC1 and PC2 is similar to the RPC1 and RPC2, as well as the physical interpretation. But RPCA assigns less weight to the baryonic properties, suggesting the halo mass as the most significant factor. This difference occurs because even a small fraction of large errors can cause arbitrary corruption in PCA's estimate. For instance, PCA is more sensitive to rapid variations of the halo chemical properties, having a steeper reaction in their first PCs. Thus, as expected RPCA surpass PCA in their ultimate goal: reduce the system dimensionality. Nevertheless, the greatest power to synthesize information carries the assumption that outliers are caused by corrupted data, which is not always the case. This potential drawback will be better understood looking at the contribution of each variables to the k-th PCs as discussed in the following. Fig. 3 shows the relative contribution of each parameter to the first three PCs (RPCs) on the left (right) side. For the PCA case, M dm and Mgas dominate PC1 at z > 14 (no less than ∼ 62 per cent), followed by a smaller contribution of SFR and x mol . Nevertheless, as gas collapses into potential wells, the relative contribution from Mgas increases, surpassing M dm at z ≈ 15. The dominant contribution of Z and x mol to PC1 at z ≈ 14 indicates a critical epoch for the cosmic chemical enrichment (see also discussion above), triggered by a rapid variation of x mol , followed by a wide metal pollution at z ≈ 13. After a decline in the chemical enrichment rate, a second peak in Z occurs at z ≈ 10. This selfregulated, oscillatory behavior is caused by the simultaneous coexistence of cold pristine-gas inflows and hot metal enriched outflows that create hydro instabilities and turbulent patterns with Reynolds numbers ∼ 10 8 − 10 10 (see e.g. Fig. 2 from ). Finally at z = 9, M dm and Mgas have become almost subdominant, since PC1 is mainly led by T and Z, as a result of the ongoing cosmic heating from SF and thermal feedback. The dominance by T to PC1 at this redshift occurs due to the presence of some small (see Fig. 1), high-temperature objects, whose properties are contaminated by hot enriched material at T 10 5 K.
An inspection of PC2 reveals the supporting roles during the galaxy formation process. The PC1 peak in Z at redshift 13 is preceded by a strong contribution of SFR and halo masses to PC2, while the second PC1 peak in Z, around z ≃ 10, is anticipated by an increasing contribution to PC2 from the formed stars, which later explode as SNe and start the metal enrichment of the Universe. The first rise of PC2 at z 14, dominated by SFR, occurs because the protogalaxies at this epoch are experiencing the first bursts of SF. Nevertheless, not all of them have necessarily formed stars already. Whilst the second peak is composed of a more balanced contribution from SFR and Mstar. The oscillatory behavior might be caused by the competitive effects of different feedback mechanisms: the gas undergoing SF is heated by SN explosions and it is inhibited to continuously form stars (mostly in smaller structures that suffer significantly gas evaporation processes); while shock compressions and spreading of metals in the medium enhance gas cooling capabilities and consequently induce more SF. The former preferentially occurs in bigger objects that can keep and re-process their metals because of the deeper potential wells.
PC3 is nearly negligible in the whole redshift range aside z = 14, where x mol dominates the general behavior. This epoch is preceded by a significant contribution from Mstar at z = 15. A comparison with Fig. 1 reveals that this behavior coincides with a growth in the x mol variance at the same redshift. This indicates a transition in the regular trend of increasing x mol with increasing mass at z ∼ 15 − 16, when initial collapse phases boost x mol up to 10 −3 . This rapid growth of x mol preferentially occurs in galaxies of ∼ 10 5 − 10 6 M , that are forming their first stars and have not been previously affected by feedback mechanisms. At z 15, feedback effects from Pop III forming galaxies become responsible for increasing the variance of x mol by several orders of mag-nitude, either by dissociating molecules, or by partially enhancing their formation by shocks and gas compression (e.g., Ricotti et al. 2001;Whalen et al. 2008;Petkova & Maio 2012).
Looking the RPCA, the RPC1 is dominated by halo masses during all cosmic evolution (no less than 68 per cent), with other baryonic properties relegated to RPCs of higher orders. Some caution is needed to interpret these results. The higher level of compressibility presented by RPCA is a direct consequence of attributing a smaller weight to rare events. Therefore, if one intends to describe all haloes properties using the fewest parameters possible, RPCA appears to succeed, since it states that as a first approximation, the total halo mass is the main factor to describe all other properties. The mass determines the potential well and consequently the ability of the halo to form stars, retain the metals, etc, therefore roughly dictating the baryonic dynamics at a first sight. Since RPCA ascribes a lower weight to the tails of each parameter distribution, the physical interpretation may become less evident for the highest RPCs. However, we can still see the importance of Z, x mol and SFR, with the difference that now they are considered second order effects, hence starting to be dominant from the RPC2 forward. To better understand these differences between RPCA and PCA we discuss the strength with which each variable is related to one another as follows. Fig. 4 shows how the seven halo properties correlate to each other. The main diagonal of Fig. 4 shows the density distribution of each variable at different redshifts 12 (a zoomed version of half-violin presented in Fig. 1). The majority of the parameters have a well behaved distribution, with small variations in its shape during the cosmic evolution, while quantities related to the stellar feedback (Mstar, SFR, Z) have their distribution shaped during the transition from a regime without SF activity at z 16 to the burst of SFR around z 15. The lower triangular part of the panel shows scatter plots for each variable combination colored accordingly to their redshift. Fig. 5 shows MIC and Rs for each combination of parameters as a function of redshift 13 . At high redshift, due to the poor statistics (less than 30 haloes at z = 19, with a considerably amount of null parameters), most variables are uncorrelated, receiving a low score by both Rs and MIC. As expected Mgas, M dm and T are strongly correlated, receiving the highest values. This is consistent with the fact that PC1 dominates at z > 16 and is basically dictated by M dm and Mgas. The result suggests that at higher redshifts, haloes are much simpler objects and their properties are basically controlled by their masses. Comparing with Fig. 3, it seems that the correlation between halo mass and T shows a better agreement with RPCA, which makes of T a factor almost as important as Mgas and M dm in the determination of RPC1. The molecular content, which is directly dependent on the local gas density and T, shows a correlation with Z that increases at lower redshifts until z ≈ 12. This trend is in agreement with the dominance of x mol and Z on PC1 and RPC2 at z ≈ 13 − 14, caused by the increase in the contribution of the SFR to PC2 and RPC2 at earlier redshifts. At z 13 − 14, x mol keeps a regular trend of increasing with halo mass. Nevertheless, the SF activity at z 13 leads to a dispersion of x mol followed by a metal enrichment process, as discussed in Section 4. Also Mgas shows a stronger correlation with x mol than with other quantities like SFR and Z, which indicates the crucial role of x mol to initiate SF and consequent metal pollution from Pop III and Pop II/I regimes in primordial galaxies. Comparing with Fig. 3, we see that RPCA better apprehends this effect. At high redshift, with the exception of z = 16, where the peak in RPC2 is caused by the first stages of metal enrichment (Fig. 1), x mol maintains a dominant contribution to RPC2, together with halo mass. The correlation between SFR with Mgas and M dm is roughly linear, increasing at later times. This may be explained by the wider spread of SFR in low massive haloes at z 14, which is caused by gas evaporation processes due to SN explosions, in contrast with later structures that have a more sustained SF activity. Albeit both PCA and RPCA are sensitive to this effect, RPCA ascribes a lower weight to the SFR than to x mol , in accordance to the correlation analysis.

MIC
A surprising disagreement between MIC and Rs appears when comparing Z, Mstar and SFR. Rs suggests a nearly perfect correlation between Z and Mstar, while MIC found no significant association at the highest redshifts. This highlights the robustness of MIC with skewed and sparse data. In this redshift range, z 14, there are very few haloes with non-null Z and Mstar values (Fig. 1). Therefore, the high Rs score for these two quantities is misleading, as confirmed by a visual inspection of their corresponding distributions ( Figs. 1 and 4). The same argument holds for the comparison between Z-SFR, and Mstar-SFR. During the course of cosmic evolution though, the correlations between the properties of the haloes tighten and both Rs and MIC converge for most of them at z = 10 (with Rs slightly overestimating the strength of correlation compared to MIC), as shown in Fig. 5.

CONCLUSIONS
We investigate the redshift evolution of the gas properties of primordial galaxies using RPCA and MIC statistics making a comprehensive comparison with standard approaches. This is the first attempt to probe the baryon properties of early mini-haloes and the effects of feedback processes by means of a highly solid statistical approach. We explore the correlation of different baryonic properties as expected from numerical N -body, hydrodynamical, chemistry simulations including gas molecular and atomic cooling, SF, stellar evolution, metal spreading and feedback effects.
The wide range of redshifts analyzed here (9 z 19) allowed us to perform an unprecedented study of the temporal evolution of the PC contribution to the total variance of the halo properties. The standard PCA needs two PCs to explain more than 92 per cent of the data variance (in the greater part of redshifts studied  here) with PC1 dropping below 50 per cent at lower redshifts. The first RPC from RPCA analysis explains no less than 84 per cent of all data variance anytime, with two first RPCs explaining more than 95 per cent of the total robust variance. First SF episodes and feedback mechanisms cause a drop of PC1 at z ∼ 14, when a sharp variation in the PCs behavior marks the onset of cosmic metal enrichment. At z > 14 the halo properties are basically dictated by the halo mass. Among the advantages in using RPCA is the possibility to increase the capability to reduce the dimensionality of the original dataset, although at the cost to be less sensitive to rare events that may be physically relevant.
Since RPCA ranks the contribution of variables to the RPCs in better agreement with their levels of correlation. It seems to be in better agreement with our independent MIC and Rs correlation analysis.
An inspection in the first and second PCs reveals some interesting facts. The PC1 peak in Z at redshift 13 is preceded by a strong contribution of SFR and halo masses to PC2. While the second PC1 peak in Z, around z ≃ 10, is anticipated by an increasing contribution to PC2 by the formed stars, which later explode as SNe and enrich the Universe. This indicates the importance of stellar evolution in shaping baryon properties in primordial haloes. A similar trend holds for RPCA although attenuated by the smooth- ing effect created by the use of robust statistics. It is important to note, however, that the relatively small number of haloes studied here might lessen the robustness of our results at very high redshifts. Therefore, future investigations of similar techniques into larger simulations boxes is highly recommended.
Overall Rs agrees reasonably with MIC, but MIC seems to be more robust to study highly sparse data regimes (like at early epochs). All gas properties, aside Mgas, M dm and T , are weakly correlated at high redshift. Nevertheless, due to the interplay between chemical and mechanical feedback from the ongoing stellar formation and the consequent back reaction on the thermal behavior of the surrounding medium, baryonic quantities start to present a moderate to high level of correlation as redshift decreases. In particular, x mol shows the highest level of correlation with Mgas, followed by T , SFR, Mstar and Z respectively. In general, structure formation processes depend not only on the dark matter halo properties, but also on the local thermodynamical state of the gas, which is, in turn, affected by cooling, SF and feedback. Our analysis suggests that all the gaseous properties have a stronger correlation with Mgas than with M dm , while Mgas has a deeper correlation with x mol than with Z or SFR. The relevance of the molecular content for the baryon properties represents the physical origin of gas collapse and concentration, crucial to initiate SF and consequent metal pollution from Pop III and Pop II/I regimes in primordial galaxies. This work represents a leap forward in the statistical analysis of Nbody/hydro simulations, performed by means of RPCA and MIC into a cosmological context. We therefore stress that the use of di-mensionality reduction algorithms and mutual information based techniques in numerical simulations might be a precious instrument for future investigations, thanks to their potential to unveil nontrivial relationships, which may go undetected by standard methods.