Connection Between SDSS Galaxies and ELUCID Subhaloes in the Eye of Machine Learning

We explore the feasibility of learning the connection between SDSS galaxies and ELUCID subhaloes with random forest (RF). ELUCID is a constrained $N$-body simulation constructed using the matter density field of SDSS. Based on an SDSS-ELUCID matched catalogue, we build RF models that predict $M_r$ magnitude, colour, stellar mass $M_*$, and specific star formation rate (sSFR) with several subhalo properties. While the RF can predict $M_r$ and $M_*$ with reasonable accuracy, the prediction accuracy of colour and sSFR is low, which could be due to the mismatch between galaxies and subhaloes. To test this, we shuffle the galaxies in subhaloes of narrow mass bins in the local neighbourhood using galaxies of a semi-analytic model (SAM) and the TNG hydrodynamic simulation. We find that the shuffling only slightly reduces the colour prediction accuracy in SAM and TNG, which is still considerably higher than that of the SDSS. This suggests that the true connection between SDSS colour and subhalo properties could be weaker than that in the SAM and TNG without the mismatch effect. We also measure the Pearson correlation coefficient between galaxy properties and the subhalo properties in SDSS, SAM, and TNG. Similar to the RF results, we find that the colour-subhalo correlation in SDSS is lower than both the SAM and TNG. We also show that the galaxy-subhalo correlations depend on subhalo mass in the galaxy formation models. Advanced surveys with more fainter galaxies will provide new insights into the galaxy-subhalo relation in the real Universe.


INTRODUCTION
Understanding the formation and evolution of galaxies is a crucial aspect of modern cosmology.In recent years, large-volume galaxy surveys such as Sloan Digital Sky Survey (SDSS, York et al. 2000), SDSS-III (Eisenstein et al. 2011) and SDSS-IV (Dawson et al. 2016), and the Dark Energy Spectroscopic Instrument (DESI, DESI Collaboration et al. 2016) provide high-precision measurements of galaxy observables, leading to significant progress in this field.Since galaxies are believed to form within dark matter haloes, studying the connection between them can provide valuable insights into galaxy formation and evolution.However, unlike galaxy properties such as magnitude and colour which can be observed directly, the inner structure and formation histories of dark matter haloes are challenging to measure through observations.
In contrast, the formation history of dark matter halo and subhalo can be easily traced through -body simulations, which evolve dark matter particles under gravity (Springel et al. 2005;Prada et al. 2012;Wang et al. 2020).To simulate galaxies, semi-analytic models (SAM) of galaxy formation processes can be implemented on the subhalo merger tree extracted from -body simulations (Guo ★ E-mail: xiaoju@sjtu.edu.cnet al. 2011, 2013;Croton et al. 2016;Cora et al. 2018).Furthermore, hydrodynamic simulations are developed to produce galaxies in dark matter haloes by adding baryonic particles beyond dark matter particles (Vogelsberger et al. 2014;Schaye et al. 2015;Nelson et al. 2015Nelson et al. , 2019)).Both SAM and hydrodynamic simulations can be tuned to reproduce statistical galaxy observables such as abundance and clustering.However, since the galaxy formation processes are not yet fully understood, simulated galaxies may still deviate from those in the real Universe.Additionally, it is difficult to compare simulated galaxies individually with the real ones, as the one-to-one correspondence between them is not guaranteed.
One approach to address these issues is to construct constrained simulations based on the observed distribution of galaxies in the local universe.Using the group catalogue built from SDSS galaxies (Yang et al. 2007(Yang et al. , 2012)), the matter density field at low redshift can be constructed and treated as the final output of the constrained simulations (Wang et al. 2009(Wang et al. , 2012)).To infer the initial condition of the final density field, Wang et al. (2014) proposed a method that utilises the Hamiltonian Markov Chain Monte Carlo algorithm to sample the posterior distribution of the initial condition, together with a Particle-Mesh model that evolves the initial condition to the final state.With the constrained initial condition, Wang et al. (2016) carry out the ELUCID -body simulation, which accurately reproduces the observed large-scale structures in SDSS Data Release 7 (DR7, Abazajian et al. 2009).Based on this similarity, Yang et al. (2018) implements a neighbourhood abundance matching method that matches the observed galaxies in DR7 to the subhaloes in the ELUCID simulation.
The one-to-one matching between observed galaxies and simulated subhaloes provides a novel path for investigating the galaxyhalo relation.It is shown that this approach can recover the massive haloes to a large extent (Tweed et al. 2017), and the haloes linked to the bright galaxies may represent the actual haloes in the Universe with a high possibility.This provides an opportunity to compare galaxies in observation with those in the SAM implemented on the ELUCID simulation and the upcoming ELUCID hydrodynamic simulation on an individual level.Such comparison is helpful for understanding the differences between galaxy formation models and the actual galaxy formation processes in the Universe.In addition, studying the galaxy-halo relation of the SDSS-ELUCID matching pairs statistically also provides insights into galaxy formation and evolution in the real Universe.In this work, we aim to capture this relation with machine learning and predicting galaxy properties based on subhalo properties.
Machine learning models are widely used in cosmological studies in the literature due to the ability to efficiently learn non-linear multivariate dependencies between input and output variables.Efforts have been made on predicting halo occupations or galaxy properties with dark matter halo or subhalo properties based on SAM or hydrodynamic simulations (Kamdar et al. 2016a,b;Agarwal et al. 2018;Lovell et al. 2022;Xu et al. 2021Xu et al. , 2022)).Once trained, these machine learning models can be applied to large-volume -body simulations to create mock galaxy catalogues that reproduce the galaxy-halo connection in corresponding galaxy formation models.In this work, we focus on predicting galaxy properties from subhalo properties based on the SDSS-ELUCID matching catalogue in Yang et al. (2018), and we evaluate the feasibility of using machine learning to produce realistic mock catalogues with large-volume -body simulations.However, the accuracy of the dark matter halo reconstruction in ELUCID, particularly for low-mass haloes, is not guaranteed, which may affect the robustness of our analysis.Therefore, we perform tests to estimate the impact of uncertainties in subhalo properties (or in other words, the mismatching between subhaloes and galaxies) on the prediction of galaxy properties.This study is helpful for revealing discrepancies between observed galaxies and modeled galaxies and shedding light on galaxy-subhalo relation in the real Universe.
The structure of this paper is as follows.We provide an overview of the ELUCID -body simulation, the SDSS-ELUCID matching catalogue, galaxy formation models, and the machine learning method we implemented in Section 2. The main results of predicting SDSS galaxy properties are shown in Section 3. We then investigate the possible effect of mismatching between galaxies and subhaloes with a SAM implemented on the ELUCID simulation and a hydrodynamic simulation in Section 4. Finally, we summarise and discuss our results in Section 5.

ELUCID simulation and SDSS-ELUCID matching catalogue
In this study, we utilise the SDSS-ELUCID matching catalogue from Yang et al. ( 2018) (Match2 method), which links observed galaxies to subhaloes in the ELUCID -body simulation.ELUCID is a constrained simulation designed to reproduce the large-scale distributions of galaxies observed in the Northern Galactic Cap (NGC) region of SDSS DR7 (Abazajian et al. 2009), in the range of 99 • < R.A. < 283 • , −7 • < dec.< 75 • and 0.01 <  < 0.12.To achieve this, the matter density field reconstructed from the Yang et al. (2007) group catalogue, which is built based on the New York University Value-Added Galaxy Catalogue (NYU-VAGC, Blanton et al. 2005), is used as the final condition for inferring the corresponding initial condition.For this purpose, Hamiltonian Markov Chain Monte Carlo method (HMCMC, Duane et al. 1987) and PM dynamics (White et al. 1983;Jing & Suto 2002) are used.The former samples the posterior distribution of linear initial conditions with a specific final condition, and the latter evolves initial condition to final density field by efficiently evaluating gravitational forces at each time step.With the inferred initial condition, the ELUCID simulation evolves 3072 3 dark matter particles of mass 3.0875 × 10 8 ℎ −1 M in a box with a comoving length of 500 ℎ −1 Mpc on a side using an updated version of the GADGET-2 code (Springel et al. 2005).The simulation adopts the WMAP5 cosmology with cosmological parameters Ω m = 0.258, Ω b = 0.044, ℎ = 0.72, and   =0.963, and  8 = 0.796 (Dunkley et al. 2009).
In each snapshot of the simulation, dark matter haloes and subhaloes are identified using the Friend-of-Friend (FOF) algorithm (Davis et al. 1985)) and SUBFIND method (Springel et al. 2001), respectively.Subhalo merger tree is then constructed by linking subhaloes from SUBFIND in each snapshot.Yang et al. (2018) match the SDSS DR7 galaxies in the above survey area to the ELUCID subhaloes at =0 with a novel neighbourhood abundance matching technique, which we refer to as the SDSS-ELUCID matching catalogue in the following.This approach is similar to the traditional subhalo abundance matching (SHAM, Conroy et al. 2006;Behroozi et al. 2010;Moster et al. 2010;Reddick et al. 2013;Guo et al. 2016) that links the galaxies and subhaloes through their luminosity (or stellar mass) and subhalo mass (or circular velocity).In addition to this, it takes into account the separation between galaxies and subhaloes and prefers to match the galaxy to the subhalo of appropriate mass in the neighbourhood.As a result, 296,488 galaxies out of 396,069 are assigned to central subhaloes as central galaxies, and 99,581 are assigned to satellite subhaloes as satellite galaxies.We refer the reader to Yang et al. (2018) for more details regarding the neighbourhood abundance matching method and the SDSS-ELUCID matching catalogue.The ELUCID simulation and SDSS-ELUCID matched catalogue are available on the ELUCID website 1 .
We investigate the connection between galaxy properties and subhalo properties in the SDSS-ELUCID matching catalogue with machine learning.The galaxy properties we mainly focused on are rband absolute magnitude   and the − colour, with the magnitudes being K-corrected with evolution corrections to  = 0.1 according to Blanton et al. (2003) and Blanton & Roweis (2007).We also consider derived physical galaxy properties such as stellar mass and specific star formation rate (sSFR).The subhalo properties we focused on are: (1)  sub , the subhalo mass, in units of ℎ −1 M ; (2)  peak , the peak value of  sub over the formation history of the subhalo; (3)  acc , the value of  sub when the subhalo accretes onto its host ( acc = 0 for central subhalo); (4)  half , the half mass radius of the subhalo; (5)  max , the maximum circular velocity of the subhalo; (6)  peak , the peak value of  max over the formation history of the subhalo; (7)  max,acc , the value of  max when the subhalo accretes onto its host; (8)  disp , the velocity dispersion of the subhalo; (9)  vpeak , the redshift when  max ( vpeak ) =  peak ; (10)  mpeak , the redshift when  sub ( mpeak ) =  peak ; (11)  acc , the redshift when the subhalo accretes onto its host halo; (12)  0.1/0.3/0.5/0.7/0.9 , the formation redshift of subhalo, defined by the redshift when the subhalo reaches 0.1/0.3/0.5/0.7/0.9 of its peak mass for the first time; (13)  merg , the total number of major mergers (defined by a mass ratio of 1/3 between the progenitors) on the main branch of the subhalo merger tree; (14)  first , the redshift of the first major merger of the subhalo; (15)  last , the redshift of the last major merger of the subhalo; (16)  sat , the total time during which the subhalo is a satellite around the central subhalo, in the unit of Gyr; (17) , the spin parameter of the subhalo, and the environmental properties included are: (1)  2.1 , the matter density smoothed by a Gaussian filter with a smoothing scale of 2.1 ℎ −1 Mpc; (2)  web , cosmic web type, classified as one of knot, filament, sheet, and void according to the eigenvalues of the Hessian matrix (Zhang et al. 2009;Paranjape et al. 2018) calculated with  2.1 .

SAM and hydrodynamic simulation
To examine the impact of the mismatch between SDSS galaxies and ELUCID subhaloes on our results, we make use of the Luo et al. (2016) SAM implemented on the subhalo merger tree of ELUCID.As an L-Galaxies model (Guo et al. 2011(Guo et al. , 2013;;Fu et al. 2013), it accounts for various galaxy formation processes such as gas cooling, star formation, gas stripping, and feedback from AGN and supernova.In comparison with other SAMs, it introduces an analytic approach to trace the evolution of low-mass subhaloes that fall below the mass resolution of the simulation, improving the modeling of satellite quenching and galaxy clustering.
To further assess the impact of the mismatch, We also perform tests with the TNG-300 hydrodynamic simulation (Marinacci et al. 2018;Naiman et al. 2018;Nelson et al. 2018Nelson et al. , 2019;;Pillepich et al. 2018;Springel et al. 2018).This simulation evolves 2500 3 dark matter particles of mass 5.9 × 10 7 ℎ −1 M and the same number of baryonic particles of mass 1.1 × 10 7 ℎ −1 M in a cubic box with a length of 205 ℎ −1 Mpc on a side using the AREPO moving-mesh code (Springel 2010).The Planck cosmology (Planck Collaboration et al. 2016) is adopted, with cosmological parameters Ω m = 0.31, Ω b = 0.0486, ℎ = 0.677, and   =0.97, and  8 = 0.816.The TNG-300 simulation is an updated version of the original Illustris simulation (Vogelsberger et al. 2014;Nelson et al. 2015), with improvements on AGN feedback, galactic wind, and magnetic fields.Compared to the original Illustris, the galaxy colour distribution in TNG is found to be more consistent with observation.
We use the subhaloes from TNG-300-dark, which is a dark-matteronly (DMO) counterpart of the full-physics (FP) TNG-300 simulation.We adopt similar subhalo properties as in Section 2.1 calculated from the SUBLINK merger tree of TNG-300-dark, including  sub ,  peak ,  max ,  max ,  peak ,  disp ,  vpeak ,  mpeak ,  acc ,  0.1 ,  0.3 ,  0.5 ,  0.7 ,  0.9 ,  merg ,  first ,  last ,  sat , .To assign galaxies to DMO subhaloes, We apply the matching catalogue between the subhaloes of the DMO and FP runs in Rodriguez-Gomez et al. (2015).In the case that multiple galaxies are matched to one subhalo, we assign the most massive galaxy to the subhalo.To reduce matching noise, we exclude outliers with |log sub,DMO − log sub,FP | > 1 for the galaxies.The TNG snapshot data, group catalogue, and SUBFIND catalogue are all available on the TNG website2 .

Random forest
We focus on reproducing galaxy properties based on subhalo properties with machine learning techniques to better understand the connection between the two.To accomplish this, we utilise the random forest (RF) model (Breiman 2001), which is highly efficient in capturing complex multi-variate dependencies between input and output variables.The RF model is widely used in galaxy formation studies and shows promising results in reproducing galaxy properties based on halo or subhalo properties (Kamdar et al. 2016a;Agarwal et al. 2018;Xu et al. 2021Xu et al. , 2022)).
RF is an ensemble of decision trees (Breiman et al. 1984) which are constructed by splitting training data into hierarchical nodes.At each node, the training data including feature variables and the target variable is split into lower-level nodes in a way that minimises the cost function (e.g. the Gini impurity for classification tree and mean squared error for regression tree), until the specified maximum level of tree is reached, or the minimum number in node is reached.The predicted output is then calculated from the bottom level of nodes, also known as leaves.For a classification tree, the output is the majority of the target variable of data in the leaf, and for a regression tree, the output is the mean of the target variable of the data in the leaf.Once trained, the RF can be tested using a test sample, and the prediction performance can be estimated by performance scores such as  1 for classification and  2 for regression.To predict galaxy properties, we employ the regression RF in the sklearn package of Python and the  2 score.For all the RF analyses in this work, we use 60% of the original data as training sample and the rest as test sample.

PREDICTING SDSS GALAXY PROPERTIES
We construct RF models for predicting galaxy -band absolute magnitude   and  − colour separately.These models are trained using galaxies selected from the SDSS-ELUCID matching catalogue, and the subhalo properties listed in Section 2.1 are used as input variables.With the predicted   , we compare the luminosity function and galaxy-matter cross-correlation in different   bins to those in observation.We then compare the predicted colour distribution to that of the SDSS.

Subhalo mass completed sample
For training the RF, we first select an appropriate sample from the SDSS-ELUCID matching catalogue.Since only the galaxies brighter than a specific magnitude threshold can be observed at fixed redshift, the low-mass subhaloes with faint galaxies are likely underrepresented in the SDSS-ELUCID matching catalogue, which is also known as the Malmquist bias.With this bias, the number density of subhaloes of fixed mass decreases beyond a certain redshift, which we refer to as the limited redshift  lim .In other words, the subhalo population of this mass is incomplete above  lim .For a specific low subhalo mass, galaxies residing in early-formed subhaloes with luminosities higher than average are more likely to be observed, while  [11.00,11.20]logMsub= [11.20,11.40]logMsub= [11.40,11.60]logMsub= [11.60,11.80]logMsub= [11.80,12.those with luminosities lower than average may fall below the detection limit of the survey.This leads to a biased luminosity-subhalo mass relation for the low-mass subhaloes in the SDSS-ELUCID catalogue.If the RF captures this biased relation, the predicted magnitude would be brighter than expected at fixed subhalo mass.It will also introduce biases in the relationships between other galaxy properties and subhaloes since the early-formed subhaloes are more represented in observation.To avoid this kind of bias, it is necessary to select the subhaloes with redshift smaller than their  lim .
In Figure 1, we compare the number densities of SDSS-ELUCID matched subhaloes (solid) to all the ELUCID subhaloes (dashed) in the SDSS region in log sub bins of 0.2 dex as a function of redshift.Different colours represent log sub bins in the range of [11,12].The total number densities of SDSS region subhaloes are approximately constant across the redshift range, except for a bump near z=0.08, which may be caused by the well-known Sloan "great wall" structure.In contrast, the number densities of SDSS-matched subhaloes deviate from those of the SDSS region subhaloes and decline beyond specific redshifts, which increase with subhalo mass.This indicates again that the subhalo sample matched to SDSS galaxies may be incomplete due to the Malmquist bias.The impact of Malmquist bias vanishes for subhaloes of log sub > 12.The bottom panel shows the ratio between the two number densities  matched / total .For each subhalo mass bin, we define the limited redshift  lim at which the ratio drops to 0.9 (shown by the dashed line).By interpolating between the mass bins, a  lim can be calculated for each galaxy (subhalo) in the SDSS-ELUCID matched sample according to the subhalo mass.We then select the galaxies with redshift below their  lim .As a result, 201,980 galaxies are selected from the original 396,069 galaxies for our RF analysis.We refer to this sample as the  lim -selected sample.We also perform a test calculating  lim with log peak instead of log sub , and the result is similar.

r-band magnitude
The results of the   prediction are shown in Figure 2. In the topleft panel, we compare the luminosity function (LF) of the SDSS   (blue solid) of the  lim -selected sample with the corresponding RF predictions (blue dashed for training and blue dotted for test set).To measure the LF, we adopt the  max method, which determines the maximum volume in which the galaxy can be observed above the flux limit of the survey (note that this is different from the subhalo maximum circular velocity  max ).For each galaxy, a weight of inverse  max is assigned for number counting.The RF predictions demonstrate good agreement with the SDSS measurement within the magnitude range of −22 <   < −18.However, discrepancies arise at both the bright end (  < −22) and the faint end (  > −18), where the prediction is lower than the SDSS.This is not surprising, as the machine learning methods are unable to reproduce 100% variance of the input data and tend to underpredict extreme values (Agarwal et al. 2018).The RF predictions on the training and test sample are in excellent agreement, indicating that the construction of the RF is appropriate and the prediction result is reliable.
We then apply this trained RF to all subhaloes in the SDSS-ELUCID sample and show the predicted   LF by the black solid.For comparison, the direct measurement of the SDSS LF of the same sample is shown by the red solid.Similar to the result of the  limselected sample, the prediction is consistent with the direct measurement within the range of −22 <   < −18.As the SDSS region only covers a fraction of the ELUCID volume, we also apply the trained RF model to all the subhaloes in the ELUCID simulation and show the predicted LF with the black dotted curve.It is again very similar to the SDSS measurement, with the exception of the bright and faint ends.A bump exhibits at   > −18, which is likely attributed to the low abundance of faint galaxies hosted by low-mass subhaloes in our training sample.In addition to this, the cosmic variances may also contribute to the discrepancy at the faint end.As highlighted in Chen et al. (2019), the faint end slope of the LF was significantly underestimated due to the cosmic variances in the SDSS observation.
The top-right panel presents a direct comparison between the SDSS   (x-axis) and the predicted   (y-axis) for all galaxies in the  lim -selected sample.The blue contours show the 20%, 40%, 60%, 80%, 95% of the data distribution, and the black solid (shadow) shows the median (16%-84%) of predicted   at fixed SDSS   .The black dashed line along the diagonal indicates equality between the prediction and SDSS values.Overall, the prediction is consistent with SDSS along the equality except for the faint and bright end.For galaxies fainter than   ∼ −20, the RF tends to predict brighter magnitudes, while the trend is reversed for galaxies brighter than   ∼ −20.Scatters exist in the prediction at fixed SDSS   , with smaller scatter for brighter galaxies compared to fainter ones.To quantify the performance of the prediction, we provide the  2 score which describes the fraction of the variance in the target variable (e.g.  in this case) captured by the prediction at the bottom right of the panel.As  2 = 1 represents a perfect prediction that recovers the full variance in the target variable, an  2 of 0.8 indicates that our prediction captures a significant fraction of the variance in SDSS   .
The bottom-left and bottom-right panels show the same comparison for the training sample and test sample, respectively.The  2 of the training sample is slightly higher than that of the full sample, and the  2 of the test sample is slightly lower.This is reasonable since the RF is data-driven, and the model is trained to fit the training sample with a priority.We also perform the same analysis to predict the stellar mass, and the result (shown in Appendix A) is very similar to that of the   prediction.
We then proceed to compare the   -dependent galaxy clustering in SDSS and the predictions.To measure the SDSS clustering, we construct four volume-limited   bin samples in which the sample completeness is ensured.In other words, the apparent magnitudes of all galaxies in each bin fall in the detection limits of the survey from   =14.5 to   =17.72 (Zehavi et al. 2005).To obtain a higher signal-to-noise signal, we calculate the two-point galaxymatter cross-correlation using the estimator  gm =DD/DR-1 in the ELUCID coordinate instead of the galaxy-galaxy auto-correlation, where DD is the number of galaxy-matter pairs, and DR is the number of galaxy-random pairs.The positions of subhaloes serve as the positions of their matched galaxies.
The SDSS clustering of each   bin is illustrated by the red solid curve in each panel of Figure 3.The black solid curve shows the prediction of SDSS-matched subhaloes, and the black dotted curve indicates the prediction from all subhaloes in ELUCID.In the three bright bins where −22 <   < −19, both the prediction of SDSSmatched subhaloes and all subhaloes consist with the SDSS, except for very small scales.It is worth noting that for the clustering of SDSS-matched prediction, we still utilise the position of subhaloes, so the clustering discrepancy is solely due to the prediction of   .
In the faintest bin where −19 <   < −18, the prediction of the SDSS-matched sample still agrees with SDSS measurement.However, the prediction from all subhaloes in ELUCID exhibits a lower clustering amplitude than SDSS on all scales.This discrepancy can be attributed to the bump of the black dotted curve in Figure 2, which could be a result of the scarcity of low-mass subhaloes in the training sample and therefore the low accuracy of   prediction in these subhaloes.

g-r colour
In addition to the   , we also train the RF model to predict − colour with the subhalo properties, and the results are shown in Figure 4.The top-left panel displays the distribution of the SDSS colour of the  lim -selected sample (blue solid) and the RF prediction separated into training (blue dashed) and test sets (blue dotted, overlapping with the blue dashed).We then apply this RF on all subhaloes of the SDSS-ELUCID catalogue and show the prediction by the black solid, and also provide the SDSS colour distribution of the same subhaloes by red solid for comparison.The SDSS colour distribution consists of a narrow red peak around  −  = 1 and a smooth blue component in the range of 0.4 <  −  < 0.7, and only the red peak remains after the  lim selection.However, the red peak of the RF prediction shifts towards lower values of  − .Additionally, the width of the predicted distribution is narrower than that of the SDSS, indicating that extreme red and blue values are not fully recovered by the RF.Since the RF is trained solely on the red peak galaxies, it is not able to recover the blue component when applied to all subhaloes in the SDSS-ELUCID matched catalogue.
In the top-right panel, we show the comparison between the SDSS colour (x-axis) and the predicted colour (y-axis).The overall trend deviates more noticeably from the diagonal compared to that of the   prediction, and the  2 score (∼ 0.3) is significantly lower.The bottom-left and bottom-right panels display the prediction for the training sample and test sample, respectively.The  2 of the training (test) sample is slightly higher (lower) than that of the full sample but still indicates a similar level of prediction accuracy.Instead of using the inferred assembly properties characterising subhalo formation history such as  0.1/0.3/0.5/0.7/0.9 , we also input the original merger tree information to the RF by using the subhalo masses of 21 snapshots from  = 4.86 to  = 0, and masses are set to zero if the subhaloes are not identified in early redshifts.The result is very similar to that using the inferred assembly properties.We also build RF models for central and satellite galaxies separately, but no significant improvements in the prediction are found.This indicates that predicting the SDSS galaxy colour with subhalo properties is more challenging than predicting   .We also train the RF to predict the SFR and specific sSFR of SDSS galaxies based on subhalo properties.The results are shown in Appendix A. The  2 of sSFR prediction is similar to that of the colour, while the  2 of SFR is much lower.
The reasons for the low-accuracy colour prediction are complicated.Firstly, the correlation between galaxy colour and subhalo properties may be weak in SDSS, and baryonic processes such as AGN feedback could have more significant effects on galaxy colour.Secondly, noise in the galaxy-subhalo relation of the training sample may raise from possible mismatches between the SDSS galaxies and ELUCID subhaloes.It is difficult to test the first possibility directly since subhalo or halo properties such as formation redshift are difficult to measure in observation.Empirical models can be used to infer the correlation between galaxy colour and halo property.For example, Hearin & Watson (2013) propose an age-matching model that assumes a monotonic relation between galaxy colour and subhalo assembly property to reproduce colour-dependent galaxy clustering.On the other hand, Xu et al. (2018) propose a conditional colour-magnitude distribution model that assumes magnitude and colour depend purely on halo mass and find that it can also reproduce the observed galaxy clustering dependence on colour reasonably well.In these two models, the former suggests a non-zero relation between colour and halo assembly history, while the latter suggests an independent trend.This indicates that the conclusion can be model-dependent, and further investigations are needed to resolve this debate.In this study, we will focus on investigating the second possible reason mentioned above, which is the mismatch between SDSS galaxies and ELUCID subhaloes.It is important to note that the term "mismatch" here refers not only to errors in matching caused by the neighbourhood abundance matching method, but also to other sources of noise that could introduce biases in the galaxy-subhalo relation.All the RF studies above are based on the assumption that the matching is accurate, or in other words, that the true subhalo properties of a galaxy can be accurately recovered by those of the matched ELUCID subhalo.However, this is not guaranteed, especially for the low-mass subhaloes that are expected to host faint galaxies, as these may not be recovered by the constrained simulation.The reconstruction of the matter density from the group catalogue only uses groups of mass above log group = 12 and applies a Gaussian kernel with a smoothing scale of 2 ℎ −1 Mpc (Wang et al. 2016).As a result, information on haloes and subhaloes below this mass scale and length scale is lost, and the reconstructed (sub)haloes could differ from the actual ones.Matching galaxies to these subhaloes could introduce noises to the galaxy-halo relations compared to the true ones.Therefore, it is necessary to consider the mismatch effect when analysing the galaxy-halo relations based on this galaxy-subhalo matching catalogue.In the following section, we will perform tests to investigate the impact of the mismatch effect on our RF results using SAM and hydrodynamic simulation.

Mismatch effect using SAM
In this section, we aim to test the potential impact of the mismatch effect between SDSS galaxies and ELUCID subhaloes by creating a similar mismatch in galaxies of a SAM model implemented on ELU-CID (Luo et al. 2016).Since we regard the mismatch effect as noise in the galaxy-subhalo relation, we mimic it by randomly shuffling the SAM galaxies in subhaloes within narrow  peak bins of 0.2 dex in the vicinity of 5 ℎ −1 Mpc cubic cells.The constrain of narrow  peak bin maintains a relatively reasonable stellar mass- peak relation, consistent with the principle of neighbourhood abundance matching when assigning SDSS galaxies to ELUCID subhaloes.Shuffling in the neighbourhood of 5 ℎ −1 Mpc cells is in line with the advantage of the constrained simulation that it can recover the subhalo distribution at this scale.For example, Yang et al. (2018) investigate the separation between galaxy and subhalo pairs in the SDSS-ELUCID matched catalogue and find that most of the pairs are separated below ∼ 5 ℎ −1 Mpc in both   and  directions.The shuffling breaks the original connection between galaxy properties and subhalo properties other than  peak , thus adding noise to the true galaxy-subhalo relation.
We subsequently construct RF models to predict galaxy colour using the original SAM galaxy-subhalo pairs and the shuffled pairs, respectively.To access a reasonable estimation of the mismatch effect, we use the SAM galaxies hosted by the subhaloes of the  lim -selected sample before shuffling.The left panel of Figure 5 shows the SAM colour distributions of galaxies in  lim -selected subhaloes (red solid) and the RF prediction (black solid), which are highly similar.The middle panel displays the two-dimensional distribution.Generally, the contours are aligned with equality with small deviation.The black solid with shadow indicates the median and 16%-84% of prediction at fixed SAM colour bins.The large deviation at  −  < 0.2 of SAM is possibly due to the low number of extreme blue galaxies in this range.The  2 of the prediction is ∼ 0.8, significantly higher than that in the SDSS prediction.This indicates that the galaxy-subhalo connection in the SAM is much stronger, which is consistent with the construction of the SAM.Xu et al. (2022) find that adding galaxy properties such as black hole mass and cold gas mass can further improve the prediction of the SAM colour.
Recently, Jespersen et al. (2022) propose a graph neural network method to predict several SAM galaxy properties based on halo merger trees.Unlike traditional machine learning methods where the input features are halo properties extracted from the merger tree, their model uses the merger tree itself as input, maximizing the information obtained from the growth history of the halo.Their prediction performance of SFR is impressively higher ( 2 = 0.876) compared to previous studies in the literature.We also perform a test predicting the SFR with RF and find that the  2 score is 0.864, which is very similar to that of the graph neural network.This indicates that the RF is capable of capturing the connections between galaxy properties and halo or subhalo properties if exist.
Back to the left panel of Figure 5, the black dashed curve indicates the prediction based on the shuffled sample.The prediction still features the blue and red peaks, but the red peak is slightly lower than that in the original SAM, and the blue peak is slightly higher.The overall recovered colour range is narrower, and some of the extreme blue and red values are missing compared to the prediction before shuffling.The right panel is the two-dimensional comparison between the shuffled prediction and the original SAM.The deviation from equality is larger than that in the middle panel, especially at  −  < 0.4, and the scatter in the prediction at fixed SAM colour is also larger.The  2 value of 0.655 is lower than that before shuffling.
With the noises in the galaxy-subhalo relation introduced by the shuffling, the performance of RF colour predicting is impacted.However, even with shuffling, the  2 score of the prediction is still higher than that of the SDSS prediction.We find from the RF that the most important subhalo feature for predicting SAM colour is  peak , which is highly correlated with  peak and likely remains similar after shuffling.Other relatively important subhalo features for the prediction are subhalo assembly properties such as  acc and  0.1/0.3/0.5/0.7/0.9 .Although the shuffling process reassigned these subhalo properties for a given galaxy, the correlations between galaxy and subhalo properties may not be completely removed due to the constraints of the shuffling.This will be further demonstrated by the correlation coefficients before and after shuffling in Section 4.3.As a result, the galaxy colour can still be partially reproduced after the shuffling.If the colour-subhalo relation in the real Universe is similar to that in the SAM, the RF could capture this relation with an  2 of approximately 0.6, accounting for possible mismatches.Thus, it is likely that the connection between colour and subhalo properties in the real Universe is not as strong as that in the SAM.As a further step, we perform a similar test using the TNG300 hydrodynamic simulation and compare it with the SDSS prediction in the following section.

Mismatch effect using TNG300
Without the SDSS region, comparisons between the TNG300 predictions and the SDSS or SAM predictions are indirect.Since the SDSS galaxies are matched to a fraction of ELUCID subhaloes in the corresponding SDSS region, and we select SDSS galaxies according to  lim to ensure the completeness of subhaloes, some subhaloes in the SDSS region of ELUCID are empty (i.e.not occupied by  limselected SDSS galaxies).Since more massive subhaloes tend to host brighter galaxies that are more likely to be observed, the occupied fraction of subhaloes will increase with log sub .To account for this effect in TNG300, we measure the occupied fraction as a function of log sub in the SDSS  lim catalogue and select a random sample of galaxies in TNG300 which can reproduce this trend.
Figure 6 displays the occupied fraction in both the SDSS  lim sample (red solid) and the selected TNG300 sample (black solid).The occupied fraction is ∼ 0 for log sub < 11 and rapidly increases to ∼ 1 at log sub ∼ 12.This implies that the abundance of low-mass subhaloes is largely suppressed in observation, while the subhaloes of log sub > 12 are barely affected.The advantage of this selection in TNG300 is that it can create a training sample where the subhalo population is similar to that of the SDSS training sample.This is important because the machine learning performance of colour prediction might depend on log sub .
To investigate the effect of mismatch on RF colour prediction using the TNG300 simulation, we implement the shuffling strategy described in Section 4.1 on the selected sample, shuffling galaxies in subhaloes of fixed subhalo mass bins (0.2 dex) in cells of 5 ℎ −1 Mpc.Similar to the SAM, we construct RF models to predict galaxy colour with subhalo properties before and after shuffling and present the results in Figure 7.The left panel shows the colour distribution of the selected TNG sample (red solid) and the corresponding prediction (black solid).The TNG colour distribution shows a narrow red peak at  − = 0.75 and a broad blue peak around  − = 0.4.The prediction successfully captures the red peak, but the predicted blue peak is narrower and higher than that in TNG, and the amount of extreme blue galaxies with  −  < 0.3 are underestimated.In the middle panel, the deviation of the prediction is mainly seen at  −  < 0.4 where the predicted values are higher, and the prediction at the red end aligns more closely with TNG.The  2 score for the prediction is 0.726, which is comparable to that in the SAM.However, the performance of TNG RF in recovering the blue colour is relatively worse than that of the SAM.
The black dashed in the left panel represents the RF prediction based on the shuffled sample.Compared to the prediction before shuffling, both the predicted red and blue peaks deviate more from those in the original TNG, in the way that the red peak is lower and the blue peak is higher.Moving to the right panel which illustrates the two-dimensional distribution of the shuffled prediction and the original TNG, we find that the deviation from equality is also more pronounced, with an  2 score of 0.588.Compared to the SAM results in Figure 5, the mismatch effect shows a similar impact on the RF prediction of the TNG sample.
It is worth noting that both the  2 of SAM and TNG prediction after shuffling are higher than that of the SDSS prediction.Assuming that the SDSS-ELUCID matched catalogue is also subject to a similar mismatch effect, it is reasonable to infer that the true connection between galaxy colour and subhalo properties in SDSS is weaker than those in the SAM and TNG before shuffling.This suggests that the galaxy colour in the real Universe may also depend on baryonic processes such as AGN feedback.In the next subsection, we will compare the galaxy-subhalo relation in SDSS, SAM, and TNG in more detail in terms of the correlation coefficient between galaxy properties and subhalo properties.

Comparison between SDSS, SAM, and TNG
To further investigate the differences in the galaxy-subhalo relations between the SDSS, SAM, and TNG samples, we calculate the Pearson correlation coefficient  between each pair of galaxy properties and halo properties.The correlation coefficient is a statistical measure that quantifies the strength and direction of the correlation between two variables.It ranges from -1 to 1, and values close to 1 (-1) indicate strong positive (negative) correlations, while values close to 0 indicate weak correlations.In Figure 8, we show the correlation coefficients between SDSS or SAM galaxy properties (-axis) and ELUCID subhalo properties (-axis).Subhaloes with non-physical  0.1/0.3/0.5/0.7/0.9 (i.e.main branch starts with a fraction of peak mass larger than 0.1/0.3/0.5/0.7/0.9) are excluded when measuring the correlation coefficients related to these properties.The colour coding indicates the correlation coefficients, with reddish for positive correlations and blueish for negative correlations.
The second panel displays the correlations of the original SAM sample using the  lim -selected subhaloes.Compared to the findings in SDSS,   in the SAM sample correlates weaker with mass indicators, and the correlations with subhalo assembly properties are negligible.In contrast, galaxy colour in the SAM correlates stronger with mass indicators than that in SDSS.This may be the reason that the RF provides a more accurate prediction of galaxy colour in the SAM.Both   and colour in the SAM correlate very weakly with halo assembly properties.
In the corresponding shuffled sample shown in the third panel, all the correlation coefficients involving subhalo mass indicators and environmental properties are almost maintained from the original sample due to the shuffling constraints.Since the correlations relating to subhalo assembly properties are weak in the original SAM, the overall correlations between   or colour and subhalo properties are essentially unchanged.However, it is important to note that the correlation coefficient captures the correlations between individual pairs of variables instead of the multi-variate dependence.With the shuffling, the multi-variate dependence between colour and subhalo properties experiences small variations, as indicated by the slightly lower  2 after shuffling compared to that before the shuffling.
The fourth panel shows the correlations in the original SAM sample using all subhaloes above  sub =10.This sample contains more low-mass subhaloes compared to the  lim -selected sample.Compared to the second panel, this sample shows tighter correlations between   and mass indicators, as well as the merger tree properties such as  merg and  first/last .The correlations between colour and mass indicators are weaker in this sample, while the correlations between colour and subhalo assembly properties are stronger.Positive correlation coefficients suggest that red galaxies tend to reside in early-formed subhaloes.Interestingly, the colour correlates more strongly with late formation stage properties (i.e. 0.7 ) than those characterising the early formation stage of subhaloes (i.e. 0.1 ).
Considering the differences between the second panel and the fourth panel, it is important to acknowledge that generalizing ML models based on the  lim -selected subhaloes to the entire ELUCID simulation may introduce biases if the galaxy-subhalo relation also depends on subhalo mass in the real Universe.Observations including more faint galaxies (and thus low-mass subhaloes) such as DESI and constrained -body simulation recovering smaller mass and length scales could be helpful for investigating galaxy-subhalo relations in the low mass range.
We also conduct the same analysis with TNG galaxies.The top panel of Figure 9 presents the correlation coefficients of the selected subhaloes of TNG.  is highly correlated with mass indicators and weakly correlated with assembly properties, which are both stronger than those in the SAM.Notably, the   correlations with  0.1 ∼  0.9 gradually decrease, suggesting that   depends more on the early formation stage than the late formation stage of the subhalo.This trend is absent in the  lim -selected SAM sample but is also present in the SDSS sample.Similar to the SAM, the TNG colour moderately correlates with mass indicators and is nearly independent of assembly properties.The second panel shows the results of the shuffled sample.We find again that the shuffling barely affects the correlations related to mass indicators.Additionally, the correlations between   and  0.1 ∼  0.9 remain partially intact, along with the gradually decreasing trend.This is possibly due to the shuffling constraint which limits the shuffling within 5 ℎ −1 Mpc cells, and the subhaloes assembly properties of similar mass may exhibit minimal variations within these cells.
In the third panel of Figure 9 which includes all subhaloes above  sub =10, the   correlations with mass indicator properties are slightly stronger, and correlations with assembly properties can be both higher (e.g.,  merg ,  first/last ) and lower (e.g.,  0.1 ∼  0.9 ) compared to the selected sample.With a large amount of low-mass subhaloes, the colour correlations with mass indicators are much lower than those in the selected sample.However, the colour correlations with assembly properties are higher.Overall, the colour-subhalo correlations are weaker in the TNG compared to those in the SAM in all subhalo above  sub =10.
Comparing the results of the SDSS sample in the top panel of Figure 8 with the corresponding SAM (second and third panels of Figure 8) and TNG results (top two panels of Figure 9), we find that the   -subhalo relation in the SDSS is more similar to that in TNG, in terms of the dependence on mass indicators and some of the assembly properties.The SDSS colour-subhalo correlation is weaker than both the SAM and TNG, even after shuffling.So it is possible that the true underlying colour-subhalo connection in SDSS without the mismatch effect is lower than those in the SAM and TNG before shuffling, and baryonic processes such as AGN feedback and other stochastic processes may have significant impacts on SDSS galaxies.Further comparison between the SDSS and TNG galaxies can be carried out with the upcoming ELUCID hydrodynamic simulation (HELUCID, Cui in prep), which can provide new insights into galaxy-subhalo relation in the real Universe.

SUMMARY
Using a catalogue matching SDSS galaxies with ELUCID subhaloes, we employ random forest to predict galaxy magnitude and colour based on a few subhalo properties that characterise subhalo mass, assembly history, and environment.Before training the RF, we select a sample of galaxy-subhalo pairs from the SDSS-ELUCID matched catalogue according to the redshift limitation that corresponds to subhalo mass completeness.This eliminates most of galaxies with subhaloes of log sub < 11 and a fraction of galaxies with subhaloes of 11 < log sub < 12. Training on this selected sample, the RF model can predict the   reasonably accurately with an  2 score of ∼0.8, with deviations mainly arising from extremely bright and faint galaxies.The prediction can recover the luminosity function and galaxy-matter cross-correlation in the range of −22 <   < −18.Extending the predictions to all ELUCID subhaloes results in slightly larger deviations, especially at the faint end.In contrast, the accuracy of colour prediction is significantly lower, with an  2 score of ∼ 0.3.The RF model fails to reproduce the position of the red peak in SDSS  lim -selected sample, leading to large deviations in predicted colour values from the true colour.We also train RF models to predict physical galaxy properties such as  * and sSFR.The prediction performance of  * is similar to that of the   , and the prediction performance of sSFR is similar to that of the colour.
One possible explanation for the low accuracy of colour prediction is the difference between the matched subhaloes and the underlying true subhaloes of SDSS galaxies, or in other words, the mismatch between SDSS galaxies and subhaloes.To investigate this effect, we utilise galaxies from a SAM model implemented on ELUCID.We shuffle the galaxies around subhaloes in log peak bins of 0.2 dex and in cubic cells of 5 ℎ −1 Mpc.RF models are trained on the  limselected subhaloes both before and after the shuffling.Before the shuffling, the colour prediction is reasonable with an  2 of 0.79, and the bimodal distribution of colour is reproduced.The effect of shuffling lowers the  2 score to 0.66, but still higher than that of the SDSS sample.
We also perform the same test using galaxies in TNG300.Since the density field of TNG300 is not directly matched to the SDSS, we select random fractions of subhaloes as a function of log sub to ensure that the selected subhalo sample reproduces the subhalo abundance in the  lim -selected subhaloes of SDSS.Before shuffling, the  2 of colour prediction is 0.73, and it decreases to 0.59 after shuffling.The impact of shuffling in TNG is comparable to that in the SAM, which slightly lowers the colour-subhalo connection.This finding suggests that the colour-subhalo connection in SDSS may be weaker than both the SAM and TNG, even in the absence of the mismatch effect.
In the end, we measure the Pearson correlation coefficients between   or colour and the subhalo properties for SDSS, SAM, and TNG samples.In the SDSS and selected TNG,   shows a strong

Figure 1 .
Figure1.Top: subhalo number density as a function of redshift at fixed log sub for SDSS-ELUCID matched subhaloes (solid) and ELUCID subhaloes in the SDSS region (dotted).A few selected log sub bins are shown with different colours.Bottom: the ratio between the number density of SDSS-ELUCID matched subhaloes and ELUCID subhaloes in the SDSS region.The complete threshold of 0.9 is indicated by the black dashed line.

Figure 2 .
Figure2.  Prediction trained on the  lim -selected SDSS-ELUCID catalogue.Top-left: luminosity function of the  lim -selected SDSS galaxies (solid blue) and RF prediction separated into training sample (blue dashed) and test sample (blue dotted).The solid black curve shows the predicted   applying the trained RF on all subhaloes in the SDSS-ELUCID sample, and the solid red shows the measurement of galaxies in the same sample.The dotted black indicates the RF prediction on all ELUCID subhaloes.Top-right: distribution of comparison between SDSS   (x-axis) and predicted   (y-axis) of the  lim -selected sample, shown by the blue contours (20%, 40%, 60%, 80%, 95% of the sample).The black solid and shadow indicate the median and 16%-84% of the prediction at fixed SDSS   .Equality is shown by the black dashed line along the diagonal direction.Bottom-left/right: comparison between SDSS and prediction in the training/test sample.

Figure 3 .
Figure 3. Galaxy-matter cross-correlation of SDSS   samples and the predictions.Four   bins are shown in four panels.In each panel, the original SDSS cross-correlation is shown by the red solid, and the error bars are measured from 16 jackknife samples.The cross-correlation of predicted   of SDSS subhaloes (all subhaloes) is shown by the black dashed (dotted).

Figure 4 .
Figure 4.  −  colour prediction trained on the  lim -selected SDSS-ELUCID catalogue.Top-left: SDSS colour distribution of the  lim sample (blue solid) and RF predictions of the training (blue dashed) and test (blue dotted) sets from this sample.The application of this RF to all SDSS-ELUCID subhaloes is shown by the black solid, and the corresponding true SDSS colour distribution of these subhaloes is shown by the red solid.Top-right: comparison between SDSS colour (x-axis) and predicted colour (y-axis).Bottom-left/right: comparison between SDSS and prediction in the training/test sample.

Figure 5 .Figure 6 .
Figure 5.  − colour prediction based on original SAM and shuffled SAM.Left: SAM colour distribution of SDSS  lim -selected subhaloes (solid red), predicted colour of these subhaloes (black solid), and the prediction based on the shuffled sample (black dashed).Middle: comparison of SAM colour and predicted colour for SDSS-matched subhaloes.Right: comparison of SAM colour and prediction based on the shuffled sample for these subhaloes.

Figure 7 .
Figure 7.  −  colour Prediction based on original TNG and shuffled TNG.Left: TNG colour distribution of selected subhaloes (red solid), prediction of these subhaloes (black solid), and prediction based on the shuffled sample (black dashed).Right: comparison of the TNG colour and the prediction.Right: comparison of TNG colour and prediction based on the shuffled sample.

Figure 8 .
Figure8.Pearson correlation coefficient between galaxy properties (-axis) and subhalo properties (-axis).From top to bottom, the samples are  lim -selected SDSS galaxies (subhaloes), original SAM galaxies of these subhaloes, shuffled SAM, and original SAM of all subhaloes above  sub =10.