Reinterpreting Fundamental Plane Correlations with Machine Learning

This work explores the relationships between galaxy sizes and related observable galaxy properties in a large volume cosmological hydrodynamical simulation. The objectives of this work are to both develop a better understanding of the correlations between galaxy properties and the influence of environment on galaxy physics in order to build an improved model for the galaxy sizes, building off of the {\it fundamental plane}. With an accurate intrinsic galaxy size predictor, the residuals in the observed galaxy sizes can potentially be used for multiple cosmological applications, including making measurements of galaxy velocities in spectroscopic samples, estimating the rate of cosmic expansion, and constraining the uncertainties in the photometric redshifts of galaxies. Using projection pursuit regression, the model accurately predicts intrinsic galaxy sizes and have residuals which have limited correlation with galaxy properties. The model decreases the spatial correlation of galaxy size residuals by a factor of $\sim$ 5 at small scales compared to the baseline correlation when the mean size is used as a predictor.


INTRODUCTION
The difference between the intrinsic and observed (or inferred) size of a galaxy is influenced by several physical processes, including gravitational lensing (Bertin & Lombardi 2006), peculiar galaxy velocities (Strauss & Willick 1995), doppler magnification (Bonvin et al. 2017) and cosmic expansion (Blakeslee et al. 2002).With a sufficiently accurate predictor of intrinsic galaxy sizes, it is possible to construct estimators to study these effects using the size residuals, i.e., the difference between observed and predicted intrisic size.For example, the anisotropies (the dipole) in the galaxy size cross correlations will be sensitive to the galaxy velocities; the cross correlations of galaxy size with foreground galaxies are sensitive to weak gravitational lensing caused by the foreground galaxies (galaxy-galaxy lensing cross correlations); and the relation between the galaxy size and their redshifts can be used to test the redshift-distance relation and hence models of cosmological expansion.Note that these estimators use the size information differently and hence different measurements can be carried out independently with ‹ cschafer@cmu.eduonly weak correlations/contamination between different effects.
Such measurements also hold promise to constrain the uncertainties in the photometric redshifts of galaxies by exploiting the dependence of inferred galaxy size on the estimated distance to the galaxy.The ratio of galaxy-galaxy lensing cross correlations using the galaxy size residuals and galaxy shear is sensitive to the uncertainties in the galaxy redshift estimates, i.e.

Pgγ
´1 9 δ log Dpzsourceq (1) where P is the cross power spectra (or correlation function), g refers to the foreground lens galaxy, λ is the estimated size residual, γ is the galaxy shear, and δ log Dpzsourceq is the error in the estimated distance to the source galaxy (the galaxy for which we measure λ and γ) due to uncertainties in the photometric redshifts.This estimation is similar to the consistency tests that have been done between galaxy shear (spin-2) and CMB convergence maps (spin-0), e.g.Singh et al. (2017), and estimators developed for such studies can be directly applied for comparing lensing measurements using galaxy shear and galaxy size (spin-0).Further, unlike the case of CMB lensing, since the size and shear are measured on the same set of galaxies, the ratio is independent of the galaxy-matter power spectrum which is the primary observable in the galaxy-lensing cross correlations, i.e. constraints on redshift will be almost independent of the cosmological information.Independence from galaxy-matter power spectrum implies that the measurement is also independent of the cosmic variance and will only depend on the measurement noise and intrinsic scatter in the size and shear measurements.Since photometric redshift uncertainties are one of the limiting systematics when analyzing data from photometric galaxy surveys, including galaxy size estimates can potentially lead to significant improvements in cosmological inferences, beyond a simple improvement in statistical errors.
An accurate and precise predictor of intrinsic galaxy size minimizes the scatter in the size residuals, which is the primary source of noise in cosmological measurements.One such size predictor is the fundamental plane (FP) of galaxies (Dressler et al. 1987;Djorgovski & Davis 1987).FP is the relation between the size, R0, surface brightness, I0 and velocity dispersion, σ0, of elliptical galaxies given by log R0 " a log σ0 `b log I0 `c `Nz where the redshift, zi, dependent terms were introduced in (Joachimi et al. 2015) to account for the redshift evolution of the plane (see also discussion in Singh et al. 2021).While studied extensively in the literature in the context of galaxy physics, a careful study of the FP in context of cosmological measurements has only recently gained traction (e.g.Joachimi et al. 2015;Saulder et al. 2019;Singh et al. 2021) and the efficacy of the FP for cosmological analysis is not well established.Singh et al. (2021) performed a detailed study of the FP residuals and the galaxy properties involved in the FP definition.The FP residuals were found to be strongly correlated with the galaxy properties, e.g. the mean of the FP residuals increases with galaxy luminosity.These correlations suggest that the scatter over the FP is not strictly random.Furthermore, the FP residuals are correlated with the galaxy density field, an effect similar to the intrinsic alignments of galaxy shapes.This effect can also be explained by the dependence of the galaxy properties on their environment.Brighter and larger galaxies tend to reside in over-dense regions, though Singh et al. (2021) observed that these galaxies have lower surface brightness.Correlations of FP with these properties explains the correlations of FP residuals with the galaxy density field.For cosmological applications, it is important to understand these correlations of galaxy properties in order to improve the galaxy size predictors and avoid biases in the cosmological inferences.The physical origins of these correlations are still not well understood and better understanding of these effects is important to improving models of galaxy physics.
This work explores such correlations using state-of-theart, large cosmological volume hydrodynamical simulations and performs a more detailed study to understand the correlations between galaxy sizes and several other galaxy properties.The use of a simulation model (IllustrisTNG, described below in Section 2.1) for this purpose enables a more thor-ough exploration of correlations with a wider range of galaxy properties, measured with minimal error.Of course, the ultimate objective is to use these models with observed data, and a focus of this work is to develop novel methods of analysis which will enable this by using the high-resolution information available from the simulation model to guide the fitted model.
Meeting the above objectives motivates the development of novel analysis methodology for incorporating the rich structural information obtained from large simulation models, and this is also a focus of this work.A fundamental question is the following: Suppose that some feature of the galaxy or its environment (e.g., a measure of 3D density) is known to be useful in predicting intrinsic galaxy size, but that such information is only available in a simulation model.Is there a way to exploit the relationship between 3D density and other observable galaxy properties to better predict galaxy size?One may believe that sophisticated supervised learning methods should be capable of discovering the optimal model for the relationship between observable properties and galaxy size, but the complexity of this model may make it difficult to ascertain, and difficult to interpret.We place emphasis here on an approach that balances interpretability and predictive power.The simulation model provides a useful framework around which models can be built that are not of excessive complexity, but achieve strong prediction performance.
The remainder of this paper is organized as follows: Section 2 describes the simulation model utilized, and the galaxy and environment features derived from it.Section 3 presents the statistical tools behind the model and its assessment.Section 4 describes the primary model fit in this work.Section 5 discusses the results and its implications for future exploration.

The cosmological simulation
IllustrisTNG (Nelson et al. 2018;Pillepich et al. 2018b;Springel et al. 2018;Naiman et al. 2018;Marinacci et al. 2018;Nelson et al. 2019) comprises cosmological hydrodynamical simulations that were run with the moving-mesh code Arepo (Springel 2010).The TNG100 simulation at z " 0 was chosen for this study since the simulation exhibits color bimodality that agrees with SDSS data for intermediate mass galaxies (Nelson et al. 2017), as well as consistent correlations with other galaxy properties.Additionally, TNG100 provides a good balance between high resolution and a large cosmological volume.
The box of 75 Mpc/h " 100 Mpc has 2 ˆ1820 3 resolution elements with a gravitational softening length of 0.7 kpc/h for dark matter and star particles.The mass of dark matter and star particles are 7.46 ˆ10 6 Md and 1.39 ˆ10 6 Md, respectively.Additionally, the simulation incorporates various physical process for galactic evolution: radiative gas cooling and heating; star formation in the ISM; stellar evolution with metal enrichment from supernovae; stellar, AGN and blackhole feedback; formation and accretion of supermassive blackholes (Pillepich et al. 2018a;Weinberger et al. 2017).
The dark matter halos were identified using the friendsof-friends (FoF) algorithm (Davis et al. 1985), and then the subhalos were identified using the SUBFIND algorithm (Springel et al. 2001).We employ a minimum stellar mass cut of log 10 pM˚{Mdq " 9 , roughly corresponding to 10 3 star particles (Tenneti et al. 2016;Du et al. 2020).

Galaxy and Environment Properties
This section characterizes the source of galaxy properties used in the predictive models.Some standard quantities utilized, such as size (half-mass radius) and star formation rate arrive directly from the simulation catalog; for more information on these, we refer the interested reader to the simulation model website1 .The velocity dispersion of each individual galaxies was calculated using the velocities of all star particles in a galaxy.Density Measures.In the models below, both 2D and 3D galaxy density information is utilized.To calculate 2D density, galaxy counts are tabulated on a 1000 by 1000 grid, and then smoothed using a Gaussian kernel with a scale of 0.5 Mpc/h.This density, evaluated at the galaxy positions, is stored as delta smooth R. Similarly, for 3D density, galaxy counts are tabulated on a 750 ˆ750 ˆ750 grid, and then smoothed using a Gaussian kernel with scales of 0.5, 1.0, 2.0, and 5.0 Mpc/h.This generates measures of density, at varying scales, for the environment local to each galaxy.
Galaxy Morphological Classification.Galaxy morphology is characterized using the probabilistic dynamical model of Jagvaral et al. (2021).The model makes two physically motivated assumptions.First, it is assumed that the angular momentum of disc stars is approximately aligned with the total angular momentum of the galaxy, while the angular momentum of bulge stars angular is randomly aligned.Second, it is assumed that the orbits of disc stars are approximately circular, while the orbits of bulge stars orbits are elongated or circular.
In order to quantitatively model the aforementioned assumptions, define the following: ‚ jr " j star j circ prq , where jstar is the angular momentum of a single star particle and jstar is its magnitude; jcircprq " is the expected angular momentum for a circular orbit at the same position as that star, where M prq is the total mass (across all types of particles -stars, gas, dark matter) contained within that radius.
‚ cos α is the cosine of the angle between the angular momentum vector of the star particle and the total angular momentum of the galaxy.
Next consider the following model for the distribution of star particles: pstarpjr, cos αq " p1 ´f disc q p bulge pjr, cos αq `f disc p disc pjr, cos αq.(3) Here, p bulge and p disc are the densities (both normalized to integrate to 1) reflecting the probability that a star at a given point in this 2D space belongs to the bulge or to the disc.More details and further investigations of the model can be found in Jagvaral et al. (2021).Finally, mc disk, or the galaxy disk fraction, is calculated by adding up the mass of all of the star particles that were classified as disks and dividing by the total mass.

METHODS
As stated above, this work is focused not only on developing improved models for predicting galaxy size from measurable quantities, but also on providing better understanding of the relationships between these properties.Hence, a methodological focus of this work is to utilize approaches that balance modelling accuracy with scientific interpretability.This section will discuss the use of projection pursuit regression as an alternative to neural networks and other machine learning approaches.Ultimately, the residuals from these fits must be analyzed to determine if there are remaining correlations with intrinsic galaxy properties, hence this section will also discuss methods for such approaches.
The projection pursuit regression (PPR) model (Friedman et al. 1981) is characterized as follows.The response variable Y is modelled as a additive combination of m different nonlinearly-transformed projections of the predictor vector x: The ϵ are assumed to be mean zero, uncorrelated irreducible errors, i.e. scatter around the model fit.Here, the β, the αj, and the fj are estimated from the available training sample.
The αj represent the m different projections of the original predictors xi that are utilized by the model.This approach avoids the curse of dimensionality by only considering an additive combination of what could be viewed as designed features fj `αT j xi ˘for j " 1, 2, . . ., m.The model has the flexibility to learn the linear combinations of the predictors αj, in tandem with the nonlinear transformation fj, which are the most useful for predicting the response.The fj will typically be estimated via standard nonparametric regression approaches such as with a smoothing spline (Reinsch 1967).Such approaches are well-suited to one-dimensional regression problems such as this since they can flexibly fit to a wide range of relationships (here, between α T j xi and the response).Such fits are smooth, but allow the data to dictate the shape of the fit, i.e., no parametric form is assumed.
It is instructive to contrast the projection pursuit model with a fully-connected single layer neural network model wherein the user fixes an activation function ϕ, a simple nonlinear transformation which is applied to each (of typically many) linear combinations of the predictor vector.The parameters learned from the data are solely the values of the weights wj applied in these linear combinations.Projection pursuit is able to use smaller m by exploiting the flexibility in the tailored, nonlinear transformation fj that is applied to each.This leads to improvements in interpretability.
The model is fit using a two-level iterative approach.An outer loop consists of running over k " 1, 2, . . ., m.For fixed k, the residuals from the fit on the other m ´1 components is calculated: and then β k , f k , and α k are found such that This relationship between the ri and pβ k , f k , α k q uses an inner loop which alternates between estimating β k f k and α k .Heuristically, at this step the goal is to determine how to best fit to the portion of response that is unexplained by the other m ´1 terms in the model.With each update to a pβ k , f k , α k q, the other components are eventually reconsidered as the outer loop is repeated until convergence is reached.This procedure referred to as backfitting (Breiman & Friedman 1985).
Comment: Implementation.In this work, models are fit using the function ProjectionPursuitRegressor found in of Scikit-learn.(Pedregosa et al. 2011).Smoothing splines (Wahba 1990) are used in each one-dimensional nonparametric fit fj, of degree either two or three.(Initial models are fit using cubic splines, but in the final model degree will be chosen as part of the procedure described below.)The seemingly-redundant parameters βj are included in the model reflecting the custom of ProjectionPursuitRegressor and other software.This was not a part of the original formulation of Friedman et al. (1981) but allows for extra generality in simultaneous fitting of multiple response vectors using the same collection of m ridge functions.

Preliminary Models
As an initial demonstration, a model is fit with log radius as the response, and log velocity dispersion and i-band magnitude as predictors, to mimic the classic fundamental plane model.Here, m " 1, a special case of projection pursuit called single index regression.Figure 1 illustrates the results.The left panel shows the weight placed on each of the two predictors to form the first projection, i.e., p α11 " 0.69 and p α12 " 0.72.The horizontal axis in the right panel shows the value of this projection for all observations in the training set.The estimated form for β1f1 is shown as the solid curve on the panel.This is fit using a cubic spline.The quality of this simple fit is clearly poor (with RMSE on a test set of 0.411), with deficiencies partly due to the range of different galaxy types being fit.A primary motivation of this work is to build models for intrinsic galaxy size that can be used in cases where only photometric observations are available.Hence, for comparison, next is fit a model that includes the griz magnitudes as the features, along with mc disk and delta smooth R, described above in Section 2.2.Each feature is individually shifted and scaled to have mean zero and standard deviation one prior to the fit.When m " 1, the RMSE on a test set is 0.290, but this improves to 0.257 when m " 4. The results are shown in Figure 2. It is notable that the model appears to place little weight on mc disk and delta smooth R, but, in fact, excluding these two predictors increases the test set RMSE to 0.272.
Comment: Splitting the Data.Throughout this work, when data are divided into training, test, or other sets for the purposes of model fitting and validation, splits are done by pixels formed in a 5 ˆ5 ˆ5 grid that covers the full simulation box of 75 Mpc/h.This is done to mitigate issues that could result from galaxies in close proximity sharing important physical information, and hence inappropriately influencing the quality of the model fit.

Residual Analysis
The RMSE values reported for each model only partially reveal important information regarding the quality of the fit, because minimizing prediction errors is not the primary objective of this work.To serve the cosmological motivations, the ideal model would leave no remaining relationship between the residuals from the fit and any properties of the galaxy and its environment.In other words, the model would predict intrinsic size, and the difference between the measured size and the fit size would encode useful information regarding gravitational lensing, peculiar galaxy velocities, and so forth.To this end, study of the property of the model residuals is crucial.
One step in this direction is to plot residuals versus various galaxy properties, and look for patterns and/or trends.Figure 3 shows the result of comparing galaxy mass with the residuals from the model fit above to photometry-based properties.The right panel shows the clear evolution in residuals with galaxy mass, where the blue curve shows mean residuals in each of 20 bins.Error bars shown are calculated on each of these means using a jackknife procedure, described below.Similar comparisons can be made with other features, including those which are included in the model.This is an important step in revealing deficiencies in the model fit.
Figure 4 assesses the degree of spatial correlation in the residuals.Such spatial correlations have been analyzed in the previous cosmological studies using fundamental plane of galaxies (Joachimi et al. 2015;Singh et al. 2021).These correlations exist because the galaxy formation and evolution involves complex physical processes that depend not only on the galaxy itself but also its environment.These contaminate the estimators that are used in cosmological measurements of interest using size residuals and it is desirable to null them before cosmological analysis.Again, in an ideal model there would be no remaining spatial correlation in the residuals from the model fit.However, in figure 4 we see strong correlations between the size residuals and the surrounding galaxy density field.Such signal is not totally unexpected as the size residuals are a non-linear combination of galaxy properties which are correlated with the environment (see Singh et al. 2021, for detailed explanation and analysis).This correlation of galaxy sizes with the local density field is very similar to the intrinsic alignments effect for galaxy shear.Unfortunately, this implies that the current size estimators cannot be used to perform the cosmological measurements using auto-correlations, but they are suitable for cross correlations, similar to galaxy-shear cross correlations.
Comment: Jackknife Errors.The classic jackknife can be shown to be a reliable estimator of the true variance of p θ, in the case where the sample is drawn independent and identically distributed from some population.In this work, this assumption is clearly invalid due to dependencies present from nearby galaxies.Hence, the jackknife procedure is adapted to one in which the 3D simulation box is divided into a grid of 7 3 pixels, with each pixel left out in one iteration of the procedure.This reduces the bias that would result from taking the standard jackknife approach of leaving out one observation (galaxy) at a time.

Kernel PCA-Based Enhancement of PPR
As described above, the PPR model is built upon linear combinations of the supplied collection of features chosen to be optimal for predicting the targeted response variable.Hence, this model can be viewed as a supervised companion to principal components analysis (PCA), wherein a new representation of data vectors is constructed in an unsupervised manner, with a goal of finding linear combinations with maximal variance.This is motivated by the heuristic that directions in the original space along which there is the greatest variability are the projections that encode the most useful information.Thus, standard PCA used in combination with projection pursuit regression would be redundant, as nothing could be gained by considering a simple rotation of the features in Euclidean space.There exists, however, a nonlinear extension of PCA, called Kernel PCA (Scholkopf et al. 1998) which provides a potentially useful enhancement to the space of projections under consideration by the PPR model.
It is instructive to first consider the math behind standard (linear) PCA.For additional detail, see Hastie et al. (2009).Let X denote the n by p matrix whose rows are the individual feature vectors.Assume that the variables have been mean-centered so that each column of X has sample mean zero, hence X 1 X{n is the sample covariance matrix for these data.Then, the principal components are found as the eigenvectors of X 1 X{n, or, equivalently, of X 1 X.Denote these eigenvectors as v1, v2, . . ., vp and the corresponding eigenvalues with λi.PCA can be interpreted as creating a new coordinate system, or basis, within which a data vector can be represented so that si " Xvi provides the positions of all n observations along the i th axis in the new coordinate system.It follows that X 1 Xvi " λivi , XX 1 Xvi " λiXvi and XX 1 si " λisi.(8) Since }si} 2 " λi, the standardized versions si " si{ ?λi will be orthonormal and hence are the eigenvectors of the Gram matrix XX 1 .The conclusion is that the position in the new coordinate system can be found directly from the eigenvectors of the Gram matrix.
In Kernel PCA, this form of the Gram matrix is generalized such that the pi, jq entry is Kpxi, xjq, where K is a user-chosen kernel function which measures similarity between vectors.Common choices for the kernel function include the radial basis function kernel Kpx, yq " exp `´γ}x ´y} 2 ˘(9) and the sigmoid kernel Both of these examples illustrate the important role of tuning parameters in the choice of a kernel, e.g., through the specification of γ.
Kernel PCA also maps the observations into a new space, with the hope that a useful lower-dimensional representation will result.Let ϕpxq denote the position of x in the new space defined by Kernel PCA.The i th coordinate is found via the Nyström extension (Nyström 1930), For x k included in the training set, ϕipx k q " ?λi s ik .The connection with PPR is as follows: A linear combination of the predictors, now using the Kernel PCA representation, is where s¨j holds sij for i " 1, 2, . . ., n.(In this expression, λi are absorbed into the individual αi without loss of generality.)The heuristic behind this is that varying the tuning parameter γ that characterizes the kernel function leads to a wide range of nonlinear transformations of the predictor vector.The model can achieve a better fit if it has a larger class of intelligently-chosen directions to search over.The vector ϕpxq can be of dimension up to n, while the original predictor was limited to p dimensions.The choice of γ allows for great flexibility in the formation of this new representation.An analogous idea is often employed with a standard approach to classification, support vector machines (SVM) (Cortes & Vapnik 1995).The basic SVM approach searches for a linear separator in the feature space to distinguish the two classes under consideration.Of course, a linear separator in the original feature space is rarely an adequate classifier.But by projecting the features into a much higherdimensional space, the potential for finding a useful linear separator is greatly enhanced.This is often referred to as the kernel trick.
Figure 5 illustrates the potential.In this fit, Kernel PCA was used to create a nonlinear transformation of galaxy properties into a ten-dimensional space.Galaxy properties utilized were photomery-based, as in the previous fit, but now in the PPR model these features are supplemented with those derived from Kernel PCA.The sigmoid kernel was used.The figure shows how the PPR model is able to exploit this new representation to find a direction in the new space along which the response evolves.The dashed lines show contours which the model is fitting to have constant log size.It is important to keep in mind that this shows one such projection; in fact, m " 4 in this model, so there are four such projections through the ten-dimensional space created by Kernel PCA.The RMSE on a test set is reduced to 0.240.

Incorporation of Auxiliary Information
In the application of interest, x will be decomposed into the pair x " rx obs , xauxs.Here, x obs consists of the observable properties of the galaxy, i.e., quantities that can be measured or adequately-estimated using solely photometry.The variables in xaux will be additional properties of a galaxy that are not observable, but are believed to encode information useful for predicting its size.These include properties such as three-dimensional density information and the galaxy's location in the central or satellite region of its cluster.This auxiliary information is unobservable in photometric surveys, but will be available in a high-resolution simulation model such as the Illustris model used in this study.The objective is here is to exploit this additional information to improve predictions of intrinsic galaxy size.
The approach developed here will build off the Kernel PCA-enhanced PPR model described above.First, note that in Equation 12, the s¨j are not dependent on the particular x for which the prediction is sought.These s¨j represent the directions found in the new, Kernel-PCA derived representation of the features.Since they are not dependent on x, these can be learned from a training set that has full access to both x obs and xaux, e.g., the information generated from a simulation model.The dependence on x arises only in the kernel function Kpx, xjq.The additional complication of this approach comes from the need to approximate Kpx, xjq from Kpx obs , xjq.
Hence, in the first step the auxiliary training set is constructed from the simulation model, i.e., for these na galaxies both x obs and xaux are available.From this set, a lowdimensional representation is learned using the Kernel PCA approach described above.The tuning parameters of the chosen kernel function will become tuning parameters for the final prediction model.The result of this first step is a set of vectors (directions) s¨j for j " 1, 2, . . ., na.Again, these directions can exploit the rich information available in the features in both x obs and xaux.
For the next step, recall from above that the position of any x in this new space can be found as The challenge at this point is that on the actual, observed data, only x obs is available.To account for this, the kernel function K will be approximated via Here, this approximation will be achieved using a neural network learned from the auxiliary training set derived from the simulation model.
The natural question at this point is the following: What is gained by the incorporation of the auxiliary information?I.e., is it not possible to simply model the galaxy size as a function of x obs directly through a model?This is definitely possible, but this approach is exploiting the additional structure available in the auxiliary information.The auxiliary variables are demonstrably quite powerful source for making these predictions.This information is passed on through the vectors s¨j which are learned from the auxiliary training set.
A second evident question is as follows: Would it be better to fit one or more models that learn the relationship between x obs and xaux, use these to impute the unavailable xaux vectors, and then use these in a model trained on the auxiliary training set?The approach advocated for here avoids the fitting of several models, or one model with a vector-valued response, and instead focuses directly on approximating a single, real-valued quantity which encodes the important information, namely the kernel function evaluated at relevant pairs.
In Section 4 below, results are presented from fitting using this procedure.

MODELS FOR GALAXY SIZE
This section will present the results from the fitting of a more sophisticated model for intrinsic galaxy size.The approach will follow what is outlined in Section 3, with a mix of features from simulation model and photometric sources, all used in an effort to build an improved model for the size.The features based on photometry are as above: The griz magnitudes, mc disk, and delta smooth R. (These latter two quantities are described in Section 2.2.)The auxiliary features extracted from the simulation model are as follows: galaxy mass, velocity dispersion, star formation rate, 3D density measures, and central versus satellite classification of the galaxy's location within its cluster.

Model Pipeline Architecture
For the purposes of this modelling pipeline, the data are divided into three sets.(As mentioned above, groups are formed by pixel.)First, Set 0 consists of those galaxies used to create the Kernel PCA representation.Here, this is done using the sigmoid kernel (Equation 10) with c " 1 and with γ the first of the tuning parameters to be optimized.(The approach to setting the values of the tuning parameters is described below.)Figure 6 depicts the first two dimensions in this representation, showing the important relationship with galaxy size.Ultimately, the number of dimensions which are used in the model is another tuning parameter.
In the next step, the observations in Set 0 are further divided into a training and test set for the purposes of predicting the kernel function when evaluated at a px obs , xq pair.This model is fit using a fully-connected, four-layer neural network, with 1000 nodes per layer.Learning is allowed to run for 200 epochs, with learning rate fixed at 0.001.The dropout rate (applied after each layer) and the mini-batch size in the utilized ADAM optimizer are additional tuning parameters.Figure 7 shows the performance of this model on the test set in the final chosen model.
The role of the aforementioned model is to allow for the prediction of the value of Kpx obs , xq for pairs where x obs is not in Set 0, but x is from a galaxy which is in Set 0. To understand this step, it is useful to consider an updated  version of Equation 15 above, as follows: Here, p ϕipx obs q is the position in the i th dimension of the Kernel PCA of the galaxy with observed properties x obs , when the approximated Kernel is utilized.One can imagine that the Set 0 galaxies comprise a collection of simulation modelderived "reference points" to which the galaxies outside Set 0 are compared, albeit using an approximation to the Ker-nel function.The values of p ϕipx obs q is calculated relative to these reference points.
At this stage, the information is available for fitting the projection pursuit regression model that relates galaxy size (on the log scale) to the mix of observable and Kernel PCA-generated features.Set 1 is the training set used for this model, while Set 2 is held out as a test set.In this model, the number of Kernel PCA features, the degree of the spline functions, and the number of ridge functions are tuning parameters.

Selection of Tuning Parameters
A challenging aspect of fitting a model of this complexity is the number of tuning parameters that result.In this pipeline, some model components are fixed to values that are deemed to be reasonable, e.g., the use of 1000 units in each layer of the neural network, and the choice of the sigmoid kernel.Other tuning parameters are set via randomization at the outset of the pipeline: ‚ The value of γ in the Kernel PCA procedure, with log 10 pγq chosen uniformly on the interval p´5, ´2q.
‚ The dropout rate used in the neural network model, chosen uniformly between 0.1 and 0.5.(The same dropout rate is used for all four layers.) ‚ The batch size used in the neural network fitting algorithm, set to 16, 32, 64, or 128.
‚ The number of Kernel PCA dimensions used in the PPR fit, set to 10, 15, or 20.
‚ The degree of the spline functions in the PPR fit, set to either 2 or 3.
The final tuning parameter is the number of ridge functions used in the PPR model.With each of the above five parameters fixed, this is varied from 2 to 16, with a cross-validation approach used to choose its value.Ultimately, the figure of merit used in choosing the global set of tuning parameters is the minimal MSE within this cross-validation procedure.The values chosen by this procedure are as follows: γ equals 0.00535, dropout rate equals 0.22, batch size equals 64, there are ten retained KPCA dimensions, and quadratic splines are used in the PPR model.
Comments: As an additional hedge against overfitting, this cross-validation procedure uses the one-SE rule (James et al. 2013), wherein the value of the associated tuning parameter is set to the smallest such value which yields a figure of merit within one standard error of the best performing choice.The motivation behind this approach is that one should only choose a model if there is convincing evidence that the additional complexity is warranted.Also, note that this procedure avoids using the test set (Set 2) in the se-lection of the tuning parameters, which helps to preserve the role of the test set as an ultimate tool for assessing the performance of the model.

Model Performance
Figures 8 and 9 show the eight ridge functions fit in this model.The results show that the contrasts in the magnitudes (i.e., the colors) are clearly the most important in predicting the response values.Each of the kernel PCA-derived predictors receive small weight, but they still play a crucial role in improving the predictions, as evidenced by the reduction in the RMSE on the test set error to 0.231.Figure 10 compares the model predictions with the true galaxy size, for observations in test test set, i.e., Set 2.
Figure 10 depicts a fair amount of scatter around the fitted line, but a central question is the following: To what extent does this remaining scatter correlate with physical properties of the galaxies, i.e., to what extent can the remaining scatter be attributed to intrinsic properties of the galaxies?Ideally, by incorporating these physical properties into the model we have reduced any remaining such correlation, and hence the residuals largely originate from physical effects which occur between the galaxy and when it is observed.Figure 11 explores this by comparing the residuals with galaxy features.It is observed that correlation in the residuals with each of the physical properties is largely eliminated.
Figure 12 shows how the correlation of residuals from the fit vary across scales.While the reduction in the amount of spatial correlation is encouraging, there remains a clear, negative correlation on the smallest scales.The pattern of correlation in this fit is consistent with that seen in Figure 4, indicating that the additional complexity introduced in this final model did not help to further reduce this correlation.This correlation at small scales is expected, since the galaxy physics at this scale is very difficult to capture.While our relatively simple model managed to reduce the correlation by a factor " 5, more sophisticated ML architectures may be needed in order to probe these small scale galactic physics, as demonstrated in Jagvaral et al. (2022), where adding graph convolutional layers to a neural network removed remaining small scale correlations.Such approaches suffer, however, from reduced interpretability due to the convolutional abstraction of the inputs.

DISCUSSION
This work demonstrates the potential of supervised learning approaches that are designed to emphasize interpretability for yielding accurate predictions of intrinsic galaxy sizes.The residuals from such a fit, estimates of the difference between the intrinsic and observed size, hold a wealth of useful cosmological information regarding topics such as gravitational lensing and the peculiar velocity of galaxies.These techniques could also potentially lead to improvements in uncertainty quantification on photometrically-estimated redshifts.The present work focuses on the use of data generated from the simulation model Illustrus-TNG, in an effort to explore the limits of the potential for such models, while also demonstrating a novel prediction approach that incorporates the learned structure in the high-resolution information only available in the simulations.
The final model in this work serves as an illustration of the potential of the developed methods, but also of the directions for further improvements.The results demonstrate how photometry-only samples, in conjunction with highresolution simulation models, could be combined as part of a framework to improve intrinsic galaxy size predictions.The model fits yield a relatively interpretable picture of the way in photometrically-derived properties relate to galaxy size.The results make it clear that magnitudes are useful predictors of galaxy size, provided a sufficiently complex model form is allowed.The residuals in the fits for intrinsic size show minimal correlation with some key physical galaxy properties, indicating that the models are successfully capturing the key relationships.It is clear, however, that there remains correlation with local environment, and that photometric data are not sufficient for capturing this correlation.This could possibly be improved by using more complex supervised learning methods.The hope is that the gain in interpretability from the proposed approach outweighs the drawback of this remaining correlation.
A next step would be to explore the use of such approaches with real, photometric survey data.In such an analysis, the steps taken in this work would be repeated, with Set 0 still built from the simulation model.Set 1 should consist of a sample of real galaxies with available spectroscopic data, in order for reliable measures of galaxy size and other observable galaxy properties to be available in the training of the PPR model.This model could then be applied to a photometry-only sample to produce predictions of galaxy sizes.This approach would achieve the simultaneous goals of using a modelling approach that produces interpretable results, but also exploits the information available in the high-resolution simulation model.A comparison is made between the baseline correlation in the data (in red), and the correlation in the residuals after the model is fit (in blue).While there is still remaining spatial correlation, especially on smaller angular scales, the degree of correlation has been reduced dramatically from that present in the original data.

Figure 1 .
Figure 1.Illustration of the results from the first, simple fit.Here, m " 1, and there are only two predictors, log velocity dispersion and i-band magnitude.The left figure shows the weight placed on each of these two predictors, while the right shows the non-linear function applied to this linear combination.

Figure 2 .
Figure 2. Illustration of the results from the second fit.Here, there are six predictors (all based on photometry), and m " 4. The vertical axes of right plots is labelled "Residuals" because the figure shows the fit to what remains after the other m ´1 " 3 components is subtracted off.

Figure 3 .
Figure 3. Evolution of the residuals with galaxy mass.The mean residual in bins is shown in the blue curve.Error bars are constructed using a jackknife procedure.

Figure 4 .
Figure 4. Measurements of correlations between the galaxy size residuals and the galaxy density field.This effect is similar to the intrinsic alignments effect for galaxy shear.It arises because size residuals are a non-linear combination of galaxy properties which are correlated with the galaxy environment.

Figure 5 .
Figure 5.This figure illustrates how the inclusion of Kernel PCA coordinates enhances the fit.The position of each galaxy along the first and third Kernel PCA coordinates is shown, colored by the galaxy size.The dashed contours show lines of constant galaxy size, as fit by the PPR model.

Figure 6 .
Figure 6.The first two dimensions created by the Kernel PCA transformation.Each dot represents one galaxy, and color reflects the log size.The evident relationship between position in this space and galaxy size suggests that Kernel PCA is picking up important physical information.

Figure 7 .
Figure 7.Comparison of the actual and predicted kernel values when using a neural network model.

Figure 8 .
Figure 8. Illustration of the first four ridge functions fit in the final model, following the optimization of the tuning parameters.It is evident from the weights placed on the griz magnitudes that these are the most important in predicting the size of the galaxy.The kernel-PCA derived factors are also influencing the fit, however, and help to reduce the RMSE error on a test set to 0.231.

Figure 9 .
Figure 9. Illustration of the second four ridge functions fit in the final model, following the optimization of the tuning parameters.It is evident from the weights placed on the griz magnitudes that these are the most important in predicting the size of the galaxy.The kernel-PCA derived factors are also influencing the fit, however, and help to reduce the RMSE error on a test set to 0.231.

Figure 11 .
Figure 11.Residuals versus physical properties of the galaxies.The color scale reflects the density of residuals in that bin.The mean residual in a vertical slice is shown by the solid line, with error bars calculated by a jackknife procedure.The model has largely achieved its objective of removing any remaining correlation with these quantities.

Figure 12 .
Figure12.Spatial correlation in residuals from fit.A comparison is made between the baseline correlation in the data (in red), and the correlation in the residuals after the model is fit (in blue).While there is still remaining spatial correlation, especially on smaller angular scales, the degree of correlation has been reduced dramatically from that present in the original data.