Subhalo effective density slope measurements from HST strong lensing data with neural likelihood-ratio estimation

Examining the properties of subhalos with strong gravitational lensing images can shed light on the nature of dark matter. From upcoming large-scale surveys, we expect to discover orders of magnitude more strong lens systems that can be used for subhalo studies. To optimally extract information from a large number of strong lensing images, machine learning provides promising avenues for efficient analysis that is unachievable with traditional analysis meth-ods, but application of machine learning techniques to real observations is still limited. We build upon previous work, which uses a neural likelihood-ratio estimator, to constrain the effective density slopes of subhalos and demonstrate the feasibility of this method on real strong lensing observations. To do this, we implement significant improvements to the forward simulation pipeline and undertake careful model evaluation using simulated images. Ultimately, we use our trained model to predict the effective subhalo density slope from combining a set of strong lensing images taken by the Hubble Space Telescope . We found the subhalo slope measurement of this set of observations to be steeper than the slope predictions of cold dark matter subhalos. Our result adds to several previous works that also measured high subhalo slopes in observations. Although a possible explanation for this is that subhalos with steeper slopes are easier to detect due to selection effects and thus contribute to statistical bias, our result nevertheless points to the need for careful analysis of more strong lensing observations from future surveys.


INTRODUCTION
The standard ΛCDM cosmological model has been in remarkable agreement with large-scale observations, but there is scarce evidence for the nature of dark matter on small (subgalactic) scales.Because the nature of dark matter (DM) remains elusive, examining various dark matter models using small-scale cosmological observables becomes crucial.One of the promising observables used to study DM is subhalos, which are small dark matter clumps gravitationally bound to a larger halo.Probing the properties of these subhalos can potentially shine light on the nature of DM, as subhalos exhibit different properties under alternate DM models beyond cold dark matter (CDM).For instance, warm dark matter (WDM) models predict a smaller number of low-mass subhalos and more cored subhalo density profiles compared to CDM (Bode et al. 2001), while self-interacting dark matter (SIDM) models generally predict more cored subhalo profiles than that of the CDM model (Spergel & Steinhardt 2000).
Because low-mass subhalos are observed to lack luminous matter (Fitts et al. 2017;Read et al. 2017;Kim et al. 2018), they are typically probed through their gravitational effects.
⋆ yzhang7@g.harvard.edu Strong gravitational lensing, a predicted phenomenon from General Relativity, is a powerful way to constrain subhalo properties.In strong gravitational lensing, light emitted by a distant source gets deflected by the gravitational field of a massive structure (lens), and properties of the lens and its substructure can be inferred by analyzing the images of the source light.In this paper, we will focus on studying subhalos in the lens galaxy of strong lensing systems in which both the lens and background source are galaxies.
To date, there have been a few claimed detections of substructure in galaxy-galaxy strong lensing observations (Vegetti et al. 2010(Vegetti et al. , 2012;;Hezaveh et al. 2016b).Existing analyses that use observed strong lensing images to constrain DM models primarily rely on modeling individual (often the most massive) substructure in a lens system (Vegetti et al. 2014;Ritondale et al. 2019;Şengül et al. 2022).While useful, direct substructure modeling is computationally costly, and it is often limited to capturing the effect of relatively massive subhalos.Even though the CDM paradigm predicts a large number of subhalos with smaller masses, they are difficult to probe through traditional analysis methods because the inclusion of more subhalos makes sampling of the joint parameter space prohibitive.As a result, it is important to explore alternative analysis methods that can more optimally incorporate information from the large population of smaller subhalos.
To leverage the collective effect of subhalo populations on strong lensing images, there has been significant work done to obtain statistical constraints from subhalos (Dalal & Kochanek 2002;Hezaveh et al. 2016a;Cyr-Racine et al. 2016;Díaz Rivero et al. 2018a;Birrer et al. 2017;Díaz Rivero et al. 2018b;Daylan et al. 2018;Brennan et al. 2019;Gilman et al. 2018;Cyr-Racine et al. 2019;He et al. 2022).In particular, machine learning has emerged as a promising candidate to analyze subhalos in strong lensing images for its ability to efficiently and implicitly marginalize over a large parameter space.With the upcoming large-scale imaging surveys, the number of observed strong lensing systems is expected to increase significantly (Laureijs et al. 2011;Collett 2015;McKean et al. 2015;Bechtol et al. 2019;Jacobs et al. 2019;Huang et al. 2021;Storfer et al. 2022).Machine learning has a much-needed advantage that can make inference on these large datasets feasible.
Several deep learning techniques have been demonstrated to be effective at constraining the subhalo mass function using simulated strong lensing images (Brewer et al. 2016;Brehmer et al. 2019;Ostdiek et al. 2022b,a;Anau Montel et al. 2022;Wagner-Carena et al. 2023), but so far, there has been no successful attempt at applying them to real observations.The main challenge of applying deep learning methods to observations comes from the need for the training set to closely resemble the test set, as deep learning models are known to struggle in the presence of a distribution shift between training and test sets (Recht et al. 2018(Recht et al. , 2019)).Most of the previous works on machine learning applications to strong lensing made simplifying assumptions in the forward modeling pipeline of the training set in order to demonstrate the potential suitability of a method.However, for the machine learning model to be deployed on observations, its training set needs to incorporate all possible complexities that exist in the observed data.
In this work, for the first time, we analyze subhalo properties in real strong lensing observations with a machine learning technique.We build upon the method developed in Zhang et al. (2022) by adding multiple layers of complexities in the forward pipeline for the training set.Through training, our model learns to infer the effective subhalo density slope (directly related to the subhalo concentration), a promising observable proposed by Şengül & Dvorkin (2022) for distinguishing DM models.Several other works have also shown that the concentration of subhalos is an effective probe of DM (Minor et al. 2021a,b;Amorisco et al. 2022).Using our trained model on real observed strong lensing images, we found a subhalo density slope steeper than those of subhalos predicted by the CDM model.This measurement is consistent with previous works, which also found unexpectedly large subhalo concentrations (Minor et al. 2021b;Şengül & Dvorkin 2022).This paper is organized as follows.In Sec. 2, we discuss details of the forward model used to generate mock strong lensing images.In Sec. 3, we summarize the deep learning technique that we use for inference, discuss our inference procedure, and outline our neural network architecture.In Sec. 4, we evaluate our trained model and compare the model predictions on the observed data with those under the CDM model.We conclude with a summary of our results in Sec. 5, and discuss the implications of our work.

DATA
We generate simulated lensing images to train our neural network and compare our model predictions with ground truths on mock images post-training to ensure training quality.At inference time, we apply the trained model to a set of observed lensing images from the Hubble Space Telescope (HST).We discuss details of both the mock data and the real HST observations below.

Mock data generation
To generate our mock strong lensing images, we use the software package lenstronomy (Birrer et al. 2015;Birrer & Amara 2018).In order to match the HST post-drizzling image configuration, we generate (100×100) pixel 2 images, with a resolution of 0.04 ′′ per pixel.We build upon the simulation pipeline used in Zhang et al. (2022) and include significantly more complexities in the modeling process so as to make the images as similar to real observations as possible.Modeling a strong lens system requires several ingredients in the forward model: a source galaxy, a main (host) lens galaxy, a population of subhalos and light-of-sight (LoS) halos.In addition, we specify the instrumental configuration and image pre-processing of the mock images in the forward simulation.The distributions of parameters governing the lens models of our simulated images are summarized in Table 1.

Source and main lens
In a galaxy-galaxy lens system, light rays of a background source galaxy get gravitationally deflected by a foreground lens galaxy en route to the detector.Strong gravitational lensing specifically refers to the case where the projected surface mass density of the lens is greater than the critical surface density Σcrit.In this scenario, the bending of source light is significant enough to result in characteristic arcs of light in observed images.
To simulate the source light, we use galaxy images taken by the HST Cosmic Evolution Survey (COSMOS) (Scoville et al. 2007;Koekemoer et al. 2007) processed by paltas (Wagner-Carena et al. 2023).The paltas package takes a sub-sample of the HST COSMOS survey galaxy images (Mandelbaum et al. 2012(Mandelbaum et al. , 2014) ) and filters out suitable source candidates.To simulate the source for each mock image, we randomly draw a galaxy image from the COSMOS catalog and randomly vary the rotation angle and the source coordinates (xsource, ysource).From the 2,262 available source galaxies, we use 2,163 (96%) for the training set, 70 (3%) for the validation set, and the remainder (1%) for testing and evaluation.
We model the main-lens mass distribution using an elliptical power law (EPL) profile (Barkana 1998).The convergence of an EPL profile at position (x, y) on the lens plane is given as follows: where θE is the Einstein radius, q is the minor/major axis ratio, x ϕ , y ϕ are positions on the axes aligned with the major and minor axes, and γ is the power-law slope of the mass distribution.To model each main lens, we draw its γML (γ of the main lens) from N (2, 0.1) and truncate the tails of the normal distribution so that the range of possible values is bounded by 1.1 and 2.9.Slope values outside of the (1, 3) interval lead to nonphysical or divergent mass profiles and are thus not included in our modeling.Adding variations in γML simulates the natural stochasticity in lens density profiles that deviate from an isothermal profile (γ = 2).Note that ϕ indicates the angle between the major/minor axes and the fixed (x, y) axes of an image.The inputs into lenstronomy are the ellipticity moduli, which are directly related to q and ϕ: In addition, we add multipole moments m = 3, 4 to the EPL lens mass distribution in order to more realistically model the mass distribution of more complex lenses that may deviate from an elliptical profile.We also include an external shear parametrized by γ shear,1 and γ shear,2 (Keeton et al. 1997).
The shear parameters γ shear,1 and γ shear,2 are the diagonal and off-diagonal terms of the shear matrix, respectively.
In Zhang et al. (2022), it is assumed that the light produced by the lens galaxy has already been subtracted from the original observed image through a coarse modeling process.However, for real observed images, removing the lens light may involve imperfect modeling and high computational cost.To bypass this assumption, we include lens light in our mock image modeling.We assume that the center of the lens light coincides with the center of its mass density profile and that the lens light takes on an elliptical Sérsic profile (Sérsic 1963), with the brightness parametrized as: where nsersic is the Sérsic index, bn sersic ≈ 1.999nsersic −0.327.
Here, rsersic is the half-light radius, and I0 is determined by the apparent magnitude (Birrer & Amara 2018).We draw the apparent magnitude of the lens light from a uniform distribution between 17 and 19.We choose this range to be consistent with the apparent magnitude measurements of lens galaxies in the observed images used in our analysis (Auger et al. 2009), which are discussed in detail in Sec.2.2.In each mock image, we vary all parameters governing the lens model, including its center position, Einstein radius, shear parameters, apparent magnitude, and its ellipticity.The variation ranges of these parameters are summarized in Table 1.In simulating our training set images, we take into account the spectroscopic redshifts of the real HST observations used during inference.To simulate each image in our training set, the source galaxy redshift is drawn from a uniform distribution of zsource ∼ U(0.5, 0.7), while the lens galaxy redshift is drawn from a uniform distribution of z lens ∼ U(0.15, 0.25).These redshift ranges roughly match with those of the real observations that we use for inference.We deliberately chose to work with systems with relatively low source redshifts because they align better with the redshifts of the COSMOS galaxies that are used in our modeling pipeline, minimizing the difference between our simulated images and the real observations.

Subhalos and line-of-sight halos
Aside from the main lens, the observed strong lensing images of the source light are affected by additional structures: subhalos, which are small halos residing inside the main host halo, and line-of-sight (LoS) halos, which are located along the line-of-sight between the source galaxy and the observer.If these (sub)halos are found within the bright lensed arcs in the observed image, they can leave detectable perturbations on the observed images.Analyzing these perturbations provides us with information about the properties of these substructures.
In our pipeline for simulating the training and validation set images, we add subhalos and LoS halos that follow the EPL profile given in Eq. 1.The γ parameter in the EPL profile controls the steepness of the halo density profiles: a larger γ implies a denser halo density profile.We model our training and validation set images with EPL (sub)halos because it allows us to label each image with its underlying power-law slope, which is the ultimate parameter of interest during our inference.We include a uniform prior on γ in our training set so that we do not unnecessarily bias our model.To model the subhalos and LoS halos in each image, we first draw γ from a uniform distribution: γ ∼ U (1.1, 2.9); we then draw normally distributed slopes γi ∼ N (γ, 0.1γ) for the ith subhalo.This normal distribution is truncated so that γi is constrained between 1.01 and 2.99.The number of subhalos added to each image is drawn from a uniform distribution N sub ∼ U{0, 3000}.Note that the upper bound of 3000 subhalos is an overestimate of a realistic number of subhalos for our host halo mass, but we include a higher number of subhalos so that our neural network can successfully learn the signatures in the lensed images corresponding to the changes in the density slope γ.During model evaluation, we will use a smaller N sub range to simulate a more realistic substructure fraction, as will be discussed in Sec. 4.
Because only subhalos near the bright arcs of an image have observable effects, in our simulated images, we place subhalos solely in pixels whose brightness is more than a fifth of the maximum brightness in the smooth model image, which is the image modeled with only the lens and source galaxies and no substructure (and in this case no lens light).The Einstein radius of each subhalo is determined by its mass M200, which is drawn from a subhalo mass function dN sub /dM200 ∝ M −1.9 200 .The mass M200 is defined as the total mass enclosed by r200, which is the radius within which the average mass density is 200 times the critical density of the Universe.In our simulated images, we only add subhalos with masses between 10 7 M⊙ and 10 10 M⊙, because subhalos heavier than this range are scarce and can often be individually modeled.
To add the LoS halos in our modeling, we use the pipeline provided by paltas, with several added modifications.The properties of each LoS halo are determined by the following parameters: its mass M200, density slope γi, ellipticities e1, e2, redshift z los , and position coordinates x los , y los .paltas determines the mass M200 of each LoS halo using a modified Sheth-Tormen halo mass function, which includes two additional free parameters to the mass function proposed origi-nally in Sheth et al. (2001): 1) an overall scaling factor that accounts for uncertainties in the normalization of the mass function; 2) a parameter that accounts for the contribution from the two-point halo correlation function, due to the fact that dark matter halos are biased tracers of the overall matter distribution.The two-point halo correlation function correction is only added for halos along the line-of-sight that are sufficiently close to the main halo of the lens.For a more comprehensive discussion of the modified Sheth-Tormen mass function used in paltas, we refer readers to Wagner-Carena et al. (2023).The density slopes and ellipticities of LoS halos are drawn from the same uniform distributions as subhalos, as discussed above.To determine the position of LoS halos, we divide the space between the observer and the source galaxy into thin slices of redshift with uniform thickness.The position coordinates, (x los , y los ), are bounded by a double cone whose bases lie in the lens plane and whose apexes lie at the observer and the source.The position of each LoS halo is sampled uniformly in the volume of the double cone.
The addition of subhalos causes an enlarged effective Einstein radius, so to restore the effective Einstein radius to its smooth model counterpart, we add a negative mass sheet in the lens plane for the subhalos.In addition, to avoid making the region along our line-of-sight overdense compared to the rest of the Universe, we add a negative mass sheet in each redshift slice for the LoS halos.The negative mass sheet is a constant sheet of convergence such that the sum over all its pixels cancels out the total convergence added by the subhalos or LoS halos.

Instrumentation details and data pre-processing
To make the mock images as similar to real observations as possible, we incorporate HST instrumentation details in the production of our training set.We model our images using the HST ACS/WFC F814W filter configuration and apply an empirical point spread function (PSF), obtained from examining the exposure of point-like stars (Anderson & King 2000).We add noise using an exposure time of 2200 seconds, in approximate agreement with the noise level of the observed HST images discussed in Sec.2.2.
Moreover, in real observations, there are often bright structures close to the strong lens system of interest that can potentially distract our analysis.During training, we apply a circular mask to cover the region outside of the lensed arcs, so that our model learns to not get confused by potential confounders.To mask out the edges of the images, we set the area outside of a circular mask to zero after an image has been whitened.We vary the radius of the circular mask based on the Einstein radius of the image.

HST observations
We demonstrated, in Zhang et al. (2022), that a neural likelihood-ratio estimator is capable of extracting subhalo population density slope information from simulated strong lensing images.In this work, we apply the same method to real strong lensing data taken by the HST.Specifically, we use strong lens systems identified by the Sloan Lens ACS (SLACS) survey (Bolton et al. 2008) and followed up by HST observations.
In the SLACS strong lens systems, the redshifts of the foreground galaxies range from 0.05 to 0.5, while the redshifts of the background galaxies range from 0.2 to 1.2.For our analysis, we choose observed images that share the same set of properties, and then simulate a matching training set.The shared properties include 0.04 ′′ pixel resolution, F814W camera band, and exposure time of approximately 2200 seconds.We also select lens systems with source redshifts, lens redshifts, and Einstein radii that fall in a reasonably narrow range to limit the span of the overall parameter space.From the HST observations, we made (100 × 100) pixel 2 cutouts in which the lens systems of interest are located roughly at the center.Out of these cutouts, we then selected a subset of them that contain visible lensed arcs.The selected HST observations share the same instrumentation details with our training set so as to avoid having an unnecessary distribution shift between training and testing.Our selection process led to a subset of 13 images that we ultimately used for inference, as shown in Fig. 1.

MODEL AND INFERENCE
To infer the subhalo density slopes, we use a simulationbased inference (SBI) machine learning technique.SBI methods have gained increasing popularity in parameter inference problems in cosmology because of their ability to approximate intractable likelihoods due to complicated physical processes.In our application, we train a neural likelihood-ratio estimator as a parametrized classifier to learn the likelihood function (Cranmer et al. 2015;Baldi et al. 2016;Hermans et al. 2019).
In this section, we will give a high-level summary of our model and inference method.For a more detailed description of the theory underpinning the neural likelihood-ratio method, we refer readers to Cranmer et al. (2015), while for more details on its application to analyzing subhalo density slopes in strong lensing images, we refer readers to Zhang et al. (2022).

Inference method
Suppose θ denotes the parameters of our interest and x denotes the observed data.The core idea of likelihood-ratio estimation is training a classifier to distinguish between samples from two different probability distributions: the joint dataparameter distribution p(x, θ), which is the distribution of our interest, and the product of the marginal distributions of the data and the parameter p(x)p(θ).In our case, x corresponds to observed strong lensing images, while θ is the subhalo density slope γ underlying each image.We train a neural network as a classifier to learn the decision function s(x, θ) = p(x, θ)/ (p(x, θ) + p(x)p(θ)), which is in one-to-one correspondence with the likelihood ratio r(x | θ) = p(x, θ)/(p(x)p(θ)) as follows: This allows us to convert a likelihood inference task to a classification task (Cranmer et al. 2015;Baldi et  image, we obtain the classifier logits for a linearly-spaced array of input γ values.The likelihood-ratio estimation method is amortized: after spending an initial overhead for model training, minimal computational cost is needed during inference.
If we have an ensemble of strong lensing observations {x} that are independently and identically distributed when conditioned on γ, then we can obtain their combined likelihood ratio by computing the product of the individual likelihood ratios, This offers a way for us to efficiently combine results of multiple observations with little additional computational cost.

Uncertainty quantification
If our likelihood-ratio estimator is a perfect classifier, then the test-statistic 2 (ln rMLE − ln r) should be χ 2distributed (Wilks 1938), where ln r is the log-likelihood evaluated at the true γ and ln rMLE is the log-likelihood at the maximum likelihood estimate (MLE) γMLE.However, we found that with our imperfect classifier, the test-statistic distribution deviates slightly from a true χ 2 .Therefore, instead of quoting the 68% uncertainty interval using a χ 2 distribution, we empirically determine the threshold for the 68% confidence interval (CI) of the test-statistic.In practical terms, we do this by computing the test-statistics of many samples and then determining a threshold value under which approximately 68% of the test-statistics of these samples are included.Then, for a likelihood-ratio profile, the γ values whose likelihood-ratios evaluate to this threshold determine the upper and lower uncertainties on the MLE.Because we found that combining different numbers of likelihood-ratios leads to slightly different test-statistic distributions, this empirical threshold is determined separately for combining different numbers of images.This uncertainty quantification procedure ensures that approximately 68% of the ground truths fall within the uncertainties quoted, and is used to determine the error bars presented in Sec. 4. true γ for each input image to the latent vector after the convolutional layers in order to ensure that the model learns the true label.At test time, we instead append test γ values in order to obtain likelihood-ratio estimates over a range of γ.Our training objective is the canonical binary cross-entropy loss for classification.We provide more details of our customized ResNet-50 architecture in Appendix A.

Model and training details
To help with model convergence, we pre-process our training set images.We normalize image pixel values to having zero mean and unit standard deviation across the training set; in addition, we normalize the γ values to zero mean.To ensure consistency at test time, we use the training set mean and standard deviation to whiten our test data.
We use the AdamW optimizer (Kingma & Ba 2014;Loshchilov & Hutter 2017) with an initial learning rate of 10 −3 .We follow a learning rate schedule that decays by an order of magnitude when the validation loss stagnates for 3 epochs, followed by a 2-epoch cool-down period.We use a batch size of 1000 based on the maximum GPU memory available.There are 5,000,000 mock images in our training set and 1,000 in our validation set, all of which are generated using the forward model described in Sec. 2. Training terminates when the validation loss plateaus under a threshold of 10 −3 .We carried out our neural network training on NVIDIA V100 GPUs for ∼20 epochs, with each epoch taking ∼5 hours.We found that scaling up the size of the training set and the model complexity significantly improved the model performance during inference, and we expect there to be more improvement if the computing resource availability allows for more up-scaling.

RESULTS
After our model has been trained, we first need to evaluate its convergence.To do this, we compare model predictions of the subhalo density slope with their ground truths using individual images in our validation set.In Fig. 2, we show the maximum likelihood estimate (along with their 68% uncertainties) compared to the true γ for 93 images with 1.2 < γ < 2.8.
The validation images have parameters drawn from the same distributions as the training set, except for N sub , which is drawn from U{0, 1800} to simulate a more realistic substructure fraction.Note that because each image in our training and validation set is labeled with a true underlying γ value for EPL subhalos, we ideally would like the neural network to predict the ground truths and be agnostic to the number of subhalos.Therefore, having a more realistic number of subhalos in our validation set serves as a way for us to ensure that changing the number of subhalos does not  incur a bias in our neural network predictions.The source galaxy images used in validation were held out in training.These images contain EPL subhalos whose true underlying subhalo density slopes are known in the forward simulation pipeline, making this direct comparison possible.As shown in Fig. 2, our model predictions follow the ground truths in trend.This demonstrates that despite the addition of several layers of complexities into training images, our neural network remains sensitive to the signature imprinted on strong lensing images by changes in the subhalo density slope.However, the relatively large confidence intervals indicate that the constraining power of individual images is limited, which makes it imperative to combine multiple images for inference.Note that the uncertainties are larger at the lower end of the γ range because smaller γ values indicate less concentrated subhalos, which leave less detectable signatures in the lensed images.
In addition, we check the model predictions of combining likelihood ratios of multiple images.In Fig. 3, we show γMLE compared to the ground truth γ for combined likelihoods of sets of 13 images, with each set sharing the same underlying slope γ.Note that we still simulate the natural spread in γi, so the slope for each subhalo varies slightly.These images share the same parameter distributions as the images used in Fig. 2 except that the source galaxies come from the held-out set for validation.In Fig. 3, we see that the MLE predictions of our model closely follow the ground truths with relatively small error bars.Comparing the uncertainties between Fig. 2 and Fig. 3, we see that combining images significantly improves constraining power.

Simulated images with NFW subhalos
To obtain the expected subhalo density slopes under the cold dark matter model, we simulate images containing subhalos and LoS halos following the Navarro-Frenk-White (NFW) profile (Navarro et al. 1997).Its radial density profile given by: where r is the distance from the center of a subhalo, and the normalization ρ0 and the scale radius rs are free parameters.The NFW profile is the most common fit for the universal density profile of halos from CDM N-body simulations.In addition to a normalization and a scale radius, the NFW profile can also be parametrized by the (sub)halo mass M200 and concentration c200.The concentration c200 relates to the scale radius and virial radius r200 (as defined in Sec.2.1.2) following r200 = c200rs.In our simulated images with NFW subhalos, we relate M200 and c200 with a mass-concentration relation extrapolated from Dutton & Macciò (2014), which is an empirical relation determined using halos in CDM simulations.We add a dex scatter of N (0, 0.1) to the massconcentration relation for each subhalo in order to mimic the natural spread in the relationship.Note that a dex scatter of 0.1 corresponds to a ∼ 26% variation in concentration.If we denote the CDM mass-concentration as fCDM(M200), then we can modify the CDM mass-concentration relation by multiplying it by a constant factor (which will be referred to as concentration multiplicative factor) in order to simulate different density slopes of subhalo populations.In other words, this means that for a given concentration multiplicative factor a, we set the concentration of a subhalo with mass M200 to be a × fCDM(M200).To test the robustness of our neural network with as realistic images as possible, we add subhalos everywhere in the image in these test sets.
In addition, due to tidal stripping from the host halo, subhalos typically lose mass in their outer region (Hayashi et al. 2003;Diemand & Moore 2011).This means that, instead of an NFW profile, they can be more realistically modeled by a truncated NFW (tNFW) profile.The tNFW profile is parametrized by the NFW parameters as well as a truncation radius rt: Because the truncation steepens the subhalo density profile past the truncation radius, we expect tNFW subhalos to have steeper power-law density slopes than their NFW counterparts.To model the tNFW subhalos in our pipeline, we first determine their NFW parameters following the procedure described for NFW subhalos and subsequently set their truncation radii.The choice of truncation radii affects subhalo density profiles in the intermediate and outer region and thereby their measured slopes (Şengül & Dvorkin 2022).For our test images, we set the truncation radii following Wagner-Carena et al. ( 2023): with mtrunc,pivot = 10 7 M⊙ and rtrunc,pivot = 50 kpc.Here, M200 is the subhalo mass and r sub is the distance of the subhalo from the main halo center.Note that in the test sets where subhalos are modeled with tNFW, LoS halos are still modeled with the NFW profile as they experience less tidal stripping than subhalos.One question that might arise is why our trained likelihoodratio estimator can be applied to images with (t)NFW subhalos and LoS halos even though the training set only contains their EPL counterparts.The justification for this has been demonstrated in Şengül & Dvorkin (2022) and Zhang et al. (2022): given limited resolution and appropriate noise level, the observable changes in the surface brightness due to the presence of (t)NFW subhalos can be well approximated by that of a power-law profile subhalo.
Because the density slopes of (t)NFW subhalos vary with mass (i.e.larger masses have more extended density profiles and thereby smaller density slopes), the intrinsic stochasticity in the masses of a (t)NFW subhalo population introduces intrinsic aleatoric uncertainty into the slope measurement.To account for this uncertainty in each of our measurements, we generate 100 separate sets of images with shared properties and then obtain the MLE of each combined likelihood; using the set of 100 MLEs, we empirically determine the 68% confidence intervals.
In Fig. 4, we sample an array of varying concentration multiplicative factors and show our model MLE from the combined likelihood of 13 images that contain (t)NFW subhalos and LoS halos at each multiplicative factor.As expected, subhalos with higher concentrations lead to larger γ predictions.In particular, the data points for a concentration multiplicative factor of 1 correspond to the expected subhalo density slope measurements under the CDM model, and they will be compared with the predicted slope of the HST observations in Sec.4.2.From the figure, we find that tNFW subhalos in general produce higher density slope measurements than NFW subhalos, consistent with findings presented in Şengül & Dvorkin (2022).

Result with HST images
Having done model validation and obtained the expected density slope of CDM subhalos, we will now use our model to infer the subhalo slope of observed HST strong lensing images, which are described in Sec.2.2.These images are masked at the edges and whitened with the training mean and standard deviation before being fed into our neural network.In Fig. 5, we show the individual likelihood-ratio test statistic profiles for several of the HST observations.In comparison with the predictions of NFW and tNFW subhalos under the CDM model, as discussed in Sec.4.1, we see that the predicted slopes of these HST observations are larger than the predicted slopes of the CDM model.We subsequently combine these individual likelihood ratios using Eq.6 in order to obtain tighter constraints.In Fig. 6, we show the combined likelihood-ratio test statistic profile for all of the 13 images.From this profile, we get a measurement of the subhalo density slope of γMLE = 2.51 −0.04 +0.05 , with the quoted uncertainties indicating the 68% credible interval shown in dotted lines.
In the same figure, we also show the 68% confidence intervals for the combined likelihood-ratio test statistic profiles of 13 images containing NFW subhalos or tNFW subhalos.These correspond to data points shown in Fig. 4 for a concentration multiplicative factor of 1. Comparing the slope prediction of the HST images with that of the simulated images with NFW subhalos, we see that the measured density slope of the HST data is significantly steeper than the expected slope under the assumption that CDM subhalos follow an NFW profile.The predicted slope of the images containing tNFW subhalos is also less than the HST measurement, but their difference is less statistically significant than that with the NFW prediction.While surprising, this is in agreement with previous works that also measured a higher than expected concentration (Minor et al. 2021b;Şengül & Dvorkin 2022).In particular, our 13 HST images include the SDSSJ0946+1006 system analyzed by Minor et al. (2021b), which measured a much higher concentration than the CDM prediction.The individual likelihood-ratio test statistic profile for the same system in our analysis is shown in Fig. 5, and it is in broad agreement with the result in their work.It is also worth noting that our method provides a stronger constraint due to the neural network's ability to efficiently combine multiple observations.It would also be useful to compare our results with those obtained by Şengül & Dvorkin (2022) of the JVAS B1938+666 lens system, but to our knowledge, there is no suitable HST observation of this lens system that matches our training set configuration.Thus, we leave this for future work when more observations become available.One possible explanation of the difference between our result and the CDM predictions lies in the assumptions made in our subhalo modeling.Several assumptions about subhalo density profiles went into modeling the lens system in the image; in particular, the density profile parametrization and the choice of mass-concentration relation affect the predicted slope measurements of subhalos under the CDM model.Modeling these properties for subhalos is an onging area of research (Green et al. 2021), and an improved understanding of subhalo profiles may change the predicted CDM density slopes.Another possible reconciliation is accounting for the selection effects.Subhalos with steeper density slopes are more concentrated and, therefore, are easier to detect in observations.Within our current resolution constraint and noise level, the less concentrated smaller subhalos are not detectable, hence biasing our statistics.This effect of the selection function on slope measurements is important, and we leave a careful study of it for future work, when more observations become available from ongoing and upcoming surveys.

CONCLUSIONS AND OUTLOOK
Observations at sub-galactic scale are essential for examining alternate dark matter models and contrasting them against the standard CDM model.Among the small-scale observables, subhalos provide a promising avenue for dark matter studies.In addition to constraining the subhalo mass function, studying the subhalo density slope (concentration) can help to potentially differentiate various classes of dark matter models.Subhalo properties can be probed by analyzing strong gravitational lensing images.Traditional strong lensing image analyses model individual subhalos through a forward modeling pipeline, but this process can only provide limited statistics; to model more subhalos in a system or to combine statistics from many images, direct lens modeling becomes computationally infeasible.
The rapid progress in machine learning enables the development of techniques that have the power to leverage the collective effect of subhalo populations in strong lensing images, as well as to efficiently analyze a large ensemble of observations.Despite showcases of success on simulated images, many of these machine learning methods require further validation and improvements before they can be successfully applied to real strong lensing observations.
In this work, we built upon the likelihood-ratio estimation method developed in Zhang et al. (2022) and trained a neural network capable of making inference from observed strong lensing images.To make the leap from mock to real images, we added numerous layers of realism in the forward pipeline of the training set.This includes complexifying the lens model to account for the lens light, multipole moments as well as external shear, incorporating realistic noise levels, and adding line-of-sight halos.We demonstrated that the likelihood-ratio estimator retains its sensitivity to changes in the subhalo density slope in simulated strong lensing, even after adding these layers of realism.Furthermore, we obtained the expected subhalo density slope measurements in simula-tions under the CDM model.This measurement comes from using our trained neural network to predict the slope of simulated lensing images containing (t)NFW subhalos that follow a mass-concentration relation derived from CDM simulations.Finally, we measured the subhalo slopes of a set of 13 HST observations and statistically combined their constraints.By comparing the subhalo slope in the HST observations with the measurement from simulated CDM images, we found an unexpectedly high slope measurement in the HST observations, in tension with CDM predictions.
Several recent works in cluster lensing have also suggested that substructures in galaxy clusters are more compact than expected of the CDM model (Meneghetti et al. 2020(Meneghetti et al. , 2022(Meneghetti et al. , 2023)).Combined with several similar results in the literature, our measurement has important implication for dark matter studies as it may motivate more careful examination of alternate dark matter models.The most common alternatives to CDM, the warm dark matter model and many self-interacting dark matter models, predict a lower than CDM subhalo density slope and would exacerbate the tension that we observe (Lovell et al. 2012(Lovell et al. , 2014;;Vogelsberger et al. 2012;Rocha et al. 2013;Kahlhoefer et al. 2019).However, certain self-interacting dark matter models (e.g. with large self-interacting cross sections (Nishikawa et al. 2020)) also predict that SIDM subhalos can undergo core collapses that result in unusually concentrated inner profiles in a timescale relevant for observations today (Lynden-Bell & Wood 1968;Kochanek & White 2000;Colín et al. 2002;Elbert et al. 2015;Nadler et al. 2023).This gravitothermal core collapse due to dark matter self interactions has been suggested as a possible explanation of these high density central regions in cluster galaxies (Yang & Yu 2021).Resolving galactic subhalos in simulations is harder due to their lower masses.A hybrid approach in Zeng et al. (2022), which includes a combination of semi-analytical methods and N-body simulations has shown that some SIDM models can produce subhalos with collapsed cores at subgalactic mass scales (< 10 10 M⊙).This phenomenon provides a possible explanation for the high subhalo density slope that we measured.Based on our work, it is still not possible to pinpoint the mechanism that causes this outlier measurement from the CDM model, but there are several directions of future work that can take us closer to answering this question.For instance, one can study the subhalo slope predictions under different microphysical dark matter models and compare them with the predictions from observed lensing images.In addition, one can examine the effect of assumptions about CDM subhalo properties on the likelihood-ratio estimator's slope predictions.As more lensing systems are expected to be discovered with upcoming surveys (and followed up by observations), the likelihood-ratio estimator will be a valuable tool for obtaining more measurements to help elucidate the nature of dark matter.

DATA AVAILABILITY
The 13 HST images analyzed in this work can be downloaded from https://mast.stsci.edu/portal/Mashup/Clients/Mast/Portal.html.The code used to produce the results shown in this paper is available at https: //github.com/gemyxzhang/neural-subhalo-slope-data.

APPENDIX A: MODEL ARCHITECTURE
We describe in this appendix the customized ResNet-50 architecture used in this work.The original ResNet-50 model used in computer vision consists of a series of convolution blocks followed by pooling and dense layers.We made two modifications to this model for our inference task.Firstly, we append the truth label γ of each image during training to the flattened latent space vector after the convolution blocks, as indicated by the top arrow in Fig. A1.This ensures that the neural network incorporates information about γ into its prediction.In addition, we add a logistic activation function after the last layer of ResNet-50 to ensure that the final output is a valid classification score ŝ(γ, x) (i.e. between 0 and 1).As discussed in Sec.3.3, when we train the neural network as a classifier, the value given by the ResNet before the logistic activation gives us the log likelihood estimate ln r, as indicated in Fig. A1.

ForFigure 1 .
Figure 1.For our inference, we selected 13 image cutouts from observations that share similar instrumental configurations taken by the HST F814W filter.Each image cutout has 100 pixels of size 0.04 ′′ per side.The pixel values of the images are shown in log scale so that features of the lensed arcs are visible by eye.The title of each image corresponds to the name of the strong lens system.

Figure 6 .
Figure6.The 68% confidence interval of combined likelihoodratio test statistic profiles of 13 images containing M 200 ∈ [10 7 , 10 10 ] M ⊙ NFW subhalos (blue) and tNFW subhalos (green), both with a concentration multiplicative factor of 1, as well as the combined likelihood-ratio test statistic profile of the 13 HST images shown in Fig.1(magenta).The uncertainties corresponding to the 68% confidence interval are shown in dashed lines for the likelihood-ratio test statistic profile.

Figure A1 .
Figure A1.Graphical illustration of the neural network architecture used in this work.
al. 2016;Mohamed & Lakshminarayanan 2016).At test time, to compute the likelihood-ratio profile as a function of γ for a given lensed Representative examples of the individual likelihoodratio test statistic profiles for HST images.Each profile is labeled with the name of its corresponding lens system.