Strong-lensing source reconstruction with variationally optimised Gaussian processes

Strong-lensing images provide a wealth of information both about the magnified source and about the dark matter distribution in the lens. Precision analyses of these images can be used to constrain the nature of dark matter. However, this requires high-fidelity image reconstructions and careful treatment of the uncertainties of both lens mass distribution and source light, which are typically difficult to quantify. In anticipation of future high-resolution datasets, in this work we leverage a range of recent developments in machine learning to develop a new Bayesian strong-lensing image analysis pipeline. Its highlights are: (A) a fast, GPU-enabled, end-to-end differentiable strong-lensing image simulator; (B) a new, statistically principled source model based on a computationally highly efficient approximation to Gaussian processes that also takes into account pixellation; and (C) a scalable variational inference framework that enables simultaneously deriving posteriors for tens of thousands of lens and source parameters and optimising hyperparameters via stochastic gradient descent. Besides efficient and accurate parameter estimation and lens model uncertainty quantification, the main aim of the pipeline is the generation of training data for targeted simulation-based inference of dark matter substructure, which we will exploit in a companion paper.


INTRODUCTION
Gravitational lensing has found application in a wide range of cosmological problems.As can be inferred from the name, it is of great aid in observations of the most distant objects in the Universe like the once most distant known galaxy EGSY8p7 (Zitrin et al. 2015).It is an important component in the analysis of the cosmic microwave background (CMB) (Planck Collaboration et al. 2020) and, perhaps, the only alternative to kinematics for measuring the mass of distant galaxies and clusters (proposed first by Zwicky (1937)).At the same time the effect known as microlensing has been used to detect objects as small as planets (Tsapras 2018) and to look for mountainmass primordial black holes (Niikura et al. 2019).Finally, the time delay induced by lensing is being used to give a straightforward and independent of local and CMB measurements estimate of the Hubble constant (Wong et al. 2020).
Dark matter (DM) is a well established hypothetical component of the Universe that interacts with particles of the Standard Model predominantly gravitationally and only very weakly (if at all) through other means (Bertone & Hooper 2018).Of particular interest in constraining models of warm dark matter is the distribution of DM structures on sub-galactic scales.In that context gravitational lensing is a powerful tool since it directly encodes the gravitational field of DM (which is its defining characteristic) into signatures in electromagnetic images.
Naturally, the objects of interest in substructure studies are galaxy-★ E-mail: kkarchev@sissa.it,a.m.coogan@uva.nl,c.weniger@uva.nlscale lenses, preferentially in one of two lens-source configurations: multiply (quadruply) imaged point sources and (again multiply imaged) extended arcs or Einstein rings (occurring near critical lines).
In the former case, the ratios between fluxes of the source projections are used, together with strong assumptions about a "macroscopic" (smooth) lens mass distribution, to rate the marginalised statistical likelihood of the presence of substructure with a given low-mass cutoff.Recent studies using this method (Gilman et al. 2020;Hsueh et al. 2020) derive an upper limit  hm 10 7.5 M , roughly corresponding to constraints from other methods.On the other hand, techniques analysing extended sources aim, usually, at localising and characterising individual subhaloes, whose (non-)detection allows conclusions on the mass function to be drawn based on the statistical power of the used technique (Vegetti et al. 2010;Vegetti et al. 2012;Hezaveh et al. 2016;Ritondale et al. 2019).An alternative is to add a "correction" potential on top of a smooth extended lens and from it extract information about the DM model (Birrer et al. 2017;Bayer et al. 2018).These studies employ a "classical" approach to statistical testing by utilising Monte Carlo (MC) simulations and comparing their outcome (possibly through the use of a test statistic) to the observation.Although statistically straightforward and robust to correlations, this approach is extremely time-consuming, necessitating the exploration of a very high-dimensional parameter space, especially if any sort of "pixellated" (i.e.highly resolved) lens or source model is used.
Strong-lensing source reconstruction.The main endeavour in modelling strong gravitational lenses is disentangling the effects of the lens from surface brightness inhomogeneities of the source.The two are largely (and, without priors, completely) degenerate.The degeneracy can be broken if multiple projections of the same source are identified.One can then expect that a macroscopic mass distribution accounts for mapping the multiple images to the same location in the source plane, while smaller dark matter structures only introduce local variations in one of the projections.
Existing methods fall roughly into four categories: (i) lowdimensional parametric models, (ii) pixellation of the source plane and bi-linear regularisation, (iii) source modelling through basis functions attached to the source plane, and (iv) source regularisation through neural networks.The computationally fastest approach is to use simple parametric functions, including Gaussian and Sérsic profiles (e.g.Bussmann et al. 2013;Spilker et al. 2016).In that case, the small number of parameters can be analysed using standard sampling techniques.This approach is, however, not suitable for capturing the complex light distribution of galaxies observed with high angular resolution.
In order to account for the complexity of observed galaxies, pixellation-based techniques treat the source as an unknown image to be reconstructed together with the lens parameters.Even though in the majority of cases the reconstruction of the source light from the observation, under fixed lens parameters, is a linear problem (Warren & Dye 2003), the source image still needs to be derived, e.g.via conjugate gradient methods, at each step of a non-linear (MC) exploration of the lens parameters.In this semi-linear approach the source reconstruction is typically ill-defined, and its intensity needs to be regularised.This is commonly done using bilinear penalty terms in the likelihood function that restrict the source intensity, its gradient, and/or its curvature.The strength of the regularisation affects the reconstructed source image and is determined by hyperparameters.Optimal hyperparameters that maximise the marginal likelihood of a given image can be determined using Bayesian methods, as proposed by Suyu et al. (2006).
It was quickly realised that a simple Cartesian pixellation of the source plane is not optimal, since it does not account for variations of the source magnification (Dye & Warren 2005).Pixellation of the source plane using Delaunay triangulation based on the projected image pixel positions was proposed by Vegetti & Koopmans (2009) and is now used in many works.This automatically increases the number of pixels in regions of the source that are magnified by the lens.However, there is no agreed-on optimal choice for source plane pixellation, and reconstruction results depend in general on the adopted regularisation and pixellation choices.The problem of selecting the right source pixellation and regularisation hyperparameters was recently studied and newly addressed in the automated approach of Nightingale et al. (2018), which employs 14 (!) hyperparameters that are all non-linearly optimised through nested sampling techniques.
Other approaches that address the problem of source regularisation make use of basis sets like shapelets, defined in the source plane (Birrer et al. 2015;Birrer & Amara 2018).The number of regularisation parameters is reduced in this case, and, instead, choices about the maximum order of shapelet coefficients have to be made.Recently, source regularisation using neural networks trained on observed galaxy images has been proposed, based on recurrent inference machines (Morningstar et al. 2019) and variational autoencoders (Chianese et al. 2020).Although those approaches bake prior choices that might be difficult to control into the regularisation scheme/source prior, they have the advantage of removing the necessity of hyperparameter tuning for source regularisation.
Computer science developments.The recent widespread success of deep learning is due in large part to the advent of automatic differentiation (AD) frameworks that enable efficient gradient-based optimisation of large numbers of network weights; see Günes Baydin et al. (2017) for an overview.Differentiable programming provides an abstraction layer for programming languages that treats programswhether neural networks or not-as flexible building blocks for potential solutions, which are then optimised using gradient descent.Usually, AD is implemented through libraries and extensions of common programming languages like Python or Julia.Nevertheless, stand-alone implementations exist as well: for example, DiffTaichi (Hu et al. 2019) is an independent new differentiable programming language tailored for building physics simulators.
Physics simulators that incorporate AD can be seamlessly coupled to neural networks and thus benefit directly from advances in deep learning (Cranmer et al. 2019).Still, the use of AD in the physical sciences is in its infancy: examples can be found in computational quantum physics (Torlai et al. 2020), while Baydin et al. (2020) discusses the advantages and challenges of using AD in the context of high-energy physics.In a previous work (Chianese et al. 2020) we presented the first auto-differentiable strong-lensing image analysis pipeline.
Probabilistic programming elevates probability densities and random sampling to first-class constructs of the programming language.In this paradigm, probability density estimation (required for inference) and sampling (required for simulation) are usually done with the same code.This enables practitioners to focus on writing simulator code, and the subsequent statistical analysis is then performed automatically.Various probabilistic programming languages (PPLs) exist, often as libraries and extensions, e.g.Stan (Team 2021) and Pyro (Bingham et al. 2018).The latter integrates with AD and deep learning frameworks and thus combines the advantages of deep learning, AD, and variational inference.
Variational inference (VI) allows the approximation of extremely high-dimensional Bayesian posteriors with simple proposal distributions by solving an optimisation problem (for a review see Zhang et al. 2017).The idea of VI is not new, but an efficient implementation requires a differentiable simulator.Accordingly, it regained traction in the context of deep neural networks when Kingma & Welling (2013) introduced the very well-known variational auto-encoders (VAEs), which are now increasingly used also in physics both for parameter inference and as generative models (e.g. Green et al. 2020;Otten et al. 2019).With the advent of powerful general AD frameworks that can differentiate through physics simulators, VI has the potential to drastically improve the analysis of high-dimensional (in particular, image) data.It allows one, often with little overhead above simple maximum a posteriori estimation, to derive meaningful posteriors for thousands or millions of parameters in situations where sampling methods would utterly fail.
In the case of strong-lensing image analysis with pixellated sources, the number of parameters can be extremely large, O 10 5 and more.Due to the non-linearities introduced by lensing, marginal likelihoods are intractable and require approximate inference.The commonly adopted sampling techniques do not scale well to extremely large numbers of parameters, and indeed existing pixellated source reconstruction pipelines only derive MAP estimators for source parameters.VI enables us to go significantly beyond this limitation.
Gaussian processes (GPs) are excellent non-parametric models for uncertain functional dependencies (Murphy 2021).While the source regularisation schemes discussed above are difficult to interpret and control since the various penalty terms and their hyperparameters define the correlation structure only implicitly, in GP models it is directly expressed through the covariance kernel, which effectively connects flux, gradient, curvature, and higher mode regularisation.In the astronomy and physics communities GPs have been used for instance in the contexts of component separation for 21 cm lines (Mertens et al. 2018), Ly  absorber detection (Garnett et al. 2017), light-curve classification (Revsbech et al. 2018), particle physics bump hunting (Frate et al. 2017), radio image analysis (Arras et al. 2021), and time series analyses (Foreman-Mackey et al. 2017).
Numerous machine learning libraries for GP regression exist.However, they usually focus on stationary kernels and simple correlation structures, which makes them unsuitable for modelling the correlations in strong-lensing images.
In this paper we present a fully end-to-end differentiable GPUaccelerated strong-lensing image analysis pipeline embedded in the probabilistic programming language Pyro.We propose a new method for modelling the sources in strong-lensing images inspired by GP regression.However, we formulate it as an equivalent mixture model, which allows us to replace time-consuming GP optimisation with fast approximate VI.Our approach accounts both for correlations induced by overlapping pixels in the source plane and for the intrinsic correlation structure of the source flux.The resulting pipeline enables us to quickly infer and constrain the parameters of the main lens, as well as the amount of intrinsic source inhomogeneity at various length scales, and obtain estimates for the uncertainties of the reconstructed source.We believe that the proposed methods can be beneficial for a large variety of image analysis problems in astronomy.Yet, our main motivation is to use the approximate posterior predictive distribution of the fitted model to draw examples for the targeted training of inference networks for dark matter substructure.We briefly demonstrated this approach in Coogan et al. (2020) and will present a more thorough exposition in a companion paper (Coogan et al. prep).
The rest of this paper is structured as follows.The overall framework is introduced, and a brief description of strong gravitational lensing is given in section 2. The source model is elaborated and related to Gaussian process regression in section 3. Variational inference and variational GP optimisation are discussed in sections 3.6 and 3.7.Finally, the application of the framework to mock observations is presented and discussed in section 4, with more examples shown in appendix A.

OVERVIEW OF THE MODEL AND GRAVITATIONAL LENSING
We model observational data (a single-channel telescope image) that represents surface brightness1 using a pipeline of three components: a source, which will be discussed in the next section, a lens, and an instrument (i.e.telescope).In this work we model the instrumental effects as simply imbuing the observation with Gaussian noise  that is constant across all pixels and independent for each of them, but the framework can seamlessly be extended to account for a pointspread function (PSF), a varying observational uncertainty, and for In this work we fix the kernel size hyperparameter , so it is depicted as an input.The source model, represented as a plate, is replicated (independently)   times, with the outputs of those "layers" then summed to produce the "true" modelled flux (f).See table 1 for descriptions of the variables and table 3 for the parameters of the proposal distribution.

Notes:
The solid lines indicate deterministic relations, except when they "flow" through a distribution (for  and f).The direction of the arrows indicates the order of operations in a forward (generative) pass through the model.2).Angles are all assumed to be small enough that they can be used for Euclidean calculations.The dashed line is the optical axis perpendicular to the planes and connects the origins of the coordinate systems for each plane.

The physics of strong lensing
The general theory of relativity (GR) allows for solving for the propagation of light through arbitrary spacetimes, but this can, accordingly, be arbitrarily difficult.Usually, therefore, when considering "ordinary" non-extreme systems like galaxies and clusters one adopts a weak-gravity approximation which assumes that mass densities are low enough and that relativistic effects can be ignored, so that the Newtonian approximation to GR can be used; i.e. one assumes the metric can be fully described by a scalar gravitational potential .Furthermore, usually there is a clear separation of scales in the system, with the lensing mass being located much further from the observer and the source of light than the size of the gravitating object.This setup lends itself to the thin lens approximation in which all the mass is assumed to lie in a single plane (the image plane), while all the light is assumed to be coming from a source plane (see fig. 2).
We will be using ì  and ì  =   ,   as general angular coordinates in the source and image plane, respectively, labelling with  the coordinate perpendicular to the planes.Since the deflections involved in a weak gravity setting are small, we can use the angular measures as a Cartesian coordinate system in the two planes.
The configuration, then, is determined by two fields: the distribution of surface brightness in the source plane, (ì ), and the distribution of mass in the image plane, described by the projected potential: Under these assumptions, the image-plane and source-plane coordinates of a light ray are related by the simple raytracing equation (Meneghetti 2016): The (reduced) displacement field ì  is determined by the projected potential: where the gradient is taken in the image plane and should have dimensions of inverse length, whence the introduction of the observer-lens distance3   .The additional factor   /  is geometric (as illustrated in fig.2) and motivates the "reduced" qualification.
From a computational standpoint it is more convenient to represent eq. ( 3) as an integral.This is accomplished through the use of the Poisson equation ∇2  = 4  and setting appropriate boundary conditions (Meneghetti 2016): where the integral is performed over the image plane.Here Σ is the projected (surface) mass density related to the 3D mass density by an integration along : similarly to the expression for the projected potential.
In fact, eq. ( 4) can be interpreted as a properly rescaled sum of the well-known contribution 4 / 2 (which Dyson et al. (1920) confirmed) from infinitesimal point masses  = Σ d 2 (  ì  ) to the deflection of a ray with impact parameter ì  = ì  − ì  .This is an expression of the linearity of thin weak gravity4 lensing, which allows the lensing effects of arbitrary mass distributions to be trivially superposed by summing their respective contributions to the displacement field.
The expression in eq. ( 4) is simplified by introducing the critical surface density for lensing and the closely related convergence: The former is a bound above which a lens can form multiple images.
It only depends on the relative strength of gravity () to the speed of light () and dictates that further sources are more easily lensed (for a fixed distance to the lens) since the deflections required are smaller.
The multiple images condition is equivalently stated as  > 1 (see Meneghetti (2016, section 2.6) for discussion and illustration).
The lens equation can be viewed as a coordinate transformation and then conveniently characterised through its Jacobian, which can be shown to take the form dì conveniently split into an isotropic and a shear part.The (inverse) Jacobian defines the magnification which can be interpreted as the overall scaling of the area of small patches of the source plane after lensing.The magnification can take any real value, with negative magnification signifying a flipped image.It can also become extremely large (positive or negative) where the argument in brackets is close to zero.The regions where this occurs are called caustics and critical lines in the source and image planes, respectively.The magnitude of magnification differentiates strong from weak lensing: for the former  and  are comparable to (and/or even exceed) unity, and so sources are strongly magnified.
It is important to note, however, that lensing does not create or destroy photons (and thus, conserves energy5 ).Thus, the amount of light arriving at the observer is proportional to the area in the image plane that the source subtends, which effectively acts as a collecting area for light that is then simply rerouted just as with conventional lenses.This means that lensing conserves surface brightness around infinitesimal points, and thus the surface brightness registered by the observer (in the absence of other sources) is It is obvious from this expression that the lens and the source are entirely degenerate: for any observation there is for every deflection field (that does not form multiple images!) a surface brightness configuration that can reproduce the observation.The reverse (that one can find deflection fields that map distinct sources to the same image) is also sometimes true, giving rise to the mass-sheet (Falco et al. 1985) and source-position degeneracies (Schneider & Sluse 2014) (the latter being more general but approximate).As a consequence, the spatial scale of the source cannot be conclusively determined (in the absence of time-delay data) since a sheet of constant mass density could be magnifying it uniformly by an arbitrary amount (Kuĳken 2003).

Lens model
Our framework admits the use of an arbitrary combination of mass distributions as the lens model due to the aforementioned linearity of thin weak gravity lensing.Ultimately, we are interested in investigating the statistical properties of the overall distribution on different size (and mass) scales.As a first step, it is usual to split the lensing mass into a macroscopic smooth component, which is constrained by the need to converge the multiple observed images of the source onto the same location in the source plane, and a "substructure" component.For simplicity, at present we do not consider substructure and defer all investigation into and discussion about it to upcoming works.
Almost exclusively, the smooth component is realised using an analytical model giving the displacement as a (relatively) easily computable expression.While in computation-driven studies the singular isothermal sphere or ellipsoid (SIS/SIE) model seems to dominate due to its simplicity (Hezaveh et al. 2017;Brehmer et al. 2019), analyses of observational data (e.g.Bolton et al. 2008) require further flexibility and hence often resort to the singular power law ellipsoid (SPLE) model, which is an extension of the SIS/SIE.Furthermore, the SPLE has been shown to be a more adequate representation of the combined DM and baryon mass distribution in the inner regions of galaxies, to which strong lensing is most sensitive (Suyu et al. 2009).
A (spherical) power law lens is parametrised by its 3D density exponent  (or slope) and an Einstein radius   .In terms of a general (2D) radial coordinate ( ì ) its surface density is: An ellipticity can be induced by first transforming from Cartesian,   ,   , to elliptical coordinates, (, ).The transformation is controlled by a scale factor , and, for generality, a rotation through an angle , and an overall translation, ì  0 : tan The SPLE displacement field has a closed form expression in the complex notation, in which  =   + i  , (O'Riordan et al. 2020;Tessore & Metcalf 2015): Here  is the elliptical angular coordinate (and not the ellipse orientation !), and 2  1 is the hypergeometric function6 .In the circular ( = 1) isothermal ( = 2) case this reduces to  =   e i , i.e. is an isotropic constant, as is well-known.The function 2  1 , however, is not implemented in PyTorch7 , let alone in an auto-differentiable fashion, so we instead use the procedure described in full in Chianese et al. (2020, Appendix A), which relies on pre-tabulating the angular part of eq. ( 13) (through an alternative formulation) and interpolating at runtime.This results in very fast code with negligible output degradation.
Additionally, we include a displacement field component due to an externally caused shear: in analogy with eq. ( 7).It represents the combined weak lensing effect of all unmodelled lenses along the light path (assumed constant across the image).As discussed in section 2.1, an overall magnification is degenerate with scaling the source plane, so it is not modelled explicitly.
Overall, the lens model has eight trainable parameters that we collect in a vector  lens :  0, ,  0, , , ,   ,  from the SPLE and { 1 ,  2 } for external shear.

MODELLING THE SOURCE LIGHT
The goal of this section is to describe a source model flexible enough to enable fits to the lensed image at the level of image noise but also statistically rigorous enough to not overfit in areas where the observed details could be due to lensing substructure.Inspired by common non-parametric Gaussian process (GP) regression, whose usual application is precisely statistically justified interpolation of arbitrary data, and motivated by the need to avoid costly operations like matrix inversions, which would preclude application to future high-resolution datasets, we replace the usual regression strategy with optimisation via variational inference.Furthermore, because of the peculiarities of the lensing transformation and the possible overlap between non-adjacent pixels when projected into the source plane, we need to introduce a method for modelling the thus-induced correlations, which goes well beyond typical applications of GP.Still, in certain limits, our model does reduce to conventional GP regression.We expect that in the field of gravitational lensing it outperforms existing methods in terms of accuracy, precision, speed and flexibility.

Overview of Gaussian processes
From a mathematical perspective a (Gaussian or not) process is a distribution over functions.In the simplest case, the functions may be the possible surface brightness profiles of the source galaxy (but see section 3.2 for how this is modified by pixellation).The assumption of Gaussianity then directly provides the likelihood of observing some data d with observational covariance8 S, marginalised over the possible "true" values of the flux (Rasmussen & Williams 2006, Chapters 2 & 5): The process is entirely defined9 by the prior covariance matrix of the true values, K, whose entries are assumed to be given by a covariance function applied to the positions10 ì p associated to the observations: With  GP we indicate the hyperparameters that the covariance function (and hence also the evidence in eq. ( 15)) depends on.Although a variety of covariance functions can be used11 , we will restrict ourselves to the most common one, the Gaussian radial basis function (RBF) with hyperparameters  2 for the overall variance and Σ as a covariance kernel.It is given by Often one aims to infer a single kernel best fitting the data.However, in galaxy images there may be a number of different relevant scales, like the scale of the whole galaxy and that of dust lanes and individual stellar clumps, and so a single kernel is usually not optimal for modelling them.Furthermore, one usually can get an estimate of the appropriate kernel size by e.g.first fitting a smooth analytical source model.Hence, we model the covariance using a sum of   Gaussian RBFs as in eq. ( 17): assigning to each an isotropic kernel Σ () →  2 () I, where I is the identity, and an overall variance  2 () , which together form the set of GP hyperparameters.Due to the linearity of Gaussian processes, this is equivalent to considering a number of linearly superimposed independent (a priori) GP sources, which we call layers.
We do not vary the length scales  () but fix them, along with the total number of layers, based on an initial fit so that all relevant correlation scales are covered (see section 4).On the other hand, the variance hyperparameters are optimised variationally by maximising the model evidence (see section 3.6).In a multilayer GP they act as weights for the layers and can suppress and enhance the modelled source variations on the relevant scales as determined from the data.In the following we will usually omit the layer subscripts and instead imply that the descriptions apply to each layer individually and independently.

Approximate pixellated Gaussian processes
A particular complication arising in the modelling of multiply lensed systems is that it is not enough to assign the observed data to zerosized points in the source plane, as this will disregard the correlations arising from the overlap of the light-collecting areas of different pixels (see fig. 3).To fully account for the (projected) pixel shapes and sizes, one must introduce a set of indicator functions   ( ì ), each being non-zero only within the source-plane light-collecting area for the respective pixel, so that where   are the noiseless observations (GP true values), and ( ì ) is the surface brightness, which we treat as the function being modelled by the Gaussian process.Hence, we assume that the covariance from eq. ( 18) in fact applies to the underlying surface brightness: and consequently modify the covariance matrix of the observed fluxes to take into account the correlations from all pairs of points within the light-collecting areas: Note that this covariance is not generally stationary since it is not just a function of the positions ì   and ì   of the observations.Calculating this integral exactly is infeasible even if the projected pixels are assumed to be polygonal (which is not necessarily true), since it would involve numerous intersection checks and code branching points, which are extremely inefficient on GPUs.Instead, we opt for a simplified treatment that approximates the indicator functions to Gaussians located at the pixel centres: The conservation of surface brightness (as described in appendix C) demands that ∫ The first requirement simply means that the indicator function acts as a "contribution density" and is satisfied by any properly normalised Gaussian function.The second condition, where   is the projected pixel area, or an estimate thereof, demands that the indicator function is similar in size to the pixel and ensures that the variance of a pixel's brightness scales inversely with its light-collecting area.In the same appendix we show that this is satisfied when the "pixel shape covariance matrix" is with ì  and ì  the axes of the pixel12 (depicted as ⃗ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ⃗  and ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗  in fig.3).
The indicator function from eq. ( 22) allows us to easily solve the integral in eq. ( 21).Assuming that the surface brightness ( ì ) has the Gaussian RBF covariance from eq. ( 17), the prior covariance matrix of the data is Reassuringly, in the limit   → 0, in which pixels are treated as points, this covariance reduces to the usual expression.A similar construction for nonstationary covariance functions has been considered in the Gaussian process literature (Paciorek & Schervish 2003).
To test the approximation, we apply it to a pixel grid deformed by a realistic lens configuration as in section 4 and compare it to the exact calculation according to eq. ( 21) with "hard" indicator functions, i.e. non-negative only within the polygon formed by the projected pixel corners13 .We directly compare the approximated covariance matrix elements in fig. 4 and verify that for the majority of pixel pairs that would be considered for fitting (see section 4 for the "masking" procedure) the approximation is adequate, with 95 % of the entries reproduced to within a factor of 2. As evident from the figure, where significant deviations occur, either one or both of the pixels in the pair have a very elongated shape, as quantified by a shape index related to the area-to-perimeter ratio of the pixel projection.22), ( 25) and ( 26)).The colour scale depicts a "shape index" = 4  × area/perimeter 2 , which is an indicator of elongation, being 1 for very elongated polygons.The shaded area indicates the region a factor of 2 away from the exact result.Top: cumulative fraction of pixel pairs (subsets as indicated in the legend) for which the approximation is off (either smaller or bigger) from the exact calculation by more than a given factor.The setup is an example lensed pixel grid of size 100 × 100 for a typical lensing configuration from section 4, and a single layer of kernel size comparable to the typical (projected) pixel sizes.The "exact" calculation was performed by averaging eq. ( 17) over 1000 uniformly sampled points in each pixel.
Only pixels that would be considered for fitting (masked 23 as in section 4) are included in the bottom panel, and a further cut was enacted to remove abnormally elongated pixels (located e.g.near critical lines).
Lastly, we note that the Gaussian approximation leads to a biased estimate of the pixel variances,   , since in eq. ( 24) we chose to normalise for the case of a vanishing kernel size.Thus, Gaussian indicator functions are more concentrated ("smaller") than the polygons they approximate and hence end up having a slightly bigger variance.This effect is on the order of <10 % in typical cases, and we do not expect it to have a noticeable effect on the final results.

Formulation as a generative model
In general, the positions ì p at which observations are made are a function of other parameters in the overall model.In the case of lensing studies they depend on the lens configuration.Thus, a way to optimise the lens parameters is to perform gradient descent on the marginal likelihood of eq. ( 15), for which they act as hyperparameters.How-ever, this would require performing expensive matrix inversion and determinant calculations, both of which have a complexity scaling as O  3 with the number of observations.Even though software like GPyTorch (Gardner et al. 2018) improve on naïve implementations by using the conjugate gradient method (Press et al. 1992, Section 10.6), it still needs to be performed at each step of parameter optimisation.Furthermore, the method of calculating the determinant presented in Gardner et al. (2018) is stochastic and can sometimes introduce noise comparable with the effects of substructure (1-2%).This renders typical GP regression suboptimal for lensing studies and particularly so for substructure searches.
Instead, we propose a GP-inspired source model that reduces to a GP in ideal situations but circumvents expensive matrix operations and allows other model parameters to be fit via gradient descent simultaneously with the optimisation of the source hyperparameters.Furthermore, the fitting procedure produces a simple analytic posterior for the true values which can be used to marginalise them out directly or to obtain training samples for subsequent analysis using inference networks (Coogan et al. 2020, prep).
The source model relies on the framework of variational inference (VI), which will be discussed shortly.A key component of it is a generative model that produces mock data according to the prior.A generative description of a GP as above is 14 which upon marginalisation of f indeed gives eq. ( 15).Sampling f can further be expressed as as long as T satisfies Here we introduced the set of source parameters , which we optimise by ELBO maximisation as presented in section 3.7 in order to approximate the exact GP solution.This procedure shifts the complexity of the GP into calculating T appropriately and so requires only sampling with diagonal (or, otherwise, constant) covariance and one matrix-vector multiplication.Crucially, all operations 15 have complexity O  2 at worst and efficient standard GPU implementations.A slight caveat is that, if there as many source parameters as pixels in the observation, the matrix T becomes too large to store at once in memory, without necessarily being sparse.To circumvent this, we instead implement directly the product T via the PyKeOps library (Charlier et al. 2021), which enables on-demand compilation of symbolically defined (e.g. as in eq. ( 30)) matrices to extremely efficient reduction routines optimised for graphical processing units.Critically, PyKeOps fully integrates into the PyTorch ecosystem by also supporting automatic differentiation of these operations.
Our source model can also be understood in terms of Bayesian linear regression onto a set of basis functions defined by the transmission matrix T and shifted to the positions ì p associated with the observations.What we term "source parameters" play the role of 14 assuming a single GP layer.A multilayer model generates "true values" f =  f () with covariances K () . 15barring, in the general case, the final data sampling d | f, S, which, however, is trivial for uncorrelated noise or requires one matrix inversion per dataset for a general covariance matrix.In some cases (as for a PSF), this final step can even be split into a deterministic "postprocessing" of the fluxes (a convolution with the PSF) and an uncorrelated sampling step.
the weights, over which a normal prior is placed.Our expressions can be translated into those found in Rasmussen & Williams (2006, Section 2.1) by substituting  → ,  2 I → Σ  , T → Φ , and d → .

Factorisation of the covariance matrix
Given a covariance matrix K, linear algebraic methods like the Cholesky decomposition (Press et al. 1992, Section 2.9) can produce the required factorisation, generally in O  3 time, which, as discussed, is to be avoided.Instead, motivated by the fact that a matrix product is analogous to a convolution, and the convolution of two Gaussian functions is again a Gaussian, we approximate the (Gaussian) covariance matrix from eq. ( 26) as the product of two other ("transmission") matrices of the same form where ì q are a set of "inducing" points, and A  are a set of normalisation constants which, along with Σ and { Z }, are determined by requiring that where  ì q (ì ) is the density16 of inducing points ì q around ì , and Ã2 =  ì q (ì )A 2 (ì ) is set to be independent of ì .The approximation becomes an equality as the inducing points become infinitely dense and cover the whole space.Comparing with eq. ( 26), we can identify This derivation is akin to the one presented by Rasmussen & Williams (2006, p. 84) and relates to the well-known fact that the squared-exponential covariance function corresponds to Bayesian linear regression onto an infinite linear combination of Gaussian basis functions.

Aside: Source reconstruction and inducing points
As discussed, our source model can be interpreted as a regression onto a finite set of Gaussian basis functions each with kernel Σ = Σ/2 located at the inducing points and weighted by the source parameters.Intuitively, this corresponds to treating the source as a collection of (a finite number of) Gaussian "blobs" attached to the inducing points and free to scale in intensity and with size fixed by the kernel.Thus, once the source parameters / weights (and hyperparameters) have been determined, the source surface brightness can be reconstructed at arbitrary locations ì  by evaluating all the basis functions at ì  and summing their contributions.This is succinctly expressed in matrix notation by simply replacing ì   → ì  in eq. ( 30) and then evaluating eq. ( 28): Note that here we do not include the "pixel shapes" Z since we are directly evaluating the underlying surface brightness.Alternatively, we can assume that, in contrast to observational data, the source reconstruction is evaluated on a high-resolution regular grid for which the point approximation is adequate.This equation in fact defines a posterior distribution over the functions (ì ), arising from the inferred posterior distribution of the source parameters17 , as appropriate for a process model.Furthermore, for fixed inducing points and covariance hyperparameters, this distribution will be Gaussian.

Reconstructed uncertainty
A well-known problem of basis function regression, and one of the main advantages of the full GP treatment, is that it is not well-suited for making predictions away from the inducing points, especially in the case of decaying bases like the squared exponential.It is plain to see from eq. ( 35) that sufficiently far away from the data a source model as we describe, however trained, predicts unfailingly  * = 0, i.e.  | ∅ ∼ (  ), where ∅ denotes the absence of data.Even though this is not unreasonable when it comes to the mean-indeed, the GP predictive mean also reverts to the prior (which is null in our case) away from data-it also paradoxically predicts with vanishing uncertainty, which is in stark contrast to the expected regress to the prior covariance of a true GP.This is not a problem either for our stated goal of training a model that is then able to produce further training samples for substructure inference or for plotting the reconstruction in the image plane since these will only require evaluations at the same points as the model was trained. 18However, in plotting high-resolution source reconstructions (in the source plane), the uncertainty will be concentrated "unnaturally" around the inducing points, an artefact further concealed by the integration over pixel areas during training (and when generating image-plane reconstructions).One solution might be to only evaluate the posterior at the inducing point locations and then perform mesh interpolation or kernel interpolation using the Gaussian indicators as kernels.However, this would defeat the philosophy of using processes for interpolation, and hence in figures like fig. 8 (and fig.5, although there it is less prominent) we still present the mean of the source reconstruction as derived from eq. ( 35) while mesh interpolating only the predicted variance.

Choice of inducing points
Note also that the basis functions are entirely defined at the inducing points, without reference to the observational data: Z appear in eq. ( 30) only to account for the integration over the pixel indicator functions and should in fact be regarded as part of the instrumental component of the model.This means that, even layer-by-layer, the inducing points, and crucially, their number, can be chosen freely.If, for example, only a small number (fewer than the image pixels) of inducing points is used, the model may be solvable exactly because of the reduced complexity of the inversion and determinant calculations: this is the so-called low-rank approximation, which we will explore in further work.
We employ two strategies of choosing the inducing points, which are useful for modelling source variations at different scales.The first is to fix ì q (ip) = ì p, i.e. attach the source-plane Gaussians to the image pixels and have them move with the change of lens parameters.The advantages of this prescription are that it automatically results in a higher density of inducing points in highly magnified regions of the source plane, akin to triangulation schemes, but also reduces the correlations between the lens and source parameters by "dragging" the observed flux along the source plane as the lens changes.We dub this GP configuration the image-plane GP (hence the subscript) and use it to model small-scale image variations.
However, this strategy is unsuitable for modelling variations on scales larger than the separation between the projected pixel locations, which can be very small in highly magnified source regions.Indeed, in a situation in which the inducing points are significantly closer than the kernel size  of the GP layer, the Gaussians in the sum eq. ( 35) overlap substantially, resulting in significant degeneracies between source parameters.While one option for mitigating this problem would be to use only a subset of the image pixels as inducing points, we instead introduce a large- layer with inducing points fixed to a predetermined grid in the source plane with separation comparable to : ì q (sp) = ì p grid .We call this source setup the source-plane GP.Since we only use it to model large-scale variations, for this layer we drop the pixel shape contribution in eq. ( 30), which corresponds to treating each basis function as approximately constant over the projected size of a single pixel.

Variational inference by ELBO maximisation
The method of maximisation of the evidence lower bound (ELBO) (Saul et al. 1996;Jordan et al. 1999;Hoffman et al. 2013) is particularly suitable for fulfilling the two main goals of our source modelling: to provide an estimate of the posterior and optimise the source hyperparameters (which, in some sense, include the lens parameters).
Traditionally, the latter is performed by maximising the model evidence, and this is a natural outcome of ELBO maximisation as the name suggests.The former is achieved by using a proposal distribution, the so-called variational posterior, which is strictly non-negative for any two distributions and zero only if they coincide.This not only justifies the ELBO's name, but also means that ELBO maximisation produces the optimal approximation of the posterior within the confines of the parametrisation of the proposal.

Stochastic gradient descent (SGD)
Note that, unlike typical marginalisation where integration is performed over the model parameters and can often be intractable, the ELBO only requires samples from the proposal, which is usually much simpler and often represented by analytical distributions.Given a model (d, ) and a proposal   (), however, it is often impossible to calculate the expectation value in the ELBO analytically, let alone its gradient with respect to .Even if a Monte Carlo estimate is available, taking its gradient naïvely does not capture the dependence of the sampling function on  and hence does not give the correct result.
There are two strategies to circumvent this, both of which express the gradient of the ELBO as an expectation of a "surrogate function".The first approach, called black box variational inference (Ranganath et al. 2013), uses which is an unbiased estimate for general proposal and model and can be sampled using Monte Carlo.However, such estimates often suffer from large variance, especially when the proposal does not cover the majority of the posterior mass (e.g. in the early stages of a fit).
An alternative is to re-parametrise the proposal distribution so that its stochasticity is independent of the parameters we are taking a gradient with respect to (Kingma & Welling 2013).This "reparametrisation trick" is applicable to the majority of common analytical distributions, for which sampling is implemented as a series of deterministic manipulations on a "unit" random variable  that is usually uniform on [0, 1) or  ∼ N (0, 1).Crucially, the distribution () is not parametrised, and the gradient needs to "flow" only through the deterministic transformations.Formally, this can be expressed as whence we can again obtain a stochastic estimate of the required derivative via standard Monte Carlo methods.Of course, automatic differentiation is still practically indispensable in computing the gradient for any non-trivial model.Both methods of SGD are implemented in the package Pyro with re-parametrised samples used where possible.Indeed, all the distributions that we use are re-parametrisable.

Variational inference of a Gaussian process
According to the generative model eq.( 27), the posterior for the true values f, given an observation d, has the form of a multivariate normal distribution (MVN) (Rasmussen & Williams 2006, Section 2.2): and Since the source parameters  are related to f by a linear transformation, their posterior, too, will be multivariate normal.Hence, an MVN is the correct form to assume for the proposal distribution.However, an MVN proposal requires fitting  ( + 1)/2 parameters of the covariance matrix, which cannot be stored in reasonable memory, let alone optimised in reasonable time.Therefore, we are forced to use a mean field approximation in which we treat the posterior covariance as diagonal and use a univariate normal proposal distribution for each source parameter independently of the rest: where θ and  2  are, respectively, the variational mean and variance vectors.A full analytical treatment of a variationally optimised GP, including detailed account of the ramifications of this choice, is presented in appendix D.

Hyperparameter optimisation
Hyperparameter optimisation is a special case of Bayesian model selection, which is usually performed by maximising the model evidence, thus accounting for the posterior uncertainties of the latent variables (in regular GP regression these are the "true values", which we have reformulated in terms of the "source parameters").However, variational inference does not have access to the true evidence, but rather only to the ELBO, which contains the additional KL divergence term.As discussed, the latter is null if and only if the variational posterior matches the true one, and in our case this cannot be achieved since we are forcing the proposal covariance to be diagonal.Hence, the KL divergence may (and indeed does) bias the model selection procedure.
As described in appendix D, using a diagonal approximation to the covariance overestimates the posterior variance, and therefore the fit will attempt to diminish it, which it can achieve by decreasing the prior variance.Hence, we anticipate the recovered hyperparameter values, α2 , to scale inversely with the average number of correlated observations,   , with respect to the "true" hyperparameters,  2 0 : In appendix D.1 we show that this is indeed so.

Example
An illustration of the validity of the variational approach to solving a Gaussian process and comparison with the analytical solution is presented in fig. 5 for the case of a (relatively) small 1-dimensional dataset designed to model the main aspects of lensing images in the source plane.Given the true hyperparameters, a variational optimisation with a full covariance matrix reproduces exactly the analytical solution within the domain of the data (i.e.not too far away from the full VGP, 0 eq.( 45) = 0 = ˆ eq. ( 46) Figure 5.Comparison of the exact GP solution of a 1-dimensional problem to variational approximations.The observed data (black dots) were drawn from a GP with two layers of sizes  (1) = 0.2 and  (2) = 0.02 and root variances  (1) = 0.5 and  (2) = 0.2 with noise  = 0.15 (shown as segments).The locations of the observations (every fourth one shown as a dash along the top of the top plot) imitate two overlapping magnified grids as in a lensing system.The central 95 % of the posteriors for the true values  ( ) (marginalised over  (  ≠ )) are shown as shaded areas in gray (the exact solution eq. ( 40)), green (a variational fit with full covariance matrix), and red (with diagonal covariance).Solid lines trace the posterior means of the respective fits.All but the diagonal fit have the hyperparameters fixed at the true values.The bottom panel depicts the 1-sigma intervals normalised to the location and size of the exact solution.It also includes a diagonal fit with fixed true hyperparameters (yellow) and depicts the expected standard deviations of the diagonal fits from eq. ( 45) as dashed lines.Finally, the red dotted line is the bias of the mean predicted by eq. ( 46).observation), whereas a diagonal fit locates the mean but attains a constant uncertainty as derived in appendix D. There we also justify why this usually leads to an overestimation of the posterior variance of the true values, as can also be observed in fig.5; that is, our fitting strategy is usually conservative.
The uncertainties are somewhat reduced if the hyperparameters are also optimised.As discussed, this leads to a reduction of the prior variance, and hence, also of the posterior.However, this also means that predictions away from zero are more strongly penalised, which biases the mean with respect to the analytical treatment (observe the offset between the red and green lines in fig.5).As derived in appendix D.2, the variational mean is a factor smaller (in absolute value) than the analytical result.Evidently, the offset depends inversely on the signal-to-noise ratio, and hence we do not expect it to significantly bias the parameters we are actually interested in while fitting lensing images.Note, furthermore, that variational inference still recovers the exact posterior mean at the optimised values of the hyperparameters, so the bias is not due to the fitting procedure directly, but rather is a side effect of the different hyperparameters, themselves influenced by the additional KL term in the ELBO.

MOCK DATA ANALYSIS
While there are currently on the order of a hundred available observations of galaxy-galaxy lenses suitable for substructure detection (mostly contained within the SLACS (Bolton et al. 2006;Shu et al. 2017) and BELLS (Brownstein et al. 2012;Shu et al. 2016) optical surveys 19 ), instruments in the near future are expected to deliver hundreds of thousands more (Collett 2015).The main contributors are expected to be the Vera Rubin Observatory's Legacy Survey of Space and Time (LSST, Ivezić et al. 2019) and the Euclid space telescope (Laureĳs et al. 2011).The latter will in fact provide imagery of the necessary resolution and, more importantly, seeing to be directly usable for substructure modelling.On the other hand, the expected point spread function (PSF) of the LSST will be on the order of the Einstein radius for the majority of the discovered lenses, and so follow up will still be required.In that respect, the James Webb Space Telescope and the Extremely Large Telescope (ELT) will improve on the imaging capabilities of the Hubble Space Telescope (HST), with the latter expected to produce 10 × 10 diffraction-limited images (resolution is expected to be at least 2 mas (ELT Instrumentation), compared to Hubble's 50 mas (ACS Handbook) and Euclid's 100 mas (Euclid VIS Instrument)). 19These surveys identify candidates in large spectroscopic surveys (the SDSS and BOSS), by looking for objects that seem to combine light with distinct redshifts.Follow-up observations have been performed with the Hubble Space Telescope to obtain the final deep high-resolution images.JVA/CLASS (Myers et al. 2003) is an older radio survey that also provides suitable images.Other collections have also been produced (More et al. 2012;Huang et al. 2020), for which high enough quality imagery is generally still not available.
Table 2. List of lens parameters, their priors, true values in the mock image of Hoag's object, and results of the final fit with 3 image-plane GP layers.The "fit" column quotes the means and standard deviations derived from the fitted multivariate proposal described in the text.The full variational posteriors from all fits (to this mock observation) are depicted in fig.12 In this section we demonstrate that our pipeline is capable of rapidly analyzing mock observations representative of what upcoming high-resolution telescopes such as ELT will produce.After describing our mock data setup we study a benchmark system in detail, showing that we can accurately fit the observation, recover the source light distribution, and precisely infer lens parameters with reasonable variational uncertainty estimates.We also comment on several key technical points, most importantly the independence of the lens parameters' variational posteriors from the number of GP source layers.Fits for several other mock lensing systems are presented in appendix A.

Data generation and source model
For our mock data we assume a resolution of 12.5 mas, which is significantly larger than the expected size of ELT's point spread function (Davies et al. 2016), allowing us to neglect it at this stage.We generate images of size 400 × 400 pixels, spanning an area of 5 × 5 on the sky.In order to simulate integration of the light across the pixel areas and to avoid pixellation artefacts, we initially generate the mock observation at 10-times higher resolution and then downsample to the target size by local averaging.We set the noise level so that (after downsampling) the brightest pixel has a signal-to-noise ratio (SNR) of ∼30, in line with HST observations of the SLACS lenses. 20 The properties of the lensing system are representative of the expected distribution of strong galaxy-galaxy lenses detectable by next generation instruments, as derived by Collett (2015).Specifically, we set an Einstein radius   ∼ 1 .5 and the size of the unlensed source to 0 .4. 21 The other parameters of the SPLE lens and the external shear are drawn at random from the priors listed in table 2. As a source we use high-resolution public domain galaxy images.For the main example we interpolate over a post-processed 1185 × 1185 HST image (STScI 2002-21) of Hoag's object (Hoag 1950), converted to grayscale.We choose this object because it combines a largely featureless "early-type" core with a featureful "late-type" star-forming ring, which allow us to test the model simultaneously in both relevant regimes of image complexity.Lastly, we do not model the lens galaxy's light at this stage 22 , assuming it has been perfectly accounted for in a preprocessing step as in Bolton et al. (2006).Figure 6 displays the mock observation, source, critical curve, and caustic.
Finally, we create a mask 23 for our observation by first applying a ∼35 mas Gaussian blur to the observation and keeping only the pixels with SNRs larger that 1.
In the fits we model the source via a multilayer image-plane GP and a single source-plane GP (cf.section 3.5).We set the kernel size of the latter to  (sp) = 40 mas, while for the former we use a geometric progression of spatial scales ranging between 10 mas and 2 mas.The layer sizes span from ∼1/6 to ∼3 times the image pixel size and thus allow simultaneously modelling highly magnified areas in detail and the large-scale appearance of the source.We test from one (either on the small or the large end of the size range, indicated by subscripts "s" and "b", respectively) up to eight image-plane layers.The image-plane GP is set up to only model the masked pixels, i.e. the ones with a significant source-light contribution, which are also used as inducing points.In contrast, the source-plane source is evaluated across the whole image, and its grid of inducing points has a size of 40 × 40 with a separation of 30 mas, spanning an area about 20 % larger than the images used for mock data generation.The variance hyperparameter  (sp) is initialised to 10 times the noise level (similar to the highest SNR) for the source-plane GP and to the level of the observational noise for the image-plane GP layers.

Variational posterior
The slope  has a complex degeneracy with the source, related to the mass-sheet and source-position transformation degeneracies (Falco et al. 1985;Schneider & Sluse 2014).Breaking it requires additional information or assumptions on e.g. the dynamics of the lens galaxy For simplicity in the current work we do not optimise  but instead fix it to the true value from table 2. For the remaining seven lens parameters we use a multivariate normal (MVN) proposal distribution, while we optimise the GP source parameters  () via a diagonal normal proposal, as discussed in section 3.However, since the source-plane GP layer has relatively few (O (1) with the number of image pixels) source parameters, it is computationally feasible to at least partly model their off-diagonal covariance and their correlation with the lens parameters.We achieve this by using a "partial" MVN proposal: where is the scale matrix of the MVN with L lens an  lens ×  lens lowertriangular matrix with positive diagonal entries ("lower Cholesky"),  , (sp) a diagonal matrix with  2 , (sp) on the diagonal, and C the  grid ×  lens matrix correlating the lens and the source-plane source.Under this proposal the covariance of  lens is L lens L lens , while that of  (sp) is  , (sp) + CC , so apart from the lens-source correlations, C also accounts for a part of the full posterior GP covariance.Exploring more flexible variational posteriors capable of better accounting for the most important correlations between source parameters is left for future work.
Thus, our full variational posterior can be written as Here the last line is a technical way to write the optimisation of hyperparameters, for which a posterior is not derived.The complete set of variational parameters  are listed in table 3. Finally, we note that the parameter constraining functionality of PyTorch / Pyro was used to ensure that samples for the lens parameters always fall within the domains of the respective priors (listed in table 2).This is implemented by transforming the values after they have been sampled from the proposal, which means that in fact the samples do not follow the distribution as written in eq. ( 49).However, since  is Gaussian, and the posterior uncertainties we obtain are very small in comparison with the prior ranges, the transformation amounts to no more than a linear rescaling of the uncertainty parameters (and a transformation of the means), which we take into account when plotting and citing the posteriors.Furthermore, we account for the transformation Jacobian while fitting by explicitly adding it to the loss.

Optimisation and initial fit problem
We optimise our model in stages using Adam (Kingma & Ba 2014) with the default PyTorch settings.Initially, we remove the imageplane GP layers and fit only the variational posterior parameters for the lens and source-plane GP in order to obtain an initial estimate of the lens parameters and the appearance of the source.We find that generally the true parameters are discovered by fits initialised at random locations covering the whole prior ranges for the Einstein radius and position of the lens, as well as starting from a null external shear.However, it is necessary to have a prior estimate of the axis ratio  within ∼0.1 and the position angle  within ∼0.5 rad, otherwise the fit usually gets stuck in a local minimum.These could potentially be obtained with neural networks (using e.g. the network from Hezaveh et al. ( 2017)) or by using the lens galaxy's light.Instead, here we point out that since the local minima have a substantially higher loss value, and it can be determined with certainty whether the fit has entered one of them or has discovered the global (true) values within a few thousand steps (see fig. 7), which usually take about a minute, a brute force initial search is a viable simple strategy.Moreover, we examine the robustness and convergence properties of initial fits, when they indeed locate the global optimum.Starting at a learning rate of lr = 10 −2 , we let PyTorch's ReduceLROnPlateau reduce it when the loss does not change significantly down to lr = 10 −4 over the span of 20 000 fitting steps.These fits usually take around 10 min on a single Titan RTX GPU.As seen in the left panel of fig. 7, the final loss is consistent within a scatter of ∼100 despite the random initialisation. 24Furthermore, for our mock observation, the approximate posteriors derived from these initial fits (depicted as pale red to green ellipses in the corner plot of fig.12) are consistent and unbiased, having uncertainties in the third decimal 25 .
In the next stage of fitting we reintroduce the image-plane source and optimise its variational parameters for 5000 steps at a lr = 10 −2 , while keeping the parameters of the proposal related to the rest of the model fixed.This stage is represented in the shaded area of fig. 7.After that we train all parameters, starting at lr = 10 −3 and again reducing it automatically upon encountering a plateau in the loss.In this stage we use 4 samples (instead of one for the previous stages) to estimate the gradient of the ELBO in eq. ( 39) in order to constrain better the posterior (co)variances.We find that in general the loss stabilises after about 10 000 steps and starts decreasing exponentially slowly.After ∼50 000 more steps all fits reached a rate of decreasing loss of O (10) per 10 000 steps, and so we stopped them arbitrarily at this point.
As expected, introducing more GP layers leads to a lower overall loss and a better data reconstruction loss in particular (see fig. 19).However, if a few layers are already present, the improvements are minimal.

Final results
The final fit to the mock image of Hoag's object is depicted in fig.8 (with more examples presented in appendix A).The mean and standard deviation images were computed using 100 samples of the lens and source parameters from the approximate posterior.For the image-plane reconstruction (top row) these were processed by the whole model in order to calculate the corresponding noiseless lensed image.We find that the posterior mean reproduces the observation with high fidelity and residuals dictated by the observational uncertainty.Furthermore, we find no evidence for an overall systematic bias of the reconstructed flux (see figs. 10 and 11).
Using the same samples we apply eq. ( 35) over a fine grid in the source plane in order to obtain high-resolution source reconstructions, which can be thought of as samples from the approximate posterior for the in the source plane.Its statistics are presented in the second row of fig. 8.As in the image plane, the mean is an unbiased estimator of the true source flux, and the standard deviation is below the observational noise level, further decreasing in areas of higher projected pixel density, especially near caustics.Still, we refer 24 Even though the absolute scale of the loss is arbitrary (in plots it is presented relative to an arbitrary zero point), a useful basis of comparison is the number of pixels in the observation since random Gaussian noise contributes a loss of O (1) per pixel, and so 100 is indeed a small scatter for an image of O 10 5 pixels. 25which is only a couple of orders of magnitude bigger than the machine precision of single-precision floats and certainly smaller than the accuracy of the SPLE interpolation tables used in our lens model as well as the deviations from ideal lenses found in nature the reader to section 3.5.1 for the caveats relating to source-plane flux reconstruction.
We observe that our model does not reproduce the details of the source to arbitrarily small spatial scales, as evidenced by the presence of abnormally negative residuals in fig. 10 (see also panel h. in fig.8), irrespective of the fact that we do include GP layers with sufficiently small kernel sizes.There are multiple reasons for this, the first one being that the resolution of the observation, projected onto the source plane, imposes a resolution limit for the reconstruction.This limit, furthermore, varies across the source plane, depending on the multiplicity of the projection and the magnification.Secondly, the uncertainty in the lens parameters smears details in the source plane over the scale of the induced variations of the projected locations of the pixels.We attempt to account for these by comparing the source reconstruction not to the original high-resolution image, but with a Voronoi tessellation26 of the observed fluxes projected onto the source plane.Still, some detail is not recovered, and this is due to the fact that the source model imposes a global variance,  2 () , for each spatial scale  () , while the power of the relevant fluctuations in the actual image varies with position.Since the average power even of the smallest scales of fluctuations is low, the correspondingly low  2 () penalises the reconstruction of small isolated details, even when the data requires them.In essence, this is a representation of the fact that real sources are not, in fact, samples from a multilayer GP model.We reserve exploring possible remedies, like allowing a spatially varying  2 (ì ) with regularisation, e.g. by interpolating over a coarse grid, or replacing the prior distribution for the source parameters in the forward model with one with more probability mass in the tails for future work.
Crucially for the purposes of subsequent substructure analysis, the fits also produce an estimate of the uncertainty of the reconstruction (panels c. and g. of fig.8).We find that the standard deviation of the reconstructions is usually below the observational noise level and that, furthermore, it is a conservative estimate: as seen in the histograms of fig.11, the true (noiseless) observation is more tightly distributed around the posterior mean than expected from the respective normal distribution.
We also examine the effect of having a different number of imageplane GP layers and find that the reconstructions are all almost equally good, except for the case of a single small-scale layer of  = 2 mas (labelled 1  ), owing to the fact that the most relevant variations are captured by the source-plane and largest image-plane layers.We observe that in general all layers are assigned non-negligible power, with the total (corresponding to the total amount of detail in the image) roughly preserved (see fig. 18).
Similarly, all fits (except for 1  ) have comparable uncertainty in the reconstruction.Even though each image-plane GP layer contributes a number of free parameters comparable to the size of the data, the model does not overfit the data because it is optimised variationally, and even though the small-scale layers are uncorrelated and weakly constrained, optimising their variance hyperparameters prevents them from contributing excessive noise to the reconstruction (see fig. 9).
Finally, we present the full approximate posterior for the lens parameters from all fits in fig.12.In general, adding a small-scale GP source lowers the uncertainties with respect to the initial (sourceplane source only) fit by a factor ∼3-5, while remaining unbiased.The final posteriors are similar across the fits with different number of small-scale layers with only slight improvement noticeable with the increase of layer count.Again the exception is   = 1  , which still improves upon the initial fit but by a smaller factor, while remaining unbiased.

CONCLUSION
Our aim with this work was three-fold: (1) to present a fully differentiable pipeline for the analysis of strong lensing images, (2) to introduce Gaussian processes (GPs) as a statistical model for the surface brightness of the source, and (3) to lay the foundations for targeted generation of training data for neural inference models.
Based on the deep learning framework PyTorch, our simulator code allows gradients to be calculated automatically for arbitrary lensing and instrumental setups, while benefitting from trivial parallelisation on graphical processors.This enables powerful statistical methods like variational inference, which optimises a lower bound on the Bayesian evidence (ELBO) and thus enables posteriors to core Distributions of the residuals (corresponding to the bottom panel of fig.10) for the fits with different number of small-scale layers, compared to a standard Gaussian.All histograms, except for 1  , approximately coincide, indicating the approximate independence of the result from the number of layers used.Moreover, they are narrower than a standard Gaussian (black lines), indicating a conservative estimate of the uncertainty.Pixels have been split into "core" (top) and "ring" (bottom) as in fig.10.
be derived for all lens and source parameters simultaneously with the optimisation of any relevant hyperparameters.The probabilistic programming approach of Pyro integrates tightly with the PyTorch ecosystem and enables the physics simulator and inference machinery to use the same code, resulting in fast analyses that scale to the high-resolution, large datasets that will be produced by upcoming telescopes.
At the same time Gaussian processes provide a principled nonparametric Bayesian framework for learning correlation structures directly from data, while quantifying the reconstruction uncertainty.The hyperparameters we have used admit straightforward interpretation as the spatial scales on which the source light varies and the amplitudes of these variations.Computing the exact GP covariance matrix, however, is computationally expensive since beyond the intrinsic correlations of the source on different spacial scales, it must also account for the overlaps between pixel projections in the source plane.Furthermore, the full GP likelihood is accessible only through a matrix inversion and determinant calculation, both of which have prohibitive complexity.Hence, we have resorted to a number of approximation techniques, which we have introduced and validated: • By rephrasing exact GP as basis function regression defined over a set of inducing points, we have constructed a source model whose (hyper-)parameters can be optimised via gradient descent, circumventing inversion and determinants.
• We have introduced an accurate approximation scheme for pixel overlaps to stand in for the GP covariance.A PyKeOps-based imple- Figure 12.Variational posteriors for the lens parameters of the mock ELT observation of Hoag's object.The paler lines come from the (successful) "initial fits" and are colour coded from red to green according to the average loss value of the last 100 steps (green is lower loss), while the thicker lines come from the final posteriors with number of small-scale GP layers as indicated in the legend.Contours in off-diagonal panels encompass 95 % of the approximate posterior probability, while the diagonal plots depict the marginal PDFs (in different arbitrary units for each panel).Black lines are placed at the true values.
mentation produces optimised GPU routines that fit seamlessly in the overall auto-differentiable framework.
• We have refined the standard Gaussian covariance function into a set of layers to allow for a flexible effective GP kernel, modelling multiple relevant scales simultaneously and automatically adjusting their relative contributions.
• Lastly, for the different layers we have used two strategies for choosing inducing points: for fast reconstruction of the source's largescale appearance we have defined a fixed grid, while for recovering fine details we have allowed inducing points to move with the fitting of the lens.The latter has the effect of focusing the inducing points onto the relevant highly magnified regions while also reducing corre-lations between the source parameters and the lens.We have also used a proposal posterior for VI that partially captures these correlations.
Combining GPs and VI, we have been able to reconstruct observations down to the noise level (fig.8) while making unbiased inferences for the main lens parameters (fig.12) from high-resolution mock observations.We have shown that the results depend only weakly on how the source layers are configured and that adding many smallscale layers has only a marginal impact on the residuals.Meanwhile, optimising their variance hyperparameters effectively prevents them from contributing excessive noise.Through analytic arguments and numerical one-dimensional tests, we have demonstrated that the uncertainties of the variational posteriors for the source parameters are conservative.While recognising that replacing the GP marginal likelihood with the ELBO biases the GP hyperparameters and flux reconstructions towards zero, we have shown that the offset scales inversely with the observation's signal-to-noise ratio, and so we do not expect it to significantly impact our analysis.We have found our modelling sufficient to obtain unbiased lens parameter estimates with resolution-limited precision.More complex lens models and proposal posteriors could be employed in future work if necessary.
The computational efficiency of the pipeline allows obtaining reasonable lens parameter posteriors in ∼10 min with full convergence in a few hours.Importantly, the runtime of the pipeline scales linearly with model complexity, making it imminently applicable to more realistic analysis configurations.We will extend fits to multiband observations with non-trivial instrumental effects (a realistic point-spread function, for example), explore more complex main lens models, and include the lens galaxy's light in follow-up work.Additionally, we plan to improve modelling of the smallest-scale source details, which are difficult to fully resolve at present.Lastly, our main scientific goal is to make inferences about dark matter substructure and ultimately the fundamental features of dark matter.A simultaneously accurate and precise reconstruction of the source and main lens in all their complexity is crucial for detections of the percent-level fluctuations introduced by substructure.Furthermore, due to the statistical challenges of the substructure inference problem itself, the approaches best suited for tackling it require a wealth of well-targeted but also realistic training data, which our variational source analysis is well-positioned to deliver.In upcoming work, therefore, we will use the pipeline presented here as a targeted simulator (Coogan et al. 2020) to generate training data to teach neural networks to make substructure inferences, enabling heretofore intractable analyses., for the different fits is between 1.8 and 2. We observe a "redistribution" of the "power" of the modelled features.posteriors, starting from reasonable guesses for  and  and random guesses for the other parameters.We are able to fit the observation at the noise level and derive uncertainties.In the case of NGC4414, the "raw" uncertainties approach the original resolution at which the mock image was created, and so we put a lower bound on them by artificially introducing "sub-pixel"-localisation noise on the sampled lens parameters.In all cases we can reproduce small-scale details of the source galaxies (figs.14 to 17) with the same caveats as discussed in section 4.

B ON THE NUMBER OF IMAGE-PLANE GP LAYERS
See figs.18 and 19 for further details on the reconstruction of the image of Hoag's object with different numbers of image-plane GP layers.In general we find that the improvement in reconstruction fidelity is minimal after ∼3-5 layers.Meanwhile, due to the automatic regularisation via the optimisation of the variance hyperparameters, the overall "power" accounted for by the source model (related to the area under the curve in fig.18) is kept roughly constant and is simply redistributed across the layers.Furthermore, we reiterate that the additional many parameters, which for the smallest layers are independent of one another and largely unconstrained by data, neither lead to overfitting, nor contribute excessive predictive variance, and so the reconstructions and the associated losses remain similar.

C WINDOWING NORMALISATION DERIVATION
Due to conservation of surface brightness, the brightness received in pixel  is the integral over the pixel area in the image plane of the source surface brightness at the projected locations in the source plane; i.e.
where ì  is an image-plane and ì  a source-plane coordinate,  is the source surface brightness, and   is the pixel's image-plane indicator function, which is simply 1 inside the pixel and zero outside.The coordinate transformation27 ì  → ì  introduces a factor equal to the magnification from eq. ( 8), which can be approximated by the ratio of image-plane to source-plane pixel areas:

𝛽(ì 𝑥).
(51) From here we can identify ( ì ) ∝   ( ì  ( ì ))/  with   the projected (source-plane) pixel area.This is a function that is still zero outside the pixel but has a value  −1  inside, which leads to the normalisations ∫ dì  ( ì ) ∝ 1 and with the same coefficient of proportionality for which it is only important not to depend on the projected pixel properties, i.e. on the kernel.The first condition is trivially satisfied by a properly normalised Gaussian.Plugging eq. ( 22) and eq.( 25) into the second condition, we can verify: In the last line it is presumed that the pixel is a parallelogram with sides ì   and ì   .

D ANALYTICAL TREATMENT OF A VARIATIONALLY OPTIMISED GAUSSIAN PROCESS
An exact multilayer GP has a (prior) covariance matrix that is the sum of the covariances for the different layers: K =   2 () K () .We assume that each covariance layer can be factorised into the this is smaller than the marginal variance of the true posterior (V , ), and so as expected of a mean-field approximation, the per-parameter variance is underestimated.
However, we are interested instead in the reproduced posterior covariance of the true values, i.e.
Var  (f) = T  T , (61) in which the transfer matrix combines the variances of a number of apriori independent source parameters that are, in the exact case, a posteriori anti-correlated (due to the second term in eq. ( 57)): intuitively, the information that nearby source parameters produce similar effects in the observed data is lost in a diagonal approximation.This leads to an overestimation of the posterior variance of f that scales roughly with the number of a priori correlated observations31 : see fig. 5.

D.1 Hyperparameter optimisation
To verify eq. ( 44), we will directly compute the ELBO already minimised with respect to the variational mean and variance at a fixed set of hyperparameters  2 () (from which we can construct the matrix A as above).Substituting θ =   and eq.( 59) into eq.( 58 Using this and the evidence from eq. ( 15) we can write the loss function, which is the negative of the ELBO, up to constant terms and a proportionality, as The log-determinant of V  can be written, via eq.( 57 To proceed, we will look into the case of fitting a single-layer GP to data that consists of clumps of observations, where the points within each clump are perfectly correlated with each other but perfectly independent of the points in any other clump.We can assume each clump consists of   observations, given, for a stationary covariance function, by with  the average point density.Then the covariance matrix K is block diagonal, and each   ×   block is filled with the value  2 , i.e. we can write a block as K =  2 uu , where u is a vector of   ones.
Crucially, now we can compute the inner product in eq. ( 65) (for

Figure 3 .
Figure 3.The projection of two pixels onto the source plane (a).A full treatment should consider their exact shapes and use an indicator function that is nonzero only inside the shaded areas.Instead, we approximate the pixels as ellipses defined by the (projection of) the pixel centres   and the vectors ì  = ⃗ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ⃗  and ì  = ⃗ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ ⃗  , as shown in (b).The indicator functions are Gaussian with covariances derived from those ellipses (see the text for the exact definition) and are depicted in (c).

Figure 4 .
Figure 4. Bottom: comparison of pixel-pixel covariance calculated exactly using projected pixel geometry (horizontal axis) to the approximation used in this work (eqs.(22), (25) and (26)).The colour scale depicts a "shape index" = 4  × area/perimeter 2 , which is an indicator of elongation, being 1 for very elongated polygons.The shaded area indicates the region a factor of 2 away from the exact result.Top: cumulative fraction of pixel pairs (subsets as indicated in the legend) for which the approximation is off (either smaller or bigger) from the exact calculation by more than a given factor.The setup is an example lensed pixel grid of size 100 × 100 for a typical lensing configuration from section 4, and a single layer of kernel size comparable to the typical (projected) pixel sizes.The "exact" calculation was performed by averaging eq.(17) over 1000 uniformly sampled points in each pixel.Only pixels that would be considered for fitting (masked 23 as in section 4) are included in the bottom panel, and a further cut was enacted to remove abnormally elongated pixels (located e.g.near critical lines).
(), to approximate the true posterior ( | d) of the model parameters  and optimising its parameters  using gradient ascent on the function ELBO[ (, d),   ()] = E   () [ln (d, ) − ln   ()], (36) where (d, ) = (d | ) () = ( | d) (d) is the joint likelihood of the generative model.The ELBO is thus composed of two terms: the overlap between the proposal and posterior (rescaled by the evidence) and the entropy of the proposal, both of which we, naturally, would like to maximise.More formally, it can be shown that the difference between the (logarithm of the) model evidence and the ELBO is the Kullback-Leibler (KL) divergence between the variational and true posteriors (Bishop 2006): ln (d) − ELBO[ (, d),   ()] = E   () [ln   () − ln ( | d)] = KL[  () ( | d)],

Figure 6 .
Figure 6.Mock ELT observation (left panel) and the original HST image used as a source (right panel).The red curves depict the tangential critical curve and the corresponding caustic line, respectively in the image and source planes, along which the magnification is (formally) infinite.While we show the images in colour, the analysis was performed on a grayscale version.

Figure 7 .
Figure7.Loss history our fits to image of Hoag's object.The left panel is the initial fit stage and shows 100 trials with lenses initialised at random from the prior (but see text).The lines are colour-coded according to the current learning rate.At step 2000 we discard any fit that has not achieved a minimum loss indicated by the dashed red line.The right panel shows the subsequent fitting of the full variational posterior including image-plane source parameters (in the shaded area it was trained without varying the rest of the model) for different numbers of layers.We reset the training to lr = 10 −3 every 10 000 steps, hence the spikes.All losses have been smoothed over 100 iterations and shifted by an arbitrary constant for presentation.

Figure 8 .Figure 9 .Figure 10 .
Figure 8. Mock ELT image reconstruction in the image plane (top row) and source plane (bottom row).The middle two columns depict the mean and standard deviation of images derived from 100 samples from the trained variational posterior with three small-scale GP layers.All the fluxes, as well as the residuals in the rightmost column, have been normalised by dividing by the observational noise .See section 3.5.1 for a note and caveat about the source-plane reconstruction uncertainty.

Figure 18 .
Figure18.Final inferred overall variances of the image-plane GP layers in fits with different number of layers.For comparison, the root overall variance of the source-plane layer,  (sp) , for the different fits is between 1.8 and 2. We observe a "redistribution" of the "power" of the modelled features.

Figure 19 .
Figure19.Final image reconstruction losses, defined as the mean squared normalised errors between the observation and the mean of the PPD, for the fits with different numbers of image-plane GP layers.For the circles and solid line only the masked pixels are considered, while the squares and dashed line include all pixels.The data MSEs for the 1  fit (not shown) are 1.46 (masked) and 1.08 (all).

Table 1 .
Summary of the variables involved in the model.We use small bold letters to denote "arrays" (ordered sets) and an arrow to indicate spatial vectors.A bold variable with an arrow indicates an array of vectors, and indexing it returns, naturally, a single vector.Bold capital letters are used for matrices operating on arrays, and a double underscore denotes a matrix that operates on spatial vectors.The variable Z is an array of spatial matrices.

Table 3 .
Parameters of the variational posterior, .