PyTorchDIA: A flexible, GPU-accelerated numerical approach to Difference Image Analysis

We present a GPU-accelerated numerical approach for fast kernel and differential background solutions. The model image proposed in the Bramich (2008) difference image analysis algorithm is analogous to a very simple Convolutional Neural Network (CNN), with a single convolutional filter (i.e. the kernel) and an added scalar bias (i.e. the differential background). Here, we do not solve for the discrete pixel array in the classical, analytical linear least-squares sense. Instead, by making use of PyTorch tensors (GPU compatible multi-dimensional matrices) and associated deep learning tools, we solve for the kernel via an inherently massively parallel optimisation. By casting the Difference Image Analysis (DIA) problem as a GPU-accelerated optimisation which utilises automatic differentiation tools, our algorithm is both flexible to the choice of scalar objective function, and can perform DIA on astronomical data sets at least an order of magnitude faster than its classical analogue. More generally, we demonstrate that tools developed for machine learning can be used to address generic data analysis and modelling problems.


INTRODUCTION
Difference Image Analysis (DIA) describes several astronomical image processing algorithms with the shared goal of delivering precise photometric measurements of variable astronomical sources. Given two images of the same scene acquired at different times, in a DIA framework, one aims to subtract one image from the other and recover a difference image from which differential fluxes can be directly measured. This makes DIA a particularly effective approach to measuring the photometric variability of objects in crowded stellar fields -such as microlensing campaigns e.g. (Wozniak 2000;Bond et al. 2001) and studies of globular clusters e.g. (Bramich et al. 2011;Kains et al. 2012;Figuera Jaimes et al. 2013) -where the blending of light from neighbouring sources cannot otherwise be easily disentangled.
Even for two perfectly aligned images from the same instrument, in order to produce a clean subtraction, the DIA algorithm of choice must model the inevitable changes in 1) the Point spread function (PSF) due to variations in seeing, telescope focus or tracking errors, 2) the Photometric scaling from differences in atmospheric transparency, exposure time or instrument throughput and 3) the ★ E-mail: jah36@st-andrews.ac.uk Sky background caused by e.g. changes in the position and phase of the moon. 1 The differential change in PSF and photometric scaling can be found by inferring the convolution kernel, which 'blurs' the sharper of the two images to match the PSF in the other. Alard & Lupton (1998) (hereafter AL98) demonstrated that the kernel can be found using linear least-squares by decomposing it as a set of user-specified basis functions. Specifically, AL98 opted for Gaussian basis functions modified by low-order polynomials. A decade later, the linear least-squares solution was advanced by Bramich (2008) (hereafter B08) who modelled the kernel as a highly flexible (albeit computationally more expensive) discrete pixel array, which is analogous to the AL98 algorithm, but with a choice of delta basis functions for the kernel model. Later advances included extending the DIA solution to a spatially varying kernel and differential background to model position-dependent variations in seeing, transparency and airmass across the field-of-view (FoV) in widefield imaging data (Alard 2000;Bramich et al. 2013).
In this analytical linear least-squares framework, a normal ma-trix for the linear system of equations is required. It is the construction of the normal matrix for these approaches which results in a computational bottleneck. For example, for a (square) pair of images each of size , and a (square) kernel of size the construction of the normal matrix for the B08 approach -which implements the same model for the kernel as our algorithm -scales as O ( 2 4 ) (i.e. the run-time increases with the square of the input image size and with the kernel size to the power of four). These problems are exacerbated by the need to iteratively fit for the model parameters in the linear least-squares sense (as the problem is in fact non-linear, see Section 2.1) and, typically, the normal matrix will need to be constructed about 3-4 times until convergence is reached. Attempts have been made to exploit symmetries in the problem (see Section 5.2 of Bramich et al. (2013)) or bin pixels around the kernel edges to speed up this construction (Albrow et al. 2009). In the era of wide FoV sky surveys such as the upcoming Legacy Survey of Space and Time (LSST) (Ivezić et al. 2019), or the ongoing Zwicky Transient Facility (ZTF) (Bellm et al. 2018) etc., which aim to deliver prompt alerts of transient astronomical phenomena detected through image subtraction, many current popular approaches make real-time event discovery impractical. This has driven some recent advances in the DIA literature, most notably the ZOGY algorithm (Zackay et al. 2016), and even a machine-learning approach (Sedaghat & Mahabal 2018). Image processing tasks -where some common operation is performed on very many pixels -are inherently massively parallel. Mild to substantial computational speed-up by parallelising the construction of the normal matrix in classical DIA algorithms with GPUs has been demonstrated in the literature (for example, Hartung et al. 2012;Li et al. 2013;Zhao et al. 2013). Adoption by the larger astronomical community however has been slow.
The main barrier to adoption for most astronomers is likely the lack of a working knowledge of CUDA, NVIDIA's parallel computing platform, which is required to develop GPU-accelerated applications on supported devices. In astronomy, the Python programming language has firmly established itself as a favourite of the majority of the community, and most importantly, will be the primary user-interface language for the next generation of astronomical data management, processing and analysis systems (Perkel 2018). The appropriateness of DIA image models can be highly data set dependent, and tuning by individual researchers to meet their science goals within a Pythonic framework is to be expected. The work presented here is one of a few recent attempts 2 to address the paired issues of performance and accessibility.
In this paper, we present a novel Pythonic implementation of an alternative route to the B08 solution, without the need for constructing the computationally expensive normal matrix. Conceptually, our approach is unique in that we model the kernel as if it were the convolutional filter of a very simple convolutional neural network (CNN), which can be solved for efficiently with GPU-accelerated optimisation tools originally developed for deep learning applications. As we will describe, the machine-learning framework which we use to approach this problem also equips us with powerful modelling tools, and frees us from a number of restrictive assumptions inherent to classical approaches. Our implementation is also unique amongst GPU-accelerated DIA algorithms for being written entirely with standard Python packages.
We begin Section 2 with an introduction of the basic image 2 Including approaches attempting to parallise the construction of the B08 normal matrix (Albrow 2017) model used in our DIA implementation. We then outline how Py-Torch could be well suited to addressing existing astronomical image modelling and data processing challenges in general, before describing the details of our own DIA implementation, 'PyTorchDIA'. We quantify the model fit quality and photometric accuracy of PyTorch-DIA with tests on both synthetic and real images in Sections 3 and 4, comparing it directly to the performance of its classical DIA analogue, the B08 algorithm. In Section 5, we compare the speed of our GPU-accelerated numerical solution against a fast Cython implementation of the B08 algorithm 3 used in the ROME/REA project (Tsapras et al. 2019), and explore how our algorithm scales with image and kernel size. We summarise our conclusions in Section 6.

PROBLEM FORMULATION
In this section we outline the DIA problem, and provide a motivating overview for using PyTorch as a tool to address image modelling challenges -of which DIA is just one example -and describe the details of our DIA implementation.

Difference Image Analysis
Given a reference image with pixels , ideally of excellent spatial resolution and high signal-to-noise, and a target image with pixels , of the same scene taken at some other epoch and aligned on the same pixel grid, the model image which represents is given by and so the challenge is to find the fit for an accurate kernel, , and differential background . Following B08, by representing the kernel as a discrete array containing a total of pixels, and including an additive scalar differential background, 0 , we can re-write equation 1 as The photometric scale factor -which encodes any differences in atmospheric transparency and/or exposure time between the images -is simply the sum of the kernel pixels, Assuming that the pixel values of the target image, , are independently drawn from normal distributions N ( , 2 ) -where is paramaterised by the vector = [ , 0 ] (see Section 2.3)the negative log-likelihood function for the target image takes the form where are the pixel uncertainties, data is the number of pixels in the target image, and the 2 is equal to 3 The kernel solution method heavily borrows from the relevant section of the pyDANDIA microlensing reduction pipeline, https://github.com/ pyDANDIA.
It is important to note that data is not equal to the total number of pixels in the target image, as the convolution operation is undefined for pixels within half a kernel's width from the target image edges i.e. for a kernel of size (2 + 1) × (2 + 1) and (square) reference and target images of size × , data = ( − 2 ) 2 .
The pixel uncertainties are dependent on the image model , and so fitting this model to the target image is therefore a non-linear optimisation task. Linear least-squares approaches like AL98 and B08 must then approach this iteratively, by minimising the 2 with fixed estimates for . After the first estimate of is acquired, the are computed, and then plugged into the 2 for the next iteration. This process continues for at least 3 iterations, until some convergence condition is met.
In this work, we use both simulated CCD images and real Electron Multiplying CCD (EMCCD) images, acquired on the Danish 1.54m (DK154) Lucky Imaging camera (Skottfelt et al. 2015) to verify our implementation, each of which requires a different noise model. In what follows, we ignore the noise contributions from the reference image, since these are negligible in the experiments performed in this work. 4 .
For CCD images, we adopt a noise model for the pixel uncertainties of as where 0 is the read noise (ADU), is the detector gain (e − /ADU), and is the master flat field. The EMCCD images are constructed from typically thousands of shift-and-added sub-second exposures. The electron multiplying gain reduces the read out noise to negligible levels, but the cascade amplification process effectively doubles the variance of the photon noise. For the real EMCCD images used in this work, we adopt a noise model of the form where Total represents the combined gain (e − phot /ADU) of the DK154 Lucky Imaging camera, which is calculated as the ratio of the CCD gain of 25.8 (e − EM /ADU) over the electron-multiplying (EM) gain of 300 (e − EM /e − phot ). Following Harpsøe et al. (2012), we differentiate between electrons before and after the cascade amplification with the notation e − phot and e − EM respectively. represents the 'excess noise factor', and accounts for the probabilistic nature of the cascade amplification as the EMCCD is read out, and is set to be = 2. EMCCD images are flat corrected in the same way as conventional CCD images, and as in Equation 6, is the master flat field.

Astronomical Image Processing with PyTorch
The model image of Equation 1 is a convolution with some added scalar constant. In the computer science literature, this is exactly analogous to an extremely simple convolution neural network (CNN), with a single input and output related by a single convolutional filter, with some additional scalar bias added to the output. 4 Our interest in testing the performance of our DIA algorithm on DK154 EMCCD images relates to our ongoing microlensing follow-up campaign, which is the main science project of the MiNDSTEp consortium, http: //www.mindstep-science.org/ Efficient solutions for the 'weights' (the kernel) and 'bias' (background term) to this convolution operation can be implemented within the Python package, PyTorch (Paszke et al. 2019).
PyTorch is a popular open source machine learning framework. Its constant development is motivated by the impressive advances in deep learning (for an overview see e.g. (LeCun et al. 2015;Goodfellow et al. 2016)), in which CNNs have played an important role in processing images. Specifically, this package supports CUDA-enabled GPU acceleration and automatic differentiation to perform efficient optimisations. One key motivation behind these developments is the training of complex CNNs, consisting of thousands, to hundreds of thousands of parameters. The 2-dimensional convolution (as in Equation 1) is the core processing operation in these networks, and is straightforward to implement in PyTorch's modelling architecture. Indeed, many useful image models in astronomy can be written as a convolution, and the tools outlined in this work are therefore broadly applicable to many astronomical image processing problems.
We stress that although PyTorch's powerful tools were designed for machine learning, they can be turned to generic data analysis and modelling problems, as shown in this work, of which image models are just a subset. In particular, efficient computation of gradients via automatic differentiation frees the user from having to manually recompute gradients if the parameterisation of the model is changed. This flexibility allows models written in PyTorch to be easily tuned to meet the variety of science goals arising from a diversity of data sets. In astronomical image processing in particular, this flexibility combined with GPU acceleration could make PyTorch a valuable tool for addressing challenges associated with both model complexity and data volume. As PyTorch is Pythonic, these implementations could be easily integrated into existing Python stacks.

Difference Image Analysis as an Optimisation
The crucial advance in AL98 was to formulate DIA as a linear least-squares problem, and several efficient algorithms for solving these problems exist (Golub & Van Loan 1996). The computational bottleneck associated with this approach is the construction of the normal matrix (see Section 1). Models in PyTorch however are generally fit with a numerical optimisation procedure -making use of automatic differentiation -which we outline here.
For a given target image with Gaussian noise contributions, , and current estimates for the kernel and differential background 0 which transform the reference image, , we define our vector of weights as = [ , 0 ]. Ignoring the irrelevant normalisation constant, the maximum-likelihood-estimate (MLE) for can be found by minimising the overall loss (or negative log-likelihood Cf. We note here that both the image model and noise model are both functions of . We drop the notation explicitly showing the dependence of and on from the rest of the manuscript. Equation 8 can be solved by a process of (steepest) gradient descent. At each iteration in the optimisation, we use the update rule where is the learning rate (or step-size) at each iteration. Py-Torch includes implementations of several more sophisticated gradient descent algorithms, which can compute adaptive learning rates for each parameter (see Ruder 2016 for a good overview). In our implementation, we use the Adaptive Moment Estimation (Adam) algorithm , which computes learning rates for the kernel pixels and the background parameters from estimates of the first and second moments of their gradients at each step (Kingma & Ba 2014). Adam has been empirically demonstrated to work well on nonconvex optimisation problems, is computationally efficient, and the hyper-parameters are both intuitive and require little tuning from the astronomer (see Section 2.9 for an overview of the 'engineering' aspects of the optimisation).
With an excellent choice of learning rate, solutions via steepest descent when accelerated on the GPU are lightning fast (see Section 5), but this approach in general can be slow. Solving this problem on the CPU -and therefore forgoing the massive inherent parallelism otherwise exploited in Equation 9 -would result in a substantial performance hit. This problem is particularly severe when close to the minimum of the loss function, where the gradients become increasingly shallow and the gradient steps become correspondingly smaller. An effective solution to this problem is to make use of quasi-Newton optimisation methods which approximate the curvature of the loss surface.
These approaches converge extremely quickly where the loss surface can be modeled quadratically, and use an approximation of the inverse of the Hessian at each iteration, ( ) ( ) , to condition the search direction, giving the update rule For this convex optimisation, this is a very good assumption when close to the minimum. Newton methods in general are sensitive to bad parameter initialisation (i.e. when we're initially far from the minimum), and so we advocate for optimising via steepest descent steps -as in Equation 9 -to first get close to the minimum, before converging with the more memory intensive quasi-Newton procedure once the relative change in the loss between any two optimisation steps falls below a user-specified threshold. Specifically, we use a L-BFGS method (see Chapter 7.2 of Nocedal & Wright 2006) which is available as an in-built algorithm in PyTorch 5 .

Robust loss function
In general, the assumption that all the pixel values in the target image are drawn from N ( , 2 ) will be violated for real images. Instrumental defects, cosmic ray hits and variable or transient sources etc. will all arise as outlying pixel values. Our Gaussian loss function (Equation 4) will be badly affected by these outliers, as it minimises the squared difference between the model and the data. The standard approach to outlier rejection in astronomy is to iteratively remove these 'bad' pixels by sigma-clipping, and B08 use this approach to mitigate the impact of outliers to the least-squares solution. In addition to being very sensitive to the accuracy of the adopted noise model, this procedure does not explicitly penalise the rejection of data, nor is it necessarily clear how many iterations should be performed. In the context of classical DIA algorithms, these iterations incur an extra computational expense, as a new normal matrix must be constructed each time.
As we are not restricted to standard least-squares minimisation 5 https://pytorch.org/docs/stable/optim.html Substituting = ( − )/ , Huber, ( ) is the loss function, which is defined as We see that Huber, ( ) computes the squared error between the model and data for small residuals, and the absolute deviation for outliers above some user-specified threshold, . This threshold defines the switch between quadratic and linear treatment of the error (see Figure 1), and Huber & Ronchetti (2009) recommend = 1.345 as a suitable value for enforcing robustness while retaining reasonable efficiency for normally distributed data.
Using the same notation as in Section 2.1, if we assume the pixel values in the target image are drawn from the distribution in Equation 11, the negative log-likelihood function takes the general form where is the normalising constant, Note that when becomes large, = ( /2) × ln 2π, which is equivalent to the normalisation constant of a Gaussian loglikelihood (Cf. Equation 4).
As our implementation makes use of automatic differentiation, it is straightforward for the user to experiment with loss functions, as they are freed from having to manually recompute gradients. We highlight the Huber loss in particular as this enforces robustness without sacrificing performance (see Section 5.1), but any number of wide-tailed distributions (e.g. Student's t distribution) could be used as drop-in replacements.

Uncertainty estimation -Observed Fisher Information
The central limit theorem states that than any well-behaved likelihood function approaches a Gaussian near its maximum. The Fisher Information Matrix (FIM) is a measure of the curvature of the likelihood function with respect to the model parameters -intuitively, this can be thought of as a 'sensitivity' -and its inverse provides a lower bound on the asymptotic variance of the Maximum Likelihood Estimate (MLE).
The observed FIM, ( ) for our + 1 parameters is the ( + 1, + 1) matrix containing the entries where ln p( | ) is the log-likelihood (Equation 4). The inverse of ( ) evaluated at the MLE for (i.e.ˆM LE ) can then be used as an estimate for the covariance matrix which provides an estimate of the uncertainties on the model parameters by taking the square roots of the diagonal elements of Cov(ˆM LE ). Fisher information provides us with the limiting precision with which model parameters can be estimated for any given data set i.e. the subsequent error bars cannot be smaller (Heavens 2009). Formally, the Cramér-Rao inequality states that the uncertainty on some parameter is given by This method is subject to two assumptions 1) the likelihood function correctly describes the data generating process (i.e. the error distribution of the measurements is correctly described by the likelihood), and 2) the likelihood really is approximately Gaussian at the MLE. In practice, both of these assumptions will likely be violated, and in these situations, the uncertainty estimates by this approach can be severely underestimated. Given this, Andrae (2010) strongly recommend to test this assumption by checking the validity ofˆ. In general, a valid covariance matrixˆmust be positive definite (i.e. for any non-zero vector , .ˆ. > 0). If either the determinant ofˆis negative, or (after diagonalising the matrix) any eigenvalue is found to be negative or zero, thenˆis not valid. We include both these tests in our code release, and a warning flag is raised if any is failed.

Extension to a Spatially Varying Background
Within the PyTorch architecture, a spatially varying background can easily modelled by replacing in Equations 1 and 8 with a linear combination of functions of and . We adopt a polynomial model of some user-specified degree , where are the polynomial coefficients to be inferred, and ( ) and ( ) are the normalised spatial coordinates, which result from a Taylor expansion of coordinates ( , ) about the image centre ( , ), for an image of × pixels. This choice of coordinates is recommended by Bramich et al. (2013), and has the effect of (1) improving the orthogonality of the spatial polynomial terms, and (2) preventing the coefficients of the higher order polynomial terms from being pushed towards zero.

Regularising the kernel pixels
As noted by Becker et al. (2012), while kernels modeled as a discrete array of pixels are very flexible, the consequent fidelity with the data can result in significant overfitting. To guard against excessively noisy kernels, we provide the option to regularise the loss with the addition of a penalty term. Following the notation in Bramich et al. (2016) (hereafter B16) -where the strength of the regularisation is controlled by the parameter , which must be tuned empiricallythe scalar objective function to minimise now takes the form, where data is the number of target image pixels, is the vector of parameters to optimise, L 0 ( ) is the loss function, and is the symmetric and positive-semidefinite ( +1) × ( +1) Laplacian matrix, which represents the connectivity graph of the set of kernel pixels, and has elements , and adj, is the number of kernel pixels adjacent to the kernel pixel corresponding to −1, for ≤ , ≤ , ≠ , and and are adjacent kernel pixels. 0, otherwise.
This penalty term is derived from an approximation of the second derivative of the kernel surface (Becker et al. 2012). Intuitively, it favours compact kernels, where adjacent kernel pixels should not vary too sharply. The optimal value of should, ideally, be tuned for each image, and B16 found optimal values could be anywhere in the range = 0.1 − 100.
By regularising the kernel weights, we are in effect introducing a Bayesian prior, which would then transform our solution into a maximum a posteriori (MAP) optimisation. In our experiments in the following sections, we restrict ourselves to the MLE solution by simply minimising either Equation 8 or 13.

General purpose computing on GPUs
For this work, we ran our PyTorchDIA implementation on two different NVIDIA GPUs. The computations associated with the results presented in Section 3 and 4 were performed on a GeForce GTX 1050 on a local machine, and the speed tests in Section 5 were run on a Tesla K80 on Google's Colab 6 service, a free-to-use cloud-based computation environment. There are a couple of important details worth describing on using these devices for scientific computing.
Firstly, while modern CPUs are able to handle computations on 64-bit floating point numbers efficiently, such operations are either not supported on GPUs, or are associated with a significant reduction in performance (Göddeke et al. 2005). It is for this reason that 32-bit precision -or increasingly commonly, 16-bit precisionis used in deep learning. For this problem, we have found computations typically take at least 2−5 times longer for fiducial image sizes using 64-bit precision with a Tesla K80 7 . For memory intensive optimisers such as L-BFGS, the hit to performance will be even worse. For this reason, for the results in this work we use 32-bit floating point precision for our PyTorchDIA implementation. This speed-up comes at the expense of some accuracy, but we show this difference to be small in Sections 3 and 4.
Convolution computations on NVIDIA GPUs can be accelerated by highly-tuned algorithms available in specialised libraries. One such NVIDIA library, cuDNN, specialises in efficient computations with small kernels (i.e. those in the regime of DIA). These algorithms are selected by cuDNN heuristically, and not all are deterministic. As such, for an identical pair of images, the number of computations used to infer the convolution kernel may be different on different runs. Consequently, the optimiser may converge at a slightly different point in the parameter space. We explore the speed-up associated with enabling cuDNN tools in Section 5.
Finally, while not explored in this work, we highlight mixed precision 'training' as a technique which our approach could benefit from. PyTorch now supports Automatic Mixed Precision (AMP) training, in which 32-bit data can be automatically cast to half precision for some types of computationally expensive operations on the GPU 8 . By appropriately scaling the loss during the training/optimisation, AMP prevents underflow that would otherwise cause gradients to drop to 0 at half precision, and can achieve the same accuracy as training only with 32-bit precision with significant speed-up, depending on the GPU architecture and model design (Micikevicius et al. 2017). Some recent NVIDIA GPUs now include Tensor cores, which are specifically designed to perform highly optimised 16-bit matrix multiplications. Consequently, cuDNN convolution algorithms such as 'GEMM' are particularly well suited to benefit from this technique (Jordà et al. 2019).

Optimisation as an engineering problem
As we fit for the convolution kernel via an optimisation, hardware alone will not determine the solution time. There is an 'engineering' aspect to optimisation that the user should be aware of. Here, we summarise three key choices that the user must make: 1) Parameter initialisation; 2) The learning rate (or step size) and 3) The convergence condition. We also provide an overview of (and justification for) the specific choices made when generating the results presented in this manuscript. 6 https://colab.research.google.com 7 Tesla cards are designed to perform fast computations at higher precision than most GPU models, and a slowdown of ∼ 2 − 5 at 64-bit would still outperform some state-of-the-art classical DIA methods (see Section 5). Most NVIDIA models however (including GeForce cards), excel at 32-bit and lower precisions, but would suffer a more severe slowdown at 64-bit. 8 https://pytorch.org/docs/stable/notes/amp_examples.html Throughout this work, we always background subtract the reference image, and so we initially set the differential background parameter, 0 , to be equal to the median pixel value of the given target image. For the fit quality and photometric accuracy tests in Sections 3 and 4, we make no assumptions about the shape of the kernel at initialisation. We set all kernel pixels to have the same value, with the only condition being that they sum to 1 (i.e. each pixel is initialised as 1/ ). In Section 5, we experiment with initialising the kernel as a symmetric Gaussian, with a shape parameter judiciously set with knowledge of either the known or measured differences in the PSFs of the reference and target image pair.
The Adam algorithm used at the start of the optimisationwith which we perform steepest gradient descent -allows us the freedom to set individual learning rates for our model parameters, which will be adaptively tuned (see Section 2.3). For all tests in this work, we set the learning rate of the Adam optimizer for the pixels in the kernel at 0.001, as recommended in Kingma & Ba (2014). We have found that it is advantageous to use a fairly high learning rate for 0 , and we set this to either 1, 10 or even 100, dependent on the data set. There is a strong anticorrelation between and 0 , and quickly finding the approximate photometric scaling between the two images allows us to disentangle this offset from the inference of the spatial differences in the images (associated with the different PSFs), which is encoded in the shape (and not the 'scale') of the convolution kernel. The learning rate of the L-BFGS optimiser is always set to 1.
We use the same convergence condition for all tests in this work. This decision was made on the basis of the model performance and photometric accuracy metrics in Sections 3 and 4 , and should balance a satisfactory model accuracy with the time taken to converge. If between any two steepest descent updates the relative change in the loss is less than 10 −6 (i.e. we are getting close to the minimum) a L-BFGS optimiser takes over. For each subsequent quasi-Newton update, the L-BFGS function evaluation routine terminates when either the change in loss between evaluations reaches the limit of our numerical precision -which corresponds to a relative change in loss in the range 10 −8 − 10 −7 between the last two optimisations steps -or a first order optimality condition is met, such that the gradient of the scalar objective function to be minimised with respect to each model parameter is less than 10 −7 .

The Algorithm
Our DIA algorithm is as follows. Decisions to be made by the user/human are in italics. 'tol' is the user-specified relative change in the loss between successive steepest descent (SD) updates, at which point the optimisation routine switches from SD to L-BFGS.
• Choose the square kernel dimensions (must be odd), and whether to use a scalar, or polynomial fit of degree , for the differential background • Set Adam's per-parameter learning rates and 'tol' • Set the convergence condition • Initialise • Begin minimising the chosen scalar objective function (Equations 4 or 13, with or without the optional penalty term (Equation 20)) with steepest gradient descent (Equation9) • At each iteration if the relative change in the loss is less than 'tol', switch to a quasi-Newton update (Equation 10) and continue until convergence condition satisfied.

SIMULATED IMAGE TESTS
Artificially generated images can be used to assess the accuracy of our algorithm. In this section, we compare the fit quality and photometric accuracy of our numerical, GPU-accelerated algorithm against an implementation of the analytic B08 algorithm with a data set of 71569 synthetic images.

Generating Artifical Images
We base our image generation procedure on Section 5.1 of B16. The results from the simulated image tests in that work were shown to closely agree with similar tests using real CCD data. Specifically, we generate images similar to those in the 'S10' set of that work, wherein the reference image of each image pair has 10 times less variance than the target.
We first generate a noiseless reference, noiseless, , of size 142 × 142 pixels as follows.
(i) We generate a normalised two-dimensional symmetric Gaussian PSF for this image, parameterised by a standard deviation, , drawn from the continuous uniform distribution ∼ (0.5, 2.5).
(ii) We populate the 142 × 142 pixel array template of noiseless, with stars , whose lg density lg stars per 100 × 100 pixel region, is drawn from the uniform distribution ∼ (0, 3). The fractional pixel coordinates for each star are uniformly drawn between the image axis lengths.
(iii) For each of the stars , we draw a value of F −3/2 from the uniform distribution ∼ (10 −9 , 10 −9/2 ), where F is a given star's flux (ADU). This flux distribution is a good approximation when imaging to a fixed depth for certain regions in the Galaxy. For reasons of performing PSF photometry at the position of the brightest star in the difference images, we move the pixel coordinates of the star associated with the largest F value to the centre of the image.
(iv) For each star, the normalised Gaussian profile generated in (i) is placed at the appropriate pixel coordinates, and scaled by the star flux from (iii).
(v) Finally, a constant background level is drawn from the continuous distribution, ∼ (10, 1000), is added to the image.
(vi) It is common practice to create a high signal-to-noise reference frame by either stacking images or increasing the exposure time. To simulate this, we generate a 'noise map', , to apply to noiseless, with 10 times less pixel variance as is applied to the target image. This can be achieved by scaling the uncertainties by a factor of 10 −1/2 ∼ 0.316. Adopting the usual CCD noise model (Equation 6) with 0 = 5 (ADU), = 1 (e − /ADU) and = 1, we compute the reference frame pixel uncertainties as (vii) A 142 × 142 pixel image, , with values drawn from a standard normal distribution, N (0, 1), was also generated, and the noisy reference, noisy, , is formed as For each noisy, we then generate a corresponding target image.
(viii) As with the reference images, we choose a symmetrical Gaussian PSF for the target images, paramaterised by . As the convolution of a Gaussian with a Gaussian is another Gaussian, the kernel is itself a Gaussian, and we draw the corresponding kernel width from ∼ (0.5, 2.5). We can then compute the width of the PSF in the target image, and repeat steps (iv) -(v), using in place of and in place of to generate noiseless, . Additionally, we also apply sub-pixel shifts to the stellar fractional pixel positions along both the x-and y-axis, with the Δx and Δy shifts each drawn separately for each simulation from ∼ (−0.5, 0.5).
(ix) Similar to (vi), we then generate the noise map (x) and then generate a 142×142 pixel image, , with values drawn from N (0, 1), and create noisy, , The signal-to-noise ratio of the target image is computed as

Performance Metrics
We assess the fit quality and photometric accuracy of the kernel and background solutions with the following performance metrics. The derivations of these metrics are based on Section 5.2 of B16. For easy reference, the definitions and equation numbers for all metrics are listed in Table 1.
(i) Model error. The mean squared error (MSE) assesses how well the inferred model image, , matches the true target image, ,noiseless , The smallest values of MSE indicate the best fit in terms of model error.
As noted in Bramich et al. (2015), systematic errors in the photometric scale factor, , can badly influence photometric accuracy. Given this, we also assess the inferred and 0 parameters, whose true values are equal to 1 and 0 respectively in these simulations.
(ii) Fit quality. The mean fit bias (MFB) and mean fit variance (MFV) are measures of the bias and excess variance in , and are given as Kernel and background solutions with a MFB close to 0, and a MFV close to unity are measured to have a good fit quality.
(iii) Photometric accuracy. To assess the photometric accuracy of the solution, we perform PSF fitting photometry at the position of the brightest star in the difference image. We generate a normalised PSF object parameterised by , centered at the position of the brightest star, and convolve this with the kernel solution. This is renormalised, giving us a normalised PSF object to fit to the difference image. The true target image PSF width is known by Equation 24, and we set the size of the renormalised PSF object 'stamp' to be 9 × 9 pixels large. We then fit this PSF stamp to an equally sized cutout from the difference image (centered at the position of brightest star) with a scaling factor F diff and an additive constant, weighted with the known pixel variances in the target image, 2 , . The fitted F diff is then scaled to the photometric scale of the reference image, giving F measured = F diff / .
The theoretical minimum variance of F measured for a PSFfitting procedure which scales a PSF model to a stamp with pixel indices is where true is the true photometric scale factor (equal to 1 in our simulations) and P , is the true PSF of the brightest star in the target image i.e. a normalised Gaussian with standard deviation . As there are no variable sources, over sim simulations of accurate kernel and background solutions we should expect a distribution of F measured / min values with mean 0 and a variance of unity. The mean photometric bias (MPB) and mean photometric variance (MPV) over sim simulations, each indexed by , would be equal to the statistics As noted in B16, although the MPV should be close to unity, it may have values less than this when the target image is overfitted, and/or when the model PSF fitted to the difference image (i.e. formed from the convolution of the reference image PSF with the inferred kernel) is different than the true PSF of the target image. Figure 2 shows an example 142 × 142 [pixels] reference and target image pair generated for these tests. The difference images and kernels corresponding to the B08 and PyTorchDIA solutions, annotated with the fit quality and photometric accuracy metrics, are shown below.

Simulated Image Test results
In these tests, both B08 and PyTorchDIA attempt to fit the image model in Equation 1, with an unregularised (square) 19 × 19 pixel kernel. The generated (square) reference and target images are each 142 × 142 pixels large, and so the number of target image pixels which enter the fit is data = (142 − 2 × 9) 2 = 15376. B16 used 141 × 141 large reference images and included 10201 target image pixels in their solution, and so the results presented here can be meaningfully compared against that prior work.
Following B16, we first divide the results of our 71569 simulation tests into 3 regimes by the signal-to-noise ratio (SNR) of the target image, : (1) 8 < SNR < 40; (2) 40 < SNR < 200 and (3) 200 < SNR < 1000. Each of these 3 categories is then divided into a 4 further categories by the sampling regime of the images: (i) > 1 and > 1, (ii) > 1 and < 1, (iii) < 1 and > 1, (iv) < 1 and < 1. The distributions of the fit metrics are in general skewed, and so we report the median values as a robust estimate of the central value in Figure 3 for both the B08 and PyTorchDIA implementations. Each sub-plot pair in Figure 3 pertains to a single metric, with the B08 results in blue in the left column, and PyTorchDIA results in red in the right column. On the x-axes, we plot the SNR regime of the target image, categorised as above. There are 4 points in each SNR regime in each sub-plot, each of which corresponds to a different sampling regime, and we offset these from each other for clarity. We use circular markers to denote the sampling regime of the reference image, and crosses to indicate the sampling regime of the kernel. A big circle or cross corresponds to an over-sampled reference image or kernel respectively, and a small circle or cross corresponds to an under-sampled reference image or kernel. The green dashed lines for each sub-plot pair represent the 'best' value for each metric. We also tabulate these results in Table 2, and include the 16th and 84th percentiles about the median of the distributions. In the bottom section of this table, we note the number of simulations which fall into each SNR and sampling regime category.
Across all the metrics, one of the biggest differences between the B08 and PyTorchDIA solutions are seen in the values for the photometric scale factor. B08 is more accurate in general, but differences between the median values in each SNR and sampling regime are typically small (∼ 0.001), and become negligible when the SNR is high. B16 showed that the accuracy of is strongly determined by the SNR of the target image, and in Table 2, we also see that the distribution of values about the median becomes tightly concentrated about unity as the SNR increases. Interestingly, for results in any given sampling regime, there is no clear similarity in trends between the two approaches with the target image SNR.
There are no large differences between B08 and PyTorchDIA in terms of MSE, with B08 better only at the level of the first decimal place. Looking at Table 2, B08 does however begin to slightly outperform PyTorchDIA in the highest SNR regimes when both the kernel and the reference frame are under-sampled.
As observed in B16, both approaches show overall gradually decreasing MFB as the SNR increases, although B08 shows a negative bias while PyTorchDIA is typically positively bias. PyTorchDIA again only appears to do noticeably worse than B08 in the highest SNR category when both the reference image and the kernel are undersampled.
The MFV values returned by B08 and PyTorchDIA are also very similar. The B08 MFV values are usually lower than those for PyTorchDIA, although all median values are less than unity. These results too are consistent with B16, who also found the unregularised kernel was prone to overfitting when the SNR of the target image is low.
The differences between B08 and PyTorchDIA in terms of MPB are small, with both showing the same trends with SNR and sampling regimes. As found by B16, with increasing SNR, the The subsequent difference images and fit metrics recovered by the B08 approach (left) and our PyTorch implementation (right). (Bottom row) The corresponding convolution kernels and fit parameters for the B08 (left) and PyTorch (right) solutions. The median pixel value is subtracted from the reference image before fitting the kernel and differential background term, and so 0 will not be equal to zero. The B08 and PyTorchDIA solutions are very similar. MPB greatly improves. That this is typically best when the kernel is oversampled, and worst when both the kernel and reference image are undersampled, is likely in part due to sampling issues associated with the PSF model which is scaled to the flux in the difference image. Encouragingly, with reference to the values in Table 2, we can see that in general, PyTorchDIA performs even better than B08. That the MPB metrics are similar despite the differences in MFB is likely due to our choice to simultaneously fit for a constant scalar offset in our PSF fitting procedure, which would correct for any net over-/underestimation in the background parameter of the image model.
The same trends in the MPV values with both SNR and sampling regimes can again be seen by the two approaches. With an increasing SNR, the MPV increases. This behaviour was also seen by B16, and it can be explained as reduced overfitting of the bright central stars from which this metric is computed. For both approaches, apart from some cases when both the kernel and reference image are undersampled, the MPV is always less than unity, and similar results were obtained by B16 (see Figure 6 therein). We experimented by performing an additional 21271 tests in which the photometry of the faintest star in the image was used to compute the MPV. In this instance, no MPV values were less than 1.1.
While being overall very similar, the PyTorchDIA MPV values are slightly higher than their B08 counterparts. Because B08 overfits more strongly than PyTorchDIA (see MFV values), we should expect the noise in the B08 difference images to be slightly more suppressed. Consequently, the photometry will tyically have a lower variance. The difference in MPV is greatest when both the kernel and reference image are undersampled. This may be associated with the worse MFB values for PyTorchDIA in this category, as while the additive constant in our PSF fitting procedure will be able to correct for inaccuracies in the differential background, it will do so at the cost of slightly increased variance in the photometry.
In conclusion, PyTorchDIA performs very similarly to the B08 algorithm in these tests on synthetic images, and it is encouraging to see that the major conclusions drawn from these tests are consistent with those in B16. These images have Gaussian noise contributions, and have no outliers, and it is therefore not surprising that the B08 algorithm does slightly better overall, since the linear least-squares formulation (within an iterative scheme) correctly describes the observation model, and this approach operates at twice the floatingpoint precision of PyTorchDIA.

No 'algorithmic' bias
As PyTorchDIA is a numerical algorithm, it is essential to know to what precision it can recover the correct solution if random noise has been removed from the data. By default, PyTorchDIA operates at F32, giving us 6-8 decimal digits of precision.
Following Section 3.1 of B08, we perform the following tests (i) -(iv) where known, noiseless transformations are applied to a reference image to generate a target image. The 142 × 142 pixel reference image used in what follows is generated randomly, as outlined in Section 3.1. We compute the fractional error, , of and 0 , as = |( − True )/ True | and 0 = |( 0 − 0,True )/ 0,True | 9 . (i) The simplest possible test we can perform is to difference an image against a copy of itself. The kernel should be exactly 1 at the centre and 0 everywhere else. Encouragingly, PyTorchDIA is able to return the correct answer to within the theoretical convergence 9 As always, we background subtract the reference frame, so 0 > 0. The signal-to-noise regime of the target image is shown on the x-axis, increasing from left to right. Metrics in each SNR regime are offset from each other for clarity. We use circular markers to denote the sampling regime of the reference image, and crosses to indicate the sampling regime of the kernel. A big circle or cross corresponds to an over-sampled reference image or kernel respectively, and a small circle or cross corresponds to an under-sampled reference image or kernel; there are therefore 4 possible combinations of marker for each SNR regime. The green dashed lines for each sub-plot pair represent the correct, 'ideal' value for each metric.  Table 2. Fit quality and photometric accuracy metrics for the 71569 simulated image tests for both an implementation of the B08 algorithm and PyTorchDIA. We divide the results of these tests by the SNR regime of the target image and the sampling regimes of the reference image and convolution kernel. The number of simulations used for the computation of the metrics in each of the 12 possible SNR and sampling regime categories is shown in the bottom section. tolerance, with fractional errors of = 4×10 −7 and 0 = 1×10 −8 . The central kernel pixel was correct to a precision of 7 decimal digits.
(iii) The target image is created by shifting the reference image by one pixel in each of the positive and spatial directions, without resampling. The corresponding kernel should be the identity kernel (central pixel value of 1 and 0 elsewhere) shifted by one pixel in each of the negative and kernel coordinates. PyTorchDIA returns a kernel with a peak pixel value of unity precise to F32 machine precision, and = 4 × 10 −7 and 0 = 1 × 10 −8 . (iv) The reference image is shifted by half a pixel in each of the positive and spatial directions to create the target image, an operation that requires the resampling of the reference image. We use a bicubic spline resampling method 10 . PyTorchDIA performs very well, successfully returning the complicated kernel, with fractional errors on and 0 of = 4 × 10 −7 and 0 = 1 × 10 −8 . As PyTorchDIA is able to recover the true fit parameters correct to the data type's decimal digit precision, we conclude that there is no bias associated with our numerical algorithm.

REAL IMAGE TESTS
DIA is a particularly effective tool for measuring the flux and positions of variable sources in crowded fields (see the start of Section 1 for some examples). The MiNDSTEp consortium performs followup observations of microlensed targets towards the galactic bulge with the high frame-rate EMCCD cameras on the Danish 1.54m telescope (DK154), at ESO La Silla (Skottfelt et al. 2015). Short 0.1 second exposures are shift-and-stacked to eliminate tip tilt distortions, and recover scenes with ∼ 2−3 times better resolution than conventional long integrations. The combined effects of the mount (the DK154 rests on three points) and the shift-and-add procedure produces irregular, triangular PSFs, with a sharp peak and diffuse halo. With a scale of 0.09 arcseconds per pixel, this high resolution EMCCD instrument explores a subset of the sampling regimes covered in the CCD simulations in Section 3.

Data and reductions
We use a sample of 251 512 × 512 pixel images, each consisting of up to 3000 shift-and-stacked 0.1 s short exposures 11 (i.e. each stack is equivalent to at most a 5 minute exposure), obtained with the 'red' camera (approximately equivalent to a broad SDSS 'z' filter Fukugita et al. 1996)  To create a high signal-to-noise reference, 13 sharp images acquired on 2019-07-20 were registered to the same pixel grid using bicubic spline resampling. The stacked reference frame was constructed by then summing the registered images, and dividing by 13. To measure the PSF of this stacked image, a (symmetric) Gaussian PSF model was independently fit to 5 bright, isolated stars, from which we found a median width of = 2.89 pixels. The median of the 238 remaining target images was 3.83 pixels, with a maximum of 6.67 pixels.
In preparation for assessing the photometric accuracy of the algorithms, we measured the fluxes and positions of the stars in the reference image. A total of 236 candidate sources were detected in the image. We avoided all stars near the reference image edges and those with a peak pixel flux less than 3 × 10 4 ADU. This gave us a reduced sample size of the 37 brightest stars. Due to processing time constraints (see below), this sample was further reduced to 30, approximately uniformly spaced stars. The light curves of these 30 stars can be used to compute the MPB and MPV metrics (see Section 4.2).
For both the reference and each of the 238 target images, a 142 × 142 pixel region was cutout around the positions of each of the 30 stars. This approach avoids resampling the target images to align them with the reference, and prevents introducing correlated noise into the target images by interpolating between pixels. 12 Cursory image model fits to these target image stamps identified kernels with a size of 7 × 7 to give good results. For the tests in Section 3, a 9 × 9 kernel corresponded to a 19 × 19 pixel array. For these tests on real images, even though the kernel is set to be just 7 × 7 large, due to the fine sampling of the LI images, the median size of the kernel in pixel-space for this data set is 27 × 27 pixels, with a maximum size of 47 × 47 pixels. For the B08 algorithm in particular -which scales with the square of the number of kernel pixels -processing times are much more expensive than in Section 3, and this is why we further reduced our sample of bright stars down from 37 to 30 to keep this investigation tractable.
This should give us a total of 7140 reference-target image pairs (i.e. ∼ 10% the size of the simulated image data set), but in some instances, for stars close to the borders of the target image, large shifts between this image and the reference meant that part of the 142 × 142 pixel region fell outside the image edge. Additionally, the measured signal-to-noise ratio of the target image in some instances exceeded 1000, and so these were rejected to provide a more consistent comparison with the results from Section 3. After these cuts, we had a final sample of 6989 reference-target image pair stamps.

Model performance metrics for real data
For this test, we compare the PyTorchDIA code implementing our robust loss function with = 1.345 (Equation 13), against the B08 solution making use of sigma-clipping. We run the B08 approach for a total of 3 iterations, and clip | | > 5 pixels on the third iteration only.
Although we are both limited to a smaller sample size and do not know the true noise properties or model parameters of the actual images and kernel solutions respectively, we can still use almost all of the metrics outlined in Section 3.2 to assess the accuracy of the implementations. As the true model image is unknown, we are unable to use the MSE metric to measure the model error. The photometric scale factor, , while also unknown, can however be compared on a relative scale, which requires an estimate for true .
It was found that the inferred was correlated with the distance from the image centre along the x-axis. This is due to a change in the sky level along this axis associated with instrumental effects, which influences due to the anti-correlation between the photometric scale factor and 0 . Consequently, true for the entire image is not well represented by a single number (e.g. the median value from all 30 stamps). Given this, we divided the x-axis into 4 equally spaced regions, and computed 4 separate true values using the subset of stamps in each of these region. The measured from any stamp was then normalised by the true corresponding to the region it belonged to along the chip's x-axis. As B08 outperformed PyTorchDIA in terms of in Section 3, we use the values inferred by this approach to determine true .
The remaining metrics require an accurate noise model to be meaningful. For these EMCCD images, we use Equation 7, and substitute the obtained on the third and final iteration of the B08 solution of each reference-target image pair to estimate the pixel uncertainties. These pixel uncertainties can then be used to calculate the MFB and MFV metrics, in place of , . Similarly, by substituting the target image for noiseless, in Equation 27 we can use these pixel uncertainties to calculate the signal-to-noise ratio of the target image for each reference-target image pair. For the third iteration of the B08 approach, we also sigma-clip | | > 5 pixels to guard against variable sources or spurious pixels affecting the kernel solution. We use this associated bad pixel mask to omit these pixels from the calculation of the corresponding MFB and MFV metrics for both the B08 and PyTorchDIA solutions.
We perform PSF fitting photometry at the centre of the difference images obtained from each reference-target image pair (i.e. at the position of one of the 30 possible stars) to assess the photometric accuracy of the approach. We infer an empirical model PSF for the target image in the following way. For each reference-target image pair, we identify the peak pixel values of bright sources in the reference, and set all other pixel values to 0. We then use our PyTorchDIA implementation to infer the 'PSF' which convolves this scene of (approximate) delta-functions to the target image. That is, we are inferring an estimate of the PSF of the target image with the 'kernel' solution of the PyTorchDIA code. When inferring this empirical PSF, we weight all pixels equally, which gives the pixel values of the very brightest stars more weight relative to the other pixels in the image.
As in Section 3.2, we use a small square stamp for the PSF fitting. We set the radius of this stamp (i.e. the half-width of the stamp) to 1.5 FWHMs of the target image (i.e. ∼ 1.5 × 2.355 × pixels), rounded up to the nearest odd integer pixel. The PSF object is normalised, and scaled to the flux of the difference image. To guard against the influence of (possibly variable) neighbouring sources or spurious pixels in the stamp not captured by the PSF model, we scale the model to a star's differential flux under our robust loss function (Equation 12), with fixed pixel uncertainties and = 1.345.
As with the MFB and MFV metrics, we adopt Equation 7 as the noise model, substituting the values obtained on the final iteration of the B08 solution. In order to calculate 2 (Equation 31), we substitute the normalised empirical PSF inferred by the PyTorchDIA code for P , , and substitute the region-dependent true inferred by the B08 implementation. An example referencetarget image pair and associated fit quality metrics for both the B08 and PyTorchDIA implementations are shown in Figure 4.

Real Image Test results
We again start our analysis by splitting our results into three signalto-noise ratio (SNR) regimes by the signal-to-noise of the target image, : (1) 8 < SNR < 40; (2) 40 < SNR < 200 and (3) 200 < SNR < 1000. We found that no reference-target image pairs correspond to the very lowest SNR regime, and so we are forced to restrict our analysis to regimes (2) and (3). Also, unlike the simulated image tests, all 251 real images used here are well sampled. A majority 195 of the target images correspond to the case where the kernel is oversampled (i.e.
> 1), and for the other images the kernel is undersampled (i.e. < 1). We therefore further divide each signal-to-noise regime into two sampling regimes, dependent on whether the kernel is approximately over or undersampled. We plot the median metric values and the MPB and MPV for both B08 (blue) and PyTorchDIA (red), each split into the 4 possible subcategories in Figure 5. As all images are oversampled, the circular markers are all large. The small and large crosses correspond to under-and oversampled kernels. The theoretical best value for each metric is plotted as a dashed green line. The median metrics and the 16th and 84th percentiles of their distributions are tabulated in Table 3.
It was noted that some star differential light curves could have a noticeable non-zero offset from their reference frame flux level. In order to correct for this before grouping the normalised photometric residuals from across all light curves into each of the SNR and sampling regime categories, each star light curve was shifted such that its median photometric residual value was 0. Additionally, there was a small subset of outlying photometric residuals that may badly affect the MPB and MPV metrics. To remove the influence of these bad outliers in each category, we first calculate the median absolute deviation of the F measured, / min, values scaled to a standard deviation MAD as a robust estimate of the spread of the underlying distribution. This is defined as where = 1.4826 for approximately normally distributed data. Before calculating the MPB and MPV metrics shown in Figure  5 and Table 3, we have removed all photometric residuals with absolute values more than 4.5 MAD from median(F measured / min ) for each SNR and sampling regime category. Note that the number of outlying points may differ for the B08 and PyTorchDIA solutions.
We tabulate the number of reference-target image pairs in each category in Table 3. Due to the sigma-clipping used before computing the MPB and MPV, we write the number of image pairs in each category used to compute these particular metrics as a fraction of the total number of image pairs. All image pairs in each category were used to compute all the other metrics, as these are robust to outliers.
We can again see that the PyTorchDIA and B08 approaches are broadly consistent, with both DIA implementations returning very similar metric values. Across all SNR and sampling regimes, only the median MFB values show consistently large differences. PyTorchDIA exhibited notably larger MFB values in the tests on simulated images, and here too, the B08 approach performs better with respect to fit bias. The values for B08 appear to be superior, but recall that these are normalised to the median values obtained from each x-axis sub-region by the B08 algorithm, so we should expect the B08 values to be closer to unity. For both implementations, as seen in the tests in Section 3 and similar experiments in Figure 4. Example reference-target image pair from the real image performance tests, in the same style as Figure 2. Note the unusual PSFs in the reference and target images, and the correspondingly irregular kernels. The target star with which the photometric accuracy was assessed is at the centre of the reference-target image pair. B16, the accuracy with which can be determined clearly improves with increasing SNR.
Similar trends with both MFV and MPB are seen from both approaches. There is a relative paucity of results corresponding to an undersampled kernel -particularly for SNR category 2 -but for the richly populated categories corresponding to oversampled kernels, both of these metrics improve with increasing SNR, as also observed in the simulated tests.
The MPV metrics for both approaches are noticeably larger here than in the simulations. However, we again see that the photometric variance goes up with the SNR of the target image, and that both approaches are overall very similar. Unlike the simulations, there is no guarantee that the 30 selected stars are the brightest in their respective stamps, and each stamp is well populated with stars in this crowded field (see top panel of Figure 4). We should not then be overfitting the target stars, and so MPVs greater than unity are expected, since the fractional contribution of flux of the target to the entire stamp will, in general, be much less than unity (see e.g. Figure  10 of B16). Also, of the 43 images corresponding to an undersampled kernel, 23 of these images have a PSF sharper than the stacked reference. In these instances, the model image will inevitably fail to match the PSFs -as it would in fact have to de-convolve, rather than convolve the reference -which will result in bad artefacts at the position of sources in the difference image. Indeed, in each SNR regime, the MPV corresponding to the undersampled kernel is much worse than its oversampled counterpart. However, these effects are common to both approaches, and here, we stress that we simply want to highlight the similarity of the MPV metrics obtained by the B08 and PyTorchDIA implementations. Indeed, the PyTorchDIA metrics can only be compared in a relative sense to the B08 results, as the B08 solutions for and are used to compute these metrics for both approaches. As we do not know the ground-truth for these values, we cannot say whether B08 or PyTorchDIA is closest to the correct answer, but only how similar they are to each other.
In addition to the differences in floating-point precision that would have influenced the small differences in results in Section 3, the two approaches now assume two different noise distributions for the target image pixels. For this real, outlier populated data, PyTorchDIA optimises a robust scalar objective function, while the B08 approach uses an iterative sigma-clipping procedure, so we should expect this to contribute to differences between the metric values also. And of course, the sample sizes in each category are smaller than their simulated data counterparts (particularly so for the instance of < 1 in SNR category 2) and therefore noisier, which could partly explain some of the small differences between the metric values for the two approaches.
In summary, similar trends in metrics with SNR and sampling regime categories are shown by both the B08 and PyTorchDIA implementations, and these trends resemble those found in our experiments on simulated images. This both validates that our simulated images are reasonable approximations of real data, and that our robust PyTorchDIA solution provides competitive photometric performance with the B08 algorithm when applied to real astronomical data sets.

SPEED TESTS
In the speed tests in this Section, we compare the performance of our GPU-accelerated numerical implementation against a fast, analytical least-squares fit solution using compiled Cython code. The signal-to-noise regime of the target image is shown on the x-axis, increasing from left to right. We use circular markers to denote the sampling regime of the reference image, and crosses to indicate the sampling regime of the kernel. As the reference image is oversampled, all circular markers are large. A large cross corresponds to an oversampled kernel, and a small cross corresponds to an under-sampled kernel; there are therefore 2 possible combinations of marker for each SNR regime.No reference-target image pairs used to compute these metrics fell into the lowest SNR regime, and so that is left blank. The green dashed lines for each sub-plot pair represent the correct, 'ideal' value for each metric.

Metric
SNR regime R > 1, K > 1 R > 1, K   Having established the accuracy of our algorithm on both synthetic and real images in Sections 3 and 4, we first provide a motivating test exploring the computational speed-up over the state of the art classical approach on a set of real DK154 microlensing observations in Section 5.1. We then use a pair of large synthetic images to formally explore the scaling of our algorithm with image and kernel size in Sections 5.2 and 5.3.

Real EMCCD images
For these tests, we use a typical microlensing data set obtained with the DK154's 'red' camera during the 2019 MiNDSTEp observing season. The data set consists of 159 shift-and-added exposures, each consisting of as many as 1200 short 0.1 s exposures (i.e. ∼ 2 minute integrations).
The sharpest image was chosen as the reference, with a = 1.82 pixels. The median target image width was = 3.18 pixels. All images were bias subtracted and flat corrected. The target images were then re-sampled using bicubic spline interpolation to align them with the reference. We use the flat field used for the data reduction in our noise model (Equation 7). For the B08 implementation, we perform 3 model iterations and employ sigmaclipping, masking pixels associated with | | > 5, to guard against variable sources (e.g. the gravitationally lensed target). For our Py-TorchDIA implementation, we minimise the negative log-likelihood corresponding to our robust loss function (Equation 13).
We test how the speed of our implementation scales with both image and kernel size by symmetrically cropping the borders of the images to different extents, and solving for both a 19×19 and 25×25 kernel. We measure the solutions times for each of the 158-large set of target images for our GPU-accelerated numerical algorithm, and plot the median solution time for a single image against the single axis length of the cropped square images (as a proxy for image size) in Figure 6, for two different choices of parameter initialisation. The 'error' bars on the median image solution times are the 16th and 84th percentiles of the distribution of the 158 target image solution times. For comparison, we plot the time taken for 3 model iterations of the B08 analytical least-squares approach. We do not plot uncertainties for the B08 solution times. The total time of this analytic approach is dominated by the normal matrix construction, which is dependent only on the size of the kernel and images, and therefore effectively consistent across all 158 target images for each image and kernel size combination.
Unsurprisingly, the massive parallelisation inherent to the convolution computations in our implementation means it performs very well for even quite large image and kernel size combinations (see Section 5.2 for a formal discussion of how the algorithm scales.). The median solution times for both the 19 × 19 and 25 × 25 kernels are at least an order of magnitude faster than for the B08 approach. When scaled to data sets consisting of hundreds or thousands of images, this clearly provides substantial savings in processing times. 13 https://colab.research.google.com The two plots in Figure 6 only differ in the choice of initialisation of the kernel pixels for PyTorchDIA. The plot on the left shows the solution times for a kernel initialised as a 'flat', box-car which sums to 1. The right-hand plot shows the results for kernels initialised as symmetric Gaussians, normalised to sum to 1, with shape parameter estimated as = √︃ 2 − 2 . For these stacks of short EMCCD exposures, neither nor are particularly well approximated with a Gaussian, and the median solution times are at most only ∼ 1.1 times faster. However, the spread of the distribution of solution times is clearly tighter when a Gaussian in used for the initialisation compared to the flat box-car.
To gain some insight into the cause of the differences between the spread of values for the two different kernel initialisations (and therefore inform us of how to get the best performance out of Py-TorchDIA) we plotted the solution times against the PSF width, , and signal-to-noise ratio (SNR) of the target image for the set of 482×482 images fit with a 19×19 kernel as a pairplot in Figure 7. We quantified the correlation between solution times and these two image properties with the Spearman's rank correlation coefficient, 14 . Both kernel initialisations result in solution times which are anticorrelated with ( ,Box−car = −0.69 and ,Gaussian = −0.29) and correlated with the signal-to-noise ratio ( SNR,Box−car = 0.63 and SNR,Gaussian = 0.26). We expect however that the cause of these trends is due only to and specifically, the difference between the width of the PSFs in the target image and reference image. That the solution time is correlated with the signal-to-noise of the target image is due to the fact that the signal-to-noise is strongly anticorrelated with .
The very strong anti-correlation between the solution time for the box-car kernel and is very likely due to this 'flat' kernel surface being a much better initialisation for wide kernels. When the PSFs between the reference and the target images are very similar, the kernel is sharply peaked at the centre, and a flat kernel is far from the solution. Initialising with a Gaussian, although imperfect for the irregular PSFs associated with these images, appears to mitigate this problem to some degree, although the solution time still blows-up when the kernel is approximately under-sampled.

Synthetic CCD images
Here, we test the performance of our implementation on synthetic images. We use our synthetic image generation procedure (see Section 3.1) to first generate a large 4000×4000 pixel noiseless reference image. We set the logarithm of the stellar density per 100 × 100 pixels to be 1.5, we set = 1 and the sky level at 100 ADU. Fluxes are assigned randomly as described in Section 3.1. We set = 1.5, and generate a target image with = √︃ 2 + 2 in the same manner as above. We then randomly add noise to the images in way described in Section 3.1, such that the target image has 10 times more pixel variance than the reference. As there are no outlying pixels in this simulated pair of images, for the PyTorchDIA implementation, we minimise the Gaussian negative log-likelihood (Equation 4). The pair of 4000×4000 images are symmetrically cropped to assess how the solution times scale with image size. For each kernel and image size combination (and  choice of kernel initialisation), we plot the solution times in Figure  8, for kernels initialised with either a flat box-car, or symmetrical Gaussian with width equal to = 1.5. As with the speed results in the prior Section, for larger images, the PyTorchDIA 19 × 19 and 25 × 25 kernel solution times are at least an order of magnitude faster than their B08 counterparts. In this case, since the correct convolution kernel really is a Gaussian, initialising the kernel as a Gaussian substantially reduces the number of optimisation steps required before convergence, with these solution times typically being ∼ 4 − 5 faster than with a box-car initialisation for the kernel. Formally, the B08 normal matrix construction time scales as O ( 2 4 ), where and are the sizes of the (square) images and kernel respectively. The two black curves in both Figures 6 and 8 follow this scaling (i.e. ∼ (25/19) 4 ). Our optimisation procedure computes a direct convolution, which is predicted to scale as O ( 2 2 ). The data points corresponding to the PyTorchDIA solution times in Figure 8 approximately obey this rule -in both the separation of the two curves and the separation between data points on each curve -although we should expect some variation due to differences in the number of optimisation steps required before converging and, potentially, cuDNN's choice of algorithm for the convolution computations, which can depend on both kernel and image size (see Section 5.3). In general, we found the number of optimisation steps required for convergence to slightly decrease with increasing image size. This makes sense, in that the effective information content of the image is greater the larger it is, and so the gradient steps are more accurate. This explains why for the smallest image size in Figure 8 the time to solve for some of the kernels takes longer than expected. Our solution time is then, not completely determined by the formal O ( 2 2 ) scaling.

cuDNN: Accelerating convolution computations on NVIDIA GPUs
One major use-case of PyTorch is for deep learning, in which CNNs, consisting of many small kernels, are used for feature detection in image recognition tasks (for example, Lawrence et al. 1997;Krizhevsky et al. 2012). The current research interest in deep learning applications has motivated device manufacturers such as NVIDIA to develop highly tuned GPU-accelerated libraries, specif-ically designed to improve the performance of common processing operations (e.g. forward and backward convolution) with these small kernels. By default, PyTorch makes use of NVIDIA's cuDNN library to accelerate convolutions on NVIDIA GPUs. This library has access to both deterministic and non-deterministic algorithms to compute convolutions, which are selected heuristically to accelerate computations 15 . As the sizes of useful kernels in DIA are themselves fairly small, it is expected that PyTorchDIA will also benefit from these tools. Indeed, we have enabled the use of nondeterministic cuDNN algorithms to accelerate the computation of convolution operations in all the results presented so far in this paper. In this subsection, we explore the benefit of this library in the context of DIA. For this test, we measure how for an image pair of fixed sizes, the time to infer the kernel scales with kernel size. We use the pair of synthetic 4000 × 4000 pixel images from the prior section, symmetrically cropped to smaller 1000 × 1000 images. We infer the associated convolution kernel (initialised with a box-car) by minimising the Gaussian negative log-likelihood for this pair of synthetic images, which contain no outlying pixels. The time to infer the kernel on the GPU for different cuDNN settings in shown in Figure 9.
For all small kernels tested here, allowing cuDNN to use nondeterminstic algorithms for the convolution computations results in slightly faster inference times. There is a clear change in behaviour for kernels larger than 23 × 23 pixels, and while the nondeteriminsitic computations again seem to win overall, the scaling with kernel size is now far less predictable. This makes sense, as cuDNN is optimised for working with small kernels. Clearly however, for the larger kernels, small changes in kernel size can lead to sporadic changes in performance. This will most likely be due to changes in the choice of convolution algorithm implemented by the software for any given image-kernel size combination, although a full exploration of the different CUDA convolution algorithms, how they are chosen, and how they scale, is outside the scope of this work.

CONCLUSIONS
We have presented a new algorithm for difference image analysis, where we model the target image as the output of a simple convolutional neural network. The kernel is treated as a discrete pixel convolutional filter, with weights which can be fit for efficiently within the PyTorch architecture. Specifically, we make use of automatic differentiation and GPU support to perform a lightning fast optimisation of the image model. We validate the fit quality and photometric accuracy of our implementation against its closest classical DIA analogue, with both simulated and real images. Our algorithm is simple to understand, and written entirely in standard Python packages, with an emphasis on accessibility to the wider astronomical community.
Its benefits over some traditional approaches can be summarised as follows: 15 While not explored here, cuDNN also contains tools to benchmark convolution computations for a given kernel and image size combination. When processing many images, it performs tests to assess which cuDNN algorithm performs best on the first example, caches this information, and uses this best performing algorithm on all further images in the data set. Turning this feature towards astronomical data set reduction with PyTorchDIA will be explored in future work. • Speed: We exploit the massive parallelism inherent to the convolution operation by making use of highly-tuned GPU-accelerated deep learning libraries to perform efficient convolution computations. With a good choice of learning rate, the optimisation procedure converges rapidly, and our algorithm can perform DIA on astronomical data sets at least an order of magnitude faster than its classical analogue.
• Scalability: For a pair of (square) images, each of size , convolved with a (square) kernel of size , our implementation approximately scales as O ( 2 2 ), while the classical approach goes as O ( 2 4 ).
• Flexibility: In our optimisation framework, we can maximise the (correct) Gaussian likelihood suitable for conventional CCD/EMCCD astronomical exposures (where the Gaussian approximation of Poissonian photon noise is valid), without resorting to an iterative procedure of 2 minimisation. This framework also allows us to relax the Gaussian noise assumption, and optimise robust scalar objective functions for images affected by outlying pixels. This provides a justifiable alternative to procedural sigma-clipping approaches. Further, we make use of automatic differentiation tools that free the user from having to manually recompute gradients if the paramaterisation of the model is changed, making experimentation by users straightforward.
The current main disadvantage to this approach is the use of 32bit numerical precision to ensure the GPU-accelerated convolution computations are performed rapidly. Also, some engineering (e.g. choice of learning rates) is required by the researcher to get the best performance. And while the O ( 2 2 ) scaling typically holds for a given image-pair and kernel combination, our approach is a non-linear (convex) optimisation, and so the solution times for our approach are not entirely deterministic. We also note that access to mid-to high-end GPUs can be an issue, and their use in both the gaming market and cryptocurrency mining has caused a surge in prices. Given this, Tensor Processing Units (TPUs) -which now support floating point computations -could be a viable alternative for some use-cases.
We highlight automatic mixed precision training, available since PyTorch 1.6.0, as an area to explore in future work to further accelerate the convolution computations. Given the impressive recent advances in general purpose GPU computing, in large part driven by the deep learning community, we are well positioned to take advantage of improvements to these tools. Lastly, we again stress that the application to DIA is just one possible example of astronomical image processing that can benefit from deep learning tools, as very many useful image models include a convolution operation.
All code used in this work can be found at https://github. com/jah1994/PyTorchDIA.