Learned Interferometric Imaging for the SPIDER Instrument

The Segmented Planar Imaging Detector for Electro-Optical Reconnaissance (SPIDER) is an optical interferometric imaging device that aims to offer an alternative to the large space telescope designs of today with reduced size, weight and power consumption. This is achieved through interferometric imaging. State-of-the-art methods for reconstructing images from interferometric measurements adopt proximal optimization techniques, which are computationally expensive and require handcrafted priors. In this work we present two data-driven approaches for reconstructing images from measurements made by the SPIDER instrument. These approaches use deep learning to learn prior information from training data, increasing the reconstruction quality, and significantly reducing the computation time required to recover images by orders of magnitude. Reconstruction time is reduced to ${\sim} 10$ milliseconds, opening up the possibility of real-time imaging with SPIDER for the first time. Furthermore, we show that these methods can also be applied in domains where training data is scarce, such as astronomical imaging, by leveraging transfer learning from domains where plenty of training data are available.


INTRODUCTION
The Segmented Planar Imaging Detector for Electro-Optical Reconnaissance (SPIDER; Kendrick et al. 2013;Duncan et al. 2015) instrument is an alternative electro-optical imaging device to current space telescopes like the Hubble Space Telescope or the James Webb Space Telescope.While traditional electro-optical telescopes require large optics, housings, and thermal controls for the optics to attain precise measurements, SPIDER aims to decrease the volume, mass, and cost of electro-optical imagers by replacing the traditional mirrors with arrays of lenslets to gather interferometric measurements.The light captured by these lenslets is processed using photonic integrated circuit (PIC) chips into interferometric measurements.Typical interferometers still require large components to make measurements, yet by processing the light from the lenslets on an integrated chip the size of the instrument can be reduced significantly.The SPIDER instrument is designed to be a cheaper and lighter alternative for instruments that can be used for both Earth observation or astronomical research.
The concept design for the SPIDER instrument proposed by Kendrick et al. (2013) and Duncan et al. (2015) uses 37 PICs mounted at different angles to form a planar disk.Interferometric measurements are acquired on each of the PICs, and by combining the measurements taken at each of the different angles, the telescope creates a large synthetic aperture with the diameter of the aperture equal to the largest spacing of lenslets on the PIC.A diagram of their design can be found in Figure 1.
Beyond the initial concept, more efficient designs that make better ★ E-mail: matthĳs.mars.20@ucl.ac.uk use of the baselines (Liu et al. 2018(Liu et al. , 2019) ) and improved lenslet arrangements to increase reconstructions quality (Lv et al. 2020;Hu et al. 2021) have been investigated.Furthermore, the PIC reconstruction capabilities were found to match the theoretical predictions of the concept design well (Chu et al. 2017).Tests of the PICs show that it is possible to measure both the amplitude and phase of the incoming light as predicted (Su et al. 2017) and that these measurements can be used for image reconstruction (Badham et al. 2017;Su et al. 2018).
Since SPIDER measures both the phase and the amplitude of the interferometric signals, the measurement process is analogous to that of radio interferometers.Interferometric imaging techniques developed for radio interferometry can thus be adapted and repurposed to recover images from the raw data acquired by the SPIDER instrument (Pratley & Mcewen 2021).Radio interferometry and aperture synthesis have played an important role in pushing the boundaries of astronomical research in the radio frequency regime, where acquiring high resolution images is otherwise difficult because of the relatively large wavelengths considered.By combining measurements from pairs of radio telescopes, spatial frequency information can be retrieved from the observed sky.Each pair of telescopes forms a baseline and measures a so-called visibility corresponding to a Fourier coefficient of the sky brightness distribution.By sampling the Fourier plane with these measurements, an observation with a uniform aperture is approximated using aperture synthesis.Since the telescope acquires only a finite number of measurements and since the Fourier plane is not sampled uniformly, the problem is ill-posed and an accurate image cannot be reconstructed simply by inverting the Fourier transform.Instead, when inverting the measuring process directly as described a so-called dirty image is obtained: a reconstruction of the sky brightness convolved with the point spread function (PSF) of the telescope configuration.In order to recover an accurate image of the sky brightness, techniques to regularized inverse problems are typically considered.
One method for solving this inverse problem is variational regularization.Variational regularization techniques combine a data fidelity component relative to a forward model with a regularizer in order to constrain and stabilize the problem.Regularizers for these algorithms encode prior information about the images.An example of modern model-based iterative methods applied to radio interferometry can be found in Pratley et al. (2018), which adopt convex optimization techniques to solve the variational regularization optimization problem.An application of these techniques to SPIDER imaging can be found in Pratley & Mcewen (2021).While these techniques can be distributed to reduce the reconstruction time (Pratley et al. 2019), they are computationally expensive as they evaluate the measurement operator that models the instrument at each iteration of the optimization process.Besides the computational cost of these algorithms, they also adopt handcrafted priors (e.g.ℓ 1 sparsity in a wavelet representation) that, while general, are not tailored to the images of interest.
An alternative to model-based algorithms, such as variational regularization, are data-driven algorithms that learn the prior information from training data.Since the prior is implicitly specified by the training data, data-driven methods are typically able to achieve higher reconstruction quality than by using handcrafted priors, provided the training distribution matches the target distribution well.Learned methods that are completely independent of the measurement operator and attempt to learn a direct mapping from measurements to target image, however, are generally not effective.Thus, learned methods typically combine some model-based information, like the measurement operator, with prior information learned from training data.Learned data-driven approaches broadly fall into three categories, differentiated by the degree to which model-based information such as the measurement operator is encoded or leveraged.Learned regularization methods (e.g.Lunz et al. 2018;Li et al. 2020) are iterative in nature and make full use of the measurement operator in each iteration; consequently, they achieve excellent reconstruction quality by exploiting full knowledge of the measurement operator, along with learned prior information, but are generally highly computationally demanding.Learned sequential methods (e.g.Jin et al. 2017), on the other hand, simply pre-and/or post-process data with learned models in observation and/or image space; consequently, while they nevertheless provide good quality reconstructions they are limited by the fact that they only evaluate the measurement operator very few times (sometimes just once).This however does make these methods substantially more efficient computationally.Learned iterative methods (e.g.Adler & Öktem 2017), also called unrolled methods, provide a balance by designing a learned model that attempts to unroll a small number of iterations of iterative approaches, encoding the measurement operator into the model; consequently, they typically achieve superior reconstruction quality to learned sequential methods exploiting greater knowledge of the measurement operator and are more computationally efficient than learned regularization methods.
Many such methods were originally proposed in the medical imaging context (CT, MRI, PAT) and are based on deep learning approaches (e.g.Adler & Öktem 2017Adler & Öktem , 2018;;Arridge et al. 2019).In the field of radio astronomy learned image reconstruction techniques were first considered by McEwen & Allam Jr (Allam Jr 2016), where super resolution convolutional neural networks (SRCNN; Dong et al. 2016) were applied to post-process dirty radio interferometric images.These networks were considered for cases where the exact telescope PSF is known, as well as for the PSF-unaware case, highlighting the potential of learning a generalized network that works with unseen telescope configurations.Although, performance was relatively poor for this rudimentary approach.More recent applications of learned post-processing methods in radio astronomy have been considered using learned denoisers (Terris et al. 2019), convolutional autoencoders (Gheller & Vazza 2021), and super resolution networks (Connor et al. 2022).The use of plug-and-play (PnP; Venkatakrishnan et al. 2013) denoisers within iterative reconstruction methods has also been considered for radio interferometric imaging (Terris et al. 2022), although such approaches still require many evaluations of the measurement operator during imaging.
While we draw inspiration from these prior works on learned imaging for interferometry, in this paper we develop new learned imaging techniques, specifically targeting the SPIDER instrument.Our primary goal is reducing computational cost, while of course still ensuring high quality reconstructions.Reducing the computational cost of recovering images from raw SPIDER measurements would open up the possibility of real-time imaging with SPIDER, which would afford numerous new applications.Consequently, we develop a learned sequential method that requires only a single evaluation of the measurement operator.While reconstruction quality for this method is similar to current state-of-the-art variational regularization techniques, computational time is orders of magnitude faster.We also develop a learned iterative method, trading off a small increase in computational time compared to the learned sequential method, but that nevertheless remains orders of magnitude faster than traditional approaches, while achieving a further improvement in reconstruction quality.
The remainder of the paper is structured as follows.Section 2 introduces the measurement process of interferometric imagers and the particular imaging configuration of the SPIDER instrument, as well as discussing the inverse imaging problem.Variational and learned approaches for the inverse imaging problems are reviewed in Section 3. In Section 4 we present two approaches to model the SPIDER instrument: alongside the standard non-uniform Fourier transform approach considered previously (Pratley & Mcewen 2021), we also introduce a new modelling approach for SPIDER based on the Radon transform.In Section 5 we propose our learned methods for reconstruction from interferometric measurements.Section 6 evaluates the performance of the reconstruction methods when applied to natural images, their robustness to additional noise, and generalization po- tential of these methods to smaller datasets of galaxy or satellite images.Finally, concluding remarks are made in Section 7.

SPIDER INSTRUMENT
The SPIDER instrument measures incoming light through pairs of separated lenses and combines it in waveguides on a PIC chip to form interferometric baselines.All the operations to make the interferometric measurements are performed on the chip, resulting in a small form-factor for the system.The concept design introduced in Kendrick et al. (2013) uses a linear array of lenslets attached to one PIC chip to measure several baselines using a 1D interferometer.Several of these so-called spokes are then mounted radially resulting in a two-dimensional sampling pattern.While in typical interferometers the information of all receivers can be combined into interferometric measurements, resulting in a total of  ( − 1)/2 measurements for  lenslets, in the SPIDER design lenslets can only be combined within one PIC chip, resulting /2 baselines per PIC module.To increase the number of baselines gathered from one PIC card, different wavelengths of light are measured.Because the spatial frequency measured depends on the separation of the lenslets and the wavelength of the light, the number of baselines is multiplied by the number of wavelength bins observed.The technique of incorporating spectral information from measurements at different wavelengths in a single reconstruction is called multi-frequency synthesis and is common in radio interferometry (Sault & Conway 1999).By mounting multiple PIC chips as radial spokes on a disc, a 2D interferometer is created.In doing so a radial sampling profile is created for the -plane.In the concept of Kendrick et al. (2013), 37 spokes of PICs with 24 lenslets each are used.The measured light frequency spectrum is from 500nm to 900nm with 10 spectral bins.This results in 120 baselines per spoke and a total of 4440 measured Fourier components.The parameters for the configuration used by Kendrick et al. (2013) can be found in Table 1 and the lenslet layout and the -sampling are shown in Figure 2.
The interferometric measurements at the baseline (spatial) frequency  = (, ) represent samples of the 2D Fourier transform of the image  ( ), with spatial coordinate , as given by the van Cittert-Zernike theorem (Zernike 1938): which is a continuous unitary Fourier transform of the signal.The baselines are determined by the spatial distance between the lenslets on the PIC with respect to the observed wavelength of the light.
To recover an image from the Fourier measurements one needs the inverse of Equation 1, which is only possible in theory when one has complete knowledge of the continuous Fourier representation f ().In practice this is not the case.
The approaches for modelling the SPIDER instrument introduced in Section 4 require an (upsampled) discrete Fourier transform (DFT).The unitary DFT for an  1 × 2 image is an operator mapping from R  1 ×  2 → C  1 ×  2 , we henceforth use the same f notation for the continuous and discrete Fourier transform as the meaning will be clear from the context.The unitary DFT is defined by and the unitary inverse DFT by The interferometric measurement process (Equation 1) can be written in the compact, discretized form where the linear measurement operator  :  →  , maps the unknown image  ∈  ⊂ R  to the set of noisy measurements  ∈  ⊂ C  (here Fourier components of the weighted skybrightness), with  ∈  ⊂ C  some type of measurement noise.
The interferometric measurement operator corresponds to a nonuniformly sampled Fourier transform.
The limited number of lenslets of the telescope results in a limited sampling of the -plane.Since the Fourier domain is sampled incompletely, the measurement operator is ill-posed, and it cannot simply be inverted to find a solution to the inverse problem.Because of the ill-posedness of the operator, prior information on the solution is needed to regularize the inversion.Therefore, approaches such as sparse regularization are needed to stably recover a solution for the reconstruction.Underlying the sparse regularization is the idea that natural signals (e.g.astronomical images) are sparse or compressible in a suitable basis or frame (e.g.wavelet bases).

INVERSE IMAGING APPROACHES
We briefly recall the state-of-the-art variational techniques that are applied to solve the interferometric imaging problem and how these methods can be enhanced or replaced by using learned approaches that make use of deep learning.

Variational regularization
To recover a solution to the inverse problem stated in Equation 5, we find a solution for  that trades off matching the data and the prior information encoded in the regularizer.Variational regularization approaches do this by posing an appropriate optimization problem, which is then often solved by proximal optimization algorithms.

Optimization problems
Trading off data fidelity and prior information can be achieved by obtaining a solution to the following minimization problem: where  is the regularization parameter, and L (, ) and S() are a data fidelity and regularization functional respectively.
A common optimization strategy is to use a regularizer that promotes sparsity in a particular frame.The signal  ∈ R  can be represented in a dictionary  ∈ R  × by  = , where  ∈ R  .Natural signals typically exhibit stronger sparsity if the dictionary is redundant, i.e.  >  (Gribonval & Nielsen 2003;Bobin et al. 2007;Starck et al. 2010).It is therefore often beneficial to use dictionaries which are a concatenation of different basis of frames, e.g.Dirac and Daubechies wavelets (Carrillo et al. 2012(Carrillo et al. , 2013a)).
The minimization problem can be expressed in the synthesis formulation, where the image  ★ is synthesized from dictionary elements by  ★ =  ★ .Alternatively, the problem is expressed in the so-called analysis setting where the image  ★ is recovered directly while promoting the sparsity of  *  ★ .If  is a tight frame the adjoint  * is equal to the self-inverse  † and the two formulations are equivalent.However, in a general overcomplete dictionary (such as a concatenation of two bases) the analysis formulation is often found to yield superior reconstruction quality (e.g.Carrillo et al. 2012Carrillo et al. , 2013a)).
The data fidelity term is typically the ℓ 2 -norm of the residuals (which can also be interpreted as the log-likelihood for the case of Gaussian noise).The optimization functional in the unconstrained, analysis setting then reads assuming identity covariance (typically measurements are weighted to have unit variance; as discussed in Section 4.1).
In the unconstrained setting the optimization depends on a good choice of the hyperparameter  to find a good balance between the data fidelity and the sparsity of the signal.The problem can also be formulated as a constrained optimization problem where we seek the most sparse  †  that satisfies the constraint with some  > 0 on the data misfit.The hyperparameter  can be estimated from an estate of the noise level (Pratley et al. 2018).

Proximal optimization algorithms
The optimization problems described in the previous subsection can often be solved by proximal optimization algorithms that leverage a proximity (or proximal) operator.The proximity operator of a (proper, semi-continuous) convex function ℎ (with  > 0) maps  ∈ R  to a unique solution to the (strongly convex) minimization problem The parameter  sets the balance between the squared ℓ 2 -distance to  and the value of ℎ.Many common proximal operators admit an analytical solution or at least a linear time iterative solution.The fixed point of a proximal operator is the global minimum of ℎ (Boyd & Vandenberghe 2004;Combettes & Pesquet 2011).
Proximal splitting methods use proximal operators to estimate the solution to the inverse problem by splitting the objective function in separate steps for the different optimization functionals.A review of different proximal splitting methods can be found in Combettes & Pesquet (2011).The simplest proximal splitting algorithm is the proximal gradient method, which consists of a gradient update step, followed a proximal update step: In this manner sparsity-promoting priors S() that are not differentiable can be supported (e.g.ℓ 1 sparsity in a wavelet basis).
In this paper we compare our new learned methods with a primal dual hybrid gradient (PDHG; Chambolle & Pock 2011) method for finding a solution to the constrained analysis problem (Equation 8) as described in Onose et al. (2016).For the dictionary representation of our signal we chose the SARA representation, i.e. a combination of the Dirac basis as well as the first eight Daubechies wavelets, Db1-Db8 (Daubechies 1992), which was shown to work well for astronomical reconstruction (Carrillo et al. 2012(Carrillo et al. , 2013a)).

Learned methods
Traditional approaches handle the ill-posedness of the inverse problems by using prior information through the use of handcrafted regularizers that, while general, fail to capture detailed prior information of real data.Learned methods attempt to overcome this by instead enforcing the prior information implicitly specified by training data, ensuring that the prior information promotes images which are in some sense similar to the training data.
Since learned methods rely on learning the prior information for the imaging problem from the data they are provided, reconstructive power depends on the quality and quantity of the training data provided.Furthermore, learned methods have to be trained before evaluation can take place, a process that may take some time, yet only has to be performed once.Once training is done, imaging can then be performed rapidly.If training data in the form of input-output pairs are available, the network can be trained in a supervised approach, with the network receiving the input measurements and their respective targets.When input-output pairs are not available, the networks can be learned adversarially, where the network is trained using a distribution of measurements to fit an independent distribution of target outputs, which is sometimes interpreted as semi-supervised learning.If no training outputs are available the methods need to be learned in a self-supervised way.
Learned methods that are completely independent of the measurement operator and learn a direct mapping from measurements to target are generally not effective since they are difficult to train and require huge volumes of training data (Adler & Öktem 2017).Therefore, most learned approaches incorporate the measurement operator in some capacity to encode or leverage the mapping from the measurement to the reconstruction space.The learned network is then constructed around that.As briefly overviewed in Section 1, learned data-driven imaging approaches broadly fall into three categories, differentiated by the degree to which model-based information such as the measurement operator is encoded or leveraged.By making greater use of the measurement operator superior performance can often be achieved but at the cost of increased computational time.We subsequently review each of these three classes of approach.

Learned regularization
One method for learning the prior information from training data is by using a learned regularizer in conjunction with traditional optimization schemes.These algorithms aim to replace the regularizer S in Equation 6 with a learned regularizer, encoding the prior information in the training data.
A common approach is to replace the application of the proximal operators with a learned map to compute the update.These methods typically adopt plug-and-play (PnP) denoisers that are used instead of a proximal update step (Venkatakrishnan et al. 2013;Ryu et al. 2019).A variant of a PnP denoiser was applied to radio interferometry by Terris et al. (2022).
Various other pioneering approaches to learned regularization have also been considered.The regularizer can be formed as the norm of a learned dictionary that translates to a sparse representation of the data (Xu et al. 2012) or by using a constraint based on a learned scattering transform (Dokmanić et al. 2016) in place of a traditional regularizer.Alternatively, methods are used that implement deep neural networks to act as regularizers (e.g.Li et al. 2020;Kobler et al. 2020) as well as methods that train adversarially learned neural networks (e.g.Lunz et al. 2018;Mukherjee et al. 2020).
While learned regularization approaches often achieve excellent reconstruction quality since they make full use of the measurement operator, along with learned prior information, they are computationally demanding since the full measurement operated must be evaluated for each iteration.

Learned sequential methods
An alternative approach is to consider a model that is split up into a sequence of operations.Sequential models are a composition of a learned operator acting in the data space   :  →  , an adjoint or pseudo-inverse mapping from the data to the reconstruction space,  :  → , and a learned operator acting in the image space   :  → , where learned operators depend on the parameters  (e.g.Zheng et al. 2020): When either the operator in the reconstruction domain,   , or in the data domain,   , is set to be the identity operator, we recover learned pre-processing or learned post-processing methods respectively.Most sequential methods are of the post-processing type since they are relatively easy to train as learning of the network is decoupled from the mapping from measurement to reconstruction space; the post-processing network can thus be trained independently, reducing training time substantially (e.g.Jin et al. 2017;Chen et al. 2017;Yi & Babyn 2018).Post-processing methods have been used in astronomy using super resolution networks (Allam Jr 2016;Connor et al. 2022), denoisers (Terris et al. 2019), and convolutional autoencoders (Gheller & Vazza 2021).
Post-processing methods only apply the adjoint or pseudo-inverse of the measurement operator once and so are highly computationally efficient.However, since the measurement operator is not continually leveraged, reconstruction quality suffers as a consequence.To compensate sequential models typically employ more elaborate network architectures (such as U-Nets; Ronneberger et al. 2015) than those adopted in learned iterative methods.Nevertheless, learned post-processing approaches can sometimes struggle to achieve the reconstruction quality of learned iterative approaches.

Learned iterative methods
Learned iterative methods unroll a small, fixed number of iterations of an iterative solver, e.g.proximal gradient method, and replace the proximal operators with learned convolutional neural networks (CNNs) (Gregor & LeCun 2010).The CNNs used are typically small feed-forward networks with just a few convolutional layers (e.g.Yang et al. 2017;Putzky & Welling 2017;Adler & Öktem 2017), in contrast to the architectures used in sequential models.To turn the proximal gradient method of Equation 10 into a learned method we replace the proximal operator with a learned network   , : where   , can be a different network for each iteration  or one network,   , =   .The latter scenario results in learning an approximation to the proximal operator.Hauptmann et al. (2020) propose evaluating the operator at different, increasingly finer resolutions/scales such that the full resolution operator only needs to be evaluated once.Combining this with a multiscale neural net like the U-Net (Ronneberger et al. 2015) cuts down evaluation time (and thus training time) significantly, while also reducing memory requirements.Other advancements are considered by learning adversarially trained versions of existing models (Mukherjee et al. 2021b).Also, combining approaches using learned regularizers with learned iterative methods, provides algorithms that are closer to the traditional optimization methods and allow to prove rigorous convergence results (Mukherjee et al. 2021a).Finally, the inclusion of the explicit measurement operator in the learned iterative models improves the robustness and generalizability (Boink et al. 2020), while at the same time reducing the number of trainable parameters and hence reducing the requirement of large training data volumes.
Since learned iterative methods only unroll a small number of iterations, the measurement operator is only evaluated sparingly, making the learned iterative approaches typically faster than the iterative solvers on which they are based.While the integration of the measurement operator in the network does require its evaluation during training, resulting in considerably longer training times than for the sequential models, encoding the measurement operator in the model typically results in superior performance to learned sequential methods, for only a modest increase in computational time.

MODELLING THE SPIDER INSTRUMENT
In order to reconstruct an image from interferometric measurements we need a model for our measurement process in the form of a measurement operator.The accuracy of the measurement operator is a critical factor for the accuracy of the reconstruction and its computational complexity is reflected in both training and reconstruction time.
For interferometric imaging, the measurement process is given by Equation 1.In a discretized version of Equation 1, following the notation of Equation 5, the measurement operator  : R  → C  is a non-uniform discrete Fourier transform (NUDFT) mapping an image in R  to  non-uniformly distributed Fourier measurements.
For an image  =  ( ), with spatial pixel coordinates    at the -th and -th pixels of the image, and Fourier measurements  = f () at non-uniformly distributed Fourier coordinates   = (  ,   ) (in contrast to Equations 3 and 4), the unitary NUDFT can be expressed as which is a finite sum over all pixels (, ) of an  1 ×  2 image with a total number of  =  1 ×  2 pixels.
To construct the dirty image   ( ), the adjoint of the measurement operator is applied, which maps from the non-uniformly distributed Fourier measurements to a uniformly sampled image We can approximate the inverse Fourier transform (Equation 2) by evaluating a weighted, finite Fourier series at the non-uniformly distributed Fourier coordinates.The accuracy of the resulting pseudoinverse DFT is determined by the choice of sampling   and the corresponding measurement weights (  ) and is given by The complexity of a naive realization of Equations 13, 14 and 15 would be O ( ), however efficient numerical NUDFT schemes make use of interpolation to map the non-uniformly distributed measurements to a uniformly sampled grid and take advantage of the Fast Fourier transform (FFT) to reduce complexity.
In this section we introduce two efficient methods for modelling the measurement operator of the SPIDER instrument.First, we consider the non-uniform Fast Fourier transform (NUFFT; Duĳndam & Schonewille 1997), which can be used for arbitrary sampling distributions.Second, we present a new approach that exploits the specific sampling distribution of the SPIDER instrument, making use of similarities of the problem to that of parallel Radon transform of X-ray tomography.

NUFFT
Using FFTs with a (de)gridding interpolation operator to approximate the non-uniformly distributed Fourier measurements lowers the computational cost significantly.The standard operator used in radio interferometry can be expressed as a composition of operators (e.g.Pratley et al. 2018) where  : R  → R  is an operator that corrects for the gridding operation (discussed further below),  : R  → R  2  is a zeropadding operator that zero pads the image in the spatial dimension to provide oversampling of a factor  in each direction in the Fourier domain,  : C  2  → C  2  is a 2-dimensional unitary FFT operator, and  : C  2  → C  is the degridding operator that interpolates the measurements off of the uniform Fourier grid via convolution with a spreading kernel centred at the irregularly spaced measurements points in the Fourier domain.The operator  corrects for the convolution with this spreading kernel by a pointwise division of the image with the Fourier transform of the spreading kernel.Equation 16 describes the measurement process, going from the observable to the measurement space.The adjoint of this operation,  * : C  → R  is readily obtained as where we made use of the self-adjointness of the convolution correction operator( * = ).The gridding operation  * : C  → C  2  convolves the non-uniformly distributed measurements with the spreading kernel to grid them on a uniformly sampled mesh, on which the inverse FFT,  † : C  2  → C  2  , is applied to obtain an image, which is restricted to R  via symmetric cropping through the operator  * : R  2  → R  , and deconvolved via pointwise division with the Fourier inverse of the spreading kernel, using the operator , to correct for the gridding operation.
The efficiency of the NUFFT schemes hinges upon a suitable choice of the spreading kernel, in particular its localization in the spatial and frequency domains and the ease of computing the Fourier transform of the kernel for deconvolution.The kernel needs to have a small support in frequency space to be computationally inexpensive, while also having a small support in the image domain as to minimize the effects of aliasing.Here we adopt the Kaiser-Bessel (KB) kernel (Jackson et al. 1991;Fessler & Sutton 2003).The kernel is truncated in the Fourier domain to limit its support and therefore its computational cost.Since these kernels are linearly separable, i.e. the kernel can be written ĉ(, ) = ĉ() ĉ(), we consider the kernel in only one dimension: where  0 is the zeroth-order modified Kaiser-Bessel function,  determines the shape of the kernel, and  the support of the kernel.The correction for applying this convolutional kernel can be calculated analytically by taking the inverse Fourier transform of the kernel to get (Jackson et al. 1991;Fessler & Sutton 2003) The correction operator can then be found by calculating Using  = 2.34 for the spread of this Kaiser-Bessel kernel gives similar performance to the optimal min-max interpolation kernel proposed by Fessler & Sutton (2003).
The pseudo-inverse of the NUFFT is obtained by including the measurement weights in the Fourier sum where  is a diagonal matrix with elements   = (  ) given by the weight for each of the measurements.
In radio astronomy, interferometric measurements are typically weighted according to one of three weighting schemes: natural, uniform, or robust weighting (Taylor et al. 1999).Natural weighting maximizes the sensitivity of the observation and weights each visibility by the uncertainty on the measurement  Natural , =  −2  , with   the uncertainty on the measured visibility   .Natural weighting is akin to statistical whitening of data and ensures unit variance of the uncertainty on the measurements and gives the highest signal-tonoise ratio for detecting weak sources.
Since there is typically a higher density of samples near the (, ) origin in radio interferometry, natural weighting emphasizes lowfrequency measurements, which can be undesirable when imaging sources with both large-and small-scale structure.Uniform weighting is an alternative weighting scheme that accounts for this by weighting the density of the sampling distribution by  Uniform , =  −2   −1  (), where   () is typically the number of measurements in a symmetrical region (either square or circular) with width  around the measurement   .If the measurement uncertainties are varying this can be replaced by the sum of the reliability weights in this region ′ , where  ′ are the indices of the measurements in the local region around measurement .The width  of the region is typically chosen to be the size of the size of the elements of the Fourier grid.
Lastly, robust weighting considers a trade-off between natural and uniform weighting, controlled by a robustness parameter.In this paper we use uniform weighting and weight the measurements according to the density of the sampling distribution as well as their uncertainties.

Sub-scale operators
An approach to reduce the computational cost of the measurement operator is to evaluate it at a lower resolution and restricted Fourier space.The forward and adjoint of the measurement operator can be evaluated at a series of scales by applying a filter bank of low pass and high pass partition of unity filters; see Pan & Betcke (2022) for the details of such filter banks.Instead of using smooth partition of unity filters as used in Candès et al. (2006); Pan & Betcke (2022), we use binary 0-1 partition of unity filters.Since the NUFFT evaluates the FFT on an upsampled grid by zero-padding the image, the periodicity in the image domain and the corresponding decay of the Fourier coefficients are reinstated.We refer to these sub-scale measurement operators as   : R   → C   , its adjoint  *  : C   → R   , and its pseudo-inverse  †  : C   → R   , operating on the reduced image scale   and the number of measurements restricted to the reduced Fourier space,   .Since these sub-scale operators are evaluated on a smaller image scale as well as a restricted Fourier space, they are considerably more computationally efficient than the full-scale measurement operators.

NU-Radon method
For the NUFFT the measurement weights are determined by considering the 2D sampling density of the instrument.In this section we propose an alternate, principled approach to derive these weights based on the similarities of the SPIDER sampling to the sampling induced by the 2D Radon transform.
The Radon transform, models the linear attenuation of photons passing through an object.
It maps the attenuation to a family of integrals along parallel lines, parametrized in polar coordinates , the angle of the projection, and  the radial distance from the origin along that line.The Fourier slice theorem expresses the equivalence of the 2D Fourier transform in polar coordinates and the 1D Fourier transform along the detector coordinate of the Radon transform, where  is the distance in the Fourier domain along the radial spoke at angle .
The adjoint operation of the Radon transform is the back-projection which can be used to recover the dirty reconstruction through The inverse of the radon transform is the filtered back-projection and includes a ramp filter ||, which corresponds to the Jacobian of the change from Cartesian to polar coordinates and which boosts the power of high-frequency measurements: Using the Fourier slice theorem we can express the NUDFT (Equation 13) in terms of the Radon transform, and in polar coordinates, by where the measurements are gathered by taking a 1D NUDFT of the Radon transform of the signal along a spoke at angle   and √  corresponds to the diameter of the circle inscribed into the square image.Similarly, we discretize the back-projection (Equation 24) to obtain the dirty reconstruction where we Fourier transform the  non-uniformly distributed Fourier measurements at radial distances   for each of the spokes, followed by back-projecting the signals for each of the  projection angles   .
Both the forward and adjoint operations of the measurement operator use a 1D NUDFT which can be accelerated by using (de)gridding in the form of a 1D NUFFT.The combined method of using the Radon transform and the NUFFT can be described as: where  : R  → R  √  is the Radon transform, and  : is the 1D NUFFT.Similarly, the adjoint operation is given by where  * : R  √  → R  is the back-projection operation, and  * : C  → C  is the 1D adjoint NUFFT.
To obtain the pseudo-inverse of the measurement operator, part of the measurement weights are now informed by the ramp filter in the inverse of the Radon transform (Equation 25).The ramp filter for our non-uniformly distributed measurements are calculated as  Radon () = ∥ ∥ ℓ 2 .The pseudo-inverse of the measurement operator is then defined as where  † : R  √  → R  is the filtered back-projection (with weights  radon ()), and  † : C  → C  is the 1D inverse NUFFT with measurements weights based on the sampling density of the lenslets along one spoke.

Comparison
We implement both the NUFFT and NU-Radon methods in our LeIA1 code.The NUFFT approach is implemented on the CPU using Numpy2 as a backend, as well as with GPU acceleration implemented in TensorFlow3 .The NU-Radon method is implemented by using a modified version of the SciKit-Image4 Radon transform.
An overview of the computational complexity of the algorithms can be found in Table 2, where the complexity is split into three operations: calculating the Fourier transform, (de)gridding nonuniformly distributed measurements, and computing the Radon transform (when relevant)5 .The NUDFT is computationally the most expensive approach, since the evaluation of the discrete Fourier transform takes O ( ) operations.Both the NUFFT and NU-Radon reduce the computational complexity by utilizing the FFT for calculating the Fourier transform, though these gains are offset by the computational cost for the gridding of the measurements and/or the calculation of the (inverse) Radon transform.
For the particular SPIDER configuration considered ( = 256 × 256,  = 4440, and  = 37), the NUFFT is the best approach in terms of computational complexity.It might be beneficial to use the NU-Radon method when the number of measurements is large (and therefore the computation time is dominated by the (de)gridding operations) or when the number of image pixels is large and the number of spokes is modest ( ≤ log ).
Figure 3 shows the effect of reconstructing a dirty image using the adjoint of the measurement operator applied to a set of simulated Fourier measurements taken with the SPIDER instrument.For the configuration of the SPIDER instrument, it appears that the NUFFT is a more accurate approximation to the NUDFT.
Figure 3 indicates that the dirty reconstructions are mostly dominated by low-frequency structures.The measurement weights used in the pseudo-inverse operations boost the power of high-frequency measurements by weighting the measurements based on the 2D sampling density for the NUFFT and the 1D sampling density along each spoke for the NU-Radon approach (which also includes the weights of the filtered back-projection).
Figure 4 compares the different pseudo-inverse implementations.The figure indicates that the pseudo-inverse NUFFT provides a good trade-off between introducing high-frequency information and not having strong high-frequency artefacts.The pseudo-inverse NU-Radon implementation provides sharper features, yet also includes more high-frequency artefacts.In the remainder of the paper we use the NUFFT approach for our experiments.Due to the lower computational cost of its implementation and its generalizability to any sampling distribution for future applications.

LEARNED INTERFEROMETRIC IMAGING FOR SPIDER
In this section we present our two learned approaches for the SPIDER imaging problem.Both approaches reconstruct images from interferometric measurements and use the adjoint or pseudo-inverse of the measurement operator to create an initial reconstruction.Our approaches fit within the exiting learned post-processing framework (e.g.Jin et al. 2017), with the second method introducing a novel post-processing architecture.
The first method is a post-processing learned sequential method to improve the initial reconstruction estimated by the pseudo-inverse and remove artefacts.This approach is highly computationally efficient, requiring only one application of the pseudo-inverse of the measurement operator to compute the initial reconstruction.We expect final reconstruction quality to be good but limited since no further knowledge of the measurement operator is exploited beyond the initial reconstruction.
The second approach is a learned iterative method and differs to the former by including updates to the reconstruction that utilize knowledge of the measurement operator.Since additional application of the measurement operator is required this results in a modest increase in computational time compared to our post-processing method.However, we expect this to be offset by an increase in reconstruction quality due to further exploitation of the measurement operator.
We do not consider learned regularization approaches further since the iterative nature of such approaches would result in a significant increase in computational time, and our primary objective is a very low computational cost during imaging in order to facilitate real-time imaging with SPIDER.

Learned post-processing (U-Net)
Learned post-processing methods are a special case of the sequential methods discussed in Section 3.2.2,where the learned operator applied in the data domain,   , is replaced with an identity operator.This makes the networks easier to train as the training is performed in the image domain without the need to invoke the measurement operator, speeding up the training process.Our method only postprocesses an initial reconstruction and is in this sense similar to a denoiser.The solution can be written as: with   the learned network, and  † the pseudo-inverse of the measurement operator.Figure 5 shows a schematic representation of this approach.The architecture of our network is based on the U-Net (Ronneberger et al. 2015) network architecture, which was originally designed for biomedical image segmentation.U-Nets can also be used for denoising problems by replacing the two segmentation specific output layers with a single output layer returning the denoised image.In this form, U-Net is a convolutional autoencoder with skip connections at each of the scales.Our network includes blocks of 2D convolutions with batch normalization and ReLU activation functions at each scale, followed by MaxPool layers to sub-sample on a downward branch.In the decoder branch, the upsampling is achieved Table 2.The computational complexity of the different methods for modelling the measurement operator of the SPIDER instrument in terms of the number of pixels  , the number of measurements , the number of projection angles , and the support of the gridding kernels in pixels .  by transposed convolution layers.A schematic representation of the architecture can be found in Figure 7.
Our learned post-processing is computationally efficient as it only evaluates the measurement operator once per image for both evaluation and training.However, this also limits the extent to which the network can utilize the measurement model.

Learned iterative method (GU-Net)
An alternative approach is to apply learned iterative methods that evaluate the measurement operator several times to improve the reconstruction quality by including measurement information at several stages of the reconstruction process.While some methods in this category closely mimic optimization methods, the method we propose takes inspiration from multiscale methods.Our approach, the Gradient U-Net (GU-Net), expands on the U-Net architecture by calculating the gradient of the data fidelity after every downsampling and after every upsampling operation, as well as after the input layer, to incorporate measurement information at every scale of the network.Figure 6 shows a diagram summarising the learned iterative reconstruction process.How we implement the network is described below and a schematic representation of the network can be found in Figure 7.
In order to add this measurement information at different scales of the network we use sub-scale operators,   : R   → C   (see Section 4.2), where the size of the image space   decreases by powers of four (the images are halved in each direction;   = /4  ), and the size of the measurement space   is determined by applying a low-pass filter that restricts the measurements to {(, ) :  ≤  max /2  ∧  ≤  max /2  }, where  max and  max are the longest baselines observed in each direction.
The first method to incorporate the measurement operator into the reconstruction process is to calculate the gradient of the data fidelity at each scale of the network: . This is given by where   ∈ R   is the first channel of the current iterate at scale  of the network and   ∈ C   is the low-pass filtered measurement vector of visibilities.By analogy to the Radon transform, the gradient does not contain the high frequency features which are reinstated via the ramp filter in the filtered back-projection.Therefore, we also include the filtered gradient where the high-frequency features are boosted using a sampling density based filter.This is analogous to taking the gradient of a weighted least-squares norm with the measurement weights the same as the uniform weights described in Section 4.1: where  Uniform  ∈ R   ×   is a diagonal matrix with the measurement weights for the sub-selected measurements as the diagonal elements.
In addition to the gradient information, we also add the scalerestricted dirty reconstruction (adjoint of the measurement operator applied to the noisy measurements): In doing so, the network can learn an update between the dirty reconstruction, the current iterate passed down by the network, and the gradient and filtered gradient information based on the current iterate.
The measurement information encoded in the (sub-scale) gradient and filtered gradient, and the dirty reconstruction is added at each scale of the U-Net architecture after downsampling and after upsampling.The added measurement information at each scale of the network is given by where  ,  is the learned convolutional operator to combine the measurement operator information.At each scale of the network, the gradient information is calculated for the first channel of the current iterate.The learned convolutional operator takes the three channels (gradient, filtered gradient, and dirty reconstruction) representing the measurement information and convolves it with a number of filters matching the number of channels of the current iterate in the network, so as not to dilute the measurement operator information when combining it with the original channels.
Evaluating the added measurement information in Equation 35 results in two evaluations of the measurement operator (one each for the gradient and filtered gradient) and three evaluations of the adjoint operator (one each for the gradient and filtered gradient and one for the scale-restricted dirty reconstruction).Since the measurement information is added after down-and upsampling, this results in four full-scale evaluations of the forward and six full-scale evaluations of the adjoint of the measurement operator for the added measurement information.An additional evaluation of a single full-scale pseudoinverse of the measurement operator is also required for the initial reconstruction.
This method is computationally more expensive than the U-Net post-processing method because the measurement operator has to be evaluated several times.However, as most of the evaluations of the measurement operator are at reduced scales, the impact on reconstruction time is kept to a minimum.Furthermore, inclusion of multiple iterations with the (albeit downsampled) measurement operator enhances the impact of the model on the reconstruction, improving robustness and generalizability of the model.Similar approaches that leverage multiscale evaluation of the forward/adjoint operators to reduce computational cost with application in X-ray CT imaging were proposed in Hauptmann et al. (2020) andTrent (2020).

EXPERIMENTS AND RESULTS
In order to evaluate the reconstruction performance and computational cost of the learned imaging methods proposed we perform   32) and filtered gradient (Equation 33) of the data fidelity, as well as a scale-restricted dirty image (Equation 34).These are combined using a CNN layer (Equation 35), concatenated to the original channels, and passed on through the rest of the network.
reconstructions on simulated measurements and compare the reconstructions the ground truths.To assess reconstruction quality we compute the peak signal-to-noise ratio (PSNR) to measure how well the reconstruction matches the measurements and the structural similarity index measure (SSIM; Wang et al. 2004) to assess the structural similarity of the reconstructions.We compare the two learned approaches described in Section 5 to using a reconstruction obtained with the pseudo-inverse (Equation 21) and a primaldual method as described in Section 3.1.2(the implementation uses OptimusPrimal6 and runs for 300 iterations), which produces results on par with the current state-of-the-art in astronomical interferometric reconstruction.Specifically, we discuss the different datasets used, the measurement simulation process, network training, and reconstruction results.Besides assessing the computational cost and the reconstruction quality of the different approaches, we also evaluate the robustness of the methods proposed to additional noise and generalisability to other smaller datasets through transfer learning.
Implementations for the measurement operators, the learned imaging techniques and the routines used to simulate interferometric measurements and train the neural networks can be found in our LeIA7 codebase.

Datasets
We consider three datasets with different characteristics to evaluate the reconstruction performance of the imaging methods.The three datasets cover different potential use cases of the SPIDER instrument, including standard imaging, astronomical imaging and Earth observation.All images are converted to greyscale and cropped to a size of 256 × 256.For each epoch of training the images are rotated and flipped randomly, and measurements are simulated and contaminated with random Gaussian noise as described in Section 6.2.Note that the networks could also be trained at larger image sizes, however since the SPIDER instrument only sparsely samples the Fourier domain (SPIDER acquires at a sparsity of 4440 256×256 ≈ 7%), reconstructing at a larger image size would be challenging.
Since these methods are fully convolutional networks they can be adapted to reconstruct at larger image sizes, if the measurement operator is adapted accordingly.Note however that the SPIDER instrument only sparsely samples the Fourier domain (SPIDER acquires at a sparsity of 4440 256×256 ≈ 7%), because of this sparsity reconstructing at a larger image size will not necessarily yield reconstructions at higher resolution.
The Common Objects in COntext (COCO; Lin et al. 2014) dataset is a large and diverse set of natural images.We selected a subset of 3000 images split into 2000 training and 1000 test images.This dataset provides a large and diverse dataset of natural images for training the networks.Instruments similar to SPIDER could in future be considered as standard imaging devices for natural images.Furthermore, large datasets of readily available natural images such as this can be used for transfer learning, as considered subsequently.This dataset is initially used to demonstrate the general reconstruction performance of the reconstruction methods when trained on a large set of diverse images.All the images in this dataset have dimensions larger than our network input size, and during training, a random region of 256 × 256 is cropped out of the images at every epoch.
Next, we consider a small domain-specific dataset of 450 simulated galaxy images, split into 300 training and 150 test images.The galaxy images are obtained from the IllustrisTNG simulations (Nelson et al. 2019b).The H-alpha column densities from the TNG50 simulation of the IllustrisTNG project (Nelson et al. 2019a;Pillepich et al. 2019) are binned to a 256 × 256 grid to obtain images of simulated galaxy structures.Pixels with no simulation data are inpainted using a primal-dual method in an unconstrained setting using the ℓ 1 -norm of a dictionary of wavelets containing the Dirac basis and the first eight Daubechies wavelets (Db1-Db8) as the regularization functional.We add a small amount of noise to the simulation data (ISNR = 30dB) and use Bayesian inference to estimate the regularization parameter for this inpainting problem from the simulation data as described in Pereyra et al. (2015).These steps are followed merely to process the IllustrisTNG particle simulations in order to provide a dataset of ground-truth galaxy images suitable for our imaging experiments.
The final dataset we consider is another small domain-specific dataset of 450 satellite Earth observations taken from the Deep Globe satellite challenge (Demir et al. 2018), split into 300 training and 150 test images.

Simulating measurements
For a SPIDER instrument with parameters as defined in Table 1, we simulate measurements using the NUFFT operator (Section 4.1) with upsampling factor 2 per dimension, resulting in upsampling to 512 × 512 images, and a 6 × 6 KB kernel, which in this configuration results in  ∈ C 4440 measurements.

Training and transfer learning
We train the networks on pairs (  ,   ) of synthetic noisy SPIDER Fourier measurements and images with the mean squared error (MSE) cost where  †  : C  → R  is the learned pseudo-inverse operator (i.e. the networks defined in Section 5), such that  †   gives the learned reconstructions of the noisy measurements .
The networks are trained using the ADAM optimizer (Kingma & Ba 2014) with a learning rate of 0.001 and a batch size of 5.The training data for each epoch is precomputed to save time.Networks are trained for 200 epochs.

Computation time
The average evaluation time required to compute reconstructed images for the different methods is shown in Table 3 for images of the COCO dataset.The evaluation times include the initial application of the NUFFT pseudo-inverse.Training time is also included and does not include the time of pre-computing the augmented training data.
As expected, the U-Net model is highly computationally efficient due to the small number of operator evaluations.The GU-Net model is moderately slower than the U-Net model, but not significantly so, again as expected.We note that learned methods are trained and evaluated on the GPU while the primal-dual method is run on a CPU, precluding direct comparison of the reconstruction time.Nevertheless, the evaluation times in Table 3 are roughly proportional to the number of fine scale operator evaluations, which corroborates that these evaluations dominate the computational cost, and so a GPU implementation of the primal-dual algorithm would remain ∼ 600× slower than our U-Net model and ∼ 55× slower than our GU-Net model.The computational times required by our learned models to recover images, in the order of 10s of milliseconds, is sufficiently low that our proposed methods can indeed open up real-time imaging for the SPIDER instrument.

Reconstruction quality
We first compare the pseudo-inverse, primal-dual, and two learned reconstruction methods on the COCO dataset.The distribution of the quantitative metrics (PSNR and SSIM) over the training and test sets are depicted in Figure 8.The reconstructions for a subset of training and test images from COCO dataset are illustrated in Figure 9.
Our U-Net model performs similarly to the primal-dual algorithm, which represents the state-of-the-art variational regularization approach, with marginally lower PSNR but marginally greater SSIM.However, recall that imaging with the U-Net model is orders of magnitude faster than the primal-dual algorithm.Our GU-Net model outperforms both the U-Net and primal-dual approaches in both PSNR and SSIM, as expected since it makes greater use of knowledge of the measurement operator, while introducing only a moderate increase in computation time compared to the U-Net approach.All methods outperform the pseudo-inverse, which is to be expected.
We note the near mirror symmetry of the plots between training and test set, which indicates only small difference in distribution of the metrics between the train and test sets (this is also the case for the primal-dual method which is dataset agnostic since it does not include any training).This suggests that the trained models generalize well to unseen data in the same domain.
From the examples in Figure 9 it is apparent that the pseudo-inverse yields noisy reconstructions with aliasing artefacts, the primal-dual algorithm provides an improvement but still yields reconstructions with considerable artefacts, the U-Net model generates good reconstructions that are sometimes over-smoothened, losing details, while the GU-Net model does best at recovering details and suppressing artefacts, resulting in the overall best reconstructions.

Robustness to noise
Next, we evaluate the robustness of our trained models with respect to increased levels of additive Gaussian noise in the input.All models are trained with ISNR = 30dB, however we vary the ISNR from 30dB to 12.5dB in test images.The averages of PSNR and SSIM over a set of COCO images are shown in Figure 10.
The plots in Figure 10 both show a relative plateau followed by a transition to linear decay at around ISNR = 20dB indicating robustness to a moderate increase in noise level, with a linear deprecation for higher noise levels.The PSNR GU-Net curve remains above the U-Net curve as noise increases, while the SSIM curves are essentially on top of one another for higher noise levels.This suggests that the advantage of the GU-Net in terms of SSIM is marginal for higher noise levels, while we maintain some PSNR advantage even for higher noise levels.Overall, Figure 10 suggest sufficient robustness of the trained models with respect to varying input noise levels,  (PSNR and SSIM) for the different reconstruction methods on both the train and the test sets from the COCO dataset of natural images for measurements with an ISNR of 30dB.The reconstructions are made using the pseudo-inverse of the measurement operator, a primal-dual optimization scheme representing the state-of-the-art, our learned post-processing approach (U-Net), and our learned unrolled iterative approach (GU-Net).The dashed and dotted lines inside the distributions indicate the mean and quartiles of the distributions.such that a moderate underestimation of the noise level will not result in a significant loss of reconstruction quality.

Generalization to other datasets
Besides assessing the performance of the models on natural images, where we have a reasonably large number of training instances, we also consider other datasets where data availability is limited, as is the case for the galaxy and satellite datasets described in Section 6.1.
We consider three approaches for training networks for the two smaller, domain-specific datasets: (i) training on the small dataset from scratch; (ii) reconstruction of test images using the network trained on the COCO dataset without retraining; and (iii) transfer learning via initializing with the COCO pretrained networks and fine-tuning on images from the small dataset for 100 epochs.
As a first test, we compare these three approaches for the smaller, domain-specific galaxy dataset.Figure 11 shows GU-Net reconstructions for a test galaxy image in these three scenarios.The network trained on just the galaxy images produces artefacts in its reconstruction, due to the small volume of training data, while the model trained solely on images from the COCO dataset appears blurred, as the galaxy images do not have sharp edges.The transfer learning approach, which leverages both the COCO dataset and the smaller dataset of galaxy images, recovers the image with the highest PSNR.Therefore, we restrict further comparisons to the transfer leaning scenario.
The distribution of the quantitative metrics over the training and test sets of the galaxy dataset for the transfer learning scenario can be seen in Figure 12.Example reconstructions can be seen in Figure 13.From Figure 12 it is evident that the U-Net has been over-fitted to the training data as its performance metrics on the test data are substantially worse.Nevertheless, the example reconstructions of both learned models in Figure13 are visually very similar.We suppose this is due to the nature of the galaxy dataset for which neither PSNR nor SSIM is a particularly well suited metric.Indeed, visually the galaxy images reconstructed by the learned models are arguably slightly more appealing that those recovered by the primal-dual algorithm, while the quantitative metrics are similar or arguably marginally favour the primal-dual case.The close match between the train and test distributions for GU-Net indicates that incorporating the physical model into the learned reconstruction yields models with superior generalization performance to unseen data.
The same experiments are performed for the satellite image dataset, with distributions of metrics shown in Figure 14 and example reconstructions shown in Figure 15.Both our learned approaches outperform the primal-dual approach, in terms of metrics and visual inspection (the primal-dual algorithm yields more grainy, noisy reconstructions).Moreover, the GU-Net models exhibits a slight improvement over the U-Net model, particularly for images with clear edges.
In summary, by adopting a transfer learning approach our learned imaging methods achieve similar or superior reconstruction qual- .Robustness of learned models (trained for ISNR = 30dB) to increased input noise level: mean PSNR (left), SSIM (right) averaged over a set of 100 images from the COCO test set as a function of ISNR in the noise model (as described in Section 6.2).Our learned models exhibit sufficient robustness to variations in noise, such that a moderate underestimation of the noise level will not result in a significant loss of reconstruction quality.

True
GUnet TNG (PSNR: 30.3).From left to right: the original image, a reconstruction from a network trained only on the small galaxy dataset, a reconstruction using a network trained only on natural images from the COCO dataset, and reconstruction made with a network trained first on natural images and then fine-tuned using transfer learning to the galaxy dataset.The transfer learning approach recovers images with the highest reconstruction quality and the least amount of artefacts.
quality, while dramatically reducing the computational time required to recover images.
Our first learned method adopts a learned post-processing approach to clearly separate the physical measurement model from the learned imaging via a U-Net model.A consequence of this configuration is that the pseudo-inverse of the physical measurement model need only be applied once, resulting in orders of magnitude reduction in imaging time compared to the benchmark traditional optimization algorithm.Reconstruction quality is similar to the benchmark traditional algorithm.
Our second learned method, the GU-Net, uses a similar architecture but enhanced with multiscale evaluations of the gradient of the data fidelity term, interweaving the measurement model into the U-Net architecture.While this results in a moderate increase in the time required to recover images compared to the U-Net model, re-construction quality is improved by making greater use of knowledge of the physical measurement model.
Overall, our learned methods achieve similar or superior reconstruction quality compared to traditional approaches, while realizing a dramatic reduction in the time required to recover images, to the extent that real-time imaging with the SPIDER instrument becomes possible for the first time, opening up many new use-cases.
While our learned methods are trained using data with a specific noise level, their performance on measurements with increased noise levels remains similar for a moderate increase in noise level.Furthermore, in scenarios where only a limited volume of training data is available, as common in domain-specific problems, the performance of the models can be increased by transfer learning from a domain with sufficient training data.While for actual observations one should try to construct as representative training as possible, our learned methods show that it is possible to achieve similar or supe- rior reconstruction quality to the traditional primal-dual approach in settings where limited train data are available, and at a fraction of the computational cost.
In addition, we have presented two approaches to modelling the measurement process of the SPIDER instrument.One method is based on the NUFFT, which is widely used in radio interferometry and is applicable to any arbitrary sampling distribution.We also present a new modelling approach based on the Radon transform, an NU-Radon method, that applies specifically when the sampling distribution has radial spokes of measurements, as is the case for SPIDER, and show similarities to works in the medical imaging domain.This latter approach is computationally more efficient when the number of measurements is large and the calculation of the measurement operator is dominated by the (de)gridding of measurements or when the number of pixels is large for a modest number of spokes.Both modelling methods and all imaging techniques have been implemented in Python and are available for use in the LeIA 8 code.
While the methods presented in this work can be used with any arbitrary sampling distribution (when using the NUFFT), the sampling distribution we have considered is fixed.For radio interferometric imaging the sampling distribution typically varies for each observation as it depends on the location of the object in the sky as well as the length of time of the observation.Looking to the future, in order to extend the learned methods presented in this paper to radio interferometry, the networks need to be trained in such as way as to support varying sampling patterns.This is the focus of ongoing research, for which preliminary results are encouraging.Variants of the methods presented in this paper are therefore likely to be of use not only for SPIDER imaging but for radio interferometric imaging more generally.

Figure 1 .
Figure 1.A diagram of a SPIDER instrument design proposed with 37 PICs with lenslets attached to them.Image credit: Kendrick et al. (2013).

Figure 2 .
Figure 2. (Left)The physical locations of the lenslets of the 2D interferometer as detailed in the design proposed inKendrick et al. (2013).The lenslets on each of the radial spokes are mounted to their respective PIC.(Right) The measured baselines of the interferometric measurements between pairs of lenslets on the PICs of the SPIDER instrument.The amount of baselines is increased by measuring at different spectral frequencies.Note that the measurements all lie in the same direction as the directions of the spokes, since measurements are only made using pairs of lenslets on 1 PIC.

Figure 3 .
Figure3.Generating a dirty reconstruction by applying the adjoint of the measurement operator, using the NUDFT, to interferometric measurements of the SPIDER instrument (left), compared to accelerated methods for approximation of the adjoint operation using the NUFFT (centre) and the NU-Radon (right) approaches as detailed in sections 4.1 and 4.3 respectively.The MSE of the two accelerated approaches compared to the NUDFT is shown, where the error is only calculated inside the circular aperture that limits the NU-Radon method.

Figure 4 .
Figure 4. Measurements are simulated from the original image from the COCO dataset of natural images shown in the left-most image, using a NUFFT with the SPIDER sampling pattern.Generating images by applying adjoint NUFFT operation, the pseudo-inverse NUFFT operation, and the pseudo-inverse NU-Radon operation are shown respectively from left to right.

Figure 5 .Figure 6 .
Figure5.The learned post-processing approach takes the interferometric measurements and uses the telescope model to create an initial reconstruction, in our case the dirty image.This is then passed through the learned postprocessing network to create the final reconstruction.

Figure 7 .
Figure 7. Left: The neural network architecture for the U-Net model used for learned post-processing of the initial reconstruction estimated by the pseudo-inverse of the measurement operator applied to the noisy measurements.Right: The neural network architecture for the GU-Net model for the learned iterative approach used to reconstruct an image from interferometric measurements.The input image of the network is the pseudo-inverse of the measurement operator applied to the noisy measurements.The model-based operation in this network takes the first channel on each of the scales and calculates the gradient (Equation32) and filtered gradient (Equation33) of the data fidelity, as well as a scale-restricted dirty image (Equation34).These are combined using a CNN layer (Equation35), concatenated to the original channels, and passed on through the rest of the network.
Figure10.Robustness of learned models (trained for ISNR = 30dB) to increased input noise level: mean PSNR (left), SSIM (right) averaged over a set of 100 images from the COCO test set as a function of ISNR in the noise model (as described in Section 6.2).Our learned models exhibit sufficient robustness to variations in noise, such that a moderate underestimation of the noise level will not result in a significant loss of reconstruction quality.

Figure 11 .
Figure11.Comparison of the three different approaches for training a model for smaller, domain-specific datasets (as described in Section 6.3).From left to right: the original image, a reconstruction from a network trained only on the small galaxy dataset, a reconstruction using a network trained only on natural images from the COCO dataset, and reconstruction made with a network trained first on natural images and then fine-tuned using transfer learning to the galaxy dataset.The transfer learning approach recovers images with the highest reconstruction quality and the least amount of artefacts.

Figure 12 .
Figure 12.Distribution of quantitive imaging metrics (PSNR and SSIM) over the train and test sets of the galaxy dataset.The dashed and dotted lines indicate the mean and quartiles of the distributions.The learned approaches are trained using images of the COCO dataset first, and are then adapted to the galaxy dataset by transfer learning.

Figure 14 .
Figure 14.Distribution of quantitive imaging metrics (PSNR and SSIM) over the train and test sets of the Deep Globe satellite image dataset.The dashed and dotted lines indicate the mean and quartiles of the distributions.The learned approaches are trained using images of the COCO dataset first, and are then adapted to the satellite image dataset by transfer learning.

Table 3 .
Number of full scale measurement operator calls, reconstruction time averaged over 1000 images of the COCO dataset and training time, for the pseudo-inverse, our learned methods, and a variational regularization approach.The reconstruction times for our learned methods are sufficiently low to enable real-time imaging for the SPIDER instrument.Refers to operator evaluation at the finest scale, which dominates the computational time of the GU-Net. *