deep PACO: Combining statistical models with deep learning for exoplanet detection and characterization in direct imaging at high contrast

Direct imaging is an active research topic in astronomy for the detection and the characterization of young sub-stellar objects. The very high contrast between the host star and its companions makes the observations particularly challenging. In this context, post-processing methods combining several images recorded with the pupil tracking mode of telescope are needed. In previous works, we have presented a data-driven algorithm, PACO, capturing locally the spatial correlations of the data with a multi-variate Gaussian model. PACO delivers better detection sensitivity and confidence than the standard post-processing methods of the field. However, there is room for improvement due to the approximate fidelity of the PACO statistical model to the time evolving observations. In this paper, we propose to combine the statistical model of PACO with supervised deep learning. The data are first pre-processed with the PACO framework to improve the stationarity and the contrast. A convolutional neural network (CNN) is then trained in a supervised fashion to detect the residual signature of synthetic sources. Finally, the trained network delivers a detection map. The photometry of detected sources is estimated by a second CNN. We apply the proposed approach to several datasets from the VLT/SPHERE instrument. Our results show that its detection stage performs significantly better than baseline methods (cADI, PCA), and leads to a contrast improvement up to half a magnitude compared to PACO. The characterization stage of the proposed method performs on average on par with or better than the comparative algorithms (PCA, PACO) for angular separation above 0.5".


INTRODUCTION
High-contrast imaging is an observational method used to study the close environment of stars (Traub & Oppenheimer 2010;Bowler 2016;Pueyo 2018).It is particularly adapted to detect young, massive and hot exoplanets (see e.g.Chauvin et al. (2004Chauvin et al. ( , 2005)); Schneider et al. (2011); Nielsen et al. (2019)), thus complementing well indirect exoplanet detection methods such as transit photometry or Doppler spectroscopy (Santos 2008).Direct imaging offers other appealing characteristics like the detection of candidate companions from a few hours of observations and the ability to characterize them in terms of age, surface gravity, effective temperature and composition (Allard et al. 2003(Allard et al. , 2007)), or to predict their evolution (Burrows et al. 1997;Chabrier et al. 2000).Despite these promises, only a few dozens exoplanets have been unveiled and characterized since the ★ O. Flasseur is currently with CRAL (5), and was previously with LESIA (1) and Inria (2) where most of this work was performed.E-mail: olivier.flasseur@univ-lyon1.fr emergence of direct imaging in the early 2000s (Marois et al. 2008;Lagrange et al. 2009;Nielsen et al. 2012;Macintosh et al. 2015;Chauvin et al. 2017;Keppler et al. 2018).This is mainly due to (i) the relatively low occurrence of giant exoplanets, (ii) the very high contrast between the host star and the exoplanets (typically, higher than 10 5 in the infrared), and (iii) the required high angular resolution (typically, better than a tenth of arcsec).
In this context, cutting-edge ground-based facilities like VLT/SPHERE (Beuzit et al. 2019), Gemini/GPI (Macintosh et al. 2014), Keck/NIRC2 (Castellá et al. 2016), Magellan/MagAO (Morzinski et al. 2014) or SUBARU/SCExAO (Jovanovic et al. 2015) are equipped with an (extreme) adaptive optics system and a coronagraph to attenuate as much as possible the glare of the star.Currently, the non-blocked residual starlight contamination and its temporal fluctuations remain the main limitation.It takes the form of spatially-correlated speckles that resemble the expected signature of a point-like source (e.g., an exoplanet, a brown dwarf, a background star).The observations are also impacted by additional sources of noise (i.e., thermal background flux, detector readout, photon noise).
Together, speckles and noise form a spatially and non-stationary nuisance component that corrupts the signals of the sought objects.Off-axis objects can either take the form of point-like sources or that of spatially extended features like circumstellar disks.In this paper, we focus on the detection of point-like sources, leaving the problem of disk reconstruction for future work.
In order to unmix the sought objects from the nuisance component, high-contrast observations are performed with dedicated strategies like angular differential imaging (ADI; Marois et al. (2006)), that we consider in this paper.ADI consists in tracking the observed target over time, with the telescope derotator tuned to keep the telescope pupil stable while the field of view rotates around the host star.Consequently, in the resulting 3-D datasets (2-D + time), the objects of interest follow an apparent motion along a deterministic circular trajectory centered on the star while the telescope pupil remains static.With ADI, speckles due to residual starlight aberrations are quasi-static, i.e., they are strongly correlated across exposures.ADI also allows to extract the astrometry and the photometry of detected sources.These estimates can be used to characterize the physical properties of the detected sources by fitting orbital and atmospheric models (Vigan et al. 2010;Cheetham et al. 2019;Mesa et al. 2019).
In this paper, we address two tasks: (i) the detection of point-like sources, and (ii) the estimation of their photometry.
The last cornerstone of high-contrast imaging is data reduction, i.e., the combination of the recorded images by dedicated postprocessing algorithms.This critical step brings the additional gain in contrast (between one and three orders of magnitudes for existing methods) needed to detect the faint signals coming from thermal selfemission of giant exoplanets.The classical principle is to estimate a reference image (so-called on-axis PSF) of the nuisance component, that can be subtracted from the data in order to unveil the sought objects.A simple practical implementation of this general strategy consists in subtracting the temporal mean or median of the dataset from each frame of the ADI stack.The residual images are then co-aligned to the true-North so that the signals of the sought objects are superimposed and can be combined by temporal stacking.This is the principle of the cADI method designed to process the first direct imaging observations (Marois et al. 2006;Lagrange et al. 2009).In the last decade, several more advanced methods have been developed, see, e.g., Pueyo (2018) for a review.In particular, TLOCI (Marois et al. 2013(Marois et al. , 2014) ) (or its variants such as LOCI (Lafrenière et al. 2007), ALOCI (Currie et al. 2012a,b), MLOCI (Wahhaj et al. 2015)), and KLIP/PCA (Soummer et al. 2012;Amara & Quanz 2012) are currently implemented in most reduction pipelines (Amara & Quanz 2012; Gonzalez et al. 2017;Galicher et al. 2018).LOCI-based and PCA-based algorithms are considered as standards to process highcontrast observations.With the (A, M, T)-LOCI algorithm, the onaxis PSF is estimated by combining images selected in a library.The combined images are selected and weighted to minimize the residual noise and maximize the throughput of point-like sources simultaneously.The PCA algorithm performs a principal component analysis of the data, and a low-rank estimate of the on-axis PSF is formed by keeping the first principal components of the decomposition.In the same vein, the LLSG algorithm (Gonzalez et al. 2016) decomposes the dataset into low-rank, sparse and Gaussian components.Because few sources are expected in the field of view, their signatures are mainly recovered in the sparse component.The RSM algorithm (Dahlqvist et al. 2020(Dahlqvist et al. , 2021a,b) ,b) combines residual images obtained with different post-processing algorithms (e.g., cADI and PCA) to leverage their specific benefits and mitigate their respective drawbacks.Since all of these algorithms are based on the estimation and subtraction of an off-axis PSF, they are facing a common pitfall: they fail in deriving a statistically grounded detection map, especially at short angular separations.As a consequence, the identification of candidate sources partly relies on visual inspection of the detection map.
To circumvent this issue, several works have considered alternative ways to reduce the data in order to produce more quantitative outputs.The derivation of a custom signal-to-noise ratio (S/N) through a -test empirically corrected for the varying number of samples as a function of the angular separation (Mawet et al. 2014) is a pioneering work in this direction.Jensen-Clem et al. (2017) recommend the adoption of metrics combining the achievable contrast with the fraction of detected sources.Instead of normalizing the combined residual images by the empirical standard-deviation of the noise (roughly) approximated on annuli, Pairet et al. (2019) propose to build a detection map directly from the set of residual images by comparing the variance of samples located on the expected trajectory of putative sources.Other methods reformulate the detection task as an inverse problem.Among them, ANDROMEDA (Mugnier et al. 2009;Cantalloube et al. 2015) and FMMF (Ruffio et al. 2017) build a model of the residual off-axis signal after subtraction of the estimated on-axis PSF.The PACO algorithm (Flasseur et al. 2018a(Flasseur et al. ,b,c, 2020a,b) ,b) builds a more consistent statistical model, self-calibrated on the data, that accounts for the spatial correlations of the nuisance component at the scale of small image patches of a few tens of pixels, see Sect.2.1.1.
Given the success of data-driven approaches in solving various high-level imaging tasks (e.g., detection, segmentation, classification) in very diverse fields (e.g., photography, microscopy, biomedical imaging, remote sensing), machine learning and deep learning approaches have also been investigated by the direct imaging community.Fergus et al. (2014) report one of the first steps in this direction with a discriminative approach exploiting the specific structure of high-contrast data.Based on support vector machines, the underlying model is trained from two-classes samples generated by resorting to massive injections of fake companions.Gonzalez et al. (2018) formalize the detection problem as a binary classification task and propose a fully supervised deep learning approach also trained in a supervised fashion.They use collections of patches pre-processed by PCA for different numbers of principal components as input of a random forest or of a convolutional neural network that decides in favor of the presence or on the absence of a point-like source in each patch.While demonstrating powerful detection capabilities, this algorithm showed to be prone to a high level of false alarms in some cases (Cantalloube et al. 2020).Besides, the tuning of hyperparameters remains a critical point making the operating point difficult to reach.Generative adversarial networks (GANs, Goodfellow et al. (2014)) have been used to produce multiple realizations of pure nuisance component as an alternative way to generate a large basis of labeled samples used to train a deep discriminative model (Yip et al. 2019).Recent works (Samland et al. 2021;Gebhard et al. 2022) recast the unmixing problem as a regularized regression task.The underlying linear model is in charge of explaining the evolution of the nuisance component (and possibly, the evolution of the source signals) in a time series extracted at a given pixel location from temporal series of reference selected to be signal free and causally independent from the putative source signals.
Intensive testing of PACO, both on public (Cantalloube et al. 2020) as well as on consortia data challenges, and more recently on a subsample of about 75 observations from the SPHERE's SHINE survey (Chomez et al. 2023) shows that PACO is one of the algorithms of choice to process high-contrast observations.In particular, the latter work (Chomez et al. 2023) demonstrates the expected gain in terms of achievable contrast (up to 10 −7 at a few arcsec), and in terms deep PACO: deep learning meets PAtch COvariances 3 of the underlying exoplanet population (with a mass up to 5 M Jup at 5 AU for stars at about 60 parsec away).Thanks to its unique data-driven modeling of the nuisance component, accounting for its non-stationary spatial correlations, PACO is especially well suited to process observations in which the typical spatial extent of speckles lies in a patch of a few tens of pixels.However, the statistical model of PACO is approximate in case of spatial correlations spread over a patch of a few tens of pixels (e.g., for background-limited observations and/or in case of unstable observing conditions).This motivates the path we follow in this work: we propose a new detection algorithm, named deep PACO, that combines the statistical model of PACO with a supervised deep learning framework.The statistical model of PACO is used to improve the stationarity and the contrast of the data in a pre-processing step, while deep learning is in charge of correcting for the (putative) approximate fidelity of the statistical model of PACO to the reality of the observations.To do so, the data are centered and whitened locally using the PACO framework, and a CNN is trained in a supervised fashion to detect the residual signature of synthetic sources from pre-processed science data.The network is trained from scratch using full frame samples generated with a custom data augmentation strategy allowing to build a large training set from a single ADI dataset.Finally, the underlying discriminative model is applied to the pre-processed observations and delivers a detection map.Detected sources are then photometrically characterized by a second deep neural network, also trained from scratch using patch samples generated with a dedicated data augmentation step.On this latter part, while the recent astronomy literature reports several works for photometry estimation through deep learning models (see e.g., Boucaud et al. (2020); Cabayol et al. (2021); Huertas-Company & Lanusse (2022) for galaxie's photometry or red-shift estimation), to the best of your knowledge this is the first time that deep learning is employed to estimate the flux of detected sources in direct imaging at high-contrast.
This paper1 is organized as follow.Section 2 presents the main ingredients of the detection part of the proposed algorithm.Section 3 focuses on the characterization stage of the proposed method.Section 4 evaluates the detection and characterization performance on several high-contrast observations from the IRDIS imager (Dohlen et al. 2008) of the VLT/SPHERE instrument (Beuzit et al. 2019).Finally, Sect. 5 draws the paper conclusions and gives future research prospects.
Throughout the text, the reader can refer to Table 1, Figs. 1-2, and Table B1, summarizing respectively the main notations, the processing pipeline of the proposed approach, and the main setting of the different parameters.

Stastical model of the non-stationary patch covariances
Section 2.1.1 recalls for completeness the main ingredients of the statistical model embedded in the PACO algorithm for ADI observations (Flasseur et al. 2018b(Flasseur et al. , 2020a)).Section 2.1.2describes how to use this model to attenuate the strong and spatially non-stationary correlations of the data as well as to improve the contrast.Resulting residuals from this pre-processing step are centered and whitened  Ablation studies have shown that this pre-processing step is of primary importance: the training step fails to converge in its absence due to the high non-stationarity of the nuisance component and the large fluctuations it displays near the star, see Fig. 1.This effect is due to the fact that standard deep learning architectures assume some degree of invariance, in particular that the data are normalized in a specific range and are corrupted by a stationary noise, see also Sect.4.2.1.

Statistical model of the nuisance component
A dataset  in R  × recorded with the ADI technique is formed by -pixel images captured at different times  in ⟦1; ⟧.The direct model for the observed intensity is: where  in R  × is the nuisance component, and    in R  × stands for the contribution of a point source  ∈ ⟦1; ⟧ with a contrast   in R + that is assumed constant during the few hours of the total observations.The contribution of a source  takes the form of the off-axis PSF centered at location F  (  ) in the -th image, where   is its initial location on an image at a reference time  ref (e.g.,  ref =  1 ).The function F  is a geometric transform (typically in ADI, a circular translation with respect to the star located at the center of the images) modeling the apparent motion of the field of view between the observation configurations at time  ref and time .The mapping F  is deterministic since it depends solely on the measured parallactic angles.Given that very few sources are expected in the field of view, we assume that the measured intensity is the superimposition of the nuisance component and at most one unresolved point-like source  at each pixel location , i.e., multiple sources do not overlap.
In previous works on the PACO algorithm (Flasseur et al. 2018a,b,c) maximum likelihood are the following: (2) Since the number of samples available, i.e. the number  of temporal frames, is typically equivalent or smaller than the number  of pixels in a patch 3 , the sample covariance S  is very noisy and may even be rank deficient.A form of regularization must be enforced to stabilize the estimate and allow the inversion of the covariance matrix involved in the data whitening step (see Sect. 2.1.2).We use a shrinkage estimator (Ledoit & Wolf 2004;Chen et al. 2010) formed by a convex combination between the low bias/high variance estimator S  and a high bias/low variance estimator F  : 3 The number of pixels in a patch is determined in a data-driven fashion, as described in Flasseur et al. (2018b) for the PACO algorithm.It corresponds to twice the full width at half maximum (FWHM) of the measured off-axis PSF at the given wavelength.In practice, this empirical rule typically leads to  in ⟦7 2 ; 12 2 ⟧ pixels for the VLT/SPHERE instrument operating at a wavelength  ∈ [0.9; 2.2] µm.
where F  is a diagonal matrix encoding the sample variances: The hyper-parameter   plays a central role since it governs a biasvariance trade-off.In our previous works (Flasseur et al. 2018b(Flasseur et al. , 2021)), we have derived its closed-form expression, which is an extension of the results of Chen et al. (2010) in the case of a non-constant shrinkage matrix F  : where  is the number of non-null patches involved in the computation of S  .Here,  is equal to  everywhere.
In Appendix A1, we discuss a refinement of this statistical model to account for the temporal fluctuations of the observations.It leads to a slight improvement in terms of detection performance at the cost of an increase of the computational burden by a factor 10 to 30.For these reasons, it is not applied by default in the following.We recommend to use it, in a second step, to refine the analysis of ambiguous candidate detections found by the proposed method embedding a multi-variate Gaussian model.

Centering and local whitening of the observations
We consider a set of locations P where the statistics of the nuisance component should be computed.The cardinal of P depends solely on the patch shape and on the patch stride4 used to cover the whole field of view.For a given patch stride, we first define the number of (centered and whitened) patches averaged at each location  ′ of the field of view: where  is the indicator function (i.e.,  = is either equal to 1 if the condition  =  is met, and equal to 0 otherwise), and 1  ∈ R  (respectively, 1  ′ ∈ R  ) is the vectorization of a 1-valued patch centered at location  (respectively,  ′ ). Figure 3 gives a view of the quantity  ∈ N  in the general case, i.e. for either square and circular patches as well as non-overlapping and overlapping patches.By default, we consider square patches of  pixels area.The preprocessed images  in R  × , after centering and whitening, are obtained by: where W  is an operator performing centering and whitening of the collection of patches   R  × at location , such that L  is the Cholesky's factorization of C −1  (i.e., L  L ⊤  = C −1  ).In the specific case (considered by default in this paper) of non-overlapping square patches of  pixels, card(P) = ⌊/⌉, and Eqs. ( 6)-( 7) simplify as: In Appendix A2, we discuss a refinement of the intermediate quantity  produced by the pre-processing step under the same statistical model as described in Sect.2.1.1 or in Appendix A1.It can be noted that overlapping patches should be used with this refinement, and that the patch shape can be either squared or circular.This alternative approach leads to a better stability and robustness of the method for datasets recorded under bad observing conditions5 .However, the computational burden of this variant is increased by a factor 2 × , typically lying in ⟦100; 250⟧ for the VLT/SPHERE instrument.For these reasons, this variant is not applied by default in the following.We recommend to apply it, in a second step, when the training, validation, or inference results are clearly impacted by a significant number of false alarms, much higher than expected, which is an unambiguous sign of the limited fidelity to the observations of the pre-processing procedure used by default.

Exoplanet detection by supervised deep learning
We formalize the detection problem as a supervised pixel-wise classification task: starting from a temporal series of pre-processed images including synthetic sources, the goal is to infer a detection map  in [0; 1]  , where each pixel value represents a score between 0 and 1 such that a high (respectively, a low) score values the presence (respectively, the absence) of a source centered at that location.Interpreting this score as a true probability of presence of a source requires a control of the uncertainties with dedicated methods (see e.g., Gawlikowski et al. (2021); Hüllermeier & Waegeman (2021) for recent review papers) that is left for future work.For this reason, in the following, we refer to this score as a pseudo-probability.Besides, the -pixel detection map is larger than a -pixel single temporal frame of the pre-processed data.Due to the apparent rotation induced by ADI, it is theoretically possible to detect a source lying in the sensor field of view at a single date, see Figs. 2(c) and 9.By derotating each individual temporal frame with the corresponding parallactic angle and combining the resulting transformed measurements, an area larger than  pixels can be scrutinized.Its resulting spatial extent depends solely on the total amount of parallactic rotation between the first and the last frame.
Section 2.2.1 details the construction process of the training set, Sect.2.2.2 describes the selected model architecture, and Sect.2.2.3 discusses the metrics we consider to evaluate the performance of the proposed method.

Construction of the training basis
In high-contrast imaging, obtaining real ground-truth data is a twofold challenge.
First, the overall number of positive samples is limited as relatively few point-like sources have been confirmed to date.Second, negative samples are hard to define since some undiscovered sources might be present in the observed data.To overcome these difficulties, we adopt deep PACO: deep learning meets PAtch COvariances 7 the following training strategy: the training set consists of  pairs { (  [] ;  [] } =1: of samples resulting from the massive injection of synthetic point-like sources.In this framework, (  [] ∈ R  × represents observations, with injected synthetic sources, that have been pre-processed.The quantity  [] ∈ ⟦0, 1⟧  is the groundtruth map pointing the injection locations of any synthetic training source falling within the field of view at least in one temporal frame.The ground truth map is built for a given (and arbitrary) orientation of the field of view, e.g.aligned with the true North.The implemented simulation process is quite realistic since the injected source signature corresponds to the off-axis PSF of the target star usually measured just before or just after the main sequence of observations by decentering the coronagraph.
Second, the nuisance component varies drastically from one observation to the other, as it is highly dependent on the observing conditions, the magnitude of the star, and the instrument settings.As a consequence, we follow an observation-dependent approach, and train a different model on each observation.It means that the model parameters (except algorithmic and optimization hyper-parameters, see Sect.2.2.4) are optimized from scratch for each dataset.
This setup implies the design of a custom data-augmentation strategy (i) to prevent over-fitting of the model that is trained from a unique temporal series of images, and (ii) to account for our lack of knowledge about real sources -unknown at training time but that we aim to detect at test time-.To circumvent these issues, we apply a random permutation of the  images forming the observations  for each new training sample  ∈ ⟦1; ⟧.This operation allows us (i) to create artificially different training sets, and (ii) to break the temporal consistency of (known and unknown) real sources.Synthetic sources are then injected in the temporally permuted data using the parallactic angles and the best fit of the off-axis PSF by a Gaussian and an Airy pattern.Besides, the off-axis PSF is assumed to be time-invariant.We have checked numerically that this assumption is reasonable for our classification task.Similarly, assuming a slightly different pattern for the off-axis PSF (e.g., measured versus fitted model) between the data generation process and the training step does not lead to a significant drop in the detection performance.At this intermediate stage, each training sample  [] is obtained by:  [] = P []  + where P is an operator performing the random temporal permutation of the images of  and    in R  × represents the spatio-temporal contribution of a synthetic source centered at location   on a reference image at time  ref , see Sect. 1.The number of sources  [] , their contrasts { [] } =1: [𝑠] and their initial locations { [] } =1: [] are free parameters.In practice, the number  [] of injected sources in each training sample is drawn uniformly in ⟦1; 10⟧.This setting represents a realistic scenario since we expect a few faint point-like sources in the field of view.We denote by  the total number of injected sources over the  training sets, i.e.  =  =1  [] .The initial locations {  } =1: of the injected sources are drawn uniformly in polar coordinates (i.e., the center of the field of view is more sampled than farther away).Note that we have compared this setting with a uniform sampling in cartesian coordinates (i.e., uniform density over the field of view), and we found very similar detection performance for both settings.The selected one (i.e., uniform in polar coordinates) slightly speeds up the training procedure, likely because the pre-processed observations fluctuate slightly more near the star than farther away, thus requiring more training samples to discriminate a source from the nuisance.The range of injected flux is also a criti-cal choice.For instance, if the lower bound is too low, class overlap can occur and the model is not able to discriminate between sources and the nuisance component leading to a high level of false alarms.In the opposite case, if the upper bound is too low, evident bright sources will not be detected since there are no similar examples in the training set.In practice, we set the injection range in an unsupervised fashion.We train our model on sources which are challenging to detect with other methods: the contrast {  } =1: of the injected sources is drawn uniformly in 3  PACO is the 1-sigma contrast reached by PACO at location   .This setting covers both sources that are detectable above the standard 5 detection confidence and sources whose detection confidence remains below the 5 detection limit reached by PACO.In practice, we found that this setting is suitable to detect both faint sources and bright sources without generating large number of false alarms.
As the pre-processing is an expensive procedure and becomes the bottleneck during online data generation, we adopt a local update strategy to reduce its computational cost.Prior to the injection of synthetic sources, the whole dataset is pre-processed, i.e. centered and spatially whitened, see Fig. 1 (b).We denote by  the pre-computed cube.After each batch  of injections, the set S [] of locations impacted by the signal of the  [] sources is determined.Outside S [] , the pre-processed images are obtained from the temporal permutation of .Inside S [] , the statistics of the nuisance component are updated given the contamination of the  [] injected sources, and the pre-processed images are updated with these refined statistics.At this intermediate stage, each training sample q  [] is obtained by: , for  ∈ S [] ∩ P , P []   , for  ∈ P − S [] ∩ P . (10) This dual strategy, illustrated by Fig. 4, is applied to prevent any detection bias (i.e., an over-estimation of the actual detection performance of the proposed algorithm) since we have shown in previous work on the PACO algorithm (Flasseur et al. 2018b) that the statistics of the nuisance component can suffer from a (slight) bias, in particular at short angular separations and/or when the total amount of parallactic rotation is low.This slight potential bias is due to the contamination of a source whose signal is partly encoded both in the mean and in the spatial covariances of the nuisance component.Finally, the intermediate images of each training sample are derotated with the opposite of the parallactic angles so that signal of the synthetic sources are spatially co-aligned along the temporal axis: ( where D is a derotation operator.This derotation step is mandatory to perform a semantic segmentation with the CNN we consider (see Sect. 2.2.2) given the limited spatial extent of its receptive field.
The binary ground-truth segmentation map  [] is obtained by setting to 1 circular areas centered at the locations {  } =1: [] of the  [] injected sources.Other regions of  [] are set to 0. The radius of the circles is set to the full width at half maximum of the off-axis PSF, which corresponds to the expected spatial extent of an exoplanetary signature in the data.
A schematic summary of the construction of the training set is given in Fig. 2.

Model and architecture
Deep convolutional neural networks reach state-of-art performance in solving pixel-wise classification tasks for various imaging fields  in the presence of injected training sources from pre-computed (i.e., before injections) centered and whitened observations .The spatial scale is not respected for the purpose of illustration.
including microscopy, astronomy, medical imaging or remote sensing.A large variety of model architecture has been studied in the literature (e.g., auto-encoder (Badrinarayanan et al. 2017), VGG (Simonyan & Zisserman 2015), ResNet (He et al. 2016)) and their performance often rely on an intricate trade-off between model complexity, the amount of data available for training, and their fidelity with the data used at test time.A common feature of some classical deep architectures is to encode the diversity of the training samples in a low dimensional subspace by transforming the network input with a cascade of convolution and downsampling operations.Starting from this latent representation, the initial image size is retrieved by a decoder performing reverse transformations with a cascade of deconvolution and upsampling operations.
We also based our model on the above-mentioned category of architectures.We chose a U-Net (Ronneberger et al. 2015) with a ResNet18 (He et al. 2016) as encoder backbone (≃11 millions of free parameters), which is an architecture widely used for image segmentation.Its residual connections preserve of the input's spatial information along the cascade of convolution and downsampling operations thanks to a direct mapping of the output of each layer of the compression arm into the corresponding layer of the decompression arm.We use the architecture implemented in the SMP package 6 .The encoder and decoder parts are formed by four blocks, each one is composed by a series of convolution layers, batch normalization layers, rectified linear unit (ReLU) activation, and max pooling layers. 6The SMP package containing the network architecture used in this paper is available at https://github.com/qubvel/segmentation_models.pytorch.
The final layer of the network has a sigmoid activation function to produce a detection map  ∈ [0; 1]  .The detailed description of the architecture, the number of parameters per layer, and the input / output shapes of each layer can be found at the above mentioned link.
The network weights are trained from scratch with the samples generated with the procedure described in Sect.2.2.1.Initial weights are drawn uniformly through a He-Kaiming distribution (He et al. 2015).The SMP package also provides pre-trained weights.Pre-training is performed with the ImageNet dataset (RGB conventional images) either in a supervised, semi-supervised, or weakly-supervised learning fashion (Yalniz et al. 2019).In case of pre-training, the weights of the first convolutional layer are replicated in order to match the -depth of our inputs.We compared all of these strategies against a supervised learning from scratch with our custom training set (see Sect. 2.2.1).We found similar performance with all approaches, and opted for an end-to-end learning.

Loss function and accuracy metrics
The design of loss function used for optimizing the network weights at training time is driven by three criteria: (i) handling with the strong class imbalance (the number of background pixels being much larger than the number of pixels from the sources), (ii) being computationally efficient, (iii) matching the astrophysical goals (i.e., having a measure close to a detection accuracy score).We compare losses classically used for semantic segmentation, such as the binary crossentropy (BCE), ℓ 1 and ℓ 2 norms, mean square error and Hinge loss.We have also compared losses based on a similarity measure such as the Dice score (Milletari et al. 2016) and hybrid losses combining at least two individual loss measurements (e.g., BCE with Dice score).Our experiments have consistently shown better performance with Dice-based scores.We selected the Dice2 loss (the 2 means for two classes), first introduced for biomedical imaging segmentation with very unbalanced classes (Sudre et al. 2017;Wang et al. 2020).Given a training set of ground-truth and predicted detection maps { [] ;  [] }, the Dice2 score is defined by: L detect. [] ,  [] where  is a minimum-value smoothing and stability parameter added to avoid division by zero.Targeted loss property (i) is satisfied since errors in the source and background areas are penalized equally regardless the relative occurrence of these two classes in  [] .Property (ii) is also satisfied, and we illustrate numerically in the two following paragraphs that property (iii) is also reached.At validation time, we evaluate the ability of the model to detect point-like sources while simultaneously avoiding false alarms at best as possible.In other words, we aim to obtain a model obeying a precision-recall trade-off.For a predicted detection map  [] in [0; 1]  thresholded at  in [0; 1], we count the number of true positives (TP, i.e., true detections), false positives (FP, i.e., false alarms) and false negatives (FN, i.e., missed detections).Following standard practice in direct imaging (see e.g., Flasseur et al. (2018b); Gonzalez et al. (2018); Cantalloube et al. (2020)), detections are treated as blobs of one resolution element radius which corresponds to the expected spatial extent of an exoplanetary signature in the data.From TP, FP, and FN, we derive the true positive rate (TPR, i.e., the recall), the false discovery rate (FDR, i.e., the precision), and the F1R score, which is the harmonic mean between TPR and FDR, that we use as an overall measure of the precision-recall trade-off: From TPR, FDR, and F1R, receiver operating curves (ROCs; Kay (1993)) are built.ROCs are obtained by evaluating the figures of merit defined in Eq. ( 13) as a function of the detection threshold .Finally, the area under the curve (AUC) for the F1R score is computed as an aggregate score of the model performance (best when close to 1).Gonzalez et al. (2016Gonzalez et al. ( , 2018)) Figure 5(a) displays some examples of detection maps obtained at validation time for the best validation epoch.These maps illustrate qualitatively the ability of our model to detect synthetic sources while simultaneously avoiding false alarms.Figure 5(b) shows the evolution of the empirical risk (see Eq. ( 12)) at training and validation time as well as the evolution of the F1R accuracy metric (see Eq. ( 13)) at validation time.The loss function does not exhibit significant discrepancy between training and validation steps and the convergence is reached in a few epochs 7 .Besides, the accuracy score is high and well anti-correlated with the loss.This latter observation illustrates that the loss function is a satisfactory estimate of the overall accuracy metric (see targeted property (iii) at the beginning of Sect.2.2.3).Finally, Fig. 5(c) gives an illustration of validation ROC obtained for the best epoch (symbolized by a gray vertical line in Fig. 5(b)).The validation ROC confirms the good ability of the trained model to discriminate (synthetic) point-like sources from the nuisance component.

Implementation details
Pairs of samples { (  [] ;  [] } are generated on the fly at training and evaluation time following the procedure described in Sect.2.2.1.To avoid over-fitting, each realization  is unique with no repetition for the different epochs.The notion of epoch is used only as a way to evaluate regularly the performance of the model with the validation procedure described in Sect.2.2.3, and also to adapt the learning rate through a pre-defined scheduling.The optimization of the network weights is performed with an iterative stochastic gradient-descent strategy on mini-batches of samples formed (possibly) by the concatenation of multiple training sets.It means that, at each iteration, the model weights are updated in the opposite direction to the gradient of the loss.The loss is evaluated from the current mini-batch of samples with respect to the model weights.Since we work on series of  images, with  typically lying between 50 and 100 images, the batch-size (i.e., the number of training sets comprised within a mini-batch) is fixed at 1 given memory constraints.This setting does not affect the overall performance of the method and only requires to perform more iterations to reach convergence since the cost function is quite noisy, see Fig. 5(a).Even under this setting, the convergence is typically reached in a few epochs, see Sect. (  [] ;  [] } are generated and fed sequentially as input of the network.For each validation epoch,  is fixed at 10 given the computational burden required to build ROCs representing the F1R score as a function of the detection threshold .The training process stops when the accuracy metric AUC F1R (i.e., AUC under ROC representing the F1R score as a function of the detection threshold  at validation time) evolves in less than 2% during the ten previous epochs.The model optimization is performed with the adaptive gradient algorithm AMSGrad (Reddi et al. 2019) which is a variant of the Adam (Kingma & Ba 2014) optimizer with a longer-term memory of past gradients.The parameters of the optimizer and of the scheduler have been fine tuned on two datasets and are kept constant for all our experiments.In practice, we observed that the optimized values are quite robust with respect to the dataset diversity.The weight decay8 is fixed at 10 −5 and the initial learning rate is set to 10 −3 with a regular decrease by 10% every 10 epochs.The optimization of the network weights is performed with the high-performance deep learning library PyTorch (Paszke et al. 2019) on GPUs server with NVIDIA system equipped with either Tesla V100 or GTX 1080 Ti cards.The pre-processing step is highly parallelized and has a double implementation so that it can performed either on CPUs or on GPUs depending on the number of available CPUs cores and on the server specifications.

SOURCE CHARACTERIZATION ALGORITHM
Once sources have been detected, they can be characterized in terms of astrometry and of photometry.In this section, we present a new method based on supervised deep learning to estimate the photometry of detected point-like sources.The sub-pixel estimation of the astrometry is not addressed in this paper because it requires a sub-pixel estimation of the detection criterion as well as statistical guarantees on its significance.These specific developments are left for future work, and the proposed characterization algorithm can be use to estimate the photometry at the pixel barycenter of (candidate) sources revealed by the detection stage presented in Sect. 2. As in Sect.2, we successively discuss the pre-processing stage, the formalization of the problem as a regression task, the model and the underlying architecture, the metric we use for training and evaluation, and some implementation details.

Pre-processing aspects
We adopt a simple patch-based approach, in which we predict the flux of a putative source from a unique (reduced) patch centered on the approximate location of the source.We propose to parameterize the mapping between the input patch and the flux with a CNN trained in a supervised fashion, from the dataset of interest (i.e., the underlying model is data-dependent, as for the detection stage of this paper).
The dataset is first reduced to a single frame, from which the input patches are extracted.This pre-processing consists in three steps.First, the temporal mean is computed and subtracted pixel-wise in order to remove most of the quasi-static speckles.Second, the dataset is derotated by the opposite of the parallactic angles to co-align the signal of the sources.Third, the dataset is averaged temporally, resulting in a single averaged frame.This last step allows to obtain an efficient training procedure as it reduces the size of the input data by a factor . Besides, we observed empirically that this step is beneficial to improve the estimation accuracy as it acts as a simple denoiser: the source signal is constant along time, while residual speckles are not quasi-static after cube derotation, thus canceling out.Keeping [D q ]  , (steps 2 and 3) . ( In this framework, and unlike the detection stage, we do not apply a whitening of the spatial correlations at step 1.Indeed, we observed empirically that whitening the dataset between steps 1 and 2-3 degrades the performance of our model, as illustrated by Fig. 6.More quantitatively, the absolute error of estimation is increased, whatever the angular separation, by a factor between three and five.Besides, keeping a whitening step for photometry estimation does not allow to obtain better results than PACO for most of the field of view.This is expected as the whitening distorts the shape and the norm of the source signal, thus hampering the recovery of its flux.The detection algorithm is not subject to this constraint as its task is to determine whether a source is present or not, regardless of its flux.This fact illustrates that deriving a quantitative result (as a flux estimate) is a more complex task than providing a qualitative result (as related to the presence or to the absence of a source) with our algorithmic setting.We can expect that building the model from several datasets of observations (instead of a single one in this work) would relax these constraints.These specific developments are left for future work, see also Sect. 5 for a discussion.After pre-processing, square patches (  [ ] ∈ R  are finally extracted around the location of each injected synthetic source  during the training and validation steps, or around each (candidate) real point-like source  at inference time.The patch size  is an hyper-parameter whose setting is discussed in more details in Sect.3.2.4.

Construction of the training basis
As with the detection procedure of Sect.5, we resort to massive injections of synthetic sources with various fluxes to build our training basis.
Prior to the injections, the first step consists in masking any real and/or synthetic detected sources that we aim to estimate the photometry at inference time.This operation prevents, as best as possible, data leakages between training and test sets so that training patches do not contain any pixel from patches considered at inference time.In practice, source masking is performed by replacing, for each temporal frame, the local area impacted by the signal of the sources of interest by their pixel-wise temporal mean.Like for the generation of the training basis of the detection stage, we also apply, as a data-augmentation strategy, a random permutation of the temporal frames prior to the construction of a training set  from which training samples with injected sources are extracted.
We then build each training set  by injecting a dozen of synthetic sources, with a relative flux (i.e., exoplanet to star contrast) ranging from 1×10 −6 to 3×10 −5 with respect to the flux of the host star.Given the current instrumental and processing performance in direct imaging, this setting corresponds to sources with relatively low flux, for which the estimation is usually the most flawed.This operation range can be easily modified in the algorithm to characterize (expected) sources with a lower or higher contrast level, if needed.The contrast of synthetic sources is drawn uniformly in the above-mentioned range, regardless of the angular separation.After injection, each training set is pre-processed and the input patches are extracted following the method described in Sect.3.1.This procedure is repeated to get a prescribed number  of training patches { q  [ ] ∈ R  } =1: .The number of training patches, hence the total number  of injected synthetic sources, is an additional hyper-parameter whose setting is discussed in more details in Sect.3.2.4.

Model and architecture
We built a custom network based on VGG, an architecture initially proposed for image classification (Simonyan & Zisserman 2015).The underlying model has 1.2 million of free parameters, and its detailed architecture is described in Table 2.We use a stride9 of one for all convolutional layers.We have also tested alternative models, both with a deeper and shallower architecture, all leading to worst estimation performance than the selected one.In particular, we experienced a significant degradation of the performance at short angular separations with deeper architectures.The later are the more subject 3 for the definition of the metric) on the estimated photometry is reported as a function of the angular separation, for  ∈ {400,000; 40,000; 20,000; 4,000} sources.The results are averaged azimuthally for sources of flux drawn uniformly between 1 × 10 −6 and 3 × 10 −5 .The mean ARE (denoted by ARE) averaged over the angular separation is also reported as an overall measure of the performance.The known real sources were masked out and were not considered.Datasets: the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 for the recording logs.
to over-fitting (due to the increase in terms of model complexity), especially in the absence of a whitening procedure preventing memorization of the nuisance structures by the network.
From an input patch (  [ ] ∈ R  , the network produces a single scalar  [ ] ∈ R + , representing the estimated source's photometry.

Loss function and accuracy metric
Our choice of the loss function L carac. is driven by the two following criteria: (i) being computationally efficient, and (ii) matching the astrophysical goals.We chose the absolute relative error (ARE) between the ground-truth and the predicted photometry, which is a classical loss for regression problems: [ ]   . (15) The ARE has the advantage of giving the same contribution to each individual source  regardless of its flux, when it is averaged over multiples ones.Since this metric is computationally very efficient, we also use it to evaluate the overall performance of the method at validation time.

Implementation details
In this section, we successively discuss the setting of the patch size, the number of training sources, the sampling strategy of injected synthetic sources, and some optimization aspects.
Due to the pre-processing described in Sect.3.1, part of the signal of the sources can be encoded in the temporal mean that is subtracted to the full frames in order to attenuate the quasi-static speckles.
As a result, a (negative) contribution, taking the form of an arc, can spread out along the trajectory of the sources in the reduced frame.This well-known phenomenon in direct imaging is usually referred as self-subtraction.As a consequence, we can expect that the performance of the predictor would increase with the patch size , to be able to capture the (extended) signature of sources induced by self-subtraction.Besides, increasing the patch size increases the context (i.e., local realizations of the nuisance component) that could be beneficial to unmix the different contributions.As shown by Fig. 7, increasing the patch size is indeed beneficial as it reduced the mean relative error of estimation.The gain is more pronounced as the angular separation increases, which could be due to the larger extent of the self-subtraction signature (as the apparent motion of sources induced by ADI increases with the angular separation).However, large patches are not convenient in the case of adjacent sources, as both signals will be contained in both input patches.As a trade-off, we chose a patch size of  = 31 2 pixels, as it encompasses the core of the signal of the source without being impractical when multiple sources are present in the field of view.
The number  of synthetic sources (which also corresponds to the number of patches) used at training time is an additional hyperparameter obeying a trade-off.On the one hand, it should be large enough to be representative of the variety of real sources in terms of flux and locations.On the other hand, it should be sufficiently small to avoid over-fitting on the training set.These expected behaviors are confirmed by numerical experiments presented in Fig. 8.The error at small angular separations increases with  since this is the area of the field of view the more subject to data leakages when generating multiple samples from a few tens of pixels only.The overall performance also degrades when  is too small.Based on this study, we include  = 40, 000 patches in our training set since it leads to the smallest ARE averaged over the whole of view.
Concerning the source's sampling strategy, we are interested at evaluation time in assessing the performance of the model per angular separation.As such, it is natural to sample test sources uniformly in the polar coordinates system.However, polar sampling is detrimental during the training phase, as pixels at short angular separations would be over-represented in the training set, leading to an over-fitting of the model in this area.It can be noted that this effect does not occur in the detection stage of the proposed approach given that the training sets consist of full frames of pre-processed observations; each pixel of the field of view being equally represented in the training base.We experimentally observed that sampling training sources uniformly in the Cartesian coordinates system reduces significantly over-fitting at short angular separations, without degrading performance in the rest of the field of view.As a result, we opt during training for a uniform sampling of the coordinates of synthetic sources in the Cartesian system.
Concerning the optimization process, pairs of samples { (  [ ] ;  [ ] } are pre-generated before training and evaluation.As for the detection stage, each realization  is unique to limit over-fitting as best as possible.Unlike the detection stage, the characterization stage operates on a single patch (instead of the  temporal images), thus allowing to chose a batch size higher than one to improve the stability and to reduce the computation time of the optimization process.In practice, the batch size is fixed at 1024.The number of epochs is fixed at 300, which shown to be sufficient to reach convergence of the network weights for all the considered observations.3. Observing parameters of ADI sequences from the VLT/SPHERE-IRDIS instrument considered in this paper.Columns are: target name, ESO survey ID, observation date, spectral filter , number  of available temporal frames, total apparent rotation Δ par of the field of view, number NDIT of sub-integration exposures, individual exposure time DIT, average coherence time  0 , average seeing, and the first paper reporting analysis of the same data.All the observations are performed with the apodized Lyot coronagraph (Carbillet et al. 2011) of the VLT/SPHERE instrument. (a) Since the EIDC aimed to perform a blind benchmark, information that would allow to identify the datasets are not known. (b) This dataset was recorded with the star-hopping technique recently available for the VLT/SPHERE instrument (Wahhaj et al. 2021) and its analysis is not reported yet.We do not exploit the dataset associated to the observation of the reference star.The dataset associated to the target star (HD 95086) is processed like all other datasets considered in this paper. (c) This dataset is only used for additional experiments conducted in Appendix A2.Ti cards.The pre-processing step is highly parallelized, and it is run on CPUs.

Datasets description and reduction strategies
For our comparative analysis, we have selected 15 datasets recorded with the SPHERE-IRDIS instrument.Three of the 15 SPHERE-IRDIS datasets were extracted from the exoplanet imaging data challenge (EIDC) initially designed to ground the detection performance of existing post-processing algorithms for high-contrast imaging (Cantalloube et al. 2020).These datasets are used as a sanity check to assess the ability of the proposed algorithm to detect injected sources at moderate levels of contrast.
To study in more details the performance of the proposed method, we selected 12 additional datasets, mostly part from the SHINE survey of the SPHERE-IRDIS instrument (Desidera et al. 2021;Langlois et al. 2021;Vigan et al. 2021).They were obtained by the observation of the following stars: -HD 95086, a A8III type star of the Carina constellation, hosting an exoplanet discovered by direct imaging with the SPHERE instrument (Rameau et al. 2013a,b).Ten known background point sources are also within the SPHERE-IRDIS field of view.In addition, based on the analysis of PACO and deep PACO detection maps, we have identified two additional (unbounded) candidate point-like sources.Given their unknown status, we exclude these sources from our general analysis (i.e., there are not considered as true detections or false alarms).We briefly discuss there status in Sect.4.2.3.
-HIP 88399, a F6V type star of the Vela constellation, without known bounded exoplanet.However, six faint background sources fall within the SPHERE-IRDIS field of view.
-HD 131399 A, a A1V type star of a triple system located in the Cen-taurus constellation, with a known faint point source (HD 131399 Ab) discovered by direct imaging (Wagner et al. 2016).While first supposed to be an exoplanet, follow-up analysis of its astro-photometry show that HD 131399 Ab is more likely a background brown dwarf (Nielsen et al. 2017).Besides, a bright background star falls within the SPHERE-IRDIS field of view.
-HIP 65426, a A8III type star of the Carina constellation, hosting an exoplanet discovered by direct imaging with the SPHERE instrument (Chauvin et al. 2017).
-HIP 72192, a A0V type star of the Lupus constellation, without known bounded exoplanet.However, two faint background sources fall within the SPHERE-IRDIS field of view.
Considered SPHERE-IRDIS observations were obtained in H2-H3 (i.e.,  H2 = 1.593 µm,  H3 = 1.667 µm) or in K1-K2 (i.e.,  K1 = 2.110 µm,  K2 = 2.251 µm) dual spectral bands.The main parameters of each observation are summarized in Table 3.The diversity in the experienced observing conditions is quite representative of the SPHERE observations.The raw observations were pre-reduced10 with the DRH pipeline (Pavlov et al. 2008) of the SPHERE instrument, which performs thermal background subtraction, flat-field correction, anamorphism correction, compensation for spectral transmission, flux normalization, bad pixels identification and interpolation, frame centering, true-North alignment, wavelength calibration, astrometric calibration, and frame selection.These operations are complemented by custom routines implemented in the SPHERE data center (Delorme et al. 2017), in particular to improve bad pixels correction.Finally, the SPHERE data center combines the pre-reduced observations and delivers the calibrated ADI datasets we consider in this work.ADI reduction is performed by considering the first spectral channel (i.e., either at  H2 = 1.593 µm or at  K1 = 2.110 µm) given that the contrast is significantly more favorable in this channel than in the second one.
To ground the performance of the proposed algorithm, we resort to massive injections of synthetic sources, as done to train the deep model of the proposed algorithm, see Sect.2.2.1.Given the simplicity of the signature of the sought objects (i.e., taking the form of a blob spatially correlated over only a few pixel width), we did not find significant bias in using the same injection procedure for the training and evaluation steps.Injections of synthetic sources in the EIDC benchmark were performed by there authors with the VIP pipeline.
The detection performance of the proposed method are compared in Sect.4.2 with the cADI, PCA, and PACO algorithms, (see Sect. 1 for their respective principle).For cADI, we have re-implemented the original method (Marois et al. 2006) based on a full-frame estimation of the off-axis PSF and of the S/N map, i.e., without angular-specific processing.We have also used the refined implementation of cADI available in the VIP package (Gonzalez et al. 2017), which includes a protection angle strategy accounting for a minimal field rotation between successive images when building the off-axis PSF in order to limit the self-subtraction effect.After computation of the offaxis PSF, a S/N map is derived by accounting for an annular-based estimation of the noise in the residual images.We also applied the VIP implementation of the PCA-based algorithm combined with the same protection angle strategy and the annular-based computation of the S/N.For PCA reductions, the number of modes has been optimized in annuli by maximizing the S/N of synthetic sources with similar ranges of contrast than the ones we consider for our comparisons.The other parameters of the VIP implementation of cADI and PCA are less critical and are fixed at pre-set values (Gonzalez et al. 2017).For PACO, we performed the data reduction with our fully unsupervised processing pipeline (Flasseur et al. 2018a).
Concerning the photometry estimation, we compare in Sect.4.3 the performance of the proposed method with PACO and the VIP implementation of the PCA.PACO parameters are estimated automatically in a data-driven fashion.In a nutshell, the characterization of a point-like source  is performed by a joint estimation of (i) the statistics (i.e., mean   and covariances C  ) of the nuisance component  , and (ii) of the photometry (i.e., flux   ) of the given source, see Eq. ( 1).For the PCA, we resort to a similar procedure than for the detection step to set the parameters.In particular, the number of modes is optimized for each injected source to be characterized by maximizing its S/N.Once the setting fixed, the flux of a given source is estimated by minimizing the residuals through the injection of negative fake companions (Wertz et al. 2017).This is performed with a two steps procedure, as recommended in the VIP package: (i) a first guess estimate is obtained by performing a grid-search, (ii) a local optimization is performed with a Nelder-Mead simplex algorithm (Nocedal & Wright 1999).Given computational constraints (largely dominated by the PCA), the exact (known) sub-pixel location   of each injected source  is provided to the different algorithms (i.e., it is not optimized), and the photometry is estimated at this groundtruth position.When estimating the photometry of real sources, both the astrometry and the photometry are optimized by the different algorithms.
Table 4. Mean results of AUC for ROCs giving the TPR as a function of the FDR.The scores are averaged over the eleven SPHERE-IRDIS datasets considered in this paper, see Sect. 4 for the observation logs.Only the 59 known real sources present in these datasets were considered.Figure 7 of (Flasseur et al. 2022) display the corresponding ROC from which these mean results were aggregated.

Detection of known real sources
A first classical sanity test to evaluate the detection performance of a post-processing algorithm is to study qualitatively its ability to redetect real known sources initially detected with different algorithms, and possibly from different datasets.We present detailed results for one dataset (HD 95086, 2015-05-05, see Table 3) selected among the eleven SPHERE-IRDIS observations we consider because HD 95086 is the star having the larger number of known real sources in the SPHERE-IRDIS field of view.Results obtained for the ten other datasets are given in supplementary material.Figures 9 and 10 give detection maps produced with the five tested algorithms.The detection threshold is set to  = 5 for the algorithms producing a S/N map (i.e., cADI, cADI (VIP), PCA (VIP), PACO), and to  = 0.5 for the proposed method producing a pseudo-probability map.Due to the binary pixel-wise classification task we consider for the training step of the proposed method (see Sect. 2.2), its detection map is almost binary (i.e., each pixel value is close either to 0 or 1) so that the setting of the threshold  is quite flexible.Based on the analysis of the detection maps, PACO and the proposed method lead to the best qualitative results since there are the only algorithms able to detect all real known sources without any false alarm in most of the field of view.With the proposed method, only two false alarms occur very near the borders of the field of view due to the limited number (much lower than ) of temporal samples available in this area to build a consistent model of the nuisance.This claim is also supported by the PACO detection map that displays a few false alarms in the same area.
Figure 11 gives a more quantitative analysis of the previous results through ROCs representing the TPR as a function of the FDR (see Sect. 2.2.3 and Eq. ( 13)) for the same dataset of HD 95086 (2015-05-05).This type of representation gives a comparison of the precisionrecall trade-off reached by each method, regardless the detection quantity (S/N or pseudo-probability) they produce.These curves are obtained by counting the number of true positives (TPs), and false alarms (FAs) for the full range of possible detection thresholds, i.e.  ∈ [0; 1] for the proposed method, and  ∈ [min( ); max( )] for cADI, PCA, and PACO.Table 4 presents averaged results over the eleven SPHERE-IRDIS datasets we consider in this study, and the detailed scores for each dataset are given in Table 1 of the supplementary material.These results illustrate the benefits of the proposed method in terms of precision-recall trade-off: the AUC under ROC is improved by at least 7% with respect to the comparative algorithms.
As a final study based on the detection of known real sources, we evaluate the importance of our pre-processing step by resorting to model ablation.Removing the whitening procedure and keeping only the temporal centering in the pre-processing step does not allow to reach convergence of the network weights at training time.This is due to the high dynamics and to the high spatial non-stationarity of the residual images.We also test to account only for the pixel variances in the whitening procedure, i.e. we neglect the spatial co-   variances so that matrices S  in Eq. ( 2) are considered diagonal.Figure 12 compares the precision-recall trade-off of this downgraded model to the model of the proposed approach (accounting locally for the spatial covariances).When neglecting covariances, the overall precision-recall trade-off of the detector is decreased by 8% in average and the sensitivity of detection is especially lowered for low false discovery rates (which is, in practice, the most useful regime in high-contrast imaging).While several whitening methods could be used in conjunction with our method, this study confirms the importance of our custom whitening procedure accounting for the spatial covariances of the nuisance in order to reach the best performance of a detector built by supervised deep learning.This observation is also in agreement with studies performed in other works through alternative whitening procedures.For instance, the SODINN algorithm (see Sect. 1) that also trains a CNN to perform a detection task by supervised deep learning is sometimes prone to a large false alarm rate (Cantalloube et al. 2020).Likely, this side effect is (at least in part) due to the embedded pre-processing step that builds an empirical model of the nuisance component through PCA; an approach that does not explicitly model the spatial covariances of the nuisance, thus leading to spatially non-stationary residual images used at training time.

Detection of synthetic sources
In this section, we evaluate quantitatively the detection performance of the proposed method with ROCs and contrast curves built from massively injected synthetic sources.As a first step, we apply the proposed detection algorithm on the three SPHERE-IRDIS datasets from the public EIDC data challenge (Cantalloube et al. 2020).The resulting detection maps are given in Fig. 13.Using the same procedure than Cantalloube et al. (2020) to benchmark 22 post-processing algorithms (including cADI, PCA, and PACO), we compute the F1R score at the set detection threshold ( = 0.5), and ROCs for the TPR and the FDR scores, as defined in Eqs. ( 13), by varying the detection threshold.The AUC is then computed from ROCs.Table 5 summarizes the results we obtained with the proposed algorithm.As a purpose of comparison, we also report the PACO results that have been published in Cantalloube et al. (2020).It emphasizes the interesting precision-recall of the proposed approach that performs, on these datasets, on par with or better than the 22 post-processing algorithms considered in the EIDC data challenge.However, these results should be taken with caution since they are based on a few datasets, with only six injected sources at relatively bright levels of contrast.Besides, several algorithms (including PACO) are also able to detect the injected sources without any false alarm in the field of view for a sufficiently large detection threshold.At this stage, the better performance of deep PACO in terms of false alarms rejection (quantified by the AUC FDR metric) are mostly explained by the fact that the proposed algorithm produces detection maps with almost binary values.It can be noted that this effect is also encountered for all the algorithms of the EIDC data challenge producing the same type of outputs like RSM (Dahlqvist et al. 2020) or SODINN (Gonzalez et al. 2018).
To ground in details the precision-recall trade-off of the proposed detection algorithm, it is necessary to build consistent ROCs and contrast curves by resorting to massive injections of synthetic sources -unknown at training time but that we aim to detect at inference time-for various levels of contrast.For that purpose, the most realistic procedure, hereafter called reference procedure, consists in (i) splitting the whole set of synthetic sources in small subsets, (ii) injecting synthetic sources of one subset in the dataset of interest so that injected fake sources mimic the behaviour of real (possibly unknown) sources, (iii) training the detection model with the procedure described in Sect.2, and (iv) applying the trained model to the dataset containing the synthetic sources injected in step (ii).Steps (ii) to (iv) are repeated for all subsets of synthetic sources.This procedure simulates the real situation when we face a new dataset with real unknown exoplanets that we aim to detect at inference time.Due to the Table 5.Detection scores: F1R at detection threshold  = 0.5 (the higher, the better), AUC under the ROC representing the TPR as a function of  (the higher, the better), and AUC under the ROC representing the FDR as a function of  (the lower, the better).Scores reported for PACO are extracted from the EIDC data challenge (Cantalloube et al. 2020), and deep PACO scores are computed with a similar procedure for the three SPHERE-IRDIS datasets from the EIDC.
- (a) Metrics can not be computed since there is no injected source in this dataset.computational burden of step (iii), repeating this full procedure for all subsets of synthetic sources that we aim to detect at inference time is not realistic to build ROCs and contrast curves.To circumvent this issue, we resort to a proxy procedure: instead of training a different model for each subset of injected sources, we train a unique model without synthetic sources mimicking the behaviour of real (possibly unknown) sources.Synthetic sources are injected a posteriori, i.e., after completing the training step (iii), and the trained model is applied on the resulting dataset.Figure 14 illustrates the principle of the reference and proxy procedures.The proxy procedure leads to an improvement in terms of algorithmic complexity by a factor equal to the number of subsets (typically 2,000, with in average 5 synthetic sources in each subset), that is determinant to be used in practice.At this stage, we still need to show that the proxy procedure we defined leads to reliable results so that our comparisons with state-of-the-art algorithms are fair.In particular, we aim to show that the model resulting from the proxy procedure is not prone to over-fitting induced by an imperfect separation between training and testing data.In the proxy procedure, over-fitting could possibly occur if the model partially memorizes the nuisance component that is seen by the network without the injected sources we aim to detect at inference time.Figure 15 shows examples of detection maps obtained with the procedure of reference described previously and its proxy version on the three datasets of HD 95086 (2015-05-05, 2018-02-23, 2021-03-11) considered in this work.In these experiments, we considered more than 700 synthetic sources spread over the whole field of view.Table 6 compares quantitatively the two approaches in terms of detection performance.These results show that the proxy procedure leads to reliable and conservative estimations of the overall performance of deep PACO.In addition, since synthetic sources mimicking the behaviour of real unknown sources are recovered with comparable rates between the two procedures, it emphasizes that the custom data-augmentation strategy of the training step, including a random permutation of the images of the data series, is efficient to circumvent the absence of ground-truth about real sources.In the following, we safely use the proxy procedure to compare the detection capability of the proposed deep PACO algorithm with other post-processing methods.

IRDIS 1 IRDIS 2 IRDIS 3 mean
Following the previously defined proxy procedure, we present detailed results obtained on HIP 88399 (2018-04-11), which is a SPHERE-IRDIS dataset representative of the mean results we obtained over the eleven ones we consider in this work.Results for the ten other datasets are reported in the supplementary material.Figure 16 shows detection results on a sample of 10,000 synthetic sources in a diagram contrast versus angular separation for PACO and the proposed method.Each synthetic source is classified as missed, true, or false detection using the detection thresholds defined in Sect.4.2.1.For PACO, setting the detection threshold at  = 5 corresponds, in average, to a realistic control (Flasseur et al. 2018a(Flasseur et al. ,b,c, 2020a) ) of the Table 6.Comparison between the reference procedure and its proxy version for evaluation of the detection performance of the proposed algorithm.Synthetic sources are classified as true, missed, or false detections using a detection threshold at  = 0.5.Mean detection results are averaged for the three datasets of HD 95086 (2015-05-05, 2018-02-23, 2021-03-11) probability of false alarms (PFA) at 5 (i.e., PFA ≃ 3 × 10 −7 ).While the PFA should theoretically be controlled by the other algorithms producing a S/N map (cADI and PCA), we have shown in previous works (Flasseur et al. 2018a(Flasseur et al. ,b,c, 2020a) ) that the contrast curves are over-optimistic for these algorithms (i.e., there are significantly more false alarms than expected) due to a miss-modeling of the nuisance component.This claim is also supported by the detection maps given in Figs. 1 to 20 of the supplementary material for which the number of experienced false alarms is significantly higher than expected at S/N = 5.For the proposed method, converting pseudo-probabilities into S/N scores is not feasible given that the pseudo-probabilities are very close either to 0 or 1 due to the underlying binary pixel-wise classification task considered at training time.For this reason, we can only check empirically that the targeted false alarm rate at 5 is satisfied.To do so, we capitalize on our experiments with synthetic sources by counting the number of false alarms, i.e. detection blobs above the threshold  = 0.5, that do not correspond to the location of an injected synthetic source.The number of false alarms is then converted into an empirical PFA by dividing the number of counts by the total number of possible detection blobs (each with a radius of one resolution element, i.e. four pixels radius for SPHERE-IRDIS observations) within the detection maps.By applying this procedure on several dozens of detection maps obtained with the proposed algorithm, we experienced in average a lower or equal empirical PFA than statistically expected at 5.As a conclusion of this study, the contrast estimates we will derive for the proposed approach can be fairly compared with the PACO results since they correspond to similar probabilities of false alarms and of detections.Figure 16 illustrates the capability of the proposed method to detect fainter sources than PACO.From the large number of synthetic sources classified as true, missed and false detections in Fig. 16, we now derive the contrast curve of each algorithm.Concerning the proposed approach, given that we showed empirically that the probability of false alarms is controlled at 5, it simply remains to compute the contrast level for which and equal amount of true and of missed detections is experienced.This procedure is repeated for the full range of angular separations with a sliding window of 0.05 arcsec wide.The same procedure is applied for the other algorithms.Figure 17 summarizes the resulting contrast curves obtained with the five considered algorithms.The proposed method achieves the best detection sensitivity with an improvement in contrast up to a factor four with respect to the PACO algorithm.We also compare the detection sensitivity of deep PACO with the fundamental detection limit driven by the photon noise.The procedure to compute the photon noise limit is based on a careful evaluation of the contribution of the different sources of noise (i.e., photon, thermal background, and detector readout noise) combined with a statistical evaluation of the underlying S/N.This procedure will be described in details in a paper currently in preparation.We observe that deep PACO can reach for some datasets (see e.g.HIP 88399 (2018-04-11) in Fig. 17, and HIP 88399 (2015-05-10) as well as HIP 88399 (2016-04-16) in the supplementary material) at large separations the best achievable detection limit driven by the photon noise, which corresponds to an optimal unmixing between the signal of the sources of interest and the nuisance component.Near the star, an important gap remains (by a factor 5 to 10) between the actual performance and the theoretical lower limit, which a sign of a lack of angular diversity in this area.In practice, the gap between the actual contrast and the lower bound limit depends on several characteristics such as the quality and the stability of the observing conditions, the total amount of parallactic rotation, the number  of temporal frames, etc., as illustrated in Figs.21 and 22 of the supplementary materials.Reducing this gap at close separation could be addressed by investigating ways to perform the training step of our model from various datasets for which the presence of sources at similar locations is very unlikely.This adaptation requires specific developments that are left for future work.
Figure 18 focuses on the comparison between PACO and the proposed deep PACO algorithm.It represents the TPR of deep PACO for synthetic sources missed by PACO (i.e., for S/N ≤ 5) as a function of the angular separation and of the PACO's S/N of detection.For instance, on this dataset about 65% of sources below 0.5 arcsec with a S/N of detection between 3.5 and 4.0 with PACO are detected with deep PACO (i.e., above the detection threshold  = 0.5).Similarly, more than 86% of sources between 2.0 and 4.0 arcsec with a S/N of detection between 3.5 and 4.0 with PACO are detected with deep PACO (above the detection threshold  = 0.5).For these two exam- since the targeted PFA is not reached.For that reason, the angular separations leading to a PFA locally higher than ten times the targeted PFA at 5 are not reported.The black dashed line represents the ultimate detection limit driven by the photon noise.Dataset: HIP 88399 (2018-04-11), see Sect. 4 for the observation logs.ples, achieving the same TPR with PACO would require to decrease the detection threshold at the price to an increase of the FPR up to a mean factor of 800.Figures 23 and 24 of the supplementary material give similar type of representation than Fig. 18 for the ten other SPHERE-IRDIS datasets analyzed in this work.
Figure 19 gives ROCs representing the TPR as a function of the FDR for the HIP 88399 (2018-04-11) dataset.There results are obtained with the 10,000 synthetic sources considered for results presented in Fig. 16.The results are split in four different angular separation ranges: [0; 2], [2; 4], [4; 6], and [6; 7] arcsec.Table 7 complements this study by presenting averaged results over the eleven SPHERE-IRDIS datasets we consider in this study, and the detailed scores for each dataset are given in Table 2 of the supplementary material.These results illustrate again the benefits of the proposed method in terms of precision-recall trade-off: the AUC under ROC is improved by 9 to 17% with respect to the best comparative algorithm for the four angular separation ranges we consider.

Identification of candidate background point-like sources
In this section, we take the example of a joint analysis of datasets of HD 95086 considered in this work to illustrate the ability of the proposed approach to detect candidate faint point-like sources.Figure 20 shows the detection maps obtained with PACO and the proposed deep PACO approach on the 2015 and 2018 observations of HD 95086.It emphasizes that two candidate point-like sources (respectively denoted CC-11 and CC-12) are detected with PACO at a S/N above 5 only for one of the two epochs (in 2018, which is also one of the best SPHERE-IRDIS observations of this star) while they remain just below the classical detection threshold at 5 on the 2015 epoch.These two point-like sources are detected with deep PACO on the same two epochs.Based on their astrometry, these two faint point sources are co-moving with the other background stars seen in the projected field of view of the instrument, so that they could be background stars too11 .These candidate point-like sources are not reported in the last detailed study of the HD 95086 architecture based on the analysis of ten SPHERE-IRDIS datasets (including the 2015 epoch but not the 2018 epoch) with classical post-processing algorithms (i.e., cADI, PCA, TLOCI), see Fig. 2 of Chauvin et al. (2018).The analysis of more recent and better datasets (including the 2018 one) in the SHINE survey (Langlois et al. 2021) of the SPHERE instrument allows to identify by visual inspection (i.e., not based on a strict measure of S/N above the classical detection threshold at 5) CC-11 while CC-12 has not been identified yet.The detection of the two CCs in the worst epoch data (2015-05-05) emphasizes again the benefits of the proposed deep PACO algorithm.This example also illustrates the complementarity of PACO and deep PACO.Even if deep PACO does not provide tight confidence scores (i.e., its outputs can not be directly interpreted in terms of a S/N), it allows to identify candidate point-like sources.Given the locations found by deep PACO, an estimation of the PFA can be derived from the S/N extracted on the PACO detection maps at the same locations.Using this procedure, the theoretical PFA (not including systematic sources of errors) for the co-localization of the candidate point sources CC-11 and CC-12 in the two epochs is small.None of CC-11 and CC-12 are detected in the 2021 epoch of HD 95086, as shown by Figs. 3 and 4 of the supplementary material.This observation is consistent with the fact that the achievable contrast is worse for this dataset than for the 2015 and 2018 observations, likely because the observing conditions were quite average for the 2021 observations and that only half of the total integration time was spent on the target star (the 2021's epoch being recorded with the starhopping technique, and the resulting reference dataset remaining not exploited in this paper).
PACO and deep PACO detection maps from the 2015 and 2018 epochs also display a few blobs, circled in red, above the detection threshold that are not consistently detected on multiple epochs.These detections are located very near the borders of the (extended) field of view.They are likely artifacts due to the lack of temporal samples in this area to estimate the model parameters.This hypothesis is also supported by the fact that PACO detection maps are not stationary and does not follow a centered Gaussian distribution with unit variance in this area.deep PACO detection map from the 2021 epoch display a blob at 0.25", that we likely identified as a false alarm since we have experienced a few false alarms in the same area during training time due to a strong stellar leakage (see Figs. 3,4,and 22(b) of the supplementary material).

Characterization results
In Sect.4.3.1,we ground the performance of the proposed photometry estimation module on the same datasets than the considered ones for the detection stage by resorting to massive injections of synthetic sources.For that purpose, we use the ARE metric (Eq.( 15)) averaged per angular separation.We also put in perspective the results of the detection and of the characterization stages to ground the global gain brought by the proposed algorithm.In Sect.4.3.2,we propose a fast (and approximate) evaluation procedure suited when large amounts of synthetic sources are considered.Finally, in Sect.4.3.3we briefly discuss and compare photometry estimations provided by our method with estimations coming from the literature on some known sources (either exoplanets, brown dwarfs or background sources) present in the eleven SPHERE-IRDIS datasets considered in this paper.

Characterization of synthetic sources
To assess the performance of our method, we reproduce a real setting, hereafter called reference procedure, in which we estimate the flux of injected synthetic sources representative of (possibly unknown) exoplanets.To do so, we first inject synthetic test sources in the considered dataset, and then extract the patches used for training the model.It is crucial to perform these steps in that order otherwise the model could have the opportunity to learn the residual nuisance component below the test sources, and a bias could be introduced.
Following this strategy, Fig. 21 shows the ARE score on the estimated photometry of synthetic sources massively injected with the reference procedure as a function of the their contrast and of their angular separation.Figure 26 of the supplementary material gives the same type of plots for the throughput (i.e., the ratio / between the retrieved and ground-truth source's contrast).In both cases, results are averaged over the eleven SPHERE-IRDIS datasets of this study.Figure 22 complements this study by aggregating the ARE score over the source contrast.These results show that the proposed algorithm leads, in average, to better characterization performance than PCA (VIP) and PACO for angular separations larger than 0.8", with a reduction of the ARE by a factor between 1.10 and 10 with respect to PCA (VIP), and by a factor between 1.10 and 5 with respect to PACO.Closer to the star, the advantage is on average to PACO and to the proposed approach, the best of the two algorithms depending on the dataset and of the source location.The fact that PACO can perform better than the proposed algorithm is due to the complexity to train a deep model, without leak between the train and the test sets, from a unique dataset of interest.This effect occurs only at short angular separations since this is the region of the field of view where generating multiple non-redundant training samples is the most tricky.As illustrated in Sect.4.2.2, this side effect does not occur in the detection stage of the proposed approach thanks to the included whitening procedure, which removes most of the quasi-static speckles (i.e., only residual structures non-captured by the statistical model remain).Except these residual structures that we aim to capture by deep learning, each new training set thus contain, before injection of synthetic sources, a quasi-random realization of uncorrelated Gaussian noise.This key property prevents a leak between the train and the test sets as well as the memorization of the nuisance structures by the network during its training.

Efficient (and approximate) evaluation procedure
When the massive injection of synthetic sources withing multiple datasets is needed to ground the performance of a built detector, the reference procedure described and applied in Sect.4.3.1 is computationally expensive.The computational bottleneck is related to the need to train a new model for every test dataset containing a dozen of synthetic test sources.As an illustration, 30 different models must be trained to get a test set of only 300 samples.In this context and in a similar fashion to the detection stage (see Sect. 4.2.1 and Fig. 14), we also propose a fast and approximate version of the reference procedure, that is referred as the proxy procedure in the following.Instead of training a different model for each subset of injected sources, we train a unique model without additional synthetic sources mimicking the behaviour of real sources.Synthetic sources are injected a posteriori, i.e., after training.Finally, the trained model is applied on the resulting dataset.The gain in terms of algorithmic complexity (about a factor 2,000 with in average 5 synthetic sources in each subset) is similar to the one brought by the proxy procedure of the detection step.
At this stage, we have to measure the ability of the proxy procedure to provide a fair estimate of the real performance that would be achieved with the reference procedure.Figure 23 periments, we considered around 3,300 synthetic sources spread over the whole field of view.We observe that the proxy procedure leads to results very close to the ones provided by the reference procedure, excepted near the star where the proxy procedure is over-optimistic, i.e. a positive bias is present (up to a factor five, at worst).This bias at short angular separations can be attributed to data leakages between the train and the test sets in the absence of whitening procedure.Basically, as for the detection part, the model is data-dependent and the absence of whitening procedure as well as of the associated temporal shuffling of the frames induces that some parts of the nuisance component are seen and memorized by the network using the proxy procedure.This effect occurs only at short angular separations since this is the area of the field of view where training patches contain most likely some similar parts of the nuisance component.This hypothesis is also supported by the absence of bias between the proxy and the reference procedures of the detection stage.The latter encompasses a whitening procedure and a temporal shuffling of the frames that prevents data leakages between the train and test sets.So, it remains to decide for a suited cut-off in terms of angular separation to switch from the reference to the proxy procedure.For that purpose, we resort to a binary hypothesis test.It performs a paired -test (Kendall et al. 1948), where the H 0 (null) hypothesis represents the equality between the reference and the proxy procedures while the H 1 (alternative) hypothesis represents better results with the proxy procedure.From the results presented in Fig. 23(a), we conduct in Fig. 23(b) this statistical test on the same datasets.It displays the resulting -values as a function of the angular separation, and compares them with a 5% significance threshold.For the prescribed significance level, the two procedures can be considered as equivalent for angular separations larger than 0.5 arcsec.
As a conclusion of this study, when the number of synthetic sources for which the photometry should be evaluated at inference time constitutes a computational bottleneck, the efficient and approximate proxy procedure can be safely used in the main part of the field of view and the more expensive procedure of reference should be used near the star.

Characterization of real sources
In this section, we compare the photometry estimations obtained with the tested algorithms on some real known sources present in the considered datasets.When available, we also compare them with measurements published in the literature that have been obtained either with the SPHERE Data Center implementations of TLOCI and PCA or with ANDROMEDA (for HD 95086 b only).For PCA (VIP), the reported uncertainties are computed from the residual combined image (i.e., after subtraction of the on-axis PSF, derotation, and stacking) in an annulus at the same angular separation than the source of interest.For PACO, the photometry is estimated by accounting for the spatial correlations of the nuisance component.
As concerns the proposed algorithm, and as for its detection stage, the control of the uncertainties is an intricate task given that the core of the deep model works as a black-box.For that reason, we did not produce estimations of the standard-deviation, which is left for future work.For the experiments we conduct with PCA (VIP), PACO and the proposed method, we use the same pre-reduction of the raw observations and the same off-axis PSF template that has been measured prior the science observations.It is not ensured that these specificity also hold for photometry estimates extracted from the literature, which could induce (unknown) systematic variations that are not taken into account.Similarly, given the variability of the observing conditions between different observations, absolute comparisons between multiple epochs of the same target should be done with caution.
Table 8 reports the photometry estimations on known real sources.Even in the absence of absolute ground-truth, we can make some relative comparisons for a given dataset and for a given source.In that view, photometry estimates produced by the proposed method are compatible with published and/or obtained results with PCA (VIP) and PACO for almost all sources.The largest discrepancy is for CC5 of HIP 88399 (2018-04-11), where a factor six lies between the estimates from PCA (VIP) and the proposed algorithm.Very likely, the PCA (VIP) estimate is not reliable as the source is located at large angular separation, i.e. in an area of the field of view where PCA (VIP) is prone to large errors (see Fig. 21).Besides, PACO and the proposed method lead to quite close estimations, which also supports the previous claim.
Photometry estimates of CC-12 that we identified from HD 95086 datasets in Sect.4.2.3 are consistent among the different algorithms.They are also consistent between the two epochs where CC-12 has been detected.For CC-11 that we identified from the same datasets, we note a discrepancy by about 40% between the estimates obtained (i) by the proposed method, and (ii) by the other tested algorithms.However, this two groups of estimations are consistent between the 2015 and 2018 epochs.The discrepancy could be attributed to the presence of a bright background source (denoted by CC-8 in Fig. 20) that could lead to an overestimation of the photometry with algorithms that do not rely on a training step with synthetic sources.In any case, these two candidate point-like sources should be considered with caution.
Table 8.Estimated photometry on some known real sources present in the considered datasets.For HD 95086, the (candidate) point-like sources are denoted as in Fig. 20.For HIP 88399 (respectively, HIP 72192), point-like sources are denoted by CC1 to CC5 (respectively, CC1 and CC2) through increasing angular separations.The astrometry (angular separation and true-North angle) has been estimated with PACO.Photometry estimations extracted from the literature has been obtained with TLOCI by (Langlois et al. 2021) for (a), TLOCI by (Desgrange et al. 2022) for (b), ANDROMEDA by (Desgrange et al. 2022) for (c), PCA by (Chauvin et al. 2017) for (e), and TLOCI by (Cheetham et al. 2019)

CONCLUSION
We have described the key principles of a new algorithm for detecting and characterizing point-like sources at high contrast from ADI observations.The detection stage combines the statistics-based model of PACO with deep learning in a three step procedure: (i) the data are centered and whitened using the PACO framework, (ii) a CNN is trained to detect synthetic sources from the pre-processed images, and (iii) a detection map is inferred.While the CNN itself works as a black-box approach, the proposed method encompasses prior domain knowledge such as the apparent motion of sources and the expected shape of the exoplanetary signal inside the ADI datasets.More importantly, the proposed detection approach capitalizes on the statistical model of the nuisance component embedded in PACO to improve the stationarity and the contrast during a pre-processing step.Once a candidate source has been identified, its photometry can be estimated using a dedicated deep learning module, also trained in a supervised fashion.Tested on eleven SPHERE-IRDIS datasets, the proposed detection method performs better than standard algorithms of the field like PCA as well as PACO in terms of precision-recall trade-off.The detection sensitivity is improved by a factor between two and five in average with respect to PACO for the whole range of angular separations.This gain in even more important compared to baseline methods as cADI or PCA.Interestingly, we showed experimentally that the proposed approach is able to (mostly) reach the fundamental detection sensitivity driven by the photon noise limit for angular separations above 0.5".This corresponds to an optimal extraction of the deep PACO: deep learning meets PAtch COvariances 25 sought signals in that regime of separations.The gain brought by the characterization stage of the proposed approach is more moderate but non-negligible.The absolute error of photometric estimation is reduced by a factor two with respect to PACO for sources of contrast up to 10 −6 located above 0.5".Nearer the star, the advantage is on average either to PACO or to the proposed approach, depending on the dataset and of the source location.The gain brought by the characterization module of the proposed approach could allow to further constrain the physical properties of detected exoplanets, usually performed by fitting atmospheric models to the extracted photometry.
Based on the analysis of these results, several lessons can be drawn regarding the deployed methodology.As a major point, this work illustrates the feasibility to build a complex model (with millions of parameters) of the nuisance component through supervised deep learning from a single ADI dataset of observations.The underlying detection model is learned, without over-fitting, thanks to a custom data-augmentation strategy based on two key ingredients: (i) a random temporal shuffling of the individual images applied before injections of each set of synthetic training sources, (ii) a pre-processing step encompassing a whitening procedure that removes most of the spatial correlations of the nuisance component.Prior to the injection of synthetic training sources, each training set is thus formed by a quasi-random realization of (mostly) uncorrelated Gaussian noise.This property prevents a leak between the train and the test sets as well as a potential memorization of the nuisance structures by the network during its training.Beyond that aspects, the pre-processing step allows to improve the stationarity of the data, thus preventing the high number of false alarms that can occur with other approaches of the field based on supervised deep learning.Our results also emphasize that deriving a flux estimate is a more complex task than providing a qualitative result related to the presence or to the absence of a source with our hybrid modeling of the nuisance component.In particular, we illustrate numerically that the whitening process is detrimental for source characterization (hence, it is not applied) because it modifies both the shape and the amplitude of the exoplanetary signature.In the absence of the whitening procedure, a shallower architecture should be used to avoid over-fitting.
We are currently working on the extension of the proposed algorithm for the joint processing of multi-spectral datasets such as the ones provided by the SPHERE-IFS instrument using the angular plus spectral differential imaging technique.Besides, we are currently investigating the three main limitations of the proposed approach: (i) the lack of control of the uncertainties, (ii) its task-dependence which is not adapted to reconstruct spatially resolved objects like circumstellar disks, and (iii) the data-dependence of the learning procedure which does not take benefits from multiple observations to build a more general and robust model of the nuisance component.Concerning the later point, building a model from multiple observations could be a promising step to reduce the remaining gap (by a factor 10 to 30) between the current detection performance and the theoretical ultimate detection sensitivity driven by the fundamental photon noise limit.Besides, we would like to incorporate within our deep models some meta-data (e.g., monitoring of the observing conditions, telemetry of the adaptive optics), with the aim to further improve their sensitivity and robustness.with   ′ the number of patches averaged at each location  ′ of the field of view, as defined in Eq. ( 6).For overlapping square patches12 with a unit patch stride,   ′ is equal to  almost everywhere, excepted on the borders of the field of view where it progressively tends to zero.The term  can be interpreted as the correlation between the whitened off-axis PSF and the centered plus whitened observations, while  is a normalization term representing the auto-correlation of the whitened off-axis PSF.It can be noted that for each pixel  ′ of the field of view, the ratio   ′ / √︁   ′ (respectively, the ratio   ′ /   ′ ) corresponds to the S/N of detection (respectively, to the source flux) that would be estimated by PACO at pixel  ′ (Flasseur et al. 2018b).To prevent the deep model (built from the outputs of the pre-processing step) to learn only the mapping  → / √ , we resort to a residual learning procedure.It consists in evaluating the loss function (see Sect. 2.2.3) jointly on the detection map produced by the proposed algorithm and also on the PACO S/N map /

√
computed on the fly.This strategy explicit rewards the deep model to perform better than PACO.
Table A2 compares, with the same procedure as described in Sect.
2.2.3, the detection performance in terms of precision and recall of the proposed algorithm using either the whitening procedure described in Sect.2.1.2,and the whitening procedure described in this Appendix.It shows that for the eleven SPHERE-IRDIS datasets we study in details in this work, the variant approach accounting for the transformation induced by the whitening process on the off-axis PSF leads only to a slight improvement of the overall detection performance.This is due to the fact that none of the eleven considered datasets was recorded under bad observing conditions.Figure A1 gives a qualitative comparison between the two approaches for a dataset of HR 8799 (see Sect. 4 for the recording logs) impacted by the wind-driven halo effect.The variant procedure described in this appendix allows to avoid numerous evident false alarms occurring when the normalization of the pre-processed frames is omitted.
Compared to the whitening procedure described in Sect.2.1.2,the variant described in this appendix has a computational burden increased by a factor 4 × , i.e. by typically a factor between 200 and 500 for the VLT/SPHERE instrument.For practical reasons, we recommend to use the standard whitening procedure by default, as defined in Sect.2.1.2.This is also the choice we made for the presentation of the results in the main core of this paper.The alternative version of the algorithm accounting for the whitening of the offaxis PSF can be reserved to refine, in a second step, the reduction of some datasets for which obvious and numerous false alarms are experienced at inference time.

APPENDIX B: MAIN SETTINGS OF THE DETECTION AND CHARACTERIZATION STAGES OF THE PROPOSED ALGORITHM
Table B1 gives a quick-look summary of the main settings for the proposed algorithm, both for the detection and for the characterization modules.Fields are classified in three categories: pre-processing, generation of the training set, and deep learning model.This paper has been typeset from a T E X/L A T E X file prepared by the author.
MNRAS 000, 1-28 ( 2023) Table B1.Summary of the main settings used for the detection and characterization modules of the proposed algorithm. (a) We recall that the training sets are generated on the fly for the detection stage, i.e., the notion of epochs is used only to schedule the learning rate.For the characterization stage, the term epoch is used in a classical meaning so that the generated patches are seen multiple times (corresponding to the number of epochs) by the network.

parameters detection characterization
▶ pre-processing  1 and 2 give AUC under ROCs (TPR as a function of the FDR) for known real sources and for injected synthetic sources, respectively.Concerning the characterization stage, Fig. 25 gives the throughput of the tested algorithms averaged over the eleven SPHERE-IRDIS datasets we considered in this work.

Figure 1 .
Figure 1.(a) Typical observations  from the VLT/SPHERE-IRDIS instrument conducted in pupil tracking mode (i.e., with the ADI technique).Zooms around two known background faint sources are displayed in the red circles, a zoom on the nuisance component  near the star is highlighted in the green circle, and a view of the sought exoplanetary signal, taking the form of the off-axis PSF , is shown in the blue circle.Two spatio-temporal slice cuts along the black solid and dashed lines are shown on the right, as an illustration of the spatial non-stationarity of the correlations of the nuisance.(b) Illustration of the main operations performed during step 1 of the proposed approach, namely the pre-processing of the observations by statistical learning.The estimation of the mean  and of the covariance matrices C are based on the PACO model of the nuisance component.Examples of estimated spatial covariance matrices are displayed in squares for four regions of interest pickled at about 0.5 (pink), 1.0 (green), 1.5 (purple), and 2.0 (orange) arcsec.It results from this pre-processing step centered and whitened observations from which our detection and characterization models are built by supervised deep learning, see Fig. 2. Dataset: HIP 72192 (2015-05-10), see Sect. 4 for the recording logs.

Figure 2 .
Figure 2. Schematic representation of the main operations performed during the detection and characterization steps of the proposed algorithm by supervised deep learning.The first line displays a view of the main parameters defining synthetic sources injected into the pre-processed observations (see Fig. 1) at training time.The second line shows common operations performed for both the detection and the characterization steps.The left (respectively, the right) part of the third line is for operations applied solely during the detection (respectively, the characterization) step.Throughout this paper, synthetic training sources injected to build our models are highlighted in orange while (possibly unknown) real and synthetic sources that we aim to detect and to characterize at inference time are displayed in light blue in the schematic representations.Dataset: HIP 72192 (2015-05-10), see Sect. 4 for the recording logs.

Figure 3 .
Figure 3. Schematic view of the quantity , representing the number of patches contributing to the computation of the pre-processed observations  at each pixel of the field of view, as a function of the patch shape and of the tessellation of the field of view.Non-overlapping circular patches are not considered in this work since they do not allow a complete paving of the field of view.The spatial scale is not respected for the purpose of illustration.

Figure 4 .
Figure 4. Schematic illustration of the local update of the pre-processing embedded in the training step of the proposed detection algorithm.(a) Illustration of the sets P and S [] for a given training set  with three injected training sources displayed in orange.(b) Illustration of the computation of the pre-processed observations q in the presence of injected training sources from pre-computed (i.e., before injections) centered and whitened observations .The spatial scale is not respected for the purpose of illustration.

Figure 5 .
Figure 5. Example of training and validation results.(a) Examples of detection maps obtained at validation time for the best epoch (number 20).(b) Evolution of the loss function at training and validation time, as well as the evolution of the F1R accuracy metric at validation time.(c) ROCs representing the TPR, FDR and F1R as a function of the prescribed detection threshold  for the best epoch (number 20, symbolized by the gray vertical line in panel (b)).Dataset: HD 95086 (2015-05-05), see Sect. 4 for the recording logs.

Figure 6 .
Figure 6.Influence of the whitening of the spatial correlations during the pre-processing step of the characterization algorithm.ARE (see Sect. 3.2.3 for the definition of the metric) on the estimated photometry is reported as a function of the angular separation, with and without whitening of the spatial correlations.The performance of PACO are displayed as a purpose of comparison.The results are averaged azimuthally for 40,000 sources of flux drawn uniformly between 1 × 10 −6 and 3 × 10 −5 .The known real sources were masked out and were not considered.Dataset: HD 95086 (2015-05-05), see Sect. 4 for the recording logs.

Figure 7 .
Figure7.Influence of the patch size  in the characterization algorithm.ARE (seeSect.3.2.3  for the definition of the metric) on the estimated photometry is reported as a function of the angular separation, for patches between  = 15 2 and  = 31 2 pixels area.The results are averaged azimuthally for 40,000 sources of flux drawn uniformly between 1 × 10 −6 and 3 × 10 −5 .The known real sources were masked out and were not considered.Dataset: HD 95086 (2015-05-05), see Sect. 4 for the recording logs.

Figure 8 .
Figure 8. Influence of the number of training sources  in the characterization algorithm.ARE (seeSect.3.2.3  for the definition of the metric) on the estimated photometry is reported as a function of the angular separation, for  ∈ {400,000; 40,000; 20,000; 4,000} sources.The results are averaged azimuthally for sources of flux drawn uniformly between 1 × 10 −6 and 3 × 10 −5 .The mean ARE (denoted by ARE) averaged over the angular separation is also reported as an overall measure of the performance.The known real sources were masked out and were not considered.Datasets: the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 for the recording logs.
For each training epoch, the full set of training samples is fed as input of the network in a random order.The model optimization is performed with Adam (Kingma & Ba 2014) with a learning rate of 10 −3 .As for the detection stage, the optimization of the network weights is done in PyTorch (Paszke et al. 2019) on either Tesla V100 or GTX 1080 deep PACO: deep learning meets PAtch COvariances 13 Table

Figure 9 .
Figure 9. Detection maps obtained with the selected algorithms (see Sect. 4.1).Sources are classified as true, missed and false detections.The two additional candidate point-like sources whose identification is discussed in Sect.4.2.3 are not classified.The detection threshold is set to  = 5 for cADI, cADI (VIP), PCA (VIP) and PACO.It is set to  = 0.5 for the proposed algorithm.The light blue line represents the sensor field of view (encompassed withing a  -pixels square support) while the dashed blue line represents the extended field of view (encompassed withing a -pixels square support) on which the detection can be performed due to the apparent rotation of the field induced by ADI.Dataset: HD 95086 (2015-05-05), see Sect. 4 for the observation logs.

Figure 13 .
Figure 13.Detection maps obtained with the proposed algorithm on the three SPHERE-IRDIS datasets from the EIDC data challenge, see Sect. 4 for the observation logs.Sources are classified as true, missed and false detections.The detection threshold is set to  = 0.5.

Figure 14 .
Figure14.Schematic representation of the reference and proxy procedures used to evaluate the performance of the proposed approach through massive injections of synthetic sources mimicking the behavior (i.e., same apparent motion) of real sources.

Figure 15 .
Figure 15.Comparison between the reference procedure and its proxy version for evaluation of the detection performance of the proposed algorithm.Synthetic sources are classified as true or missed detections using a detection threshold at  = 0.5.Examples of detection maps obtained in the presence of injected synthetic sources are given on three datasets of HD 95086 (2015-05-05, 2018-02-23, 2021-03-11), see Sect. 4 for the observation logs.

Figure 16 .
Figure 16.Detection results for 10,000 synthetic sources in a diagram plotting contrast versus angular separation.Each synthetic source is classified as missed, true or false detection.The angular separation of false alarms is reported on the -axis, and they are not associated to contrast value since they do not correspond to injected synthetic sources.Dataset: HIP 88399 (2018-04-11), see Sect. 4 for the observation logs.

Figure 17 .
Figure17.Contrast as a function of the angular separation.These results are based on the classification between true and missed detections of massively injected synthetic sources at various levels of contrast, see Fig.16.It can be noted that the contrast curves of cADI and PCA are over-optimistic (see text) since the targeted PFA is not reached.For that reason, the angular separations leading to a PFA locally higher than ten times the targeted PFA at 5 are not reported.The black dashed line represents the ultimate detection limit driven by the photon noise.Dataset: HIP 88399 (2018-04-11), see Sect. 4 for the observation logs.

/Figure 18 .
Figure18.TPR (in percent) of deep PACO for synthetic sources missed by PACO (i.e., for S/N ≤ 5) as a function of the angular separation (on the -axis) and of the PACO's S/N of detection (on the -axis).The black dashed line represents the equivalent PACO's detection threshold to reach TPR = 50% with deep PACO (we recall that the classical detection threshold at 5 with PACO also corresponds to TPR = 50%).Dataset: HIP 88399 (2018-04-11), see Sect. 4 for the observation logs.

Figure 19 .
Figure 19.ROCs showing the TPR as a function of the FDR for injected synthetic sources.Dataset: HIP 88399 (2018-04-11), see Sect. 4 for the observation logs.

Figure 20 .
Figure 20.Detection maps obtained with PACO and deep PACO on two epochs of the HD 95086 star observed with SPHERE-IRDIS.Symbols classify sources as true, missed, and false detections based on our analysis.The detection threshold was set to  = 5 for PACO producing a S/N map, and to  = 0.5 for deep PACO producing a pseudo-probability map.Colors classify sources as known real sources, candidate (background) sources, and detections with unknown status.Datasets: HD 95086 (2015-05-05 and 2018-01-05), see Sect. 4 for the observation logs.

Figure 21 .Figure 22 .
Figure21.Mean ARE score on the estimated photometry of injected synthetic sources as a function of the their contrast and of their angular separation.From top to bottom, the panels corresponds respectively to PCA (VIP), PACO, and the proposed algorithm.For each panel, the mean detection limit (straight line) and the mean photon noise limit (dashed line) are superimposed.The results are averaged azimuthally for 40,000 sources of flux drawn uniformly between 1 × 10 −6 and 3 × 10 −5 .Datasets: the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 for the recording logs.

Figure 23 .
Figure 23.Comparison between the reference procedure and its proxy version for evaluation of the characterization performance of the proposed algorithm.(a) Mean ARE scores (see definition in Eq. (15)) are reported as a function of the angular separation.(b) -values are reported as a function of the angular separation, and can be compared with the 5% significance threshold given by the dashed red curve.Datasets: the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 for the recording logs.

Figure A1 .
Figure A1.(a): Illustration of two data frames impacted by a wind-driven halo near the star.The direction of main elongation of the halo is symbolized by the white arrow.(b): Detection map obtained on the corresponding dataset with the proposed detection algorithm embedding the whitening procedure either defined in Sect.2.1.2(i.e., without normalization by the whitened offaxis PSF) or defined in Appendix A2 (i.e., with normalization by the whitened off-axis PSF).The four known exoplanets (HR 8799 b, c, d, e) are circled in green.Dataset: HR 8799 (2015-07-04), see Sect. 4 for the observation logs.

Figure 1 :Figure 2 :
Figure 1: Detection maps obtained with the selected algorithms (see Sect. 4.1 of the main paper).Sources are classified as true, missed and false detections.The detection threshold is set to τ = 5 for cADI, cADI (VIP), PCA (VIP) and PACO.It is set to τ = 0.5 for the proposed algorithm.The light blue line represents the sensor field of view while the dashed blue line represents the extended field of view on which the detection can be performed due to the apparent rotation of the field induced by ADI.Dataset: HD 95086 (2018-01-05), see Sect. 4 of the main paper for the observation logs.
The experiments were conducted by resorting to massive injections of synthetic sources, and known real sources were excluded of this study.The scores are detailed for the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 of the main paper for the observation logs.Algorithm (1), (2), (3), (4), and (5) stands respectively for cADI, cADI (VIP), PCA (VIP), PACO, and the proposed deep PACO algorithm.For each range of angular separations, the best results are emphasized in bold.

Figure 25 :
Figure25: Mean throughput ( α/α) on the estimated photometry of injected synthetic sources as a function of the their contrast and of their angular separation.From top to bottom, the panels corresponds respectively to PCA (VIP), PACO, and the proposed algorithm.For each panel, the mean detection limit (straight line) and the mean photon noise limit (dashed line) are superimposed.The results are averaged azimuthally for sources of flux drawn uniformly between 1 × 10 −6 and 3 × 10 −5 .Datasets: the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 of the main paper for the recording logs.

Table 1 .
Summary of the main notations.

Table 2 .
Architecture of the proposed CNN for source characterization.The shapes of the layers are indicated for a unit batch size.
, see Sect. 4 for the observation logs.

Table 7 .
Mean results of AUC for ROCs giving the TPR as a function of the FDR.The scores are averaged over the eleven SPHERE-IRDIS datasets considered in this paper, see Sect. 4 for the observation logs.
observations  ∈ R  × observations  ∈ R  × strategy for known sources (training only) (  −   ) s.t.L  L ⊤  = C −1  , ∀ ∈ P (default) or concat.⊤ C −1  (  −   )  ∈ R  × pre-processed observations  ∈ R  × ▶ generation of the training set input pre-processed observations  ∈ R  × pre-processed observations  ∈ R  × pre-processing update on injection arcs  photometry estimates  ∈ R + deep PACO: Combining statistical models with deep learning for exoplanet detection and characterization in direct imaging at high contrast supplementary material Olivier Flasseur, Théo Bodrito, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange This supplementary material complements the corresponding main paper by presenting detailed results for SPHERE-IRDIS datasets we considered in this work.Concerning the detection stage, Figs. 1 to 20 display detection maps, Figs.21 and 22 give contrast curves, Figs.23 and 24 represent the TPR of the proposed algorithm as a function of the PACO's S/N of detection as well as of the angular separation.Finally, Tables ∈P E ⊤( []∈ R  × } =1: sets of pre-processed patches with injections { ( [ ]∈ R  } =1:

Table 1 :
AUC for ROCs giving the TPR as a function of the FDR.The experiments were conducted by considering only known real sources.The scores are detailed for the eleven SPHERE-IRDIS datasets considered in this work, see Sect. 4 of the main paper for the observation logs.The best results are emphasized in bold.

Table 2 :
AUC for ROCs giving the TPR as a function of the FDR.