-
PDF
- Split View
-
Views
-
Cite
Cite
Grant Merz, Yichen Liu, Colin J Burke, Patrick D Aleo, Xin Liu, Matias Carrasco Kind, Volodymyr Kindratenko, Yufeng Liu, Detection, instance segmentation, and classification for astronomical surveys with deep learning (deepdisc): detectron2 implementation and demonstration with Hyper Suprime-Cam data, Monthly Notices of the Royal Astronomical Society, Volume 526, Issue 1, November 2023, Pages 1122–1137, https://doi.org/10.1093/mnras/stad2785
- Share Icon Share
ABSTRACT
The next generation of wide-field deep astronomical surveys will deliver unprecedented amounts of images through the 2020s and beyond. As both the sensitivity and depth of observations increase, more blended sources will be detected. This reality can lead to measurement biases that contaminate key astronomical inferences. We implement new deep learning models available through Facebook AI Research’s detectron2 repository to perform the simultaneous tasks of object identification, deblending, and classification on large multiband co-adds from the Hyper Suprime-Cam (HSC). We use existing detection/deblending codes and classification methods to train a suite of deep neural networks, including state-of-the-art transformers. Once trained, we find that transformers outperform traditional convolutional neural networks and are more robust to different contrast scalings. Transformers are able to detect and deblend objects closely matching the ground truth, achieving a median bounding box Intersection over Union of 0.99. Using high-quality class labels from the Hubble Space Telescope, we find that when classifying objects as either stars or galaxies, the best-performing networks can classify galaxies with near 100 per cent completeness and purity across the whole test sample and classify stars above 60 per cent completeness and 80 per cent purity out to HSC i-band magnitudes of 25 mag. This framework can be extended to other upcoming deep surveys such as the Legacy Survey of Space and Time and those with the Roman Space Telescope to enable fast source detection and measurement. Our code, deepdisc, is publicly available at https://github.com/grantmerz/deepdisc.
1 INTRODUCTION
The rise of machine learning/artificial intelligence has allowed for rapid advancement in many image analysis tasks to the benefit of researchers who wish to work with large sets of imaging data. This active field of study, known as computer vision, has led to developments in many disciplines including medical imaging (Zhou et al. 2021), urban planning (Ibrahim, Haworth & Cheng 2020), autonomous systems (Pavel, Tan & Abdullah 2022), and more.
Tasks such as image compression, inpainting, object classification and detection, and many others have been extensively studied. Astronomy is no exception, and many methods that utilize deep learning have been applied to simulations and real survey data for tasks such as object detection, star/galaxy classification, photometric redshift estimation, image generation, deblending, and more (see Huertas-Company & Lanusse 2023 for a comprehensive review). Machine learning methods are already becoming instrumental in handling the large volume of data processed every day in survey pipelines (e.g. Bosch et al. 2018; Tachibana & Miller 2018; Mahabal et al. 2019; Malanchev et al. 2021; Russeil et al. 2022)
The next generation of astronomical surveys such as the upcoming Legacy Survey of Space and Time (LSST; Ivezić et al. 2019) at the Vera C. Rubin Observatory, the Wide-Field Imaging Survey at the Nancy Grace Roman Space Telescope (Roman; Spergel et al. 2013), and Euclid (Amiaux et al. 2012) will produce unprecedented amounts of imaging data throughout the 2020s and beyond. LSST will provide incredibly deep ground-based observations of the sky, revealing a map of the universe including objects as faint as ∼25–27 mag at a 5σ detection for 10 yr observing runs. Ground-based surveys such as the Hyper Suprime-Cam Subaru Strategic Program (HSC SSP; Aihara et al. 2018a) and the Dark Energy Survey (DES; Dark Energy Survey Collaboration 2016) have already mapped large swaths of the sky and produced catalogues of tens of millions of objects, with HSC depths being comparable to LSST. The astronomical research community is now in an era that demands robust and efficient techniques to detect and analyse sources in images.
Current surveys such as HSC already report large fractions of blended (overlapping) objects. For instance, 58 per cent of objects in the shallowest field (Wide) of the HSC survey are blended, that is, detected in a region of sky above the 5σ threshold (26.2 mag) containing multiple significant peaks in surface brightness. As depths increase, line-of-sight projections and physical mergers cause the overall number of blends to increase. This fraction rises to 66 per cent for the Deep and 74 per cent for the UltraDeep layers, which are comparable to LSST depths (Bosch et al. 2018). If blends are not identified, they will bias results from pipelines that assume object isolation. For example, Boucaud et al. (2020) show that the traditional detection/deblending methods can lead to a photometric error of >0.75 mag for ∼12 per cent of their sample of artificially blended galaxies from the Cosmic Assembly Near-infrared Deep Extragalactic Legacy survey (CANDELS, Grogin et al. 2011; Koekemoer et al. 2011). Unrecognized blends can cause an increase in the noise of galaxy shear measurements by ∼14 per cent for deep observations (Dawson et al. 2016). Deblending, or source separation, has been recognized as a high priority in survey science, especially as LSST begins preparations for first light.
Despite rigorous efforts to deblend objects, the problem of deblending remains, and in some sense will always remain in astronomical studies. Deblending involves separating a mixture of signals in order to independently measure properties of each individual object. This an imaging problem analogous to the ‘cocktail party problem’, in which an attempt is made to isolate individual voices from a mixture of conversations. However, since it is impossible to trace a photon back to an individual source, astronomical deblending is characterized as an underconstrained inverse problem. Deblending methods must rely on assumptions about source properties and models of signal mixing (Melchior et al. 2021).
A first step in deblending is object detection. Many codes have been developed for source detection and classification, including |$\small {\rm FOCAS}$| (Jarvis & Tyson 1981), |$\small {\rm NEXT}$| (Andreon et al. 2000) and SExtractor (Bertin & Arnouts 1996). SExtractor is widely used in survey pipelines including HSC (Bosch et al. 2018) and DES (Morganson et al. 2018), but can be sensitive to configuration parameters. While SExtractor also deblends by segmenting, or identifying pixels belonging to unique sources, modern deblenders have been developed such as |$\small {\rm MORPHEUS}$| (Hausen & Robertson 2020) and |$\small {\rm SCARLET}$| (Melchior et al. 2018), with the latter implemented in HSC and LSST pipelines. With hopes for real-time object detection and deblending algorithms in surveys such as LSST, machine learning applications to crowded fields offer a promising avenue. The use of deep neural networks, or deep learning has seen particular success in image processing. In addition to efficiency and flexibility, neural networks may be able to overcome limitations of traditional peak-finding algorithms due to their fundamentally different detection mechanism.
There is a growing body of deep learning deblending methods in astronomy. Reiman & Göhre (2019) use a Generative Adversarial Network (GAN) to deblend small cutouts of Sloan Digital Sky Survey (SDSS, Alam et al. 2015) galaxies from Galaxy Zoo (Lintott et al. 2011). Arcelin et al. (2021) use a variational autoencoder to deblend small cutouts of simulated LSST galaxies. Hemmati et al. (2022) use GANs to deblend images with HSC resolution and recover Hubble Space Telescope (HST) resolution. On larger scales, Bretonnière, Boucaud & Huertas-Company (2021) use a probabilistic U-net model to deblend large simulated scenes of galaxies.
In addition to blending, another pressing issue with increased depth is the presence of many unresolved galaxies in the deep samples of smaller and fainter objects. This will prove difficult for star–galaxy classification schemes that rely on morphological features to distinguish between a point source star or a point source galaxy, although machine-learning methods have been employed to combat this problem (Tachibana & Miller 2018; Miller & Hall 2021). Muyskens et al. (2022) use a Gaussian process classifier to perform star/galaxy classification on HSC images. This is an important area of study, as misclassifications can introduce biases in studies that require careful measurement of galaxy properties. For instance, it has been shown that stellar contamination can be a significant source of bias in galaxy clustering measurements (Ross et al. 2011). Precise constraints of cosmological models require a correction of this systematic bias in measurements of clustering at high photometric redshifts.
The broader field of computer vision has seen a large growth in object detection, classification, and semantic segmentation models. Object detection and classification consist of identifying the presence of an object in an image and categorizing it from a list of possible classes. Semantic segmentation involves identifying the portion of an image which belongs to a specific class, that is, deblending. Put together, these tasks amount to instance segmentation. This pixel-level masking can be used to deblend objects by selecting the pixels associated with each individual object by class. The benchmark leader in deep learning instance segmentation models has been the Mask Region-based Convolutional Neural Network (Mask-RCNN) framework (He et al. 2017).
The Mask-RCNN architecture was implemented in Burke et al. (2019) to detect, deblend, and classify large scenes of simulated stars and galaxies. Other architectures have been tested in astronomical contexts, including You Only Look Once (YOLO, Bochkovskiy, Wang & Liao 2020). He et al. (2021) use a combination of the instance segmentation model YOLOv4 and a separate classification network to perform source detection and classification on SDSS images, and González, Muñoz & Hernández (2018) use a YOLO model to detect and morphologically classify SDSS galaxies. However, these models do not perform segmentation.
The rapid pace of research has led to many new variations and methods that can outperform benchmark architectures. To the benefit of computer vision researchers, Facebook AI Research (FAIR) has compiled a library of next-gen object detection and segmentation models under the framework titled detectron2 (Wu et al. 2019). This modular, fast, and well-documented library makes a fertile testing ground for astronomical survey data. In addition to a variety of architectures, pre-trained models are also provided. By leveraging transfer learning, that is, the transfer of a neural network’s knowledge from one domain to another, we can cut back on training time and costs with these pre-trained models. It is also possible to interface new models with detectron2, for example, Cheng, Parkhi & Kirillov (2022) and Li et al. (2022), taking advantage of its modular nature and flexibility.1
In this work, we leverage the resources of the detectron2 library by testing state-of-the-art instance segmentation models on large scenes, each containing hundreds of objects. We perform object detection, segmentation, and classification simultaneously on large multiband HSC co-adds. Many deep learning applications have been tested on simulated images, but methods applied to real data are often limited by a lack of ground truth. Here, we construct a methodology for using instance segmentation models on real astronomical data, and demonstrate the potential and challenges of this framework when applied to deep images. The HSC data are ideal for testing this framework, as it represents the state-of-the-art among wide/deep surveys, and is closest in quality to upcoming LSST data. By interfacing with detectron2, we are able to test new models as the repository is updated. We compare models with different performance metrics, and test how robust they are to contrast scalings that alter the dynamic range of the data, which will be important to consider for application to other data sets.
The major contributions of this work can be summarized as (1) using instance segmentation models to deblend and classify objects in real images from HSC. This demonstrates the feasibility for future integration with wide/deep survey pipelines. We will show that the models can learn inherent features in the data that lead to classification performance gains above traditional morphological methods. (2) Comparing the performances of different models when the input data undergo different contrast scalings. There is no standard method for scaling image data in astronomical studies that use deep neural networks, so we apply a variety of pre-processing scalings to the data for each model. Dynamic ranges can vary significantly across data sets, and raw data may not be ideal for feature extraction. We test sensitivity to contrast scalings to identify models that will be more easily adapted to different data sets. (3) Interfacing our pipeline with the detectron2 framework to test state-of-the-art models. Of particular note are our tests using transformer-based architectures, an emerging framework in computer vision studies. We will show that these architectures are more robust and accurate than traditional convolutional neural networks in both deblending and classifying objects in large scenes.
This paper is organized as follows. In Section 2, we present an overview of detectron2 in which we highlight the flexibility of its modular nature and describe the portion of the available deep learning models we implemented. In Section 3, we describe the curation of our data sets, production of ground truth labels, data preparation, and our training procedure. In Section 4, we present the results of training our suite of models and assess performance with different metrics. Section 5, we discuss the differences in model capabilities, compare the performance of our pipeline to existing results, and discuss the benefits and drawbacks of our method. In Section 6, we contextualize our findings and conclude.
2 detectron2 FRAMEWORK
We leverage the modular power of detectron2 by implementing models with varying architectures. The pre-trained models we test in detectron2’s Model Zoo have a structure that follows the GeneralizedRCNN meta-architecture provided by the codebase. This architecture is a flexible overarching structure that allows for a variety of changes, provided they support the following components: (1) a per-image feature extraction backbone, (2) region-proposal generation, and (3) per-region feature extraction/prediction. The schematic of this meta-architecture is shown in Fig. 1.

Generalized RCNN meta-architecture. A multichannel image along with ground truth object annotations is fed to the backbone feature extractor. These features are passed to the RPN and ROI heads to predict object locations and annotations.
The feature extraction backbone takes an input image and outputs ‘feature maps’ by running the input through a neural network, often composed of convolutional layers. In our tests, we use ResNet backbones and transformer-based backbones. ResNets are convolutional neural networks that utilize skip connections that allow for deep architectures with many layers without suffering from the degrading accuracy problem known to plague deep neural networks (He et al. 2016). In this paper, we explore a few different ResNet backbones: ResNet50, ResNet101, and ResNeXt. A ResNet50 network consists of 50 total layers, with two at the head or ‘stem’ of the network and then four stages consisting of 3, 4, 6, and 3 convolutional layers, respectively. Each stage includes a skip connection. A ResNet101 network is similar to a ResNet50 setup, but with each stage consisting of 3, 4, 23, and 3 convolutional layers, respectively. Subsequent layers undergo a pooling operation that reduces the input resolution. We refer the reader to He et al. (2016) for details regarding these layers. ResNeXt layers work similar to ResNet layers, but include grouped convolutions which add an extra parallel set of transforms (Xie et al. 2017). We also test a network with deformable convolutions, in which the regularly spaced convolutional kernel is deformed by a pixel-wise offset that is learned by the network (Dai et al. 2017).
The stages of a ResNet backbone produce feature maps, representing higher level image aspects such as edges and corners. While one can simply take the feature map outputted by the last layer of the backbone, this can pose a challenge in detecting objects of different scales. This motivates the extraction of features at different backbones stages (and thus scale sizes). A hierarchical feature extractor known as a feature pyramid network (FPN, Lin et al. 2017) has seen great success in object detection benchmarks. The FPN allows each feature map extracted by a ResNet stage to share information with other feature maps of different scales before ultimately passing on to the Region Proposal Network (RPN).
After the image features have been extracted, the next stage of Generalized-RCNN networks involves region proposal. This stage involves placing bounding boxes at points in the feature maps and sampling from the proposed boxes to curate a selection of possible objects. After this sampling has been done, bounding boxes are once again proposed and sent to the Region of Interest (ROI) heads, where they are compared to the ground truth annotations. The annotations consist of bounding box coordinates, segmentation masks, and other information such as class labels. Ultimately, many tasks can be done on the objects inside these regions of interest, including classification, and with the advent of Mask-RCNN frameworks, semantic segmentation. We do not include the details of the RPN and ROI heads, as these structures largely remain the same in our tests. We do test architectures with a cascade structure (Cai & Vasconcelos 2018) which involves iterating the RPN at successively higher detection thresholds to produce better guesses for object locations. For specifics, we refer the reader to Girshick (2015) and He et al. (2017), and the detectron2 codebase.
We train a suite of networks to allow for several comparisons. We use a shorthand to denote network configurations as follows.
R101c4: a ResNet50 backbone that uses features from the last residual stage.
R101fpn: a ResNet101 backbone that uses an FPN.
R101dc5: a ResNet101 backbone that uses an FPN with the stride of the last block layer reduced by a factor of 2 and the dilation increased by a factor of 2.
R50def: a ResNet50 backbone that uses an FPN and deformable convolutions
R50cas: a ResNet50 backbone that uses a cascaded FPN.
X101fpn: a ResNeXt101 backbone that uses an FPN.
In addition to these ResNet-based models, we also test transformer-based architectures. A transformer is a encoder–decoder model that employs self-attention. Briefly, self-attention consists of applying linear operations to an encoded sequence to produce intermediate ‘query, key, and value’ tensors. A further series of linear operations and scalings are done to these intermediate tensors to produce an output sequence, and then a final linear operation is performed on the entire output sequence. Transformer models have exploded in popularity in the domain of natural language processing due to their scalability and generalizability on sequences, which translates well to language structure. Recently, transformers have been used in computer vision tasks such as image classification and object detection. These models been shown to be competitive with the dominant convolutional neural networks, and are seeing rapid advances in performance measures (Dosovitskiy et al. 2020; Caron et al. 2021; Liu et al. 2021; Li et al. 2022; Oquab et al. 2023). For example, MViTv2 utilizes multi-head pooling attention (MHPA, Fan et al. 2021) to apply self-attention at different image scales, allowing for the detection of features of varying sizes. To obtain the input encoded sequences, an image is first divided into patches which are flattened and sent through a linear layer. MHPA is applied to the sequences to produce the image features. In an object detection context, these features are input to an FPN in the same way as features obtained from a ResNet in RCNN models. Another modern transformer model, the Swin Transformer (Liu et al. 2021), applies multihead attention to image patches, but rather than a pooling operation, use patch merging to combine features of different image patches. Swin models also use shifted window attention to allow for efficient computation and information propagation across the image. We test both MViTv2 and Swin backbones in our implementation.
3 IMPLEMENTATION
3.1 HSC co-adds
In this work, the data we use consist of multiband image co-adds of roughly 4000 pixels2 from the Deep and Ultra-Deep fields of the HSC SSP (Aihara et al. 2018b) Data Release 3 (DR3, Aihara et al. 2022). The HSC SSP is a three-tiered imaging survey using the wide-field imaging camera HSC. The HSC instrument (Miyazaki et al. 2017) consists of a 1.77 deg2 camera with a pixel scale of 0.168 arcsec, attached to the prime focus of the Subaru 8.2 m telescope in Mauna Kea. The Deep+UltraDeep component of the HSC survey covers ∼36 deg2 of the sky in five broad optical bands (grizy; Kawanomoto et al. 2018) up to a full 5σ depth of ∼27 mag (depending on the filter). Despite limitations (e.g. sky subtraction and crowded field issues), the HSC DR3 data provide the closest match among all currently available deep-wide surveys to the expected data quality of LSST wide fields. The Deep/Ultra-Deep field properties are listed in Table 1. We use the g, r, and i bands.
. | Median exposure (min) . | Seeing (arcsec) . | Depth (mag) . |
---|---|---|---|
g | 70 | 0.83 | 27.4 |
r | 66 | 0.77 | 27.1 |
i | 98 | 0.66 | 26.9 |
. | Median exposure (min) . | Seeing (arcsec) . | Depth (mag) . |
---|---|---|---|
g | 70 | 0.83 | 27.4 |
r | 66 | 0.77 | 27.1 |
i | 98 | 0.66 | 26.9 |
. | Median exposure (min) . | Seeing (arcsec) . | Depth (mag) . |
---|---|---|---|
g | 70 | 0.83 | 27.4 |
r | 66 | 0.77 | 27.1 |
i | 98 | 0.66 | 26.9 |
. | Median exposure (min) . | Seeing (arcsec) . | Depth (mag) . |
---|---|---|---|
g | 70 | 0.83 | 27.4 |
r | 66 | 0.77 | 27.1 |
i | 98 | 0.66 | 26.9 |
Given the large depth of the survey, a significant portion of objects are blended in comparison to other ground-based surveys such as the DES (Dark Energy Survey Collaboration 2016). For reference, 58 per cent of objects in the the shallowest field (Wide) of the HSC survey are blended. While a significant challenge, this lends the HSC fields to be an excellent set of data for testing deblending algorithms, particularly those suited for crowded fields. The pipeline to produce the image co-adds is described in detail in Bosch et al. (2018). There are two sets of sky-subtracted co-adds. The first set consists of global sky-subtracted co-adds. The second set also uses the global sky-subtracted images, but an additional local sky subtraction algorithm is applied. This is to remove the wings of bright objects, artefacts that can cause problems in object detection algorithms. However, this process creates a trade-off with removing flux from extended objects, and Aihara et al. (2018a) empirically find a local sky subtraction scale of 21.5 arcsec to be a good balance. Ultimately, we use these local sky-subtracted images, as bright wings and artefacts can introduce problems of overdeblending or ‘shredding’ and we want our ‘ground truth’ detections to be as clean and accurate as possible. To further ensure a clean training set, we apply a few quality cuts to the sample. Some images suffer from missing data in one or more bands, especially at the edge of the imaging fields. We use the bitmasks provided in the co-add fits files to exclude images with >30 per cent of the pixels assigned a NO|$\_$|DATA flag. Given that the neural network takes multiband images, if one of the g-, r-, or i-band images is flagged in this way, we exclude the other bands as well. There remain some imaging artefacts and issues, such as saturated regions around bright stars, and we discuss how these affect network performance in Section 4.2.
3.2 Ground truth generation
We must provide ground truth object locations and masks to the network to perform pixel-level segmentation. We utilize the multiband deblending code scarlet (Melchior et al. 2018) to produce a model for each individual source from which we create an object mask. scarlet utilizes constrained matrix factorization to produce a spectral decomposition of an object. It is a non-parametric model that has been demonstrated to work well on individual galaxies and blended scenes. Before we run scarlet, we extract an object catalogue using sep, the python wrapper for SExtractor. Then, each identified source is modelled and the ‘blend’ or composition of sources is fit to the co-add image data. Once the final blend model is computed, the mask is determined by running sep on each individual model source and setting a mask threshold of 5σ above the background. Both the scarlet modelling and mask thresholding are done on the detection image, that is, the sum over all bands. The run time of this process increases with the number objects in an image. In order to reduce runtime, we divide the 4k stitched co-add images into 16 images of ∼1000 × 1000 pixels2. While scarlet on its own is a powerful deblender, the fits can take up to ∼30 min depending on the number of objects in the image, which motivates the use of efficient neural networks. After this process is complete, we compile a training set of 1000 1k × 1k pixels2 images. The distribution of the number of sources per image is shown in Fig. 2.

Histogram of the number of objects detected at >5σ above the background for HSC images in the training set. The images are taken from both the Deep and UltraDeep fields.
The trade-off in using real over simulated data is that in supervised tasks, there is a lack of pre-determined labels. For the classification task, we produce object labels with a catalogue match to the HSC DR3 catalogues. We convert each detected source centre to RA and Dec. coordinates and then run the match_to_catalog_sky algorithm in astropy to find objects in the HSC catalogue within 1 arcsec. Then, we compare the i-band magnitude of the deblended source to the ‘cmodel’ magnitude of the catalogue objects and pick the object with the smallest magnitude difference. If no objects are within 1 arcsec or no objects have a magnitude difference smaller than 1, we discard the object from our labelled set. Once an object is matched, we use the HSC catalogue ‘extendedness value’ to determine classes, which is based on differences in point spread function (PSF) and extended model magnitudes. While yielding high accuracy at bright magnitudes, this metric becomes unreliable for star classification around a limiting magnitude of 24 mag in the i band (Bosch et al. 2018). We additionally discard objects with NaN values in the DR3 catalogue, as the class is indeterminate. We show an example image and the results of our labelling methodology in Fig. 3, with colour-coded classes.

The ground truth masks and bounding boxes on an example image in the test set of our HSC Deep/UltraDeep field data. The image without overlaid masks/boxes in shown below for clarity. A Lupton contrast scaling is used in this visualization. Galaxies are coloured green, and stars are coloured red.
3.3 Data preparation
We employ three common methods for scaling the raw data from the co-add fits files to RGB values. These are: a z-scale, a Lupton scale, and a high-contrast Lupton scale. The z-scale transformations are commonly employed in computer vision tasks and are given by
where I = (i + r + g)/3 with a mean |$\bar{I}$| and standard deviation σI, R is pixel values in the red channel (and similarly for the green G and blue B channels using the r and g bands, respectively). We set A = 103 for the training and cast the images to 16-bit integers. In addition to z-scaling, we also apply a Lupton scaling from Lupton et al. (2004). This is an asinh scaling with
We use a stretch of 0.5 and Q = 10 and set the minimum to zero and cast the images to unsigned 8-bit integers. Lupton scaling brings out the fainter extended parts of galaxies while avoiding saturation in the bright central regions. These augmentations preserve the colour information of objects to aid in classification. Lastly, we also use a high-contrast Lupton scaling, in which image brightness and contrast is doubled after applying the Lupton scaling. We test all of these scalings for each network architecture. In Fig. 4, we show an example image and a histogram of pixel values in i, r, and g bands (corresponding to RGB colours).

Top row: RGB images in the HSC DR3 data set with different contrast scalings. The scalings are, from left to right: Lupton, Lupton high contrast, and z-scale. Bottom: histograms of pixel values to the corresponding image in the top row. Red, green, and blue represent values in the i, r, and g filters, respectively.
We apply data augmentation to the training and test sets. Data augmentation has become a staple of many deep learning methods. It allows the network to ‘see’ more information without needing to store extra images in memory. We employ spatial augmentations of random flips and 90° rotations. We do not employ blurring or noise addition, as the real data we train on are already convolved with a PSF and contains noise. For future generalizations of this framework to different data sets, then blur/noise augmentations may be useful, but for inference purposes on test data taken under the same conditions as the training data, spatial augmentations are sufficient. We also employ a random 50 per cent crop on each image during training so that the data can fit into GPU memory. We considered applying all contrast scalings as a data augmentation, but did not find a significant improvement in network performance. However, this could be used in future work to reduce the training costs, as results were on par with networks trained with only one contrast scaling.
3.4 Training
Training is done using stochastic gradient descent to update the network weights by minimizing a loss function. The loss functions of these Mask-RCNN models is
where the classification loss Lcls is −log pu or the log of the estimated probability of an object belonging to its true class u. Discrete probability distributions are calculated per class (plus a background class) for each ROI. Lbox is a smoothed L1 loss calculated over the predicted and true bounding box coordinates as given in Girshick (2015). Finally, the mask loss Lmask is the per-pixel average binary cross-entropy loss between the ground truth and predicted masks.
All networks are pre-trained on either the MS-COCO (Lin et al. 2014) or ImageNet-1k (Deng et al. 2009) data sets of terrestrial images, and so we use transfer learning to apply these models to the our astronomical data sets. Transfer learning (Tan et al. 2018) is a technique in deep learning that allows for a network trained to do a task in a source domain to perform the task in a different target domain. It is often used when applying a pre-trained deep learning model to a different domain than the one seen during training. By using pre-trained weights as initial conditions, training is likely to converge faster and be less prone to overfitting. We use weights provided by detectron2 as the starting point for our training procedure. We then train the networks for 50 total epochs, that is, the entire training set is seen 50 total times by the network. In order to facilitate the transfer of knowledge, we first freeze the feature extraction backbones of the models and only train the head layers in the ROI and RPN networks for 15 epochs. We use a learning rate of 0.001 for this step. Then, we unfreeze the feature extraction backbone and train the entire network for 35 epochs. We begin this step with a learning rate of 0.0001 and decrease by a factor of 10 every 10 epochs. To see if transfer learning introduced a bias from terrestrial image pre-training, we trained a model from scratch with randomized weights and compared to a model initialized with pre-trained weights, using the same training schedule. We found the pre-trained model to yield the best results, indicating that this step is indeed helpful.
We use two NVIDIA Tesla V100 GPUs in HAL system (Kindratenko et al. 2020) to train on 1000 images of size 500 pixels2 paired with object annotations. When trained in parallel on each GPU, our models take roughly ∼3 h to complete. Transformer architectures tend to use more memory, and thus are trained on 4 GPUs for roughly 4 h.
4 HSC RESULTS
After training, we evaluate network performance on the test set of HSC images. The test set is taken from the patches in the UltraDeep Cosmic Evolution Survey (COSMOS; Scoville et al. 2007) field and consists of 95 images of 1000 pixels2. No test set images were seen during training. A benefit of the instance segmentation models used in this work is their ability to infer on images of variable size. Thus, despite the need to crop images during training, we are still able to utilize the full size of the images in the test set.
We utilize two object classes, galaxy and star, and evaluate classification performance with precision and recall, given by
True positives (TP) are counted as a detection that has a confidence score outputted by the network above a certain threshold and additionally can be matched to a ground truth object by having an Intersection over Union (IOU) above another threshold. Fig. 5 shows an example of how the IOU is calculated for a pair of objects. False negatives (FN) are those ground truth objects that do not have a corresponding detection. False positives (FP) are those detections with a high confidence score but do not have a matching ground truth. The IOU is defined as
or the area of the intersection over the area of the union of the predicted and ground truth bounding boxes. Precision and recall are often broken down by class, or combined into one value, the AP score,
where p(r) is the maximum precision in a recall bin of width Δr. AP scores are computed over Nthresh equally spaced IOU thresholds from 0.5 to 0.95 and averaged. Here, Nthresh = 51.

IOU scores are calculated between a ground truth bounding box (solid lines) and inferred bounding box (dotted line) in the test set of images. An IOU score can range between 0 (no overlap) and 1 (perfect overlap). This can also be done with object segmentation maps instead of bounding boxes.
AP scores on the HSC COSMOS test set are reported for all network configurations in Table 2. We report the per-class AP score for stars and galaxies separately, as well as the Small, Medium, and Large AP scores, defined by the object bounding box size of 0–32, 32–96. and >96 pixels2, respectively. For galaxies and stars, AP score can vary significantly across network configurations. For ResNet-based architectures, AP for galaxies is consistently higher than stars, which may be due to the higher sample size of galaxies and morphological features that make galaxies easier to distinguish than compact stars. Among ResNet-based networks, a Lupton high-contrast scaling generally gives the highest galaxy AP score, while a z-scaling always gives the highest star AP score. It appears that these networks are very sensitive to the contrast scaling used, which is not desirable for application to other data sets with different dynamic ranges. However, transformer-based architectures perform more robustly with varying contrast scalings, and outperform ResNet architectures in almost all cases. For these networks, galaxy AP scores all lie within ∼50–52, showing a gain of about 5 over the highest performing ResNet configuration. Stellar AP scores for Lupton and z-scalings lie within ∼33–35, with high-contrast Lupton scalings performing worse by an AP of ∼8. Among the Small, Medium, and Large AP metrics, transformer-based networks also outperform ResNet-based networks, in some cases seeing massive gains in AP score. The networks generally perform better on Small and Large object categories over Medium objects, again likely due to sample size.
AP scores on COSMOS HSC set for all network configurations (larger is better). Galaxy and Star AP scores are calculated separately, whereas Small (0–32 pixels2), Medium (32–96 pixels2), and Large (>96 pixels2) object AP scores are averaged across both classes. The best result for each row is emphasized in bold. The MViTv2 backbone gives the best results in all cases except for one.
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 23.7 | 24.6 | 40.9 | 46.3 | 41.7 | 41.4 | 51.7 | 50.8 |
LuptonHC | 26.1 | 28.0 | 43.6 | 46.0 | 43.2 | 43.1 | 50.9 | 50.3 | |
zscale | 22.9 | 30.7 | 40.2 | 39.6 | 21.8 | 34.1 | 52.7 | 52.5 | |
Stars | Lupton | 10.3 | 9.6 | 7.3 | 7.4 | 4.3 | 2.5 | 34.1 | 33.9 |
LuptonHC | 2.4 | 5.1 | 6.1 | 8.1 | 5.5 | 8.3 | 28.0 | 25.0 | |
zscale | 15.6 | 10.5 | 17.9 | 25.5 | 12.7 | 17.2 | 35.8 | 33.9 | |
Small | Lupton | 17.6 | 18.0 | 26.1 | 28.0 | 24.6 | 23.7 | 43.7 | 43.1 |
LuptonHC | 14.8 | 17.2 | 25.9 | 27.7 | 25.4 | 26.9 | 40.1 | 38.4 | |
zscale | 19.7 | 21.5 | 30.2 | 33.2 | 18.1 | 26.8 | 44.8 | 43.8 | |
Medium | Lupton | 8.7 | 11.9 | 14.4 | 11.5 | 13.7 | 11.7 | 17.4 | 16.1 |
LuptonHC | 7.8 | 11.1 | 13.4 | 12.7 | 10.3 | 12.6 | 16.3 | 15.5 | |
zscale | 3.8 | 9.0 | 7.2 | 7.3 | 1.6 | 3.6 | 15.1 | 14.9 | |
Large | Lupton | 16.4 | 30.9 | 18.9 | 14.3 | 19.6 | 9.3 | 43.1 | 41.5 |
LuptonHC | 15.3 | 22.8 | 14.9 | 15.0 | 11.6 | 13.0 | 38.6 | 39.7 | |
zscale | 0.7 | 3.6 | 3.8 | 5.2 | 0.1 | 0.9 | 37.8 | 37.0 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 23.7 | 24.6 | 40.9 | 46.3 | 41.7 | 41.4 | 51.7 | 50.8 |
LuptonHC | 26.1 | 28.0 | 43.6 | 46.0 | 43.2 | 43.1 | 50.9 | 50.3 | |
zscale | 22.9 | 30.7 | 40.2 | 39.6 | 21.8 | 34.1 | 52.7 | 52.5 | |
Stars | Lupton | 10.3 | 9.6 | 7.3 | 7.4 | 4.3 | 2.5 | 34.1 | 33.9 |
LuptonHC | 2.4 | 5.1 | 6.1 | 8.1 | 5.5 | 8.3 | 28.0 | 25.0 | |
zscale | 15.6 | 10.5 | 17.9 | 25.5 | 12.7 | 17.2 | 35.8 | 33.9 | |
Small | Lupton | 17.6 | 18.0 | 26.1 | 28.0 | 24.6 | 23.7 | 43.7 | 43.1 |
LuptonHC | 14.8 | 17.2 | 25.9 | 27.7 | 25.4 | 26.9 | 40.1 | 38.4 | |
zscale | 19.7 | 21.5 | 30.2 | 33.2 | 18.1 | 26.8 | 44.8 | 43.8 | |
Medium | Lupton | 8.7 | 11.9 | 14.4 | 11.5 | 13.7 | 11.7 | 17.4 | 16.1 |
LuptonHC | 7.8 | 11.1 | 13.4 | 12.7 | 10.3 | 12.6 | 16.3 | 15.5 | |
zscale | 3.8 | 9.0 | 7.2 | 7.3 | 1.6 | 3.6 | 15.1 | 14.9 | |
Large | Lupton | 16.4 | 30.9 | 18.9 | 14.3 | 19.6 | 9.3 | 43.1 | 41.5 |
LuptonHC | 15.3 | 22.8 | 14.9 | 15.0 | 11.6 | 13.0 | 38.6 | 39.7 | |
zscale | 0.7 | 3.6 | 3.8 | 5.2 | 0.1 | 0.9 | 37.8 | 37.0 |
AP scores on COSMOS HSC set for all network configurations (larger is better). Galaxy and Star AP scores are calculated separately, whereas Small (0–32 pixels2), Medium (32–96 pixels2), and Large (>96 pixels2) object AP scores are averaged across both classes. The best result for each row is emphasized in bold. The MViTv2 backbone gives the best results in all cases except for one.
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 23.7 | 24.6 | 40.9 | 46.3 | 41.7 | 41.4 | 51.7 | 50.8 |
LuptonHC | 26.1 | 28.0 | 43.6 | 46.0 | 43.2 | 43.1 | 50.9 | 50.3 | |
zscale | 22.9 | 30.7 | 40.2 | 39.6 | 21.8 | 34.1 | 52.7 | 52.5 | |
Stars | Lupton | 10.3 | 9.6 | 7.3 | 7.4 | 4.3 | 2.5 | 34.1 | 33.9 |
LuptonHC | 2.4 | 5.1 | 6.1 | 8.1 | 5.5 | 8.3 | 28.0 | 25.0 | |
zscale | 15.6 | 10.5 | 17.9 | 25.5 | 12.7 | 17.2 | 35.8 | 33.9 | |
Small | Lupton | 17.6 | 18.0 | 26.1 | 28.0 | 24.6 | 23.7 | 43.7 | 43.1 |
LuptonHC | 14.8 | 17.2 | 25.9 | 27.7 | 25.4 | 26.9 | 40.1 | 38.4 | |
zscale | 19.7 | 21.5 | 30.2 | 33.2 | 18.1 | 26.8 | 44.8 | 43.8 | |
Medium | Lupton | 8.7 | 11.9 | 14.4 | 11.5 | 13.7 | 11.7 | 17.4 | 16.1 |
LuptonHC | 7.8 | 11.1 | 13.4 | 12.7 | 10.3 | 12.6 | 16.3 | 15.5 | |
zscale | 3.8 | 9.0 | 7.2 | 7.3 | 1.6 | 3.6 | 15.1 | 14.9 | |
Large | Lupton | 16.4 | 30.9 | 18.9 | 14.3 | 19.6 | 9.3 | 43.1 | 41.5 |
LuptonHC | 15.3 | 22.8 | 14.9 | 15.0 | 11.6 | 13.0 | 38.6 | 39.7 | |
zscale | 0.7 | 3.6 | 3.8 | 5.2 | 0.1 | 0.9 | 37.8 | 37.0 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 23.7 | 24.6 | 40.9 | 46.3 | 41.7 | 41.4 | 51.7 | 50.8 |
LuptonHC | 26.1 | 28.0 | 43.6 | 46.0 | 43.2 | 43.1 | 50.9 | 50.3 | |
zscale | 22.9 | 30.7 | 40.2 | 39.6 | 21.8 | 34.1 | 52.7 | 52.5 | |
Stars | Lupton | 10.3 | 9.6 | 7.3 | 7.4 | 4.3 | 2.5 | 34.1 | 33.9 |
LuptonHC | 2.4 | 5.1 | 6.1 | 8.1 | 5.5 | 8.3 | 28.0 | 25.0 | |
zscale | 15.6 | 10.5 | 17.9 | 25.5 | 12.7 | 17.2 | 35.8 | 33.9 | |
Small | Lupton | 17.6 | 18.0 | 26.1 | 28.0 | 24.6 | 23.7 | 43.7 | 43.1 |
LuptonHC | 14.8 | 17.2 | 25.9 | 27.7 | 25.4 | 26.9 | 40.1 | 38.4 | |
zscale | 19.7 | 21.5 | 30.2 | 33.2 | 18.1 | 26.8 | 44.8 | 43.8 | |
Medium | Lupton | 8.7 | 11.9 | 14.4 | 11.5 | 13.7 | 11.7 | 17.4 | 16.1 |
LuptonHC | 7.8 | 11.1 | 13.4 | 12.7 | 10.3 | 12.6 | 16.3 | 15.5 | |
zscale | 3.8 | 9.0 | 7.2 | 7.3 | 1.6 | 3.6 | 15.1 | 14.9 | |
Large | Lupton | 16.4 | 30.9 | 18.9 | 14.3 | 19.6 | 9.3 | 43.1 | 41.5 |
LuptonHC | 15.3 | 22.8 | 14.9 | 15.0 | 11.6 | 13.0 | 38.6 | 39.7 | |
zscale | 0.7 | 3.6 | 3.8 | 5.2 | 0.1 | 0.9 | 37.8 | 37.0 |
Many studies of instance segmentation models use the MS-COCO or ImageNet-1k data sets as a benchmark to judge performance through the AP score. These data consist of terrestrial images with many object classes, so it can not necessarily be used as a comparison for our AP scores calculated on astronomical survey images with only two classes. However, to give a reader a sense of the range of typical values, the AP scores for models trained on terrestrial data typically range from ∼35 to 45 for convolutional backbones and push to ∼55 for transformer backbones (see the detectron2 repo for results). For a more fair comparison, we look to Burke et al. (2019) in which instance segmentation models were tested on the simulated observations from the Dark Energy Camera (DECam Flaugher et al. 2015). The authors report an AP score for galaxies of 49.6 and score of 48.6 for stars, averaged to a combined score of 49.0. We also train our suite of models on the DECam data set and report the results in Appendix A. More recently, He et al. (2021) use a combination of the instance segmentation model YOLOv4 (Bochkovskiy et al. 2020) and a separate classification network to perform source detection and classification on SDSS images. They report an AP score of 52.81 for their single-class detection network.
4.1 Incorrect label bias mitigation
There is an inherent bias in our measure of AP scores due to incorrect object class labels. In measurements described above, we test the network abilities to infer classes based on labels generated from HSC catalogues. However, these labels are known to become unreliable, especially for stars, around i-band magnitudes of ∼24 mag (Bosch et al. 2018). We use HSC co-adds in the COSMOS field for our test data set, and attempt to mitigate this mislabelling bias by exploiting the overlap of this field with space-based observations using the Advanced Camera for Surveys (ACS) on the HST. Because of the lack of atmospheric seeing, morphological classification of stars/galaxies using the HST COSMOS catalogue data is much more precise for faint objects, and can be used as ground truth instead of HSC labels. This will test how much poor classification behaviour is due to label generation as opposed to limitations of the models. We generate HST labels by cross-matching detected sources to the catalogue of Leauthaud et al. (2007) within 1 arcsec. If there is no object within 1 arcsec, we discard the object. There is not necessarily a one-to-one match of HSC versus HST labels, as we are cross-matching to different catalogues, but the number of objects per image remains roughly the same for either labelling scheme. We will refer to this as the HST COSMOS test set.
Given the size of the HST/HSC overlap in the COSMOS field and the size of our cutouts, there is not enough coverage to produce a sufficiently large training and test set with HST labels. Instead, we take the models trained on HSC-labelled data and test their evaluation performance on the HST COSMOS test set. To highlight the differences in class label generation, in Fig. 6 we show the number of stars and galaxies as a function of HSC i-band magnitude for the COSMOS set for both HSC and HST class labels. The unreliable quality of HSC labels at faint magnitudes is reflected in the increased counts of stars, especially the bump in stellar counts beginning at i ∼ 25 mag. Also of note is the fewer amounts of star counts in the HSC COSMOS set at bright magnitudes. This is likely due to our HSC label generating procedure of discarding objects with NaN values in the HSC catalogue. Bright stars are likely to have saturated pixels in their centres, causing these error flags to appear. With HST labels. we can test with a more astrophysically accurate baseline.

Galaxy and star counts for our COSMOS set, with labels generated from HSC and HST catalogues. The extra counts of HSC stars at faint magnitudes is due to galaxy contamination when classification is based on the extendedness metric. The low sample of bright HSC stars follows from our catalogue matching procedure of excluding objects with NaN values.
Using this new test set, we present AP scores in Table 3. The results for galaxy/star AP scores are in line with the previous results on the HSC COSMOS test set. In all cases, transformer architectures outperform ResNet architectures and are more robust to different contrast scalings. AP scores for Small bounding box objects improves for all network configurations, Medium bounding box AP score roughly remains the same, and Large bounding box AP score worsens. The decrease in Large bounding box AP scores is likely due to the initial label generation step with sep that overdeblends or ‘shreds’ large extended galaxies and saturated regions around stars. With our HSC label generation, we exclude many of the shredded regions by enforcing the i-band Δ1 mag criterion and discarding labels matched to saturated catalogue objects with NaN values. However, our HST label generation is solely based on a distance matching criterion, and so some of these shredded regions are included in the ground truth labels in the HST COSMOS test set. These spurious extra labels can lead to lower AP scores if the networks avoid shredding these regions at inference. In the next section, we examine metrics other than AP score that are less susceptible to this effect.
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 25.9 | 26.8 | 42.9 | 49.4 | 43.5 | 42.8 | 51.8 | 52.4 |
LuptonHC | 27.4 | 30.0 | 46.2 | 50.2 | 46.7 | 44.3 | 51.5 | 51.6 | |
zscale | 25.5 | 32.5 | 42.7 | 41.5 | 23.0 | 35.6 | 52.2 | 52.9 | |
Stars | Lupton | 16.2 | 15.0 | 10.9 | 10.9 | 7.1 | 3.8 | 52.9 | 53.7 |
LuptonHC | 4.2 | 7.9 | 11.2 | 14.2 | 9.4 | 13.9 | 42.1 | 37.7 | |
zscale | 28.3 | 19.1 | 29.3 | 41.6 | 23.8 | 29.0 | 53.9 | 52.6 | |
Small | Lupton | 22.0 | 22.1 | 29.3 | 31.4 | 27.0 | 25.2 | 54.0 | 54.7 |
LuptonHC | 16.4 | 19.9 | 30.0 | 33.3 | 29.4 | 30.7 | 48.2 | 46.0 | |
zscale | 28.0 | 27.1 | 37.8 | 42.9 | 24.8 | 34.1 | 54.7 | 54.3 | |
Medium | Lupton | 8.3 | 11.7 | 13.8 | 11.0 | 13.1 | 11.1 | 16.3 | 15.2 |
LuptonHC | 7.5 | 10.8 | 12.7 | 12.2 | 9.9 | 12.0 | 15.4 | 14.6 | |
zscale | 3.7 | 8.5 | 7.3 | 7.4 | 1.7 | 3.6 | 14.1 | 14.1 | |
Large | Lupton | 6.2 | 11.1 | 7.2 | 5.9 | 7.2 | 3.6 | 15.1 | 15.0 |
LuptonHC | 5.4 | 7.9 | 5.3 | 4.8 | 4.4 | 4.8 | 13.7 | 14.0 | |
zscale | 0.3 | 1.2 | 1.3 | 1.9 | 0.1 | 0.2 | 13.6 | 13.5 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 25.9 | 26.8 | 42.9 | 49.4 | 43.5 | 42.8 | 51.8 | 52.4 |
LuptonHC | 27.4 | 30.0 | 46.2 | 50.2 | 46.7 | 44.3 | 51.5 | 51.6 | |
zscale | 25.5 | 32.5 | 42.7 | 41.5 | 23.0 | 35.6 | 52.2 | 52.9 | |
Stars | Lupton | 16.2 | 15.0 | 10.9 | 10.9 | 7.1 | 3.8 | 52.9 | 53.7 |
LuptonHC | 4.2 | 7.9 | 11.2 | 14.2 | 9.4 | 13.9 | 42.1 | 37.7 | |
zscale | 28.3 | 19.1 | 29.3 | 41.6 | 23.8 | 29.0 | 53.9 | 52.6 | |
Small | Lupton | 22.0 | 22.1 | 29.3 | 31.4 | 27.0 | 25.2 | 54.0 | 54.7 |
LuptonHC | 16.4 | 19.9 | 30.0 | 33.3 | 29.4 | 30.7 | 48.2 | 46.0 | |
zscale | 28.0 | 27.1 | 37.8 | 42.9 | 24.8 | 34.1 | 54.7 | 54.3 | |
Medium | Lupton | 8.3 | 11.7 | 13.8 | 11.0 | 13.1 | 11.1 | 16.3 | 15.2 |
LuptonHC | 7.5 | 10.8 | 12.7 | 12.2 | 9.9 | 12.0 | 15.4 | 14.6 | |
zscale | 3.7 | 8.5 | 7.3 | 7.4 | 1.7 | 3.6 | 14.1 | 14.1 | |
Large | Lupton | 6.2 | 11.1 | 7.2 | 5.9 | 7.2 | 3.6 | 15.1 | 15.0 |
LuptonHC | 5.4 | 7.9 | 5.3 | 4.8 | 4.4 | 4.8 | 13.7 | 14.0 | |
zscale | 0.3 | 1.2 | 1.3 | 1.9 | 0.1 | 0.2 | 13.6 | 13.5 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 25.9 | 26.8 | 42.9 | 49.4 | 43.5 | 42.8 | 51.8 | 52.4 |
LuptonHC | 27.4 | 30.0 | 46.2 | 50.2 | 46.7 | 44.3 | 51.5 | 51.6 | |
zscale | 25.5 | 32.5 | 42.7 | 41.5 | 23.0 | 35.6 | 52.2 | 52.9 | |
Stars | Lupton | 16.2 | 15.0 | 10.9 | 10.9 | 7.1 | 3.8 | 52.9 | 53.7 |
LuptonHC | 4.2 | 7.9 | 11.2 | 14.2 | 9.4 | 13.9 | 42.1 | 37.7 | |
zscale | 28.3 | 19.1 | 29.3 | 41.6 | 23.8 | 29.0 | 53.9 | 52.6 | |
Small | Lupton | 22.0 | 22.1 | 29.3 | 31.4 | 27.0 | 25.2 | 54.0 | 54.7 |
LuptonHC | 16.4 | 19.9 | 30.0 | 33.3 | 29.4 | 30.7 | 48.2 | 46.0 | |
zscale | 28.0 | 27.1 | 37.8 | 42.9 | 24.8 | 34.1 | 54.7 | 54.3 | |
Medium | Lupton | 8.3 | 11.7 | 13.8 | 11.0 | 13.1 | 11.1 | 16.3 | 15.2 |
LuptonHC | 7.5 | 10.8 | 12.7 | 12.2 | 9.9 | 12.0 | 15.4 | 14.6 | |
zscale | 3.7 | 8.5 | 7.3 | 7.4 | 1.7 | 3.6 | 14.1 | 14.1 | |
Large | Lupton | 6.2 | 11.1 | 7.2 | 5.9 | 7.2 | 3.6 | 15.1 | 15.0 |
LuptonHC | 5.4 | 7.9 | 5.3 | 4.8 | 4.4 | 4.8 | 13.7 | 14.0 | |
zscale | 0.3 | 1.2 | 1.3 | 1.9 | 0.1 | 0.2 | 13.6 | 13.5 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 25.9 | 26.8 | 42.9 | 49.4 | 43.5 | 42.8 | 51.8 | 52.4 |
LuptonHC | 27.4 | 30.0 | 46.2 | 50.2 | 46.7 | 44.3 | 51.5 | 51.6 | |
zscale | 25.5 | 32.5 | 42.7 | 41.5 | 23.0 | 35.6 | 52.2 | 52.9 | |
Stars | Lupton | 16.2 | 15.0 | 10.9 | 10.9 | 7.1 | 3.8 | 52.9 | 53.7 |
LuptonHC | 4.2 | 7.9 | 11.2 | 14.2 | 9.4 | 13.9 | 42.1 | 37.7 | |
zscale | 28.3 | 19.1 | 29.3 | 41.6 | 23.8 | 29.0 | 53.9 | 52.6 | |
Small | Lupton | 22.0 | 22.1 | 29.3 | 31.4 | 27.0 | 25.2 | 54.0 | 54.7 |
LuptonHC | 16.4 | 19.9 | 30.0 | 33.3 | 29.4 | 30.7 | 48.2 | 46.0 | |
zscale | 28.0 | 27.1 | 37.8 | 42.9 | 24.8 | 34.1 | 54.7 | 54.3 | |
Medium | Lupton | 8.3 | 11.7 | 13.8 | 11.0 | 13.1 | 11.1 | 16.3 | 15.2 |
LuptonHC | 7.5 | 10.8 | 12.7 | 12.2 | 9.9 | 12.0 | 15.4 | 14.6 | |
zscale | 3.7 | 8.5 | 7.3 | 7.4 | 1.7 | 3.6 | 14.1 | 14.1 | |
Large | Lupton | 6.2 | 11.1 | 7.2 | 5.9 | 7.2 | 3.6 | 15.1 | 15.0 |
LuptonHC | 5.4 | 7.9 | 5.3 | 4.8 | 4.4 | 4.8 | 13.7 | 14.0 | |
zscale | 0.3 | 1.2 | 1.3 | 1.9 | 0.1 | 0.2 | 13.6 | 13.5 |
4.2 Missing and extra label bias mitigation
Since we have done the labelling ourselves using sep, scarlet and catalogue matching to produce ground truth detections, masks, and classes, traditional metrics of network performance may not be the best choice in characterizing efficacy. Consider the precision/recall and AP metric. An implicit assumption in these metrics is the completeness and purity of the ground truth labels. This assumption holds for large annotated sets of terrestrial images such as the MS-COCO set (Lin et al. 2014) commonly used as a benchmark in object detection/segmentation studies. It also holds for simulated data sets of astronomical images (Burke et al. 2019) as the ground truth object locations, masks, and classes are all known a priori when constructing the training and test set labels. However, real data of large astronomical scenes present a challenge. Given that we must generate labels without a known underlying truth, any comparisons to this ‘ground truth’ are really comparisons to the methods used to generate these labels. Issues in the label generating procedures will propagate to the performance metrics.
First, the ground truth detections are produced from running sep using a detection threshold of 5σ above the background. This causes a lack of complete labels, as some objects are missed. We could lower this threshold, but then run the risk of further overdeblending extended/saturated objects. This leads to the second issue in that there will still remain some level of shredding that will cause spurious extra objects to appear in the ground truth set, that is, a lack of pure labels. If the networks do not shred extended/saturated objects as much as sep, (which is a desirable feature of the networks) then the AP metric will be lower due to less spurious network detections than the ground truth. Finally, the object detection mechanisms of the neural networks used in this work are fundamentally different from the peak-finding detection used in sep.
These issues lead to cases in which the neural networks detect objects that are not labelled in our ground truth catalogue, despite being actual objects, or cases in which the networks do not detect unphysical objects that are in the ground truth. Any metric that considers true/false detections is subject to this effect. We do not wish to count these cases of fake TP/FPs, as this would lead to a reduction in performance metrics that does not reflect network classification/detection accuracy, but rather the limitations of our label generation. Therefore, we construct a set of metrics similar to the canonical precision and recall, but slightly alter our definitions of positive and negative detections. We use equations (4) and (5), but we limit our metrics to the set of objects D that are matched to a ground truth detection. The set of matched detections D is determined by selecting the inferred bounding box with the highest IOU to a ground truth bounding box, above a threshold of 0.5. Then for a given class C, TPs are the objects in D that are correctly classified, FPs are objects that are incorrectly assigned class C, and FNs are matched objects with a ground truth class C that the network assigns to a different class. With these metrics, precision and recall measure purely the classification power of the network, without bias from missing labels or extra false labels. If we assume that the network’s ability to classify remains consistent for objects outside of the matched set, we can generalize these metrics to overall classification performance.
We combine precision and recall into one metric to judge classification power, the F1 score, which is given by the harmonic mean between precision and recall,
The F1 score balances the trade-off between precision and recall, with a value close to unity being desirable. We report the F1 scores for the networks on the HST COSMOS test set in Table 4. The best-performing configuration among ResNet architectures is the R50cas network with a z-scale scaling. A Swin network with a Lupton scaling achieves the highest overall galaxy and star F1 scores, although the MViTv2 architecture remains competitive. Nearly all transformer networks configurations perform better on star/galaxy classification than ResNet-based networks. Classification power of transformer-based networks is again more robust to contrast scalings than ResNet-based networks.
F1 scores for star and galaxy classes in the HST COSMOS test set, computed for all network configurations. The best F1 score for each row is emphasized in bold. Transformer networks outperform convolutional networks in all cases, especially for stars.
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 0.963 | 0.981 | 0.982 | 0.983 | 0.980 | 0.979 | 0.994 | 0.994 |
LuptonHC | 0.975 | 0.978 | 0.980 | 0.981 | 0.981 | 0.980 | 0.991 | 0.990 | |
zscale | 0.982 | 0.981 | 0.982 | 0.986 | 0.969 | 0.980 | 0.994 | 0.994 | |
Stars | Lupton | 0.458 | 0.475 | 0.334 | 0.327 | 0.215 | 0.145 | 0.881 | 0.884 |
LuptonHC | 0.233 | 0.330 | 0.325 | 0.397 | 0.294 | 0.375 | 0.800 | 0.751 | |
zscale | 0.690 | 0.571 | 0.615 | 0.763 | 0.603 | 0.643 | 0.873 | 0.869 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 0.963 | 0.981 | 0.982 | 0.983 | 0.980 | 0.979 | 0.994 | 0.994 |
LuptonHC | 0.975 | 0.978 | 0.980 | 0.981 | 0.981 | 0.980 | 0.991 | 0.990 | |
zscale | 0.982 | 0.981 | 0.982 | 0.986 | 0.969 | 0.980 | 0.994 | 0.994 | |
Stars | Lupton | 0.458 | 0.475 | 0.334 | 0.327 | 0.215 | 0.145 | 0.881 | 0.884 |
LuptonHC | 0.233 | 0.330 | 0.325 | 0.397 | 0.294 | 0.375 | 0.800 | 0.751 | |
zscale | 0.690 | 0.571 | 0.615 | 0.763 | 0.603 | 0.643 | 0.873 | 0.869 |
F1 scores for star and galaxy classes in the HST COSMOS test set, computed for all network configurations. The best F1 score for each row is emphasized in bold. Transformer networks outperform convolutional networks in all cases, especially for stars.
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 0.963 | 0.981 | 0.982 | 0.983 | 0.980 | 0.979 | 0.994 | 0.994 |
LuptonHC | 0.975 | 0.978 | 0.980 | 0.981 | 0.981 | 0.980 | 0.991 | 0.990 | |
zscale | 0.982 | 0.981 | 0.982 | 0.986 | 0.969 | 0.980 | 0.994 | 0.994 | |
Stars | Lupton | 0.458 | 0.475 | 0.334 | 0.327 | 0.215 | 0.145 | 0.881 | 0.884 |
LuptonHC | 0.233 | 0.330 | 0.325 | 0.397 | 0.294 | 0.375 | 0.800 | 0.751 | |
zscale | 0.690 | 0.571 | 0.615 | 0.763 | 0.603 | 0.643 | 0.873 | 0.869 |
. | ResNets . | Transformers . | |||||||
---|---|---|---|---|---|---|---|---|---|
. | . | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Galaxies | Lupton | 0.963 | 0.981 | 0.982 | 0.983 | 0.980 | 0.979 | 0.994 | 0.994 |
LuptonHC | 0.975 | 0.978 | 0.980 | 0.981 | 0.981 | 0.980 | 0.991 | 0.990 | |
zscale | 0.982 | 0.981 | 0.982 | 0.986 | 0.969 | 0.980 | 0.994 | 0.994 | |
Stars | Lupton | 0.458 | 0.475 | 0.334 | 0.327 | 0.215 | 0.145 | 0.881 | 0.884 |
LuptonHC | 0.233 | 0.330 | 0.325 | 0.397 | 0.294 | 0.375 | 0.800 | 0.751 | |
zscale | 0.690 | 0.571 | 0.615 | 0.763 | 0.603 | 0.643 | 0.873 | 0.869 |
To examine network performance on faint objects, we show precision and recall as a function of i-band magnitude for the HST COSMOS test set in Fig. 7. Galaxy recall maintains a value close to one for all objects regardless of magnitude, with some fluctuations of a few percent for some models. Galaxy precision dips for some models at bright magnitudes, which may be due to compact galaxies with bright cores resembling stars. However, these dips are more likely due to inherent limitations of the models rather than label generation, as transformer architectures produce high galaxy precision and recall across magnitude bins compared to ResNet architectures. Most ResNet architectures suffer with stellar recall, with many showing poor performance even at bright magnitudes. Stellar precision reaches near unity at bright magnitudes for all architectures, but many networks configurations begin to drop in performance around i-band magnitudes of 21 mag. The best-performing networks maintain a stellar precision above 0.8 out to ∼25 mag in the i band. The transformer models we trained are able to achieve a 99.6 per cent galaxy recall, 99.2 per cent galaxy precision, 85.4 per cent stellar recall, and 91.5 per cent star precision on our HST COSMOS test set, averaged over the whole magnitude range. For comparison, He et al. (2021) perform deep neural network object detection and classification of stars, galaxies, and quasars in large SDSS images. With their sample of objects that covers an r-band magnitude range of 14–25 mag, they report a galaxy recall of 95.1 per cent, galaxy precision of 95.8 per cent, stellar recall of 84.6 per cent, and stellar precision of 94.5 per cent.

Top: galaxy precision/recall metrics as a function of object magnitude in the HST i band. The colours correspond to individual backbone architectures and are shown in the legend. Line styles represent different network architectures following the legend, and colours indicate which contrast scaling was used (red for Lupton, blue for LuptonHC, and black for z-scale). The black vertical line indicates the Deep/UltraDeep i-band 5σ magnitude of 26.9 mag. The y-axis is truncated to better show the differences across the models. Bottom: stellar precision/recall metrics as a function of object magnitude in the HST i band.
Some images contain artefacts such as bleed trails, diffraction spikes, or ‘ghost images’ often appearing around bright stars. We do not expect the networks to misidentify these artefacts as real objects, as they are largely excluded from our training set due to the label generation procedure described in Section 3.2. Indeed, the networks largely ignore these artefacts, seen for example in Fig. 8. Explicit identification of artefacts is possible, as in Tanoglidis et al. (2022) where the authors use a Mask-RCNN architecture to identify ghosts and other artefacts in DES images.

Left: an example image containing multiple artefacts, including blooming and optical ghosts around the bright star in the upper right and large ghosts in the lower middle of the image. Right: the inference results of a MViTv2 Lupton scaled network. While identifying some objects within the regions of these artefacts, the network does not classify the artefacts themselves as legitimate objects. Bright regions around stars and other artefacts (regardless of location) are generally ignored, as they are not labelled in the training set and thus not ‘seen’ by the networks.
4.3 Deblending
In order to quantify deblending performance of the networks, we compute IOU scores for matched objects. The process is similar to the matching done in computing classification precision/recall. We first set a detection confidence threshold of 0.5 and then compute the bounding box IOUs for all detected and ground truth objects. For each ground truth object, we take the corresponding detected object with the highest IOU above a threshold of 0.5. We employ this threshold to avoid the biases discussed in Section 4.2. An IOU of one indicates a perfect match between the ground truth box and the inferred box. In addition to bounding box IOU, we also compute the segmentation mask IOU, which follows from equation (6), but uses the area of the true and predicted segmentation masks. We report the median IOU for all matched objects in Table 5, and show the distributions in Fig. 9. Transformer-based networks generally produce a higher bounding box IOU than ResNet-based networks, although the R50cas, R101fpn, and X101fpn networks remain competitive. Segmentation mask IOUs are lower than bounding box IOUs in all cases. This indicates that while the networks are able to identify overall object sizes quite well, the finer details of object shapes within the bounding boxes are not as well inferred.

Bounding box IOUs of each detected object that is matched a a ground truth object. Rows show the results for different transformer backbones. Top: results for ResNet backbones. Bottom: results for transformer backbones. The left column represents Lupton scaling, and the middle Lupton high contrast and the right z-scaling.
Median bounding box IOUs for matched objects in the COSMOS HST test. The best bounding box IOU for each row is emphasized in bold. Also shown in parentheses are the median segmentation mask IOUs. An IOU above 0.5 is considered to be a good match, and a score of 1.0 is a perfect overlap of ground truth and inference.
. | ResNets . | Transformers . | ||||||
---|---|---|---|---|---|---|---|---|
. | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Lup | 0.75 (0.61) | 0.78 (0.57) | 0.93 (0.63) | 0.94 (0.62) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
LupHC | 0.76 (0.61) | 0.79 (0.58) | 0.93 (0.64) | 0.94 (0.64) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
Zscale | 0.78 (0.61) | 0.81 (0.59) | 0.92 (0.62) | 0.93 (0.63) | 0.82 (0.65) | 0.91 (0.64) | 0.94 (0.65) | 0.94 (0.65) |
. | ResNets . | Transformers . | ||||||
---|---|---|---|---|---|---|---|---|
. | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Lup | 0.75 (0.61) | 0.78 (0.57) | 0.93 (0.63) | 0.94 (0.62) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
LupHC | 0.76 (0.61) | 0.79 (0.58) | 0.93 (0.64) | 0.94 (0.64) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
Zscale | 0.78 (0.61) | 0.81 (0.59) | 0.92 (0.62) | 0.93 (0.63) | 0.82 (0.65) | 0.91 (0.64) | 0.94 (0.65) | 0.94 (0.65) |
Median bounding box IOUs for matched objects in the COSMOS HST test. The best bounding box IOU for each row is emphasized in bold. Also shown in parentheses are the median segmentation mask IOUs. An IOU above 0.5 is considered to be a good match, and a score of 1.0 is a perfect overlap of ground truth and inference.
. | ResNets . | Transformers . | ||||||
---|---|---|---|---|---|---|---|---|
. | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Lup | 0.75 (0.61) | 0.78 (0.57) | 0.93 (0.63) | 0.94 (0.62) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
LupHC | 0.76 (0.61) | 0.79 (0.58) | 0.93 (0.64) | 0.94 (0.64) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
Zscale | 0.78 (0.61) | 0.81 (0.59) | 0.92 (0.62) | 0.93 (0.63) | 0.82 (0.65) | 0.91 (0.64) | 0.94 (0.65) | 0.94 (0.65) |
. | ResNets . | Transformers . | ||||||
---|---|---|---|---|---|---|---|---|
. | R101C4 . | R101dc5 . | R101fpn . | R50cas . | R50def . | X101fpn . | MViTv2 . | Swin . |
Lup | 0.75 (0.61) | 0.78 (0.57) | 0.93 (0.63) | 0.94 (0.62) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
LupHC | 0.76 (0.61) | 0.79 (0.58) | 0.93 (0.64) | 0.94 (0.64) | 0.93 (0.64) | 0.93 (0.64) | 0.94 (0.64) | 0.94 (0.64) |
Zscale | 0.78 (0.61) | 0.81 (0.59) | 0.92 (0.62) | 0.93 (0.63) | 0.82 (0.65) | 0.91 (0.64) | 0.94 (0.65) | 0.94 (0.65) |
The median IOUs measure the ability of the network to detect and segment objects, but it does not fully capture the deblending power of the networks. We examine the cases of a few close blends to get a sense of the ability of the networks to distinguish large overlapping objects. We demonstrate the deblending capabilities of the different networks in Fig. 10. In very crowded scenes, the networks are able to distinguish the individual sources, and even pick up objects that are not present in the labelled set, which may present an advantage for studies of low surface-brightness galaxies. As discussed in Section 4.2, this is likely due to the difference in object detection abilities of the RPNs compared to peak-finding methods, and highlights that the models are not limited by the training data, but are able to extrapolate beyond it. It is also possible to alter inference hyperparameters such as IOU or detection confidence thresholds, which could allow for more or less detections or overlap between detections. In Fig. 11, we demonstrate the effect of lowering the confidence threshold hyperparameter, allowing for more low-confidence detections. While not equivalent, this is similar to lowering the detection threshold in peak-finding algorithms. There are cases in which deblending is poor, and these are typically very large galaxies with one or more very large and very close companions. In such instances, it may be better to use a different contrast scaling. In Fig. 12, a Lupton contrast scaling prevents the network from deblending multiple large sources. With the same IOU/confidence score thresholds, a z-scaling works to better isolate the two sources. This is likely due to much larger dynamic range of our z-scaling, which allows for less smearing of the sources and more distinguishing power in this case. Overall, there does not seem to be a one-size-fits-all network configuration for the cases of very large and very close blends. Training on more data would likely improve the ability to detect and segment these objects.

Inference on a close blend. The ground truth is shown on the left. RGB images are created with a Lupton contrast scaling. Other panels show model inference of segmentation maps and classes. Top row, left to right: R101C4, R101dc5, and R101fpn. Bottom row, left to right: R50cas, R50def, and X101fpn. The colours indicate classes, green for galaxy and red for star. Differences in detections are solely due to the different backbones. While the networks do not pick up every ground truth object, they are also able to detect real objects that were missed by our ground truth labelling.

Inference on the same close blend as Fig. 10, but only with a Swin architecture. The ground truth is shown on the leftmost panel, and the effect of lowering the detection confidence threshold to 0.5, 0.4, and 0.3 is shown in left to right, respectively. As the threshold is lowered, objects within a larger footprint are detected.

The effect of using a different contrast scaling on a close blend. We show inference of a R50cas network when trained on Lupton scaled images (left) and z-scaled images (right). The objects are more easily distinguished with a z-scaling.
5 DISCUSSION
The effectiveness of instance segmentation models has been proven in many domains, boosted by the ability of networks to work ‘out-of-the-box’ and without much fine-tuning. It has been shown that an object detection model based on the Mask R-CNN framework performs well in the classification and detection/segmentation of simulated astronomical survey images (Burke et al. 2019). In this work, we have trained and tested a broad range of state-of-the-art instance segmentation models on real data taken from the HSC SSP DR 3 to push the direction of deep learning-based galaxy detection, classification, and deblending towards real applications. Network training and evaluation performance is limited by the efficacy of our label generation methodology, a task not easily formulated when the ground truth is not completely known. This limitation also affects the choices of metrics we use to measure network performance. Often, classification and detection power are combined into the AP score, used throughout instance segmentation literature. However, this may not the best choice of metric for comparisons, as it implicitly assumes the completeness and correctness of the ground truth labels. To attempt to mitigate the effects of incorrect labels on performance metrics, we construct a test set of objects with class labels determined from more accurate space-based HST observations. However, since the AP metric artificially suffers from the detections of ‘FPs’ that are true objects simply missing from the labelled set and/or the presence of spurious ground truth detections, we further attempt to mitigate this bias by constraining performance metrics to detected objects that have a matched ground truth label.
We find that all networks perform well at classifying galaxies, even out to the faintest in the sample. Despite the wide variety of colours, sizes, and morphologies in the real imaging data, our models can identify these objects. Stellar classification is worse, likely due to the smaller sample size in the training and test set. Transformer-based networks generally outperform ResNet-based networks in classification power of both stars and galaxies. They also appear to be more robust classifiers as magnitudes become fainter. Transformer-based models maintain near 100 per cent completeness (recall) and purity (precision) of galaxy selection across the whole sample and above 60 per cent completeness and 80 per cent purity of stars out to i-band magnitudes of 25 mag. These models are able to outperform the extendedness classifier used in the HSC catalogues, which depending on cuts yields near 100 per cent galaxy purity, roughly 90 per cent galaxy completeness, stellar completeness slightly above 50 per cent and stellar purity slightly above 40 per cent at i-band magnitudes of 25 mag (Bosch et al. 2018). The performance increase of our models is especially noteworthy because they are able to surpass the HSC class labelling despite being trained with it. Transformer models are also more robust to different contrast scalings than traditional convolutional neural networks, indicating that they may be less susceptible to biases introduced from transfer learning with terrestrial images and more applicable to a wide range of images across surveys with different dynamic ranges.
The detection/deblending capabilities are measured by the median bounding box IOUs of the networks. Again, transformer-based networks generally outperform convolutional ResNet-based networks. The improved performance of transformer networks over convolutional-based ones may be attributable to the ability of different attention heads to encode information at different image scale sizes (Dosovitskiy et al. 2020), allowing for more overall global information propagation than CNNs. While a convolutional neural network is able to learn spatial features through sliding a kernel across an image, a transformer learns features over the entire input at once, removing any limitations due to kernel sizes. It is possible that the transformer backbones are implicitly utilizing large-scale features in the images such as the spatial clustering of objects, background noise or seeing and using these bulk properties to inform the network.
We examine a few cases of close blends to qualitatively see how the networks distinguish objects. There are cases in which the networks do not detect close objects, but these can sometimes be mitigated by altering the confidence and NMS IOU threshold hyperparmeters (which can be done after training). In other cases, using a different contrast scaling helps to isolate closely blended objects.
There is room to improve both classification and segmentation of these models in future work. One possibility is constructing a larger training set with more accurate labels. With better and larger samples of stars/galaxies, networks may perform better on classification. The more close blends of large galaxies seen during training, the more likely the networks will be able to distinguish these scenes. There could be more fine-tuning of hyperparmeters done to the architectures before training, rather than running them out-of-the-box. Additionally, the use of more photometric information could help in all tasks. We use the i, r, and g bands on the HSC instrument in this work, corresponding to RGB colour images, but could further investigate the performance if we include the z and y bands.
It is possible that these networks need to be trained longer, or that the fundamentally different properties of astronomical images over terrestrial ones limits the abilities of these architectures in extracting useful features for classification. Despite our attempts to mitigate measurement biases arising from label generation, classification remains a challenge for these models at faint magnitudes. A machine learning model has already been used to classify HSC data using photometry information with better accuracy than morphological methods, but relies on the upstream task of detection (Bosch et al. 2018). The instance segmentation models presented in this work are able to identify and assign classes after training using only an image as input.
6 CONCLUSIONS
It is already a necessary consequence of the current epoch of astronomical research for machine learning algorithms to parse through massive sets of images. A first step in catalogue construction is detecting these objects from imaging data. Advancements in the broader computer vision community have given rise to a large ecosystem of models that perform many necessary tasks at once, including detection, segmentation, and classification. While tried and tested on terrestrial data and shown to work on simulated astronomical data, the application on real survey images remains a work in progress. Many methods rely on the object detection stage to produce measurements of individual objects. In this work, we employ a variety of instance segmentation models available through detectron2 to perform the detection task as well as deblending and object classification simultaneously on images taken from the HSC-SSP DR 3. We carefully construct ground truth labels with existing frameworks and catalogue matching, and caution that real data give no straightforward way of producing labels. We find that the best networks perform well at classifying the faintest galaxies in the sample, and perform better than traditional methods at classifying stars up to i-band magnitudes of ∼25 mag. We find that even if trained on less accurate class labels, the neural networks still pick up on useful features that allow inference of the true underlying class. We expect more data with accurate labels to improve performance. The best-performing models are able to detect and deblend by matching ground truth object locations and bounding boxes. Transformer networks appear to be a promising avenue of exploration in further studies.
There are many other areas for future study. While we tested a variety of models, there are many within detectron2 that we did not implement. Some architectures are quite large and require significant resources to train. For example, we attempted to implement ViT backbones (Dosovitskiy et al. 2020) among our set of transformer-based architectures, but were limited by the available GPU memory. Many models, especially transformers, are trained with state-of-the-art computing resources at FAIR or other organizations, and subsequently retraining them demands significant resources. Tests could be done on other sets of real data, with other downstream tasks in mind. For example, González et al. (2018) investigate the application of instance segmentation models on SDSS data to classify galaxy morphologies. It would be straightforward to add additional classes, or implement a redshift estimation network using the modular nature of detectron2. In future work, we plan to add a photo-z estimator branch to the Mask-RCNN/transformer networks and interface with the LSST software rail (Redshift Assessment Infrastructure Layers).2 The availability of realistic LSST-like simulations (LSST Dark Energy Science Collaboration (LSST DESC) 2021) for training will allow us to avoid biases from label generation. The efficiency of neural networks and the ability to perform multiple tasks at once is now a necessity with the amount of survey data pouring into pipelines.
As surveys push deeper into the sky, they will produce unprecedented amounts of objects that will be necessary to process. LSST will provide the deepest ground-based observations ever, and survey terrabytes of data every night, highlighting a need for accurate and precise object detection and classification, potentially in real-time. Correctly classifying and and deblending sources will be necessary for a wide range of studies, and deep instance segmentation models will be a valuable tool in handling these tasks.
ACKNOWLEDGEMENTS
We thank Dr S. Luo and Dr D. Mu at the National Center for Supercomputing Applications (NCSA) for their assistance with the GPU cluster used in this work. We thank Y. Shen for helpful discussion on the HST observations of the COSMOS field. We thank the anonymous referees for helpful comments. GM, YL, YL, and XL acknowledge support from the NCSA Faculty Fellowship, the NCSA Students Pushing Innovation (SPIN) programs, and the National Science Foundation (NSF) grant AST-2308174.
This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant no. 1725729, as well as the University of Illinois at Urbana-Champaign.
We acknowledge use of matplotlib (Hunter 2007), a community-developed python library for plotting. This research made use of astropy,3 a community-developed core python package for Astronomy (Astropy Collaboration 2013; Price-Whelan et al. 2018). This research has made use of NASA’s Astrophysics Data System.
The HSC collaboration includes the astronomical communities of Japan and Taiwan, and Princeton University. The HSC instrumentation and software were developed by the National Astronomical Observatory of Japan (NAOJ), the Kavli Institute for the Physics and Mathematics of the Universe (Kavli IPMU), the University of Tokyo, the High Energy Accelerator Research Organization (KEK), the Academia Sinica Institute for Astronomy and Astrophysics in Taiwan (ASIAA), and Princeton University. Funding was contributed by the FIRST program from Japanese Cabinet Office, the Ministry of Education, Culture, Sports, Science and Technology (MEXT), the Japan Society for the Promotion of Science (JSPS), Japan Science and Technology Agency (JST), the Toray Science Foundation, NAOJ, Kavli IPMU, KEK, ASIAA, and Princeton University.
This paper makes use of software developed for the Large Synoptic Survey Telescope. We thank the LSST Project for making their code available as free software at http://dm.lsst.org.
This paper is based on data collected at the Subaru Telescope and retrieved from the HSC data archive system, which is operated by the Subaru Telescope and Astronomy Data Center (ADC) at NAOJ. Data analysis was in part carried out with the cooperation of Center for Computational Astrophysics (CfCA), NAOJ. We are honored and grateful for the opportunity of observing the Universe from Maunakea, which has the cultural, historical and natural significance in Hawaii.
The Pan-STARRS1 Surveys (PS1) have been made possible through contributions of the Institute for Astronomy, the University of Hawaii, the Pan-STARRS Project Office, the Max-Planck Society and its participating institutes, the Max Planck Institute for Astronomy, Heidelberg and the Max Planck Institute for Extraterrestrial Physics, Garching, The Johns Hopkins University, Durham University, the University of Edinburgh, Queen’s University Belfast, the Harvard-Smithsonian Center for Astrophysics, the Las Cumbres Observatory Global Telescope Network Incorporated, the National Central University of Taiwan, the Space Telescope Science Institute, the National Aeronautics and Space Administration under grant no. NNX08AR22G issued through the Planetary Science Division of the NASA Science Mission Directorate, the National Science Foundation under grant no. AST-1238877, the University of Maryland, and Eotvos Lorand University (ELTE), and the Los Alamos National Laboratory.
Based [in part] on data collected at the Subaru Telescope and retrieved from the HSC data archive system, which is operated by Subaru Telescope and Astronomy Data Center at National Astronomical Observatory of Japan.
This research has made use of the NASA/IPAC Infrared Science Archive, which is funded by the National Aeronautics and Space Administration and operated by the California Institute of Technology.
DATA AVAILABILITY
The data underlying this article were accessed from the HSC data archive system https://hsc-release.mtk.nao.ac.jp/doc/. The derived data generated in this research will be shared on reasonable request to the corresponding author. The software used in this work is publicly available at https://github.com/grantmerz/deepdisc.
Footnotes
See https://github.com/facebookresearch/Detectron2/tree/main/projects for a comprehensive list of projects.
References
APPENDIX A: DECAM RESULTS
For a baseline comparison of network performances, we utilize the PhoSim data set created and used by Burke et al. (2019). We refer to the earlier work for a full description, but provide a brief summary here. Crowded fields as taken with DECam are produced using the photon simulator code (Peterson et al. 2015). Simulations account for equipment optics (Cheng 2017), telescope options (Flaugher et al. 2015), and atmospheric conditions. Spiral, elliptical, and irregular galaxies are produced by sampling 3D Sérsic profiles with additional parameters for extra morphological features. Stars are modelled as point sources and created following an initial mass function from Kroupa (2001). For both stars and galaxies, SEDs and metallicities are assigned based on physical models. Cosmic star formation history (Madau & Dickinson 2014) is used to assign galaxy number density and population, while the distribution of stars is based on galactic latitude. To simulated crowded fields, the galactic overdensity is boosted by a factor of 4. A 512 × 512 pixel2 image is produced with g-, r-, and z-DECam bands. Integration time and magnitude ranges are assigned to roughly correspond to DECaLs DR7 co-adds (Dey et al. 2019). In order to assign object masks, a g-band image without background is produced for every object in the field. The PSF is configured to ∼1 arcsec. In total, 1000 images are produced for our training set, while an additional 250 are used for validation and another 50 as our test set for evaluation. Each image contains roughly 150 objects.
Here, we present the results of two runs on the simulated DECam data, using the R101fpn and MViTv2 backbones. These backbones are chosen to compare the performance of convolutional versus transformer-based architectures. We use the same contrast scalings that were applied to the HSC data, but change the stretch parameter to 100 and Q to 10 for the Lupton and Lupton high-contrast scalings. The dynamic range of the simulated data is different from the HSC data, so the adjustment is done to make galaxy features more distinguishable. AP scores for each configuration are shown in Table A1. We adapt the ranges for Small, Medium, and Large bounding box sizes to match those used in Burke et al. (2019). Overall, we find that a Lupton scaling with a ResNet backbone works best for this data set, giving the highest AP scores for almost all categories. This is in contrast to the results on HSC data, however we note that a transformer backbone is again more robust to contrast scalings. Although Burke et al. (2019) use a z-scale with a R101fpn backbone, our results are different as we use a slightly altered z-scale formula in that we rescale each band by a constant σI rather than a per-band scale factor. This alteration makes galaxy classification performance worse (AP = 29.80 compared to AP = 49.6), but star classification performance better (AP = 54.32 compared to AP = 48.6). The large drop in galaxy AP suggests that the R101fpn backbone is very sensitive to the contrast scaling. All other configurations result in better galaxy and star AP scores than the Burke et al. (2019) results. Our AP scores for Small objects are lower, but Medium and Large are much higher. For size categories, we use the same size definitions as in Burke et al. (2019), but compute an average AP for all IOU thresholds, rather than the AP at only the lowest threshold IOU = 0.5. Thus, our results can be thought of as a kind of lower bound, as AP score tends to increase with a lower IOU threshold.
AP scores for DECam runs. Galaxy and star AP scores improve over the results of Burke et al. (2019) when different contrast scalings and backbones are applied. Transformer-based models are more robust to contrast scalings, consistent with results on real HSC data.
. | . | R101fpn . | MViTv2 . |
---|---|---|---|
Galaxies | Lupton | 65.8 | 62.5 |
LuptonHC | 58.3 | 62.2 | |
zscale | 29.8 | 60.4 | |
Stars | Lupton | 70.1 | 68.0 |
LuptonHC | 64.3 | 68.6 | |
zscale | 54.3 | 66.4 | |
Small | Lupton | 68.3 | 65.7 |
LuptonHC | 61.8 | 65.7 | |
zscale | 42.3 | 63.8 | |
Medium | Lupton | 36.1 | 31.6 |
LuptonHC | 29.0 | 46.7 | |
zscale | 16.3 | 31.7 | |
Large | Lupton | 72.6 | 54.9 |
LuptonHC | 49.1 | 65.2 | |
zscale | 38.0 | 68.0 |
. | . | R101fpn . | MViTv2 . |
---|---|---|---|
Galaxies | Lupton | 65.8 | 62.5 |
LuptonHC | 58.3 | 62.2 | |
zscale | 29.8 | 60.4 | |
Stars | Lupton | 70.1 | 68.0 |
LuptonHC | 64.3 | 68.6 | |
zscale | 54.3 | 66.4 | |
Small | Lupton | 68.3 | 65.7 |
LuptonHC | 61.8 | 65.7 | |
zscale | 42.3 | 63.8 | |
Medium | Lupton | 36.1 | 31.6 |
LuptonHC | 29.0 | 46.7 | |
zscale | 16.3 | 31.7 | |
Large | Lupton | 72.6 | 54.9 |
LuptonHC | 49.1 | 65.2 | |
zscale | 38.0 | 68.0 |
AP scores for DECam runs. Galaxy and star AP scores improve over the results of Burke et al. (2019) when different contrast scalings and backbones are applied. Transformer-based models are more robust to contrast scalings, consistent with results on real HSC data.
. | . | R101fpn . | MViTv2 . |
---|---|---|---|
Galaxies | Lupton | 65.8 | 62.5 |
LuptonHC | 58.3 | 62.2 | |
zscale | 29.8 | 60.4 | |
Stars | Lupton | 70.1 | 68.0 |
LuptonHC | 64.3 | 68.6 | |
zscale | 54.3 | 66.4 | |
Small | Lupton | 68.3 | 65.7 |
LuptonHC | 61.8 | 65.7 | |
zscale | 42.3 | 63.8 | |
Medium | Lupton | 36.1 | 31.6 |
LuptonHC | 29.0 | 46.7 | |
zscale | 16.3 | 31.7 | |
Large | Lupton | 72.6 | 54.9 |
LuptonHC | 49.1 | 65.2 | |
zscale | 38.0 | 68.0 |
. | . | R101fpn . | MViTv2 . |
---|---|---|---|
Galaxies | Lupton | 65.8 | 62.5 |
LuptonHC | 58.3 | 62.2 | |
zscale | 29.8 | 60.4 | |
Stars | Lupton | 70.1 | 68.0 |
LuptonHC | 64.3 | 68.6 | |
zscale | 54.3 | 66.4 | |
Small | Lupton | 68.3 | 65.7 |
LuptonHC | 61.8 | 65.7 | |
zscale | 42.3 | 63.8 | |
Medium | Lupton | 36.1 | 31.6 |
LuptonHC | 29.0 | 46.7 | |
zscale | 16.3 | 31.7 | |
Large | Lupton | 72.6 | 54.9 |
LuptonHC | 49.1 | 65.2 | |
zscale | 38.0 | 68.0 |