Astronomaly at scale: searching for anomalies amongst 4 million galaxies

Modern astronomical surveys are producing datasets of unprecedented size and richness, increasing the potential for high-impact scientific discovery. This possibility, coupled with the challenge of exploring a large number of sources, has led to the development of novel machine-learning-based anomaly detection approaches, such as Astronomaly. For the first time, we test the scalability of Astronomaly by applying it to almost 4 million images of galaxies from the Dark Energy Camera Legacy Survey. We use a trained deep learning algorithm to learn useful representations of the images and pass these to the anomaly detection algorithm isolation forest, coupled with Astronomaly's active learning method, to discover interesting sources. We find that data selection criteria have a significant impact on the trade-off between finding rare sources such as strong lenses and introducing artefacts into the dataset. We demonstrate that active learning is required to identify the most interesting sources and reduce artefacts, while anomaly detection methods alone are insufficient. Using Astronomaly, we find 1635 anomalies among the top 2000 sources in the dataset after applying active learning, including eight strong gravitational lens candidates, 1609 galaxy merger candidates, and 18 previously unidentified sources exhibiting highly unusual morphology. Our results show that by leveraging the human-machine interface, Astronomaly is able to rapidly identify sources of scientific interest even in large datasets.


INTRODUCTION
Astronomy, like many other fields, is experiencing a data revolution with large and rich data sets being produced at an extraordinary rate.Modern surveys, such as the Sloan Digital Sky Survey (SDSS; York et al. 2000), have already observed and classified over a billion objects across one-third of the sky (Almeida et al. 2023).The Dark Energy Spectroscopic Instrument (DESI; Dey et al. 2019) Legacy Surveys cover a smaller area of the sky than the SDSS but obtained deeper and higher quality images and have observed 1.6 billion sources, including stars, galaxies, and quasars (Dey et al. 2019).Upcoming surveys from the Vera C. Rubin Observatory1 and the Square Kilometre Array2 are expected to acquire petabytes of data and observe unprecedented numbers of astronomical phenomena.It is expected that the Vera C. Rubin Observatory will provide 32 trillion observations of 20 billion galaxies at a greater depth than previous large surveys (Ivezić et al. 2019).In addition, Euclid is poised to contribute significantly to the wealth of astronomical data (Scaramella et al. 2022) and aims to map the geometry of the dark Universe with unprecedented accuracy (Tutusaus, Sorce & Troja 2023).
Improved data quality and deeper images provide a better opportunity to detect rare and unknown astrophysical phenomena, called anomalies.Anomaly detection is often key for scientific discoveries but faces many challenges.For instance, a natural approach to look for rare objects might be to simulate data sets, as Metcalf et al. (2019) did. However, Ciprĳanovic et al. (2021) showed that this can actually fail when applied to real data sets as simulated and observational data often represent different data domains.
Anomalies may already exist in data sets but have been previously overlooked, as seen when Massey, Neugent & Levesque (2019) identified quasi-stellar objects in various spectroscopic data sets.Large amounts of data make it more difficult to detect anomalies as manual inspections are not possible.Additionally, there are various types of anomalies, not all of which are scientifically interesting.Citizen science projects like Galaxy Zoo (Lintott et al. 2008(Lintott et al. , 2011) ) have been used to accelerate data investigation.The project's many volunteers perform critical galaxy classification and anomaly detection tasks, but they may overlook some features and might not necessarily have the training to identify interesting anomalies.
Automated processes, such as machine learning (ML), have demonstrated their ability to handle complex tasks that were previously performed only by people.Supervised ML algorithms have been widely used in astronomy for classification tasks (e.g.Debosscher et al. 2007;Martinazzo, Espadoto & Hirata 2020;Soroka, Meshcheryakov & Gerasimov 2022).While supervised methods fall short as true anomaly detectors, relying on training data and classifying predefined classes, unsupervised ML excels in anomaly detection by operating without the need for prior examples of the sources of interest.In astronomy, unsupervised ML has been applied to various data types.For instance, it has been used to identify outliers in SDSS spectroscopic data through adaptations of techniques like random forests (Baron & Poznanski 2017).Similarly, anomalies in Kepler light curves have been successfully detected (Giles & Walkowicz 2019), and generative adversarial networks have been employed for anomaly detection in optical images (Storey-Fisher et al. 2021).Solarz et al. (2017) showed that anomalies vary in relevance depending on the specific study.Anomalies may range from artefacts, which are of interest to system operators but considered contaminants by astronomers, to rare but known sources such as strong gravitational lenses, which have an estimated occurrence rate of just 1 in 104 (Huang et al. 2021).
Unsupervised ML anomaly detection methods, when combined with additional techniques, have proven successful in astronomy.Lochner & Bassett (2021) developed a general framework for anomaly detection that incorporates a novel active learning (AL) approach.astronomaly combines AL with personalized user feedback, enabling users to interactively label objects and refine the anomaly detection process.Moreover, astronomaly has the capability to handle various types of data, including images, spectra, or time series data, and can leverage domain knowledge and user preferences to improve detection accuracy and efficiency.astronomaly was applied to the Galaxy Zoo 3 data set, as well as on simulated data in order to evaluate the performance of the AL technique.In both cases, astronomaly's AL approach nearly doubled the detection of interesting anomalies in the initial 100 user-viewed objects compared to the popular anomaly detection algorithm, iForest (Liu, Ting & Zhou 2008).Walmsley et al. (2022) adapted astronomaly to include a deep learning approach combined with a novel AL algorithm.A convolutional neural network (CNN) was used to learn a low-dimensional representation that captures the salient features of galaxy images.Additionally, the regressor that models user interest: Random Forest (Breiman 2001) in astronomaly was replaced by a Gaussian process (GP, Rasmussen & Williams 2006) that allowed the use of an acquisition function to more optimally select targets for user labelling.AHUNT, as an alternative approach, refines deep features through AL in each iteration to enhance anomaly detection (Vafaei Sadr, Bassett & Sekyi 2022).
Anomaly detection is a challenging task with very few studies done on a large scale in astronomy.While astronomaly has been shown to be effective in detecting anomalies, it has also yet to be used on a large data set.The main objective of this work is to test the capabilities and limitations of astronomaly by applying it to a large subset of the Dark Energy Camera Legacy Survey (DECaLS, Dey et al. 2019).DECaLS has yet to be extensively studied, making it excellent for searching for undiscovered anomalies.This will evaluate the performance and scalability of astronomaly as well as provide the opportunity to make new discoveries.
The paper is structured as follows: Section 2 covers DECaLS subset selection criteria and data pre-processing.An evaluation set is also created, which is used to test the performance of the different AL methods mentioned previously.Section 3 discusses the astronomaly framework, including the algorithms and parameters used for this work.Section 4.1 presents the findings from the evaluation set, followed by the results for the main DECaLS subset in Section 4.2.Lastly, Section 4.3 highlights some of the interesting anomalies detected.

DATA
The DESI Legacy Surveys encompass three distinct imaging surveys: the Beĳing-Arizona Sky Survey (BASS) 4 , the Mayall -band Legacy Survey5 , and the DECaLS.These surveys are enriched with data from the Dark Energy Survey (The Dark Energy Survey Collaboration 2005) and select observations from the Wide-Field Infrared Survey Explorer (Wright et al. 2010), notably the W1 and W2 bands, to provide additional colour information.This work uses data from the eighth Public Data Release (DR8)6 of DECaLS, using the three optical bands , , and .DECalS reaches depths of magnitude 24, 23.4, and 22.5 in bands , , and , respectively, which is significantly deeper than the previous large optical survey SDSS.

Selection cuts
DR8 of the Legacy Surveys contains more than 1.6 billion unique sources.The quantity of data alone requires a subset to be selected due to limitations associated with downloading and storing such significant amounts of data.More information on the sources, storage, and computational requirements can be found in Appendix A. This is a common challenge and colour or magnitude constraints are often implemented on target data (Sridhar et al. 2020;Walmsley et al. 2022).DECaLS also contains a number of sources that clearly do not have anomalous morphologies, such as stars, which should ideally be removed before searching for anomalies.While applying strict selection cuts can result in higher data quality and reduce the impact of artefacts and biases, it can also limit the scope of the analysis and remove potentially interesting anomalies.In this paper, we minimize the cuts used in order to produce a subset that is as inclusive as possible, but still of a manageable size for the resources available.
In order to implement appropriate selection cuts, we use the processing that has already been applied to DECaLS data to identify artefacts and classify sources, as outlined in Dey et al. (2019).It is important to acknowledge that the calibrations and techniques employed were not perfect.One area where challenges persist is in the detection and masking of artefacts and other bright sources.While the flagging from Dey et al. (2019) allowed the removal of the bulk of these sources, a number of artefacts remained, which will be discussed in later sections.Here we briefly outline the final selection cuts used, which are described in more detail in Appendix B.
First, we required no masked sources in any bands, which are generally masked due to artefacts or bright foreground stars.
Secondly, we ensure all sources are well fitted by a standard galaxy profile model.The DECaLS pipeline fits several models, including an exponential and de Vaucouleurs (1948) model, reporting on the reduced chi-square for the best-fitting model.We found that some sources had a reduced chi-square lower than one in all bands, which corresponded to artefacts, nearby masked sources, as well as faint and unresolved sources.We thus required all sources to have a reduced chi-square greater than one in at least one band.No upper limit was applied to the reduced chi-square as this would eliminate sources with unusual morphology which are anomalous but poorly fit by the profiles used.
A positive signal-to-noise (SNR) ratio in all three bands was also imposed.This cut also eliminated the observations that do not have passes in all three bands.We also applied a minimum flux threshold in any of the bands to remove sources that are too faint to be distinguishable.The threshold value was based on visual inspections and was chosen to ensure that only relatively bright sources were included.Sources were also selected based on their size, specifically whether they have a large enough radius in either of the two DECaLS models available (de Vaucouleurs or exponential).This cut reduced the number of compact objects that are unlikely to be resolved, focusing on more extended sources.These cuts were carefully chosen after testing various combinations of them and other criteria that are not listed in the final selection.For a more detailed description of the selection criteria, including the exact query used, see Appendix B. Mao et al. (2021) imposed additional cuts by restricting the model fitting the parameters rchi_g, rchi_r, and rchi_z, but it was seen that this removed a significant number of interesting sources when we applied this to a small subset of data, so they are not implemented.The DECaLS catalogue consists of 871 359 667 resolved (non-PSF) sources, which was reduced to 3 884 404 after applying the above selection cuts.This subset will be referred to as the main subset for the remainder of the paper.

Image cut-out sizes
Our methodology includes using a CNN to extract good representations of the source images as was done in Walmsley et al. (2022).CNNs generally require input images of identical dimensions, which makes choosing a cut-out size important.Moreover, since ML algorithms are affected by the angular source size in the image (Slĳepcevic et al. 2024), we also needed to individually adjust the scale of the input images such that the sources are similarly sized.Walmsley et al. (2022) employed a fixed image size that was possible because of the availability of the Petrosian radius (Petrosian 1976) for the sources, which allowed the pixel count that covered the target source to be estimated.Images were obtained from the DECaLS cut-out service, using the native telescope resolution and adjusting the visible sky area to match the radius determined by the Petrosian radii.Finally, the images in Walmsley et al. (2022) were interpolated to 424x424 pixels and colourized for viewing purposes.However, the Petrosian radius is not available for the entire DECaLS data set.Furthermore, the radii included in the DECaLS catalogue, which are derived by model fitting, were generally found to be much larger than the apparent visible extent of the target source.As a result, the above approach could not be directly replicated for the range of target sources used in the subset.Thus, we needed an alternative method to estimate the appropriate cut-out size for each source.
DESI has objective requirements for optical imaging, one of which is to achieve the magnitude depths of at least 24, 23.4, and 22.5 in the , , and  bands, respectively.These are defined as the 'optimal extraction depth' of galaxies near the DESI depth limit.The definition of such a galaxy is an exponential profile with a half-light radius of 0.45 arcsec.An important part of such a profile is the ability to get a good estimate of the number of photometric pixels that make up an image of the galaxy using the equation where  is the standard deviation for a Gaussian fit, p = 1.15 and  half is the half-light radius for an exponential profile fit to the galaxy (Dey et al. 2019).The equation approximates an exponential fit to a galaxy even though some sources follow a de Vaucouleurs profile, while others have a composite profile.This equation estimates the size of each source, and a cut-out could be extracted as a result.We performed visual tests on a sample of sources to confirm that the estimated cut-out size was sufficient.Fig. 1 illustrates the distribution of source numbers based on their respective pixel sizes, as determined by equation equation ( 1), for a random sample of sources.The majority of the sources within this random sample had values between 100 and 200 pixels.To ensure consistency, all source cut-outs used throughout were therefore interpolated to a standardized, yet arbitrary, resolution of 150 × 150 pixels.This size discrepancy, significantly fewer pixels than the 424 used in Walmsley et al. (2022), is attributed to the inclusion of markedly smaller and fainter sources that dominate the population of the subsets investigated.It is worth noting that the secondary peak in Fig 1, situated at approximately 350 pixels, corresponds to artefacts present within the random sample used.

Finding an evaluation set
A labelled subset of data plays a crucial role in the process of algorithm selection and hyperparameter optimization.Additionally, it serves as a valuable resource for evaluating the performance of anomaly detection techniques.To construct this evaluation subset, we randomly selected 15 000 sources from the complete DECaLS data set and supplemented them with DECaLS cut-outs for 342 lens candidates sourced from Huang et al. (2020).These lenses were intentionally included to ensure the presence of interesting anomalies within the subset.
To ensure consistency and adherence to the criteria described in Section 2.1, the same selection cuts were applied to this random set.Table 1 summarizes the number of sources excluded by each selection cut.We found that a surprisingly high number of the lens candidates did not meet the selection criteria, with only 87 candidates passing all of the applied cuts, implying that the chosen cuts are not optimal for lens inclusion.The majority of failures originated from the Shapedev_r and Shapeexp_r cuts, the half-light radii of the de Vaucouleurs and the exponential model respectively, indicating that lenses tend to have small angular diameters and are generally faint.This resulted in sources failing to meet the specified cut criteria, further reinforced by lenses also failing the flux value cut.Some lens candidates failed multiple cuts, and it was observed that three of them were incorrectly labelled as point spread functions (PSFs) in the catalogue.
After an extensive search, no generalizable criteria could be found that contain a large sample of lenses while also reducing the size of the entire data set to a manageable level.We thus elected to continue with a sample of 5000 random sources, and the 87 lenses, that passed all selection cuts.This subset of sources was then fully labelled using the labelling scheme described in Section 3 and illustrated in Fig. 2.
It is worth noting that even such a small sample is still expected to contain other interesting anomalies, including moderately interesting sources like mergers, in addition to the lenses that have been added to it.This subset formed the evaluation set that was used to assess the performance of the methods used.

METHODOLOGY
The methodology is summarized as follows: first, the images are reduced to a representative set of features using a modified CNN followed by a dimensionality reduction procedure (Section 3.1).These features are then passed to an anomaly detection algorithm, followed by AL to reduce the number of unwanted artefacts and prioritize interesting sources (Section 3.2).We evaluate the performance of our methodology using our hand labelled subset of data (Section 3.3).

Feature extraction
Images are typically high-dimensional and often require dimensionality reduction for computational efficiency.Feature extraction achieves this by transforming images into lower dimensional vectors that retain their essential information.This is done using an image representation function that maps images to vectors while maintaining similarity.
To create the representations of the images that will serve as features in this work, a pre-trained CNN was used, following the approach of Walmsley et al. (2022).The CNN model was initially trained on a complex classification task, but its remarkable capability extends to learning relevant features for different tasks beyond its original training purpose.
CNNs have multiple layers, each of which performs a different transformation on the input image.Only the last layer performs classification by generating a probability distribution over the image's possible classes.The rest of the network extracts image information for the classification layer, producing a vector which can be used as features.The CNN in Walmsley et al. (2022) uses the Efficient-NetB0 architecture (Tan & Le 2020) and is implemented in Zoobot (Walmsley et al. 2023) for various galaxy classification purposes.
To extract features from the cut-out images, we employed the same model as described by Walmsley et al. (2019).However, we used a different pre-processing method to better suit our larger and noisier data set.Since we use the estimated size of the source to extract the cut-out, cropping is not required for our data set.No augmentations of the images were done when passed to the CNN as the network was used for feature extraction only.A standard sigma clipping algorithm from the Astropy python package (The Astropy Collaboration 2013Collaboration , 2018Collaboration , 2022) ) was applied to each image.The algorithm uses an iterative approach to estimate the noise in the image and masks all pixels below the 3 threshold.Finally, the images were greyscaled by averaging the three bands, , , and , into a single band.
The CNN produces a vector that contains 1280 features.This feature vector is still a high-dimensional representation that poses computational challenges.Principal component analysis (PCA) was used to reduce the dimensionality while preserving the majority of the in-formation (Pearson 1901).PCA is a statistical method that transforms a set of correlated variables into a set of uncorrelated variables called principal components, which account for most of the variance in the original data (Shlens 2014).By setting a variance limit of 95 per cent, the principal components retain 95 per cent of the information in the original features, while reducing the dimensionality from 1280 to 26.After the features were extracted and PCA applied to reduce the dimensionality, an anomaly detection procedure similar to that of Lochner & Bassett (2021) was applied, using the image representations as features instead of the simple morphological features that were used in that work.

Anomaly detection
astronomaly incorporates two widely used anomaly detection algorithms: isolation forest (iForest) and the local outlier factor (LOF) algorithm (Breunig et al. 2000;Liu, Ting & Zhou 2008) with both algorithms implemented using the scikit-learn package (Pedregosa et al. 2011).We conducted tests and determined that LOF could not effectively scale to handle the volume of data in the main subset, while iForest proved scalable to such volumes.iForest is a fast algorithm that detects anomalies by employing decision trees.It isolates a data point by randomly selecting a feature and a value within its range, then recursively splits the data into subsections based on this value.This process continues until all data points are isolated, forming a forest of trees.The number of splits required to isolate a point is referred to as the path length, and it serves as a measure of the anomaly score for that point.The underlying idea is that anomalies are more likely to be isolated from the rest of the data, with fewer splits compared to normal points.Therefore, the shorter the path length, the more anomalous the point is considered.
iForest assigns an anomaly score to each source in the data set, determining how it differs from the majority of sources.We normalize this score such that a higher value represents how anomalous the object is and thus determine a ranked order of the entire subset from most to least anomalous.AL can then be applied, where the graphical interface of astronomaly allows the users to manually rate the objects on a scale of 0 to 5 according to how interesting they are.
AL is an approach that allows algorithms to select the most informative examples to label in order to improve model performance.The classification of anomalies is often subjective and depends on the user's judgement to identify objects of particular interest.By focusing on the relevant samples that provide the most information, AL aims to reduce the amount of labelled data needed.This is useful in situations where labelling data is expensive and time-consuming and requires human expertise, or where no labels exist and only a subset of the data is sufficient to achieve good results.astronomaly uses these user-provided labels to update the anomaly scores for the entire data set, leading to a revised ranking order.
Two different AL approaches were evaluated in this work.The first approach is the novel AL algorithm introduced by Lochner & Bassett (2021), which is integrated into astronomaly and referred to in this study as the neighbour score (NS) algorithm.The second is the one used in Walmsley et al. (2022), a GP (Rasmussen & Williams 2006) that can model smooth distributions, hereon referred to as the direct regression (DR) algorithm.
The primary objective of the NS algorithm is to adjust anomaly scores based on user-provided labels.It uses training data consisting of a small number of human-provided labels and employs a random forest regression algorithm (Liaw & Wiener 2002) to predict user scores for all data instances based on these labels.By calculating each instance's distance from its nearest labelled neighbour, the method effectively identifies areas of the feature space where the algorithm displays uncertainty.In these regions, it returns scores that are close to the original anomaly scores.Conversely, in regions with ample training data, the predicted user scores modulate the anomaly scores.The algorithm employs a scoring function that combines nearest neighbour distances, predicted user scores, and original anomaly scores to compute the final anomaly scores for each instance.
In contrast, the DR approach skips entirely the truly unsupervised anomaly detection step.Instead, it attempts to use AL to directly model the user's 'interestingness' score.The approach of Walmsley et al. (2021) is to select a small set of random examples for labelling and then iteratively query the user (sometimes called the oracle in machine learning literature) with a new set of examples explicitly selected, using an 'acquisition' function, in order to improve the regression algorithm.While any regression algorithm can be used, it must be able to produce uncertainty information in order to compute the acquisition function.GPs are ideally suited to this task as they model uncertain or noisy functions.They assign probabilities to potential data-fitting functions and update these probabilities as data points are labelled, transitioning from a prior to a posterior distribution.GPs rely on a kernel to shape function behaviour and smoothness, with hyperparameters adjusted based on data likelihood to maximize the fit to observed data.
The GP incorporated in astronomaly uses a combination of the Matérn kernel and the WhiteKernel to model the relationships between data points and account for noise in the predictions.Both implementations are implemented in Python through the scikit-learn package (Pedregosa et al. 2011).

Evaluation
To evaluate the performance of the two AL approaches, the fully labelled evaluation set described in Section 2.3 was used as a benchmark, which allowed the recall of the algorithms to be determined.iForest was applied to compute the anomaly score for each data point in the data set.In the NS algorithm, the process involved selecting the top 100 images with the highest anomaly scores.These images were then labelled through the astronomaly interface to establish a new ranking order based on user input.From this new ranking, the top 100 sources were identified and labelled, excluding those already labelled in previous iterations.This iterative process continued until a total of 500 sources were labelled and used for training.For the DR algorithm, the first 100 sources were manually labelled based on iForest scores, and the sources were then sorted by acquisition scores.The retraining procedure was repeated iteratively until 500 sources had been labelled and trained upon.Throughout this process, the accuracy of each labelling iteration was assessed to evaluate the performance and effectiveness of the algorithms.
Fig. 2 illustrates typical sources with different labels (ranging from 0 to 5) assigned as scores during the AL process.Score 0 includes artefacts, masked sources and low SNR sources.Scores 1 and 2 contain typical elliptical and spiral galaxies respectively.Score 3 showcases more interesting spiral galaxies with unusual structures.Score 4 is reserved for sources with highly unusual morphologies, while score 5 contains clearly interacting sources, lenses, and other unidentified sources considered anomalous.It is important to note that labelling is subjective and that this served as a guideline only.It does not comprehensively capture all variations within the data set, nor should it be viewed as a rigid classification system.

RESULTS
To evaluate the performance of anomaly detection methods, particularly in scenarios where the data set contains an unknown number of interesting anomalies, we employ two types of plots: a recall plot and a Uniform Manifold Approximation and Projection (UMAP) plot (McInnes, Healy & Melville 2018).
Recall plots, as depicted in Fig. 3 and Fig. 4, serve to illustrate how the anomaly detection algorithm ranks sources from most to least anomalous based on their anomaly scores.In this plot, the xaxis represents the position of the sources in the ranked list, while the y-axis represents the recall for the specific class of interest.Essentially, the y-value increases when a source belongs to the class of interest.A steeper slope at the beginning of the plot indicates better performance, as it means that a user has to search through fewer sources to find interesting anomalies.This ranking of sources is crucial because, in the astronomaly interface, the sources on the left would be displayed first for labelling when using the score-based ranking.Other ordering methods, such as random ordering, are also available in the interface.
To visualize the high-dimensional features in a two-dimensional space, the UMAP technique was used in this work (McInnes, Healy & Melville 2018).UMAP is a non-linear dimensionality reduction technique that retains the topological structure of the data.This means it can capture both global and local relationships among data points.In a UMAP plot, clusters are groups of points that are close together in feature space, suggesting that the data points within the clusters share similarities or patterns that differentiate them from other data points.UMAP can also highlight outliers, which are data points that significantly differ from the majority of the data.Outliers may appear as isolated points far from any clusters or as points located on the edges of clusters but not tightly grouped with them.By embedding high-dimensional features into a two-dimensional space, UMAP visualizations, like the one shown in Fig. 5, help provide valuable insights into the underlying data structure and the performance of the anomaly detection methods.

Evaluation set
The comparison of the NS, DR, and iForest-only anomaly detection methods is illustrated in Fig. 3.The plot shows how many of the known anomalies in the evaluation set were detected as a function of the sample size.The evaluation set consists of a total of 184 anomalies including the 87 lenses injected from a known catalogue.The performance of iForest on the evaluation set yielded somewhat surprising results, as it detected only 15 anomalies among the top 500 sources.However, when the DR approach was applied and trained with 100 labels, it detected 51 anomalies, while the NS method found 45.Both AL methods demonstrated a significant improvement in anomaly detection compared to using iForest alone.Further increasing the number of labels to 500 resulted in even better anomaly detection performances, with both AL methods detecting 84 anomalies within the top 500 sources.
To gain a deeper understanding of why iForest struggled to detect a significant number of anomalies, a more detailed analysis was conducted on the sources and their associated scores.Fig. 4 presents a normalized plot of the sources in the evaluation set, categorized into three distinct classes: artefacts, galaxy merger candidates, and known lenses, as these were the sources of particular interest.To directly compare classes with very different numbers of objects, we normalize the recall in each class to a scale of 0 to 1.This plot reveals the reason for iForest's poor performance: sources with a high anomaly score are predominantly artefacts.While technically anomalies, these sources may not be particularly interesting from a scientific perspective.
Fig. 4 provides additional evidence of the enhanced detection rates achieved through the application of the NS algorithm.The galaxy mergers and gravitational lensing candidates have been shifted higher in the order (to the left and upwards), while the artefacts have been moved lower (down and to the right).This outcome is quite important, as it demonstrates that AL can rapidly remove artefacts that may have been missed by automated pipelines, allowing a scientist to focus their attention on interesting anomalies.
Fig. 5 presents a UMAP plot for the evaluation set, where all Secondly, this offers valuable insights into the challenges encountered when using iForest as a standalone anomaly detection method.The interesting anomalies are located on the edge of the overall structure of the plot, but the artefacts are much more likely to be detected as anomalies by iForest given the position and distance of their sub-cluster.These sub-clusters highlight the importance of AL algorithms, which identify these regions of interest in the feature space and are crucial for eliminating artefacts.
The feature space in Fig. 5 is markedly different from that of Walmsley et al. (2022), where most of the interesting anomalies were deep in the centre of the plot.This is likely due to the fact that the Galaxy Zoo data represent a specific subset of DECaLS with very different properties.The Galaxy Zoo data set is well curated and subject to more restrictive magnitude cuts, among other criteria, resulting in the selection of galaxies that are well-resolved in general.This results in a very different structure in the feature space which explains why iForest, combined with the NS algorithm, worked well here but failed to find interesting anomalies in the Galaxy Zoo data with the CNN-based features.
All of these classes are interspersed among the common sources, those with a Score of 1, indicating that it is quite diverse and contains features common with all of the other classes.The class with a Score of 3 is closest to the interesting anomalies, which is likely due to the similarities between spiral galaxies with moderately interesting morphologies (which constitute most of the Score 3 class) and the galaxy merger candidates in the Score 5 class.The Score 4 class contains too few sources to draw clear observations about it.

Application on a large scale
The NS algorithm was used in the application on the main subset, which consists of 3 884 404 unlabelled sources.The volume of this subset forms the main challenge of this work as astronomaly has not been applied on such a scale before.The same approach as in the evaluation set was followed, excluding the use of the DR algorithm due to its computational demands and limited discernible benefits.See Appendix A for more details.Each AL iteration of the NS algorithm was done with 2000 labels until 10 000 sources were labelled.The top 2000 sources were investigated and fully labelled in each iteration.This entire process only took the lead author several hours using astronomaly's interactive interface.
The results are presented in Fig. 6, showing the anomalies and their ranks among the top 2000 sources.The left panel includes labelled sources and the right panel shows only the new or unseen anomalies, demonstrating the power of the algorithm to detect new anomalies.The line for 0 labels, corresponding to iForest only, is not visible because it detected only one interesting anomaly in the top 2000 sources.However, this does not mean that iForest failed to detect anomalies.In fact, most of the 2000 (1763) sources initially detected by iForest were artefacts or masked sources; obvious anomalies in the data set, but not of interest.The plots demonstrate the necessity of AL, as more labels lead to a higher number of interesting sources within the top 2000.However, as the panel on the right in Fig. 6 shows, the number of new anomalies seen in the top 2000 drops sharply when more than 4000 labels are used.This suggests a point of diminishing returns, where adding more labels would not necessarily lead to more anomalies being found when looking at the same number of sources.In such an instance, discovering additional anomalies would require investigating a larger number of sources rather than increasing the number of labels.Recall as a function of rank for increasing numbers of labels.The plot on the left shows the number of anomalies including those that have been previously labelled, whereas the plot on the right excludes the labelled anomalies, showing the unlabelled sources the algorithm determines as being the most likely to be interesting.It is clear that the number of "new" anomalies found in the top 2000 increases as more labels are used for training.However, there is a point of diminishing returns as the number of "new" anomalies decreases rapidly when more than 4000 labels are used.It should also be noted that the line for 0 labels is not visible on either plot since there was only one source detected.

Follow-up investigations
The 10 000 labels from the main subset were scored between 0 and 5 using the labelling scheme described in Section 3 and illustrated in Fig. 2. The majority of the sources (4861) received a score of 0, indicating that most were artefacts or masked sources.The next largest group ( 2408) that received a score of 1 were the common galaxy types, followed by 1648 sources given a score of 5, indicating a large number of interesting anomalies.The other labels were distributed among the scores of 2 (519), 3 (288), and 4 (276), which represent galaxies with slightly disturbed morphology.
The 1648 anomalies were analysed and 18 unclassified sources (Fig. 7), eight gravitational lens candidates (Fig. 8) and 1609 galaxy merger candidates (Fig. 9) were detected.Out of the 1648 sources labelled to be anomalies, further investigations determined that 13 of these sources were either artefacts, other uninteresting sources mislabelled as lens candidates or sources that form part of another detected source and which could be considered to be a duplicate detection.In an attempt to identify these anomalies, they were crossmatched with the Simbad data base (Wenger et al. 2000) with different cross-matching distances and using all of the available data sets on Simbad.A total of 1209 matches at 30 arcsec and 792 matches at 10 arcsec were obtained, generally corresponding to already labelled galaxies.However, there was no definitive data set to match against that could confirm the nature of the anomalies detected and a manual follow-up would be necessary to confirm their nature, as even a 10 arcsec difference can lead to potential mismatches.

Sources with highly unusual morphology
We searched for matches in Simbad (Wenger et al. 2000) for the 18 unidentified sources.Table 2 summaries the source locations and any known identifiers with a link to the Simbad entry.Almost half of the sources have no matches or identifications in any other data sets.The sources can be seen in Fig. 7 along with their labels.Here we summarize our initial investigations of the sources, based on information found in Simbad and the Data Aggregation Service7 .
(i) U1 -This appears to be a ring-shaped starburst galaxy (Toba et al. 2014) at a spectroscopic redshift of 0.077 (Ahumada et al. 2020).The northern extension may be a tidal tail or interacting galaxy, but may also be coincident as it has an imprecise photometric redshift of 0.11.
(ii) U2 -This puzzling object has detectable emission in radio (NVSS, Condon et al. 1998), infrared (2MASS, Skrutskie et al. 2006), and ultraviolet (GALEX, Martin et al. 2005).Two competing spectroscopic redshifts are available, 0.071 (GLADE, Dálya et al. 2018) and0.2 (Milliquas, Flesch 2023).The latter seems to correspond to the bright red source which could be a coincident quasar (given the selection criteria of Milliquas).A detailed analysis would be required to determine which of these sources are associated and if the ring is a lens or an unusually red ring galaxy with a bright patch of star formation.
(iii) U3 -This source, apparently half red and half blue, has the same visual appearance in PanSTARRS (Chambers et al. 2016) indicating its appearance is not due to a processing error.Unfortunately, only photometric redshifts from DECaLS are available with an im-plied redshift of 0.1120 for the red source and 0.1050 for the bright blue bulge in the centre.The uncertainties are too large to determine if these are coincident sources or not.The entire source has associated ultraviolet emission from GALEX and the red part is clearly detected in 2MASS.We have no immediate explanation for the dualcolour nature of this source.It could be two coincident galaxies with a remarkable chance alignment, interacting galaxies of very different colours and nature or something else entirely.
(iv) U4 -This source has a star-forming ring and two apparent cores that may be merging.A photometric redshift of 0.0775 (GLADE, Dálya et al. 2018) is available for one of the cores.The nearby galaxy, LEDA 213095, has a photo-z of 0.0734 (GLADE, Dálya et al. 2018) meaning it could potentially be interacting and triggering the star formation.
(v) U5 -A highly disturbed, star-forming ring galaxy.Not surprisingly, this object is bright in radio (NVSS, Condon et al. 1998) and ultraviolet (GALEX, Martin et al. 2005).Only the bright ring has spectroscopic information, with a redshift of 0.0459 (6dF Jones et al. 2009).The neighbouring source has a negative photo-z from DECaLS.It is thus not clear if these galaxies are interacting or coincident.At the bottom of the cut-out, a faint third galaxy can be seen for which no redshift information is available.
(vi) U6 -This object appears to be a lens but has the unusual feature of two foreground objects, neither of which is in the centre.A competing explanation is that the pair of galaxies is disturbing a third galaxy, completely disrupting its morphology.The left and right foreground galaxies have photometric redshifts of 0.3117 ± 0.1321 and 0.1224 ± 0.0377 respectively which suggests that one of them may be coincident.However, given the complexity of the system and the large uncertainty of the redshift of the left source, we cannot be certain they are unrelated.The entire system is visible in ultraviolet (GALEX, Martin et al. 2005) and the two foreground sources are faintly detected in infrared (2MASS, Skrutskie et al. 2006).
(vii) U7 -Tidal tail due to interaction with a neighbouring galaxy.Both galaxies have a redshift of 0.0371 (GLADE, Dálya et al. 2018), although a redshift error is not listed so they may still be coincident sources.
(viii) U8 -It is difficult to determine if the two red galaxies are associated with the oddly shaped blue galaxy.The redshift estimates for the parts of this system vary significantly, although the DECaLS photometric redshifts place the red galaxies at a similar redshift (0.22) to the spectroscopic redshift of the blue galaxy (6dF Jones et al. 2009).However, the group-finding algorithm of Eke et al. (2004) places these galaxies in a group at redshift 0.11.It seems likely that it is the group interaction that has warped the morphology of the blue galaxy triggering star formation, but careful data analysis is needed to confirm this.
(ix) U9 -The two galaxies seen here have a similar photometric redshift in the DECaLS catalogue of 0.17, but the blue region in the middle registers a very different redshift of 0.34.It is possible that these galaxies are simply coincident or they may in fact be interacting and creating a star-forming region between them which is not well estimated by photo-z algorithms.
(x) U10 -With very different redshift estimates for each part of this system, these sources could very well be coincident.However, it seems quite a dramatic alignment of sources so some amount of interaction and star formation may be more plausible.Additional spectroscopic observations would be needed to determine the nature of this interesting group.
(xi) U11 -This unusual system also has disagreeing photometric redshifts so it may be a chance alignment, although it visually appears to be a merging system.
Table 3.The gravitational lens candidates that have been identified in the top 10 000 anomalies.The second and third columns show the right ascension and declination in degrees, respectively.The last column indicates whether the candidates have been confirmed to be a lens, a lens candidate or a candidate that has not been matched to any other catalogue yet.(xii) U12 -This unusual group of galaxies has been detected with the group-finding algorithm of Eke et al. ( 2004) at redshift 0.2137.

Entry
(xiii) U13 -This system was identified as a possible compact group in Zheng & Shen (2020) at redshift 0.02640 with possible interactions.The main galaxy is also the host of the supernova SN 2012T (Asiago Supernova Catalogue, R. et al. 1999).
(xv) U15 -With three different photometric redshift estimates in this group, these are likely to be coincident galaxies although the usual caveats about photo-z uncertainties apply.
(xvi) U16 -All three photometric redshifts from DECaLS for this system are very different so again, they are either coincident galaxies or have poorly estimated redshifts.

Gravitational lens candidates
Fig. 8 and Table 3 show the strong gravitational lens candidates that have been identified in the top 10 000 anomalies.The source labelled L1 has been cross-matched, identified and confirmed to be a strong gravitational lens.The sources L2 through L5 have been cross-matched with other catalogues and identified as strong lens candidates due to the combined strong lens catalogued created in Grespan et al. (in preperation).The remaining sources are suspected strong lenses based on visual characteristics within the cut-outs, but have not been listed in any known catalogue.It should be noted that only fairly obvious lenses were labelled as interesting and there may be many more candidates in the list of anomalies that could potentially be identified by an expert in the field.
Confirming the nature of these sources is challenging without significant additional analysis and spectroscopic follow-up.We found that the photometric redshift information available to us does not appear to be particularly reliable for these lensed systems.For instance, the DECaLS photometric redshifts for the sources in L1, which is a confirmed lens system, places the lens at a higher redshift than the background source, which is obviously incorrect.Of particular interest would be to confirm if the system U6 of Fig. 7 is actually a lensed system and if the two sources in the foreground are coincidental or in fact part of the lens.Similarly, the pair of sources in L8 would need spectroscopic information to determine if they are both involved in lensing the background source or not.3. The angular scale bar within each image represents 5 arcsec.

Galaxy merger candidates
Fig. 9 shows some of the more striking examples of mergers detected with astronomaly.The 1609 galaxy merger candidates were compared with the Catalog of Morphologically Identified Merging Galaxies (Hwang & Chang 2009), but the sky coverage of this catalogue, 422 deg 2 , is significantly smaller than that of DECaLS, 14 000 deg 2 , so not many matches were expected.In addition, the different data cuts applied make a direct comparison challenging.However, six matches were found at a cross-matching distance of 30 arcsec.These six sources were visually confirmed to match the sources in the catalogue.While spectroscopic follow-up would be needed to further investigate the rest of the sources shown in Fig. 9, the pres-Figure 9. Some of the galaxy merger candidates that were found in the top 10 000 of the main set.The images displayed are those that were visually the most interesting out of the 1609 merger candidates.The angular scale bar within each image represents 10 arcsec.More information can be found in Table 4. ence of significant tidal streams and interaction suggests the majority of these are high-probability merger candidates.These results show that astronomaly could be used to build a large, albeit incomplete, merger catalogue.To gauge the theoretical completeness of such a catalogue, we compared the precision values for the identified merger candidates from our evaluation subset, first described in Section 2.3, with a supervised method from O' Ryan et al. (2023), which used the same CNN to classify mergers.It is important to note that O'Ryan et al. used data from the European Space Agency's Hubble Space Telescope Science Archive8 , which has significantly higher data quality compared to DECaLS.Unfortunately, as no similar work has been published on DECaLS data, this was the best comparison that could be made at this time.Deriving conclusions from comparisons in performance metrics should therefore be done cautiously because of the discrepancy in data quality.
The entire main subset was initially ordered by the 'pre-dicted_user_score', discussed in Section 3.2, which we found best identified merger candidates for manual labelling.For more details on the scores implemented in astronomaly, refer to Lochner & Bassett (2021).This ordering can be considered the equivalent of a classifier probability, allowing the application of different thresholds to 'classify' a source as a merger.We then extracted the index (position in the ordered list) for the evaluation subset, for which we have labels.We considered a merger candidate to be the positive class, and all others (including other genuine anomalies) as the negative class.By applying different thresholds in the index, we can obtain candidate merger samples of differing purity.To compare with O' Ryan et al. (2023), we selected thresholds resulting in specified recall values and report the corresponding precision in Table 5.Our unsupervised method excelled with high precision up to a 50 per cent recall threshold, outperforming the supervised method.However, the supervised method demonstrated superior precision at higher recall thresholds, emphasizing its strength in completeness.Our method's higher initial precision makes it well suited for efficiently obtaining a sample of desired sources.This subset can then serve as valuable training data for a subsequent supervised method.
In addition, as mergers make up the majority of anomalous sources, one could explore the application of either a second round of AL or a trained classifier to actively remove them, which would help highlight more unusual sources.

CONCLUSIONS
Anomaly detection on a large scale is critical for scientific discovery in current and future astronomical data sets.Computational challenges are abundant and more often than not, supervised methods are used to detect "new" sources that are of the same type as ones already identified.The search for novel anomalies requires unsupervised methods and has previously been applied on relatively modest scales using frameworks like astronomaly.In this work, we have applied astronomaly to a much larger scale: 3 884 404 galaxies obtained from the DECaLS DR8.This was a test to determine if these methods could be applied on significantly larger scales and if they could find interesting sources in the relatively unexplored DECaLS data set.
We looked at various options for selecting a DECaLS subset to study and discovered that the choice of data selection criteria had a significant impact on the balance between scalability and discovery, which is important for detecting anomalies.If the selection criteria incorporated are too strict, anomalies might be overlooked.The selection criteria we used impacted the number of gravitational lens candidates identified within our evaluation set and would consequently constrain the number of such candidates within the larger data set used.We were unable to identify a set of cuts that could reduce the data set size while retaining all the lens candidates.Exploring criteria that have less impact on the count of lens candidates could be a worthwhile avenue for future research.To ensure a thorough investigation, data should be as wide and free as possible, but should still remain manageable enough in size to analyse.
As the size of the data set increases, the challenges of large data volumes, storage, and computational complexity become more important.One of the main challenges that we experienced was the transfer of data from the host server to a local computer, which took several weeks as the data had to be validated too.As the size and complexity of data sets increase, such as in the case of the LSST and the SKA, the traditional approach of moving the data to the compute nodes becomes impractical and costly.Bringing the compute to the data is therefore essential for anomaly detection in big data sets.
Pre-processing can also have a significant impact on the anomalies detected.Following Walmsley et al. (2021), we greyscaled the images before feature extraction through a straightforward averaging of the three optical channels.However, it is worth noting that using estab-lished band weightings, such as those available in OpenCV9 , could potentially offer a more suitable approach for creating the greyscale images (Etsebeth 2020).Furthermore, our use of the basic sigma clipping procedure, as described in Walmsley et al. (2021), proved effective in reducing noise but at times could not accommodate nearby bright sources within the cut-outs.Alternative techniques, like the sigma clipping and masking procedure used in Lochner & Bassett (2021), may provide improved results by ensuring the removal of all sources that are not part of the target object.
For feature extraction, we showed that using the pre-trained CNN from Walmsley et al. (2021) to obtain representations of images works extremely well for DECaLS data, further demonstrating the value of CNNs as general-purpose feature extractors.This adaptability reduces the need for expensive labelling and allows the reuse of networks for other, unsupervised tasks.It also highlights the value of training networks to solve complex classification tasks on large data sets and publicly releasing the trained network weights and code for use by the community on other tasks and data sets.
We applied astronomaly using the anomaly detection algorithm, iForest, to the features extracted using the CNN.iForest scaled well to the considerable amount of data we had and was able to effectively identify artefacts, which are anomalies but are not scientifically interesting.The fact that a large number of artefacts were present in the list of anomalies, despite our selection cuts, highlights the difficulty of detecting artefacts with automated flagging techniques but also the value of unsupervised methods in detecting artefacts that were missed.However, iForest alone could not differentiate between interesting and less interesting anomalies.We found that AL, using a relatively small amount of human labelling, is critical to success in anomaly detection for large, uncurated astronomical data sets.Using astronomaly's interface, it only took one person a few hours to label enough data to enable significant discoveries.
Fig. 6 shows that AL algorithms can be used to enhance the performance of anomaly detection algorithms by prioritizing the most relevant anomalies and filtering out less interesting ones.We compared two different AL approaches and found similar performance (Fig. 3).We chose the NS due to computational challenges with the direct regression method at a larger scale.While more computationally efficient GP methods are available, the tests on the evaluation set suggested that the extra implementation effort might not justify potential performance gains.This may not be the case for other data sets (e.g.Walmsley et al. 2021).
The application of anomaly detection, in conjunction with the NS active learning method using a total of 10 000 labels, on the data set of 3 884 404 sources, identified 1635 interesting anomalies in the top 2000 sources.Of these, eight gravitational lens candidates were identified, five of which are listed as candidates in other catalogues.In addition, 1609 sources were identified that contain galaxies exhibiting some signs of a gravitational merger event.We compared astronomaly's ability to detect merger candidates with that of a supervised method.Unsurprisingly, the supervised approach outperformed the unsupervised method in completeness but we did find our method was able to more rapidly recover a pure sample of merger candidates, which could then be used as a training set for subsequent classifiers.Unfortunately the number of lenses detected was too few to perform a similar analysis but it is not unreasonable to assume that, with enough examples, astronomaly could also be effective at finding an initial sample of lenses or other sources with unusual morphology.
Finally, 18 sources, shown in Fig. 7, were found that were unstudied to the best of our current knowledge.These unusual sources vary in morphology and require additional investigation in order to identify their nature.They include ring galaxies exhibiting strange colours and morphology, a source that is half red and half blue, a potentially strong lensed system with a pair of sources acting as the lens, several known interacting groups and some sources that are either interacting or coincidental alignments.Moreover, it is important to note, that these sources were all contained within only the top 2000 most anomalous sources after applying 10 000 labels.This opens up the potential for significantly more sources that are also interesting to be identified.
Our results show that the modern anomaly detection techniques included in astronomaly scale well to large data sets and are capable of rapidly detecting scientifically interesting anomalies.As the number and quality of anomalies detected can be affected by selection cuts, these should be avoided as far as possible by leveraging computationally intensive unsupervised frameworks running on remote data centres.This work paves the way for scientific discovery with anomaly detection in large data sets, such as those expected from the Vera C. Rubin Observatory, Euclid and the Square Kilometre Array.We acknowledge the use of the ilifu cloud computing facility -www.ilifu.ac.za, a partnership between the University of Cape Town, the University of the Western Cape, the University of Stellenbosch, Sol Plaatje University and the Cape Peninsula University of Technology.The Ilifu facility is supported by contributions from the Inter-University Institute for Data Intensive Astronomy (IDIA) --a partnership between the University of Cape Town, the University of Pretoria, the University of the Western Cape, the Computational Biology division at UCT and the Data Intensive Research Initiative of South Africa (DIRISA).

Figure 1 .
Figure 1.Distribution of the number of sources with varying pixel values as determined by equation (1) for a random sample of sources.The number of pixels is given by the square root of  eff in equation (1).

Figure 2 .
Figure 2. Examples of sources in the data set with their labels.The first row contains sources such as artefacts, masked sources and low SNR sources.At the bottom are sources that would be considered to be interesting anomalies and contain sources such as galaxy mergers, strong gravitational lenses and other sources that are not readily identifiable.The angular scale bar within each image represents 5 arcsec.

Figure 3 .
Figure 3. Performance of the NS and DR active learrning algorithms on the evaluation set.The algorithms were applied in iterations of 100 labels, with the results from 100 and 500 labels illustrated.Both algorithms have comparable performance.

Figure 4 .
Figure 4. Recall of three types of anomaly in the evaluation set.The dashed lines in the plot represent the initial results from iForest, while the solid lines represent the results obtained from the NS algorithm trained with 200 labels.The plot has been normalized to emphasize the performance gains of AL.AL improves the detection rates of interesting sources while decreasing the impact of unwanted artefacts.

Figure 5 .
Figure 5. UMAP plot of the evaluation set.The different scores/labels show that subclusters are formed within the feature space, but that they are surrounded by the more common, uninteresting sources.It is interesting to note that the artefacts present, represented in pink, form a relatively distinct cluster.The anomalies show a similar pattern, with a very dense cluster formed.

Figure 6 .
Figure6.Recall as a function of rank for increasing numbers of labels.The plot on the left shows the number of anomalies including those that have been previously labelled, whereas the plot on the right excludes the labelled anomalies, showing the unlabelled sources the algorithm determines as being the most likely to be interesting.It is clear that the number of "new" anomalies found in the top 2000 increases as more labels are used for training.However, there is a point of diminishing returns as the number of "new" anomalies decreases rapidly when more than 4000 labels are used.It should also be noted that the line for 0 labels is not visible on either plot since there was only one source detected.

Figure 7 .
Figure 7.The anomalies shown here are difficult to identify with a quick visual inspection.They need further investigations to determine their nature and origin.Table 2 lists more information on these anomalies.The angular scale bar within each image represents 10 arcsec.

Figure 8 .
Figure8.The strong gravitational lens candidates detected in the top 10 000 anomalies.More information about these sources can be found in Table3.The angular scale bar within each image represents 5 arcsec.
collaboration.Funding for the DES Projects has been provided by the U.S. Department of Energy, the U.S. National Science Foundation, the Ministry of Science and Education of Spain, the Science and Technology Facilities Council of the United Kingdom, the Higher Education Funding Council for England, the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, the Kavli Institute of Cosmological Physics at the University of Chicago, Center for Cosmology and Astro-Particle Physics at the Ohio State University, the Mitchell Institute for Fundamental Physics and Astronomy at Texas A&M University, Financiadora de Estudos e Projetos, Fundacao Carlos Chagas Filho de Amparo, Financiadora de Estudos e Projetos, Fundacao Carlos Chagas Filho de Amparo a Pesquisa do Estado do Rio de Janeiro, Conselho Nacional de Desenvolvimento Cientifico e Tecnologico and the Ministerio da Ciencia, Tecnologia e Inovacao, the Deutsche Forschungsgemeinschaft and the Collaborating Institutions in the Dark Energy Survey.The Collaborating Institutions are Argonne National Laboratory, the University of California at Santa Cruz, the University of Cambridge, Centro de Investigaciones Energeticas, Medioambientales y Tecnologicas-Madrid, the University of Chicago, University College London, the DES-Brazil Consortium, the University of Edinburgh, the Eidgenossische Technische Hochschule (ETH) Zurich, Fermi National Accelerator Laboratory, the University of Illinois at Urbana-Champaign, the Institut de Ciencies de l'Espai (IEEC/CSIC), the Institut de Fisica d'Altes Energies, Lawrence Berkeley National Laboratory, the Ludwig Maximilians Universitat Munchen and the associated Excellence Cluster Universe, the University of Michigan, NSF's NOIRLab, the University of Nottingham, the Ohio State University, the University of Pennsylvania, the University of Portsmouth, SLAC National Accelerator Laboratory, Stanford University, the University of Sussex, and Texas A&M University.BASS is a key project of the Telescope Access Program (TAP), which has been funded by the National Astronomical Observatories of China, the Chinese Academy of Sciences (the Strategic Priority Research Program "The Emergence of Cosmological Structures" Grant # XDB09000000), and the Special Fund for Astronomy from the Ministry of Finance.The BASS is also supported by the External Cooperation Program of Chinese Academy of Sciences (Grant # 114A11KYSB20160057), and Chinese National Natural Science Foundation(Grant # 12120101003, # 11433005).The Legacy Survey team uses data products from the Near-Earth Object Wide-field Infrared Survey Explorer (NEOWISE), which is a project of the Jet Propulsion Laboratory/California Institute of Technology.NEOWISE is funded by the National Aeronautics and Space Administration.The Legacy Surveys imaging of the DESI footprint is supported by the Director, Office of Science, Office of High Energy Physics of the U.S. Department of Energy under Contract No. DE-AC02-05CH1123, by the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility under the same contract; and by the U.S. National Science Foundation, Division of Astronomical Sciences under Contract No. AST-0950945 to NOAO.

Table 1 .
The results of applying different selection criteria to the evaluation set.Each row represents a criterion that was applied.The first column indicates the name of the criterion, the second column shows how many lenses (out of the 342) were excluded by that criterion, and the third column shows how many of the 15 342 sources were excluded by that criterion.

Table 2 .
Information about the 18 initially unidentified anomalous sources detected in the main set.The second and third columns show the right ascension and declination in degrees, respectively.The fourth column lists any known source names of the target from other surveys or catalogues if they were matched.Blank entries are unmatched sources.

Table 4 .
Coordinates of the galaxy merger candidates shown in Fig.9.The Entry columns indicate the corresponding image as shown in Fig.9and the following columns show the right ascension and declination in degrees, respectively.

Table 5 .
Comparison of precision values based on predicted user score with a supervised method implemented by O'Ryan et al. (2023).Precision values are calculated at the various specified recall thresholds.