Weakly-Supervised Anomaly Detection in the Milky Way

Large-scale astrophysics datasets present an opportunity for new machine learning techniques to identify regions of interest that might otherwise be overlooked by traditional searches. To this end, we use Classification Without Labels (CWoLa), a weakly-supervised anomaly detection method, to identify cold stellar streams within the more than one billion Milky Way stars observed by the Gaia satellite. CWoLa operates without the use of labeled streams or knowledge of astrophysical principles. Instead, we train a classifier to distinguish between mixed samples for which the proportions of signal and background samples are unknown. This computationally lightweight strategy is able to detect both simulated streams and the known stream GD-1 in data. Originally designed for high-energy collider physics, this technique may have broad applicability within astrophysics as well as other domains interested in identifying localized anomalies.


Motivation
The history of our home galaxy, the Milky Way, has been marked by the ongoing aggregation of stars, gas, and dark matter from various sources throughout the Universe.These accumulation events include mergers with other galaxies as well as smaller-scale gravitationally-bound groupings of stars such as globular clusters.Though many such collisions occurred in the distant past, lingering remnants from more recent collisions contain crucial information revealing the Milky Way's merger history [1][2][3][4][5][6], underlying gravitational potential [7][8][9][10][11][12], and dark matter content [13][14][15][16][17][18][19][20].Since 1971, astronomers have observed collections of stars called stellar streams: thin, ribbon-like arcs orbiting the Milky Way's galactic center [21].These dynamically cold associations of stars are thought to be the result of gravitational tidal forces from the Milky Way disrupting and warping nearby low-mass progenitors -dwarf galaxies or globular clusters -until the stars are no longer gravitationally self-bound.Due to their shared origin, the stars tend to share many characteristics ranging from proper velocity to age.
To date, nearly 100 stellar streams have been identified in the Milky Way [22].They are challenging to discover and study due to their sparse densities and wide angular extents.Gaining a more detailed understanding of the structure of known streams as well as uncovering additional streams will be critical for deepening our understanding of the Milky Way.An extensive catalogue of high-precision stream measurements would greatly improve our estimation of the particularites of Galactic large-scale and small-scale structures, including contributions from cold dark matter.

Related Work
Traditional methods of identifying stellar streams look for groupings of stars that are similar along various metrics: color and magnitude [23,24], velocity [25][26][27], or position along great circle paths across the sky [28,29].More recently, an automated technique called Streamfinder [30], leverages both position and kinematic information to construct volumes called "hypertubes" in a multidimensional phase space around a stream candidate, then iterates to optimize a statistic similar to a log-likelihood to determine the best parameters for a given hypertube.While this algorithm has led to the discovery of multiple low-density stellar streams, it makes several assumptions based on astrophysical principles.For instance, it assigns the distance to a stream based on a particular choice of isochrone and assumes a specific Milky Way potential to calculate a stream's orbit.Data mining and clustering techniques such as DBSCAN have also been applied to Gaia data for stellar stream searches [31].
Machine learning techniques have also been deployed in search of stellar streams.Recently, an unsupervised machine learning technique called Via Machinae [32,33], based on the ANODE method [34] originally designed for high-energy particle physics searches, was applied towards the automated discovery of stellar streams within the Gaia Data Release 2 catalogue.Via Machinae combines a normalizing flow density estimation technique [35] for anomaly detection with a line-finding algorithm to identify stellar stream candidates without the use of detailed assumptions about isochrones or stream orbits.

CWoLa: Classification Without Labels
Machine learning has been widely and successfully applied in the context of fundamental physics to the classification and description of various physical phenomena ranging from subatomic to cosmological scales.These techniques excel at identifying complex patterns in a dataset without imposing any prior assumptions about their distributions.When dealing with real data with partially inaccurate or incomplete labels, weakly-supervised machine learning methods can be particularly helpful.
The fields of high-energy particle physics and astrophysics share a common interest in identifying localized features, meaning overdensities of data concentrated in contained regions of phase space, within vast and high-dimensional datasets.Model-independent forms of anomaly detection -the process of identifying these localized features that deviate from a dataset's typical characteristics -can efficiently filter these large datasets in an unbiased manner and aid in potential discoveries.
In this paper, we demonstrate the first astrophysical application of Classification Without Labels (CWoLa1 ) [36], a weakly-supervised machine learning technique based on a simple, lightweight neural network classifier.Like ANODE [34], the neural network architecture used in Via Machinae [32,33], CWoLa was originally designed for identifying particles within high-energy particle physics datasets.As such, CWoLa has been applied as a promising model-agnostic anomaly detection method for searching for localized features such as the potential signatures of new fundamental particles at the Large Hadron Collider (LHC) [37][38][39], but until now has not been used on astrophysics datasets.
This analysis uses the same datasets as in Via Machinae and follows the same general framework.We scan across regions defined by one proper motion coordinate and train neural networks to assign an anomaly score to stars in these regions using five input variables: two angular position coordinates, one proper motion coordinate (the one not used for the scan), magnitude, and color.In both analyses, we scan along proper motion because we expect stellar streams to be kinematically cold and therefore relatively localized along this feature compared to background stars, though we do not make any requirements that the streams fall on a particular orbit or stellar isochrone.Once the anomaly scores are assigned, we look at the subsets of stars from each scanning window with the highest anomaly scores, cluster them, and apply some fiducial selections to further refine them.We apply these techniques on a known stream called GD-1 as well as simulated stellar streams as benchmarks.
Despite these overarching similarities, this analysis also departs from that of Via Machinae in several ways.Most significantly, the choice of neural network model to assign anomaly scores and the post-training clustering method are different.These alternative choices in methodology have been implemented with the aim of achieving similar performance in anomaly detection with a much more computationally lightweight framework.A detailed discussion of similarities, differences, and potential benefits or drawbacks between these analyses is presented in Section 5.

Outline
This paper is organized as follows.First, in Section 2, we describe the Gaia dataset and how it is processed for use in our anomaly detection studies.Then, in Section 3, we explain the methodology of applying CWoLa for anomaly detection and our particular implementation of CWoLa on Gaia data, including how we define the signal and sideband regions as well as the neural network model architecture and training procedure.The results of applying CWoLa to the known stellar stream GD-1 are listed in Section 4. Finally, in Section 5 we conclude with a discussion of CWoLa's potential usefulness in aiding future stellar stream discoveries and some further steps for this work that can bring us closer to that goal.

Gaia Dataset
The Gaia catalogues are extensive astronomical data releases mapping the stars populating the Milky Way [40].The Gaia satellite itself was launched in 2013, and its data catalog is being released in discrete stages throughout its operational lifespan through 2025.The data analyzed for this work comes from Gaia's Data Release 2 (DR2), the collection of Gaia's observations from July 25, 2014 to May 23, 2016 [41].Gaia DR2 contains position, proper motion, and photometric information for approximately 1.3 billion stars, representing around 1% of the total star population of the Milky Way.While this analysis was already in process, Gaia released two additional data releases: Early Data Release 3 (eDR3) and Data Release 3 (DR3).Were this analysis to be extended using eDR3 or DR3 data, we would likely see some further improvements due to reduced measurement errors.Other changes in eDR3 and DR3  include improved distance and radial velocity measurements for a small subset of stars, but these variables are not considered in this analysis.
While Gaia DR2 also contains information on parallax and mean radial velocity, these variables are not considered as input variables to CWoLa for this analysis.We exclude parallax because it is not as reliable a feature as our other observables for stars as distant as the stream members in which we are most interested.However, we do use parallax to apply a cut on the Gaia data to restrict stars to a maximum parallax of 1, meaning stars at a distance of at least 1 kpc.Mean radial velocity, or motion along the axis between the Earth and each star, is also measured for some stars in the Gaia dataset, but Gaia DR2 contains radial velocities for only about 7 million stars, representing less than a percent of the overall star catalog.We therefore omit radial velocity in order to maximize available training statistics.
As in [32], we use the stellar stream GD-1, discovered in 2006 [42], as the main candidate for evaluating the performance of CWoLa as a stream-finding technique.Most likely the remains of a tidally-disrupted globular cluster, GD-1 consists of primarily metal-poor stars totaling approximately 2 × 10 4 M [43].It lies at a distance of approximately 10 kpc from the Sun and 15 kpc from the Galactic Center.GD-1 is especially narrow [42], dynamically cold [44], and bright [15] compared to other stellar streams in the Milky Way.It also contains various known physical peculiarities including gaps and wiggles [45], offshoots ("spurs") and overdensities ("blobs") [16], and even a surrounding "cocoon" of stars [46].The faithful reconstruction of some of these density perturbations can therefore serve as an additional metric indicating the physical validity of CWoLa's outputs.

Data Preprocessing
We use the same dataset of Gaia stars as in Ref. [32].Following the same data processing methodology, we train our model on a series of 21 overlapping circular "patches" of the Gaia dataset with radius 15 • .While the natural angular position coordinates in DR2 are right ascension (α) and declination (δ), we use rotated and centered coordinates φ and λ (as well as rotated proper motion coordinates µ φ and µ λ ) such that each patch has a Euclidean distance metric and is centered at (α 0 , δ 0 ) = (0°, 0°).This transformation is performed using Astropy [47][48][49].Each patch's center location is also documented in [32].
Beyond these two rotated and centered angular positions, we consider four additional features associated with each star in the dataset: two angular proper motions (µ φ * , where µ φ * ≡ µ φ cos(λ), and µ λ ), color (b−r, where b represents the brightness of the blue photometer and r represents the brightness of the red photometer), and magnitude (g).Distributions of the six relevant variables used in the analysis (φ, λ, µ φ * , µ λ , b − r, and g) are shown for one such patch in Figure 1.These patches cover an irregularly-shaped region stretching between Following these selections, 1,957 of the approximately 8 million total stars considered for this analysis are labeled as likely belonging to the GD-1 stream using the catalogues developed by Refs.[50,51].This choice of labeling is based on selections in position, proper motion, and along an isochrone in color and magnitude.While these labels cannot be considered fully accurate or complete, they serve as a helpful reference for evaluating our model's efficacy.

Classification Without Labels (CWoLa)
CWoLa is a weakly-supervised machine learning technique designed to find anomalous features in a dataset that are localized along at least one dimension.It was originally designed for applications within high-energy particle physics, where mixures of particle classes with unknown proportions of signal and background are common.It detects anomalies by scanning along a localized dimension -for instance, the invariant mass of the final state of a particle physics even -and learning to distinguish between mixtures of data classes where the precise  class proportions within each mixture need not be known.Simply by learning to differentiate regions with higher vs. lower proportions of signal, i.e. "signal" vs. "sideband" regions, CWoLa can be a powerful indicator of patterns of anomalous events.
Consider a signal-enriched mixture M 1 and a signal-depleted mixture M 2 , as shown in Figure 2a."Signal" refers to an object class of interest -here, a member of a localized anomalous feature, such as a stellar stream, that one would like to detect -while "background" refers to objects not part of the anomaly.Both mixtures contain signal and background events, but the signal-enriched mixture has significantly more signal events relative to the signal-depleted mixture (i.e., f 1 > f 2 , where f i indicates the fraction of signal events in each mixture).We exploit the fact (see proof of Theorem 1 in [36]) that an optimal classifier trained to distinguish events between M 1 and M 2 is the same as an optimal classifier trained in a fully-supervised manner to distinguish signal from background events.Importantly, the exact proportions of signal in each mixture (f 1 and f 2 ) need not be known for this to hold.This theorem relies on the Neyman-Pearson lemma [52] that states that an optimal classifier h( x) is any function monotonic to the likelihood ratio constructed from the probability distributions of signal and background p S ( x) and p B ( x) for input variables x.
We can apply CWoLa as a model-agnostic, data-driven anomaly detection technique [37,38] by identifying a certain feature of our dataset that might contain a localized anomaly, as illustrated in Figure 2b.We then train a fully-supervised classifier to distinguish between events from a "signal region" and a surrounding "sideband region", as defined by ranges of this feature.The inputs to this classifier are auxiliary variables that should be decorrelated from the characteristic used to define the signal and sideband regions if no anomaly is present.If an anomaly is present and contained primarily in the signal region, then we expect the anomalous events to be ranked more highly by the classifier.We can then repeat this process by sliding the signal and sideband windows across a range of values.For each choice of signal and sideband, we apply a threshold on the classifier output score (i.e. the top N events or a top percentile of the test set) such that only the highest-score events remain.When an anomaly is present, these highest-score events will contain an enhanced signal-to-noise ratio of events.The CWoLa anomaly search makes two key assumptions during its procedure.First, it requires that the anomaly is localized in the dimension over which we search.Because stellar streams are kinematically cold, they are relatively localized in both coordinates of proper motion.We select µ λ as the primary coordinate used to define the signal and sideband regions for each patch of DR2.A histogram demonstrating the highly localized nature of µ λ within an example patch of GD-1 is shown in Figure 3. Second, it expects that background and signal events are indistinguishable between the signal and sideband regions.This is a reasonable assumption, as shown for an example patch of data in Figure 4.

Defining Signal & Sideband Regions
For each of the 21 patches of DR2 considered in this study, we construct a signal region to contain the bulk of the stream stars available and neighboring sideband regions such that the background stars in both signal and sideband regions are as close to indistinguishable as possible.Ideally, the stars in both regions should have similar characteristics: background stars in the signal region should closely resemble background stars in the sideband region, and same logic applies for the stream stars.Stars belonging to a stellar stream make up a small fraction of total stars within each patch, so in our case, each signal region will still be dominated by background stars not labeled as belonging to the stream.However, each signal region should have a higher signal-to-background ratio than the sideband region.Signal regions are ideally defined by a range of µ λ values that encompass the bulk of the stream stars.There are many valid ways to define these regions in general, and in some cases, the best definitions may be model-dependent.

Signal and Sideband Distributions
For this proof-of-concept result, we opted for idealized signal and sideband limits based on where we know the stream to be concentrated in proper motion.However, it is crucial to note that for a full-scale anomaly search, one could scan across a range of µ λ values, meaning one should still be sensitive to streams even if different signal and sideband region were selected.
In this case, we define the signal region in each patch as the region within one standard deviation of the median µ λ of the GD-1 stars in the patch.The sideband region within each patch is then defined as the stars falling within [−3σ, −σ] or [σ, 3σ] of the median.Given that the signal region encompasses a bulk of the stream, the sideband regions will have significantly fewer stream stars and will be signal-depleted, as desired.In practice, the average span of the signal regions across the 21 patches in µ λ was 2.34 ± 0.36 mas/year and the average span of the sideband regions was 7.02 ± 1.09 mas/year.

Neural Network Architecture and Training Procedure
We implement CWoLa with a neural network built in Keras [53] with a TensorFlow backend [54].The model consists of 3 hidden fully-connected layers, each with a layer size of 256 nodes and a ReLU activation [55].Each fully-connected layer is followed by a dropout operation with a dropout rate of 20% [56].These layers are followed by a final output layer of a single node with a sigmoid activation.Hyperparameter values for layer size, batch size, and number of k-folds (described below) was chosen via an optimization using Optuna [57].
For each of the 21 Gaia patches considered in our search for GD-1, we train a series of classifiers to separate stars labeled as part of the signal region from stars labeled as part of the sideband region.This quality defines CWoLa as being "weakly-supervised": it operates with a little more information than a fully unsupervised network, as we expect an optimal signal region to contain a higher fraction of GD-1 stars than in the sideband region, but it does not have access to the actual GD-1 labels.The training procedure, which closely aligns with other CWoLa searches [37,38], unfolds as follows: 1. k-folding: We implement stratified k-folding (k = 5) to randomly divide all the stars in a given patch into five sections, or "folds".Each fold is chosen such that the overall percentage of stars labeled as "signal" vs. "sideband" is also maintained within each fold.The first fold (20% of all stars) is reserved as a test set.The second fold (another 20% of all stars) is used as a validation set.The remaining 60% of stars are used for training.
2. Train: Next, a classifier with the architecture specified above is trained on the training set with a batch size of 10,000 for up to 100 epochs, though early stopping with a patience of 30 typically halts the training process well before this limit.The large batch size is necessary due to the low number of labeled stream stars in the overall dataset -for example, one patch on the tail end of GD-1 has a stream star population of just 0.15%.Large batch sizes therefore help ensure that more than a handful of stream stars will be seen at a time by the network during training.We use the binary crossentropy loss function and Adam optimizer [58].The validation set is used to monitor the validation loss for early stopping.

Repeat:
The classifier training is repeated twice more, each time with a random initialization of trainable parameters.Of the three distinct trainings, the weights are stored for the model with the lowest validation loss.

Cycle through validation sets:
This process is repeated using each of the remaining folds as a validation set with the exception of the test set, which remains unchanged.
For each configuration, the remaining three folds besides those used for the test and validation sets are used for training.

5.
Evaluate on test set: Each of the best models trained using the four k-fold options for the validation set is evaluated on the test set.The final CWoLa score for each star in the test set is defined as the average across the four scores.
6. Combine test sets: This entire process is repeated, cycling through each of the five possible k-folds as a test set.These test sets are then concatenated into a single dataset such that every star in the patch ends up in the test set exactly once.

Model Evaluation
After training the neural network classifiers inherent in the CWoLa methodology, a series of fiducial selections are applied to each patch to further refine the results and optimize for the highest possible signal-to-noise ratio.The fiducial selections used are almost identical to their counterparts in [32,33]: • g < 20.2, to ensure uniform acceptance by the Gaia satellite • |µ λ | > 2 mas/year or |µ * φ | > 2 mas/year, to remove very distant stars that are predominantly concentrated near zero proper motion and therefore not equally distributed throughout the patch • 0.5 ≤ b − r ≤ 1, to isolate old and low-metallicity stellars streams in color space.
Unlike in Via Machinae, however, we do not need to apply a cut restricting the patch radius from 15°→ 10°after training, as CWoLa is not influenced by the edge effects seen in the ANODE implementation.
Via Machinae employs a sophisticated line-finding strategy using modified Hough transforms [59] to search for line-like structures in the identified anomalous stars and then combines these line segments into an overall stream candidate.We achieve similar results with a relatively lighter computational load using k-means clustering [60] (k = 2) in proper motion space.Following the grouping of stars into two clusters, we select the cluster with the largest population of stars and discard the stars in the other cluster.This is motivated by our expectation that the stellar stream should be kinematically cold, therefore the velocities of its constituent stars should be densely clustered in velocity space.This clustering strategy is likely best used as a post-discovery tool and may not perform well in contexts with high numbers of contaminant stars not belonging to a stream.In these cases, opting for a larger k > 2 or a line-finding technique could instead be a better choice.
Following these fiducial selections, model performance was evaluated by applying the classifier to stars in the combined test set equivalent to the entire patch.The output scores were sorted from highest to lowest, where higher values indicated that the model ranked those stars as more likely to belong to the signal region than the sideband region.The top N = 250 stars, ranked by neural network output scores, are chosen for evaluation.
The number 250 was chosen following an optimization for both purity (percentage of topranked stars overlapping with labeled GD-1 stars) and completeness (percentage of labeled GD-1 stars covered by CWoLa's top-ranked stars).In principle, however, one could isolate a different absolute number or relative percentage of top stars, though it would be advisable to stay under the average of 430 labeled stream stars per patch.
It is worth emphasizing that this method of model evaluation requires ground truth labels.In the absence of reliable stream labels, or in the case of discovering a new stream, we must employ different methods to evaluate model performance, not to mention a modified strategy for the model implementation itself.We discuss this further in Section 5.

Results
Before looking at real Gaia data, we also evaluated the performance of CWoLa when applied to 100 randomly-chosen simulated stellar streams.As shown in Appendix A, with just two passes of CWoLa, 96% of streams are identified with nonzero purity, of which 69% are identified with a purity larger than 50%.

GD-1 Stream Identification
The combined results of applying the CWoLa technique to each of the 21 patches of Gaia DR2 are shown in Figure 5. Results for individual patches are detailed in Appendix B. Across the 21 patches, 1,498 unique GD-1 stars pass our fiducial selections.1,360 unique stars are identified in the combined top N = 250 stars for each CWoLa patch.Of these, 760 are part of the labeled GD-1 star set [51].Thus, across the entire stream, we achieve a purity of 56% and a completeness of 51%.In our optimization studies, we found that stream purity plateaued at a maximum value of 78% using the top N = 25 stars in each patch, but this choice of N only yields a competeness of 13%.Conversely, choosing N = 300 yields a reduced purity of just 30%, but a higher completeness of 54%.

GD-1 CWoLa
Figure 6.The CWoLa-identified stars across all patches are compared with the labeled GD-1 stars from [51] in stream-aligned coordinates φ 1 and φ 2 .This perspective highlights that CWoLa has identified several of the density perturbations unique to GD-1: two sparsely-populated "gaps" near φ 1 = −40°and φ 1 = −20°; an offshoot, or "spur", near φ 1 = −35°; and an overdensity of stars, or "blob", near φ 1 = −15°.CWoLa PDF Figure 7. Three subsets of the CWoLa-identified stars (the "blob", "spur", and a control region) are selected and fit with a three-component Gaussian mixture model to highlight the kinematic qualities of the additional feature, if present.In each case, the GD-1 stream corresponds to the primary narrow peak centered near φ 2 = 0°.In the first two plots, we see clear indications of a second peak representing each feature.).This appears to support the general trend observed in Figure 6 of [46].
The majority of GD-1 is quite narrow, with an average angular width of approximately 0.5° [46], and dense, with approximately 100 stars per 5°bin between φ 1 = −60°and φ 1 = 0°.Within this region, CWoLa can reliably identify the stream stars.The tail ends of GD-1 (φ 1 ≤ −60°and φ 1 ≥ 0°) are more sparsely populated, with about half the average population per bin of the main body of the stream, and less localized in µ λ , meaning stream stars in this region are harder to identify using CWoLa.Some stars in these regions may also have been excluded from the 21 patches due to their proximity to the Galactic disc or the presence of nearby dust.These regions also tend to include stars with small proper motions, meaning the stream stars are more likely to be overwhelmed by distant background stars.
We can also analyze these results in a rotated set of position coordinates φ 1 and φ 2 [43] designed to align with the main body of the stream, as shown in Figure 6.This perspective highlights that CWoLa has identified several of the density perturbations unique to GD-1: two sparsely-populated "gaps" near φ 1 = −40°and φ 1 = −20°; an offshoot, or "spur", near φ 1 = −35°; and an overdensity of stars, or "blob", near φ 1 = −15°.We can more quantitatively demonstrate the identification of these features, as in Figure 5 of [50], by looking at histograms of φ 2 in various ranges of φ 1 as shown in Figure 7.This study highlights the "spur" and "blob" in particular by fitting a histogram of stars near each feature, along with a third control region, with a three-component Gaussian mixture model assuming a background, the GD-1 stream, and the feature ("blob" or "spur").
As mentioned above, the underdense regions, or "gaps", in GD-1 are typically observed near φ 1 ≈ −40°and φ 1 ≈ −20°.A third underdense region has also been identified near φ 1 ≈ −3° [44].Our results would not be inconsistent with this third gap, as the density in this area for the CWoLa-identified stars is indeed low, on par with the densities seen at the other two "gaps", but since this region is so close to the furthest extent of the CWoLaidentified stars, it is not clear whether this underdensity is a feature from the stream or a reflection of the diminished purity of stars in the patches on the ends of the stream.
Another reported feature of GD-1 is a wider "cocoon" of stars with a width of around 1°s urrounding a much denser core of the stream [46].To probe this feature, we first calculate the median φ 2 in broad 5°bins of φ 1 to find a smoothed trajectory for the stream.Then, we shift each CWoLa-identified star by its median φ 2 location (see Figure 8a).Once the stream has been centered around this path, we make a 3σ selection, as in [46], then plot the histogram of shifted φ 2 (see Figure 8b).Fitting the distribution to a two-component Gaussian mixture model reveals a narrow peak with a standard deviation of σ ≈ 0.3°(the core of the stream) and an additional wider peak with σ ≈ 1.7°.This appears to support the observation of such a "cocoon" of more diffuse stars surrounding the central core of the stream.

Towards an Augmentation of the GD-1 Stream Labeling
Beyond identifying cold stellar streams without labels, the CWoLa technique may also be useful for improving the labeling systems that indicate which stars belong to a particular stream.For instance, CWoLa can identify promising stellar candidates that were not labeled as GD-1 members in [51], but nevertheless have properties closely aligned with labeled stars.As shown in Section 4, 1,360 unique stars are identified by CWoLa across all 21 patches, and of these, 760 (56%) are part of the labeled GD-1 star set.We can further refine the remaining 600 unlabeled stars by identifying the subset of individual stars s that minimize the Euclidean distance d to their respective closest labeled GD-1 star in the test set s along the 5 dimensions used as CWoLa inputs, individually standardized to have µ = 0 and σ = 1: (φ, λ, µ φ * , color (c ≡ b − r), and magnitude (g)): (4.1) Isolating the stars yielding the smallest 10% of distances d reveals 60 additional stars, shown in Figure 9 and listed explicitly in Appendix C, that appear to align with the labeled GD-1 stars across these six dimensions and would be interesting to investigate as potential GD-1 member candidates.A detailed cross-checking of these candidate GD-1 stream stars with other, more precise GD-1 stream catalogs will be pursued in future work.

Discussion
It is evident that CWoLa successfully identifies significant portions of GD-1, as measured by not only purity and completeness but also on the faithful reconstruction of physical characteristics and density perturbations characteristic of this stream.Additionally, CWoLa is highly effective at identifying simulated streams.
While the analysis presented here shares many core strategic components and the same dataset with Via Machinae, this analysis differs primarily in terms of the mechanisms for how to assign anomaly scores to stars, how to divide the sky into subsections for scanning, and how to cluster stars post-training.CWoLa is implemented via a comparatively simple, lightweight, and easy-to-train neural network-based classifier instead of a normalizing flow model to approach the same problem of anomaly detection.When applied to the same example stellar stream, GD-1, CWoLa is able to identify stars with comparable purity with much less computational overhead.We find 760 labeled stars overall, yielding a 56% purity, while Via Machinae's first iteration [32] found 738 stars, yielding a 49% purity.Via Machinae's latest iteration [33], which includes additional fiducial selections and an augmented scan over both proper motion variables, increases its star yield to 820, or 65% purity.
It is worth emphasizing that our approach does not apply any kind of line-finding or protoclustering algorithms as is done in Via Machinae -the anomalous stars here are simply combined and filtered via k-means clustering.This lightweight clustering strategy is particularly useful in a post-discovery context in which we are interested in refining stream membership catalogues.Another important distinction between these techniques is that CWoLa uses signal and sideband regions of varying widths that are chosen for each patch based on the location of the signal, while Via Machinae searches over regions of interest defined by the orthogonal proper motion coordinate with a fixed width of 6 mas/year with centers spaced 1 mas/year apart.The fiducial selections also differ slightly between these implementations: Via Machinae restricts each patch to the innermost 10°circle in position space to avoid edge effects, but CWoLa does not exhibit these effects and thus we do not impose this selection.
Training normalizing flows such as those in ANODE can be a time-intensive task, while completing the full CWoLa training paradigm for GD-1 takes just 15 minutes per patch on an NVIDIA A40 GPU.Running over each patch is embarrassingly parallel and can easily be run simultaneously based on GPU access or using multiprocessing on CPUs.Running all 21 patches on a single GPU takes about 5 hours in total, making it quite feasible for researchers to optimize their signal and sideband region definitions as well as to combine the results from scans of multiple variables.
This training time can be even further reduced by applying the fiducial selections to the samples before training -cutting training time roughly in half and yielding an overall purity of 44%.For a 100% improvement in training time, this technique only reduces the final purity of the identified stars by about 20%, making this a valuable option particularly for coarsegrained scans across wide areas of phase space.CWoLa and Via Machinae can therefore be thought of as complementary tools for stream detection under different circumstances or computational constraints.
If CWoLa can be used to identify known streams, it may also be used to potentially find new, undiscovered streams.Some additional challenges will arise when extending CWoLa to look for new stellar streams within the full Gaia dataset.Detected anomalies are not necessarily guaranteed to be stellar streams, since CWoLa could identify any localized anomalous features.A lack of ground truth labeling for a stream would also require a reevaluation of our performance metrics -for instance, streams would need to be evaluated using the standard anomaly detection technique of performing a series of selections (e.g. a range of percentiles of the neural network score, or hand-picked thresholds based on the background rate in the sideband region) on a histogram of proper motion.If no anomaly is present, these increasingly harsh selections will reduce the sample statistics without significantly altering the histogram shape, as shown in Figure 10a.However, if an anomaly is present and identified by CWoLa, a new shape will emerge with increasingly harsh selections on the distribution in question, as shown in Figure 10b.Additionally, multiple passes of CWoLa might be needed with different choices of signal and sideband region widths if the approximate width of the anomaly is not a priori known.Searching for new stellar streams will require scanning over the full range of proper motion values in the dataset, since we will not know where new streams might be localized.By applying CWoLa in a coarse sliding-window fashion across µ λ , regions of interest may be identified.These regions can be further studied through finer scans until anomalous data points are identified.When combining multiple patches together for an overall result, we might also need to additionally employ a line-finding algorithm for identifying larger-scale stream-like results, such as the modified Hough transform used in Via Machinae [32].

Conclusion
We have demonstrated a new application of "CWoLa hunting", an anomaly detection technique based on the weakly-supervised machine learning classifier CWoLa that is designed to detect localized anomalies in a model-agnostic manner.CWoLa is shown to be easy to train, highly computationally efficient, and, most importantly, effective at identifying anomalies including the stellar stream GD-1 and dozens of simulated streams with high purity.The GD-1 candidate stars identified by CWoLa exhibit the same density perturbations and physical characteristics (the "spur", "blob", "cocoon", gaps, and overdense regions) noted in several independent studies of the stream.The neural network output scores also give clues as to which stars might have been accidentally omitted from more formal GD-1 labeling schemes, suggesting several promising candidates.The successful application of CWoLa in this study shows that CWoLa has strong potential to improve the signal-to-noise ratio on the membership of known streams as well as to potentially reveal previously undetected streams throughout the Galactic halo.CWoLa has broad applicability as a weakly-supervised anomaly detection technique outside of high-energy physics -and, potentially, it could be applied into still more areas of fundamental science.

Reproducing these Results
A codebase with instructions on how to reproduce each of the plots in this paper is located at https://github.com/hep-lbdl/GaiaCWoLa.The datasets needed to fully reproduce the plots in this paper (with CWoLa already applied) are publicly available [61].The full 21 patches covering GD-1 are also publicly available [62].

A Simulated Stellar Streams
Our implementation of CWoLa for stellar stream discovery was first tested on simulated stellar streams.These simulated streams are frequently highly localized in the proper motion coordinate µ λ , meaning they may tend to be easier to find with our methods than a real stream such as GD-1.
The streams were simulated using the Gala [63] Python package to evolve stars in a mock globular cluster along an orbit through the simulated Milky Way potential [64], with the center of the stream randomly placed on the sky with a distance randomly chosen between 5 and 20 kpc from the Earth.Stellar properties for the stream components were generated from a MIST [65][66][67][68][69][70] isochrone, assuming [Fe/H] = −1.6 and an age of 10 billion years.Observational errors compatible with the Gaia DR2 dataset were added to the synthetic stream stars using the PyGaia [71] package.
An example simulated stream, representing just 1,161 (0.13%) of the 886,677 stars in the simulated patch, is shown in angular position space in Figure 11.The simulated streams are presented as standalone patches, so CWoLA is applied to just one simulated patch at a time.No fiducial cuts are applied to the simulated patches.As a proof of concept, we choose idealized signal and sideband limits with prior knowledge of the location of the stellar stream: the sideband is defined as the window of total width σ 4 surrounding the median µ λ value of the stream, while the sideband is defined as the

CountsFigure 1 .
Figure 1.Two-dimensional histograms of the six features used in this analysis are illustrated for a single patch in the sky containing some GD-1 stars.This patch is centered at (l = 207.0,b = 50.2).The top row shows the full patch with no selections applied.The second row shows the patch with fiducial selections applied: g < 20.2 to reduce streaking; |µ λ | > 2 mas/year or |µ * φ | > 2 mas/year to remove too-distant stars, and 0.5 ≤ b − r ≤ 1 to focus on identifying cold stellar streams.The third row indicates the six features for the GD-1 stream following the fiducial selections.

Figure 2 .
Figure2.Signal-enriched and signal-depleted groups are pictured above.The green data points labeled "S" represent signal events, while the red data points labeled "B" represent background events.The signal and sideband regions are chosen such that more signal events (shown in orange) are located in the central signal region than the surrounding sideband region.

Figure 3 .
Figure 3. Stars associated with the stellar stream GD-1 are highly localized in µ λ space in comparison with background stars for the same patch of Gaia data seen in Figure 1.The signal region, shown in the darkest regions in each plot, is defined by taking ±1σ from the median µ λ value for the stream stars, which in this case is [−13.6,−11.4].The sideband region is defined by taking ±3σ from the stream's median µ λ value, excluding the signal region: [−15.8,−13.6) & (−11.4,−9.3].

Figure 4 .
Figure 4. Distributions for the five neural network inputs are compared for both GD-1 stars (in red) and background stars (in grey) across signal and sideband regions.The patch shown here is the same example patch from Figure1.For both stream and background stars, the distributions for these five variables across the signal and sideband regions are approximately indistinguishable.

Figure 5 .
Figure 5.The full scope of stars identified by the CWoLa method in overlapping patches across the angular range corresponding to GD-1.Light gray dots indicate the ground truth labeling of GD-1 stars [51], while the top 250 stars identfied by CWoLa in each patch are indicated in colored dots.The colors are chosen to correlate with each star's α value.

1 − 5 Figure 8 .
Figure 8.The width of the CWoLa-identified stars is determined by first calculating the median stream position in φ 2 for 10 bins of φ 1 (the overlaid red line in (a)).The φ 2 coordinates are then shifted by these median values, yielding the histogram in (b).In (b), we use a 2-component Gaussian Mixture Model to show two individual Gaussian components with σ ≈ 0.3°(the core of GD-1) and σ ≈ 1.7°(the "cocoon").This appears to support the general trend observed in Figure6of[46].

Figure 9 .
Figure 9. Isolating the subset of the top stars identified by CWoLa with the 10% smallest 5dimensional Euclidean distances d to the nearest labeled star reveals 60 additional stellar candidates for GD-1 membership that may have been omitted from the GD-1 ground truth labeling.
When the signal region does not contain the stream, there is no obvious bump anywhere in proper motion space as the cut percentage gets larger.When the signal region contains the stream, an obvious bump forms in the same area where the stream is localized.

Figure 10 .
Figure 10.A demonstration of a scan for which the anomaly location is not previously known.

Figure 11 .
Figure 11.Distributions in position, velocity, and color space for a simulated patch as well as the simulated stream contained within it.While both background stars and simulated stream stars are both highly concentrated in velocity space, the stream stars' peak proper motions are located further from (µ * φ , µ λ ) = (0, 0) than those of the background stars.

Figure 12 .
Figure 12.Simulated streams are far more concentrated in angular velocity space than a typical stream in the Gaia dataset.As a result, the signal and sideband regions are defined within a much narrower band around the median stream µ λ .The signal region is defined within ±σ/4, or [−6.3, −3.8], while the sideband region is defined as ±σ/2, excluding the signal region: [−7.6, −6.3) & (−3.8, 2.6].The stream stars are almost exclusively contained within the signal region.