The Hierarchical Structure of Galactic Haloes: Differentiating Clusters from Stochastic Clumping with AstroLink

We present AstroLink, an efficient and versatile clustering algorithm designed to hierarchically classify astrophysically-relevant structures from both synthetic and observational data sets. We build upon CluSTAR-ND, a hierarchical galaxy/(sub)halo finder, so that AstroLink now generates a two-dimensional representation of the implicit clustering structure as well as ensuring that clusters are statistically distinct from the noisy density fluctuations implicit within the $n$-dimensional input data. This redesign replaces the three cluster extraction parameters from CluSTAR-ND with a single parameter, $S$ -- the lower statistical significance threshold of clusters, which can be automatically and reliably estimated via a dynamical model-fitting process. We demonstrate the robustness of this approach compared to AstroLink's predecessors by applying each algorithm to a suite of simulated galaxies defined over various feature spaces. We find that AstroLink delivers a more powerful clustering performance while being $\sim27\%$ faster and using less memory than CluSTAR-ND. With these improvements, AstroLink is ideally suited to extracting a meaningful set of hierarchical and arbitrarily-shaped astrophysical clusters from both synthetic and observational data sets -- lending itself as a great tool for morphological decomposition within the context of hierarchical structure formation.


INTRODUCTION
Correctly identifying spatial, kinematic, and chemical structures within simulated and observational data sets is crucial for advancing our understanding of the Universe and allows us to make predictions about structure formation and evolution.With an ever-increasing volume of data, it is essential to adopt data-mining approaches for astrophysical structure identification, as manual inspection methods cannot reliably be used to identify all relevant structure.
Data mining algorithms that find structure are referred to as clustering algorithms.Clustering algorithms, used for identifying astrophysical structures in observational data, often incorporate physical models to classify groups.These models commonly include constraints on orbital motion influenced by the parent structure's gravitational potential, as seen in algorithms such as StreamFinder (Malhan & Ibata 2018) and the xGC3 suite (Johnston et al. 1996;Mateu et al. 2011Mateu et al. , 2017)).They may also apply data projections or transformations to target specific structure types, e.g.StarGO (Yuan et al. 2018), HSS (Pearson et al. 2022), and Via Machinae (Shih et al. 2021).While these algorithms effectively reveal the structures they are designed to target, these constraints limit their ability to discover a broad range of structure types and their inter-relationships.
Clustering algorithms built for application to synthetic data -often called galaxy/(sub)halo finders -can be categorized into three types: configuration space finders e.g.SubFind (Springel et al. 2001), AHF ★ E-mail: william.oliver@iwr.uni-heidelberg.de(Knollmann & Knebe 2009), and CompaSO (Hadzhiyska et al. 2021), phase space finders e.g.6DFOF (Diemand et al. 2006), HSF (Maciejewski et al. 2009), ROCKSTAR (Behroozi et al. 2012), and VE-LOCIraptor (Elahi et al. 2019), and tracking finders e.g.SURV (Tormen et al. 2004;Giocoli et al. 2008) and HBT+ (Han et al. 2017).Configuration space finders primarily use 3D spatial particle positions to identify physical overdensities, particle velocities are also often incorporated to refine these self-bound haloes.Phase space finders use both 3D spatial positions and 3D velocities of particles to find clusters.While tracking finders use either configuration or phase space to construct haloes but are assisted by particle tracking in their determination of such groups at later times.
While simulation-specific astrophysical clustering algorithms have evolved over recent decades, many still rely on the Spherical Overdensity (SO; Press & Schechter 1974) or the Friends-Of-Friends (FOF; Davis et al. 1985) algorithms at their core.The SO algorithm constructs galaxies and (sub)haloes by locating spatial density peaks and expanding spherical volumes from them until a specified overdensity is reached.The FOF algorithm groups particles connected by distances less than the linking length (  ), often set at 0.2 mean (corresponding to overdensities with Δ ≈ 200), where  mean is the mean particle separation within the simulation box.These algorithms have established themselves as the primary means of defining galaxies and (sub)haloes, especially within synthetic data.This isn't inherently negative, as structures identified by these methods, followed by an unbinding process, represent true self-bound groups.Moreover, numerous comparative studies have shown strong agreement among modern galaxy/(sub)halo finders, including those built on the SO and FOF algorithms (Knebe et al. 2011(Knebe et al. , 2013;;Onions et al. 2012Onions et al. , 2013;;Elahi et al. 2013;Avila et al. 2014;Lee et al. 2014;Behroozi et al. 2015).However, the SO and FOF algorithms may overlook diffuse debris or anisotropic clusters and do not produce hierarchical clusterings.This often necessitates their iterative application in order to identify clusters from structure with diverse characteristics.Consequently, modern galaxy/(sub)halo finders may struggle to capture all relevant structures, though some are better able to identify more that just self-bound objects Elahi et al. (2013).
A versatile clustering algorithm is needed to overcome the limitations of both observation-specific and simulation-specific astrophysical clustering methods while preserving their advantages.Such an algorithm should be applicable to both observational and synthetic data sets of varying sizes and feature types, without imposing model constraints.It should also provide an adaptive measure of hierarchical clustering structure, accommodating the discovery of structures at different overdensity levels.Generalised algorithms such as OPTICS (Ankerst et al. 1999) and HDBSCAN (Campello et al. 2015;McInnes et al. 2017) have recently gained popularity in astrophysical structure discovery -e.g.Costado et al. (2016) (Aubert et al. 2004), share similarities to these algorithms -however, neither OPTICS nor HDBSCAN are designed with astrophysics specifically in mind -in fact, few generalised codes have been e.g.EnLink (Sharma & Johnston 2009), FOPTICS (Fuentes et al. 2017), and CluSTAR-ND (Oliver et al. 2022).
We develop a novel and almost entirely data driven astrophysical clustering algorithm AstroLink by improving upon the CluSTAR-ND algorithm.First, we briefly summarise the CluSTAR-ND algorithm in Sec. 2. We then describe the AstroLink algorithm throughout Sec. 3. In Sec. 4 we show the algorithm in practice and compare it to its predecessors, CluSTAR-ND and Halo-OPTICS (Oliver et al. 2021).Then finally, we make our conclusions and present our ideas for future work in Sec. 5.

AN OVERVIEW OF THE CLUSTAR-ND ALGORITHM
The Clustering Structure via Transformative Aggregation and Rejection in N-Dimensions algorithm (CluSTAR-ND; Oliver et al. 2022) generates a hierarchical clustering that represents galaxies and their substructures from input data of any size and number of features.This algorithm improves on Halo-OPTICS (Oliver et al. 2021) by enhancing its computational efficiency and extending its applicability to -dimensional data sets.Halo-OPTICS suitably adapts the OPTICS algorithm (Ankerst et al. 1999) for use with astrophysical data sets.Furthermore, OPTICS is itself a hierarchical extension of the DBSCAN algorithm (Ester et al. 1996), which is conceptually an extension of the FOF algorithm that substitutes the point-point linking scheme for the more robust neighbourhood-neighbourhood linking scheme.Due to this chain of algorithm succession, CluSTAR-ND exhibits elements from each of these algorithms in the following approach that: Optionally identifies FOF field haloes from the input data (otherwise treating the input data as a single FOF halo); (ii) Optionally transforms the -dimensional field halo data via a Principle Component Analysis (PCA) transformation; (iii) Estimates the local density field from the transformed data; (iv) Initiates groups with points at local maxima within this field; (v) Aggregates points in order of decreasing density to the group whose members belong to that point's neighbourhood; (vi) Merges groups when at least one of each of their members are part of a next-to-be-aggregated point's neighbourhood; (vii) Concurrently monitors whether a group, just prior to merging, meets cluster criteria; (viii) Removes outliers from clusters; (ix) And then optionally repeats steps (ii) -(ix) using the first level of the hierarchy as input data.
Step (i) allows users to ensure that root-level clusters in the hierarchy are galaxies/field haloes.CluSTAR-ND automatically optimizes and selects the neighbourhood linkage size,  link , and the overdensity factor,  threshold , based on the user's choice of  den as well as the dimensionality of the input data.
The parameter  reject is set at 0.9 following the analysis performed in Sec. 3.3 in Oliver et al. (2021).
After constructing the hierarchy, step (viii) eliminates outliers from each cluster, ensuring that all remaining points have a density greater than  cut ≔ min{  | lof(  ) <  outlier }.Here lof (  ) is defined in Eq. 8 of Oliver et al. (2022) and serves as a kernel density analogue of the local-outlier-factor formalised in Breunig et al. (1999).The parameter  outlier is optimally set to 2.5, allowing for a moderate level of outlier removal without compromising the algorithm's clustering capability.Finally, if adaptive = 2, step (ix) can be used to adjust the governing distance metric for each level of substructure in the hierarchy so that clusters are dense relative to their parent cluster.For more details on CluSTAR-ND, please refer to Oliver et al. (2022).

ASTROLINK: BUILDING UPON CLUSTAR-ND
CluSTAR-ND performs well in many scenarios, but its cluster extraction process lacks the adaptability required to produce meaningful results from data sets with high or fluctuating noise levels.In these cases, users would be forced to manually adjust CluSTAR-ND parameters, which is non-trivial due to the complex relationship between these parameters and the output.As such, we design AstroLink to remedy these drawbacks so that it can extract clusters with a single data-driven parameter,  -the lower threshold of statistical significance of clusters.This approach gives AstroLink the adaptability to accommodate different noise levels in the data while offering users an intuitive means of controlling the output.To maintain algorithmic transparency and for completeness, we now present each step of the AstroLink algorithm as shown in Fig. 1, some of which are the same as is presented in Oliver et al. (2022) regarding CluSTAR-ND.

Root-level Clusters
In CluSTAR-ND, the root-level clusters are found first using the FOF algorithm over the spatial positions -ensuring that root-level clusters of the output are physically defined as galaxies/haloes.We have eliminated this step from AstroLink so that it can be seamlessly applied to various types of astrophysical data, including those that do not necessitate halo identification based on spatial coordinates before identifying substructure.Consequently, the output of AstroLink can be interpreted as a natural and mathematical clustering of the input data.If the user still requires the identification of galaxies/haloes prior to finding substructure, they should first apply an implementation of the FOF algorithm before then providing these results to AstroLink.Ordinarily, the input data is labelled as the root-level cluster -making the hierarchy labelling of AstroLink compatible with CluSTAR-ND.

Transforming the Input Data
The first step of AstroLink is optional such that if adaptive = 1 (default), the input data () is re-scaled so that each component has unit variance.This is used to remove any unwanted global overdependencies (due to unit choice or otherwise) on a subset of 's features before a search for substructure is conducted.This functions the same as the PCA transformation in step (ii) of CluSTAR-ND, but is faster as no rotation is being made (the output is unaffected by this since the Euclidean distance is rotation-invariant).Hence by calculating Euclidean distances on the transformed data, AstroLink effectively calculates Mahalanobis distances (Mahalanobis 1936) on the input data -which guarantees that any resultant substructures will be dense with respect to the global shape of the input data -although this may be sub-optimal if the input data is strongly multi-modal in a subset of the input data's feature space.To skip this step, the user can set adaptive = 0, which may improve clustering results if the relative-scaling between dimensions is meaningful.Unlike in CluSTAR-ND, AstroLink does not provide the option to set adaptive = 2 and use a recursive PCA transformation (i.e.step (ix)) as this was not shown to consistently increase clustering power.

Estimating Local Density
After the optional transformation step, AstroLink estimates the local density of each point in a similar way to CluSTAR-ND.First the set of  den nearest neighbours,   den , are found for each data point -the default value of which is  den = 20.The local density is then estimated by using these sets in conjunction with a multivariate Find the ordered list () and the groups ( ≤ and  ≥ ) using the process detailed in Sec.3.4 and in Fig. 2.
Identify clusters as any  ≤ ∈  ≤ that is a ≥   noise outlier from the stochastic clumping in  as detailed in Sec.3.5.
Output the hierarchy of clusters.Epanechnikov kernel2 (Epanechnikov 1969) and a balloon estimator (Sain 2002) such that; where Here  is the dimensionality of the feature space, and (  ,   ) is the Euclidean distance between points  and  which, if adaptive = 1, corresponds to the Mahalanobis distance on the input data.Note that  () ≔ 0 for  > 1.
In CluSTAR-ND this estimate of the density was used directly, however in AstroLink its logarithm is taken and re-scaled it between 0 and 1 such that log ρ ≔ log(  / min ) log( max / min ) . (2) This transform renders all noisy fluctuations on the same scale regardless of their real density ( log ρ is approximately constant given fixed  den and ; Sharma & Johnston 2009) and it also allows us to see how large the affect of these fluctuations are compared to the range of densities within the data.For simplicity, we now refer to the set of the scaled (local) log-density values for all points as log ρ.In panel 1, the edges () have been computed and the empty sets (,  ≤ ,  ≥ ) have been initialised.In panel 2, the first edge has been processed -connecting two points which are added to an ordered list,  1 .In panel 3, more edges have been processed so that three ordered lists now exist.In panel 4, ordered lists  1 and  2 have been merged to create  4 -the groups specified by  1 and  2 are also noted and stored within  ≤ and  ≥ respectively.In panel 5, all edges have been processed and all points have been appended to ordered lists -however multiple ordered lists still exist.In panel 6, ordered lists  3 and  4 are merged to create  5 -as if there was an additional edge connecting the ordered lists (illustrated with a dashed blue line).Following this process, AstroLink has produced an ordered list,  (=  5 in this case), and a hierarchy of groups that embody the structure within the data set.

Improving the Aggregation Process
With a measure of local density computed, AstroLink now aggregates the data points together to form a hierarchy of groups.Conceptually (see Fig. 2 and the text below for specific algorithmic details), this process expands closed iso-density surfaces from high to low density.Whenever the iso-density value is lowered to that of a local density maxima, a new surface is born, and whenever a surface envelopes a new point, the point is appended to the end of an ordered list of the points located within that surface.We refer to the set of these ordered lists as .As the iso-density value is lowered further, these surfaces (and their ordered lists) will merge together two at a time -at which stage their start/end positions (as they appear within the newly-merged ordered list) are stored within  ≤ and  ≥ (for whichever list is smaller or larger respectively).Once this process has finished, all surfaces will have merged, one ordered list () will remain in , and a hierarchy of all density-based groups can be retrieved from it by acquiring their start/end positions from the sets  ≤ and  ≥ .
For AstroLink to be able to distinguish between clusters and noise in a way that is adaptable to varying levels of noise contamination, it can only identify groups as clusters after it has found and compared each group of points in order to judge them as either more noiselike or more cluster-like (refer to Sec. 3.5 for these details).This is unlike CluSTAR-ND, which identifies clusters during its aggregation process.Hence this change provides an opportunity to simplify and improve the effectiveness of the AstroLink aggregation process.

Connectivity during the Aggregation Process
Instead of actually constructing iso-density surfaces and checking which points become enveloped by them as the iso-density value is lowered, AstroLink relies upon the connections (referred to as edges from hereon) formed between nearest neighbours to decide which ordered list in  a point should be appended to.Oliver et al. (2022) found that by choosing  link = max{ceil(12.0−2.2 − 23.0 den −0.6 + 10.0), 7} as the number of nearest neighbours used for these edges, cluster resolution is maximised whilst algorithm run-time and the production of artificial disconnections during the aggregation process are minimised (refer to Sec. 6.1 of Oliver et al. (2022) for further details).Due to the similarities between AstroLink and CluSTAR-ND, the first step of the aggregation process is therefore to reduce the nearest neighbour lists from   den to   link .As such, the value of the  link parameter will be automatically chosen as above ( link = , default) unless the user specifies otherwise.

Aggregation
In addition to using the edges between nearest neighbours to decide which ordered list in  a point should be appended to, AstroLink also calculates a weight for each edge and uses these directly to decide the order in which each point is appended.The set of edges () is computed such that    = min{log ρ , log ρ  } where   ∈  and   ∈   link , , empty sets for ,  ≤ ,  ≥ are also initialised at this stage -as shown in panel 1 of Fig. 2. The edges are then processed in descending order such that for each edge,    , an action is taken according to whether certain conditions are satisfied by the two points at the edge's vertices,   and   , such that: If neither point belongs to a list in , create a new list with both points and add this to .This is equivalent the birth of a Notice that no action is taken if both points belong to the same list in .In fact, the edges for which an action is taken (highlighted in each panel of Fig. 2) are the same as those that belong to the minimum spanning forest (not tree, see below) that can be created using edge weights  ′   =  (   ) -where  is some strictly decreasing function, e.g. () = 1/.In other words (and since the edges are considered sequentially in decreasing order), the aggregation process resembles Kruskal's minimum spanning tree (MST) (Kruskal 1956) algorithm interwoven with additional steps for tracking the hierarchy of groups that exist within the input data.
However, it is not always possible to construct an MST from a set of edges unless they define a connected graph.As AstroLink uses ( link − 1) edges (see Sec. 3.4.1) to connect  points, the edges will not always define a connected graph.The consequence of this is that after each of the edges in  have been processed, multiple ordered lists may still remain in .In this scenario, we imagine that there now exists some additional edges that connect pairs of points within each in the remaining ordered lists, so that the ordered lists are merged together in order of descending size -this is illustrated in panel 6 of Fig. 2. In effect, action (iii) is repeated until a single ordered list exists in  -which from hereon we refer to as .

The Ordered-Density Plot
The ordered list, , can be used to construct the AstroLink analogue of the OPTICS reachability plot -the ordered-density plot.The ordered-density plot is a simple visualisation of the clustering structure as the complexity of the -dimensional feature space has been reduced to the simplicity of a 2-dimensional dendrogram.To construct the ordered-density plot, we need to compose the set of scaled log-densities with the ordered list which gives a function such that  () = log ρ , ∀ ∈ .Fig. 3 depicts the ordered-density plots for a series of input data sets consisting of two 1-dimensional standard normal distributions at various separations.
The ordered-density plot reveals the input data points linked by decreasing density and based on their shared neighbourhood connectivity.Peaks in the plot represent overdensities as the points within them are both denser than their surrounds and ordered consecutively due to being locally connectable to each other.Fig. 3 portrays two main peaks that become increasingly prominent as the separation between the distributions grows.Within these peaks are a series of smaller peaks that correspond to the stochastic clumping that arises whenever a data set contains spatial randomness -as is the case here due to the data sets being created via random sampling.As the distribution separation increases, these smaller peaks become far less prominent than the two main peaks which indicates that a set of meaningful clusters is identifiable using the ordered-density plot.

The Aggregation Tree
A hierarchy of groups, defined by the sets  ≤ and  ≥ , is also generated during the aggregation process -an example of which is shown in Fig. 4. For every group denoted in  ≤ there is exactly one (larger) group denoted in  ≥ , as these two groups are merged whenever a saddle point in the density field is found during the aggregation process -i.e.whenever action (iii) is performed.As such, the groups in  ≤ are smaller and are generally less prominently peaked within the ordered-density plot than their counterparts from  ≥ .Hence, the groups in  ≤ tend to be more noise-like than those in  ≥ .This makes any true clusters within  ≤ appear more distinct from the other groups in  ≤ than any true clusters within  ≥ do when compared to the other groups in  ≥ .
A hierarchy of clusters determined from  ≤ alone is structurally akin to that produced by the SUBFIND (Springel et al. 2001) and EnLink (Sharma & Johnston 2009) algorithms.While this is suitable for some clustering scenarios, as evident in Fig. 4, finding clusters from  ≤ alone can only capture one of the two standard normal distributions.Thus, in many scenarios, the hierarchy from  ≥ is also needed in order to identify all relevant clusters.

Identifying Clusters
AstroLink now identifies clusters by assessing the statistical distinctiveness of groups from the aggregation tree within the ordereddensity plot.By measuring the clusteredness of each group and fitting a model to the distribution of this measure for all groups in  ≤ , groups can then be classified as inliers (noise) or outliers (clusters) from this distribution.An optional hierarchy correction can then also be performed by incorporating the groups in  ≥ .

The Clusteredness of a Group
Intuitively, the measure of a group's clusteredness should consider its overdensity and also account for the magnitude of intra-group noise.Due to the logarithmic scaling of densities from Eq. 2, vertical differences on the ordered-density plot represent the logarithm of a ratio of densities.As such, we can calculate the logarithm of a group's overdensity factor by finding the difference between some characteristic log ρ value of that group and the log ρ value of its surrounds.The log ρ value at its surrounds is taken as the log ρ saddle-point that was used to merge the group into the aggregation tree -which is the maximum log ρ that can be found at the group's boundary.The group's characteristic log ρ value is taken to be maximum log ρ found within the group minus a measure of the intra-group noise.As such, we construct a statistical measure of clusteredness, which we call the prominence, such that for any group, ; P  = log ρmax, − log ρboundary, − log ρnoise,g . (3) Here log ρmax, is the maximum value of log ρ within , log ρboundary, is the value of log ρ at the saddle point that connects that  to another, and log ρnoise,g is a term that accounts for intra-group noise.The noise-correction term is calculated as the rootmean-square of the log ρmax,child − log ρboundary,child values for each of the direct children (excluding children of children) groups of .
As such, this measure may be viewed as the logarithm of a signal to noise ratio.The top panel of Fig. 5 shows a visualisation of the calculation of P  for the largest group from  ≤ as shown in Fig. 4.

A Model for Group Prominences
AstroLink now fits a descriptive model, , on a case-by-case basis to the prominences of the groups from  ≤3 by minimising the corresponding negative log-likelihood.For density estimates of Poisson noise computed with a fixed number of nearest neighbours, the probability distribution of  should be approximately Gaussian (Sharma & Johnston 2009).As such, the prominence (which is effectively an absolute difference of these estimates) should belong to a half-normal distribution.With real data however, some critical assumptions are violated; i.e. generally the nearest neighbours are not drawn from homogeneous Poisson noise; the density estimation and aggregation processes impose small-scale smoothing effects on the density field that suppress the production of groups with fewer points than   den and   link respectively; etc.As such, we construct  to be a combination of two probability distributions -one for noise and one for clusters -such that where  is a random variable representing prominence,  () is the Heaviside step function, and  ∈ [0, 1] is a parameter of the model.
For  noise , we have used the second-order correction estimate of the Akaike information criterion (AIC & AICc; Akaike 1974;Hurvich & Tsai 1989) and the Bayesian information criterion (BIC; Schwarz 1978) to assess the suitability of the Log-Normal, Inverse Gaussian, Gamma, Beta, Beta-Prime, Generalised Gamma, and Generalised Beta-Prime distributions when AstroLink is applied to a range of -point -dimensional uniform distributions on the unit hypercube (for various combinations of  and ).With both criteria, we found that using the Beta distribution for  noise is best (with parameters ,  ≥ 1 and typically  ≫ , i.e. uni-modal with a positively-skewed tail).We also found that using a Uniform distribution for  clusters provides enough shape flexibility so that  noise remains largely unaffected by the presence of outliers.This can be seen in the bottom panel of Fig. 5 as the presence of a  > 6 outlier does not detract from the fitting-quality of  noise .Using the model parameters (, , ),  noise ( ) and  clusters ( ) are also weighted so that  ( ) is normalised and continuous on the interval  ∈ [0, 1].

The Statistical Significance of Clusters
To label groups in  ≤ as clusters, AstroLink now simply identifies all such groups that have a statistical significance (S) that is greater than  noise .To be clear, any group ( ∈  ≤ ) will now be labelled as a cluster if its prominence (P  ) satisfies where  N (0,1) and   (,) are the standard normal and Beta cumulative distribution functions respectively.The parameters  and  are derived by fitting the model outlined in Sec.3.5.2 to the distribution of group prominences.With this, the group prominences are transformed so that the group significances follow a standard normal with a long positive tail of outliers containing the significances of increasingly more clustered groups.The  parameter is therefore a measure of how clustered -relative to the noise present within the data -the resultant clusters are.The value of  may be chosen by the user but the default value is  = , in which case AstroLink calculates it according to the model parameter, , i.e.  =  N (0,1) −1   (,) () .Since  marks the prominence value for which the groups transition from noise to clusters, it can therefore be used to automatically estimate an appropriate value for .

Correcting the Hierarchy
To ensure all relevant clusters are included in the final hierarchy, AstroLink can optionally label the groups from  ≥ as clusters as well as those from  ≤ .If h_style = 1 (default), then for each cluster already found, it's corresponding sibling group in  ≥ is considered if it also satisfies Eq. 4. Then for each of these cluster-like sibling groups in  ≥ , AstroLink labels it a cluster if it is the smallest such group in  ≥ that shares its starting position within the ordered list, .The latter condition simplifies and reduces the depth of the resultant hierarchy by eliminating clusters-within-clusters that only differ by a small number of points.With these additional clusters the resultant hierarchy is similarly styled to CluSTAR-ND, however now each cluster is a more meaningful and statistically interpretable group.Fig. 6 now depicts the final hierarchy of clusters extracted from the same toy data set shown in Figs. 3 -5.This step can be ignored by setting h_style = 0.

PERFORMANCE ANALYSIS
We now assess the performance of AstroLink by applying it to a series of synthetic galaxies each with a set of ground-truth labels.We analyse the time and space complexity of AstroLink as well as it's ability to produce a meaningful set of astrophysical clusters in comparison to its algorithmic predecessors -CluSTAR-ND and Halo-OPTICS.Lastly, we show some of these outputs visually.

Synthetic Data
To assess AstroLink's performance we use Galaxia (Sharma et al. 2011) to produce random samples of the 11 ΛCDM stellar haloes from Bullock & Johnston (2005) and the complementary 6 artificial stellar haloes from Johnston et al. (2008).Each of the original ΛCDM stellar haloes are simulated using a hybrid semi-analytical and hydrodynamic -body approach.Satellites are first modelled as -body dark matter systems within a parent galaxy whose disk, bulge, and halo are defined by time dependent semi-analytic functions.The simulation uses a ΛCDM cosmology with parameters Ω  = 0.3, Ω Λ = 0.7, Ω  ℎ 2 = 0.024, ℎ = 0.7, and  8 = 0.9.Semi-analytical models then assign stellar populations to each satellite while a chemical enrichment model (Robertson et al. 2005;Font et al. 2006) calculates age-appropriate metallicities for the star particles.This process yields satellites whose structural properties are in agreement with the Local Group's dwarf galaxies.With this regime each satellite has three main model parameters; the time since accretion ( acc ), the luminosity ( sat ), and the orbital circularity ( = / circ ).The distribution of these parameters specify the accretion history of a halo and as such a further 6 artificial haloes were created in Johnston et al. (2008) for the purpose of studying the effects of different accretion histories on the properties of haloes.These 6 artificial haloes are characterised by accretion histories that are predominantly radial ( < 0.2); circular ( > 0.7); old ( acc > 11 Gyr); young ( acc < 8 Gyr); high luminosity ( sat > 10 7  ⊙ ); and low luminosity ( sat < 10 7  ⊙ ).
For these galaxies, Galaxia provides spatial (x = (, , )), kinematic (v = (  ,   ,   )), and chemical (m = ([Fe/H],[/Fe])) quantities for each particle, along with ground-truth labels indicating the particle's satellite of origin and also whether each satellite is selfbound.More details on these simulations can be found in Sec.3.4 of Sharma et al. (2011) and references therein.With this information, we apply AstroLink to the various galaxies and assess its performance.

Time Complexity
We now compare the run-time of AstroLink to that of CluSTAR-ND and Halo-OPTICS.Fig. 7 shows the run-times of the three algorithms when applied to the spatial positions of the synthetic galaxies from Galaxia.All runs were performed using a single core of a 13 th Gen Intel i7-13700HX processor with a clock-speed of 5.00 GHz and all codes are written using Python3 -each making use of optimised numerical packages such as NumPy (Harris et al. 2020), SciPy (Virtanen et al. 2020), Scikit-learn (Pedregosa et al. 2011), and Numba (AstroLink & CluSTAR-ND only;Lam et al. 2015) to differing degrees.
It is clear from Fig. 7 that AstroLink outperforms its predecessors in terms of running time.The time complexities of AstroLink and CluSTAR-ND are O (log()) while Halo-OPTICS exhibits superquadratic time complexity at O ( 2.17 ).For Halo-OPTICS, this results from a combination of it performing a radial search about every data point and that the cuspyness of the haloes increase with their size -this is discussed further in Oliver et al. (2022).This does not impact AstroLink or CluSTAR-ND, as they only perform a  den nearest neighbour search for each data point.While this  den nearest neighbour search still contributes the largest portion of their total run-time.Comparatively, AstroLink is approximately 1.27 times faster than CluSTAR-ND and between 2400-9000 times faster than Halo-OPTICS over these data sets.
These run-times can be further improved by using multiple cores in parallel.For AstroLink, all steps are parallelised over shared memory -except the part of the aggregation process resembling Kruskal's MST.However, the majority of the computation in this step can also be parallelised (Bader & Cong 2006) by combining Prim's MST algorithm (Jarník 1930;Prim 1957;Dĳkstra 2022) and Boruvska's MST algorithm (Borůvka 1926;Choquet 1938;Sollin 1965) -an implementation we leave as future work.and Halo-OPTICS when applied to the spatial positions (x) of particles within the synthetic galaxies from Galaxia.The curves of best fit correspond to the time complexities for the algorithms.AstroLink and CluSTAR-ND both exhibit a run-time improvement of many orders of magnitude over Halo-OPTICS, and even though their time complexities are the same, AstroLink's smaller constant factor means that (asymptotically) it is an additional ∼ 3 times faster than CluSTAR-ND.Furthermore, since AstroLink may be run in parallel for a larger portion of its computation, it is the most-suitable of the three algorithms for applications to large data sets.

Space Complexity
In terms of memory footprint, each algorithm has  () space complexity, although with differing constant factors.CluSTAR-ND has the largest of these factors since it stores lists of indices for each cluster, while AstroLink and Halo-OPTICS compute an ordered list and then only store the start and end positions of each cluster within this list.Conversely, Halo-OPTICS has the smallest constant factor, as it only stores indices of the radial nearest neighbours for one point at a time -whereas AstroLink and CluSTAR-ND keep copies of the  den nearest neighbours of every point.Consequently, AstroLink boasts the median constant factor and thus it presents a trade-off between time and space complexity while improving interpretability and reliability.

Clustering Power
We now assess the capability of AstroLink to retrieve a set of astrophysical clusters in comparison to CluSTAR-ND and Halo-OPTICS.We first define an information-based statistic that represents the portion of meaningful classification that has been made by the algorithms, we then run each algorithm over the synthetic galaxies, before finally comparing their outputs.

Measuring Clustering Power
We use the normalised random-adjusted mutual-information presented (NRAMI) in Sec.5.1.1 of Oliver et al. (2022) to assess the goodness-of-fit between a clustering (  ) produced by As-troLink/CluSTAR-ND/Halo-OPTICS and the ground-truth labels (  ) provided by Galaxia for a given galaxy .To prepare the hierarchy of clusters from these algorithms for comparison to the flat ground-truth labels, we first assign a unique label to every point of the input data that corresponds to the smallest cluster that it is predicted to be a part of -we refer to this flattened set of labels as  *  .The knowledge gained about a galaxy's satellites from a clustering algorithm can then be quantified by the mutual information (Shannon 1948)  . (5) Here  (  ) is the entropy of   and  ( *  |  ) is the conditional entropy of  *  given   .The portion of relevant information learnt can then be calculated by normalising this to ensure that the expected knowledge gained about   from a random re-clustering of  *  (i.e. a set of randomly assigned labels with equal sizes as those in  *  ) is represented by a value of 0, and that similarly, the knowledge gained from a perfect clustering ( *  =   ) is represented by a value of 1.Hence, if M represents some feature space combination upon which the predicted clusters are dependent, the measure we use to assess an algorithm's clustering power is defined as where  [ ( *  (M);   )] is the expected mutual information between a random re-clustering,  *  (M), and   .Refer to Sec. 5.1.1 of Oliver et al. (2022) for more details on this measure.

Comparison
We now compare each algorithm's clustering power by applying them to each synthetic galaxy over various combinations of the available feature space.We select M from the combinations of the spatial (x), kinematic (v), and chemical (m) feature spaces, i.e. (x), (v), (x, m), (v, m), (x, v), and (x, v, m).We apply CluSTAR-ND and Halo-OPTICS using their default settings (refer to Oliver et al. 2022), but since their cluster extraction parameters have been optimised for these galaxies we alter the significance threshold parameter, , of AstroLink in this comparison.
For values of  in the range of ∼ 3 to ∼ 6, the NRAMI between the AstroLink output and the ground-truth hovers around values that are consistent with the CluSTAR-ND output.By default ( = ), the data-driven value of  falls within this range for all clustering scenarios -and so naively one can expect to achieve a roughly equal clustering power when using AstroLink's default settings.However since the output of AstroLink is more easily interpretable and the significance parameter is intuitive to adjust, we set the  parameter on a case-by-case basis in order to produce the best possible value of the NRAMI  (M).This allows us to see what AstroLink is realistically capable of achieving in a practical use-case/data-exploration scenario.
Fig. 8 depicts the clustering performance of each algorithm when applied to each galaxy and feature space combination as well as the difference in clustering power for each algorithm.Here AstroLink is shown to generally outperform the other algorithms, and even when it does not, the difference is most-often negligible.It is also important to note that AstroLink is effectively being penalised by the NRAMI measure -since it tends to produce a shallower hierarchy than the other two codes.With fewer levels to the hierarchy there are fewer groupings at high densities which improves the interpretability of the AstroLink output but in this case imposes a penalty to AstroLink in this comparison.Within this figure, it can also be noted that if the feature space of the input data has a larger number of more informative dimensions then the clustering power will generally be improved.Additionally, the type of clusters within the galaxy has a large effect on the clustering power -young or high luminosity satellites are well-matched whereas those that are old or that are orbiting on predominantly radial trajectories are more difficult to recover.

Visualising Clustering Structure
We now also visually present the AstroLink output in order to further demonstrate its performance and usability.The ordered-density plot produced by AstroLink holds the information attainable from the estimated density field4 , so in Fig. 9 we illustrate how this information varies for the predominantly young artificial galaxy from Johnston et al. (2008) when using different feature spaces to find the clustering of the data.Here it can be seen that the complexity of the clustering structure increases with the number of informative dimensions and hence visualising the ordered-density plot provides valuable insight to the user.Furthermore, the ordered-density plot can be used as a guide to the user when fine-tuning the significance threshold () -as unclassified structure can be classified by lowering  and, vice versa, classified noise can be declassified by raising .
Fig. 10 depicts the spatial distributions of those clusters shown coloured in the corresponding ordered density plots of Fig. 9.Each panel gives an indication as to which types of clusters AstroLink can retrieve given that the input data is expressed through certain feature space combinations.We see here that spatially compact clusters are best described by spatial features, stream-like clusters are best described by kinematic and chemical features, and combining these feature sets tends to give good results for various cluster types -permitting their retrieval simultaneously.Notably, the addition of chemical abundances to the input data's feature space gives AstroLink more knowledge about otherwise phase-mixed clusters.These observations agree with theoretical and intuitive predictions about how one would expect a generalised astrophysical clustering algorithm to perform.

CONCLUSIONS
We have presented AstroLink, a generalised astrophysical clustering algorithm that produces a set of arbitrarily-shaped hierarchical clusters from an arbitrarily-shaped point-based data set such that the clusters found are statistical outliers from noisy density fluctuations.The algorithm is unsupervised, requires little to no input from the user (other than the trivial requirement to supply the data), and has an intuitive set of hyperparameters that can be quickly and intuitively adjusted to ensure that the resulting clusters meet the user's needs.
We have also demonstrated the robustness of AstroLink's performance in comparison to its predecessors, CluSTAR-ND and Halo-OPTICS.We find that AstroLink classifies satellites from simulated galaxies with greater accuracy while requiring less run-time and memory allocation.Similarly to Halo-OPTICS, it also produces a visual representation of the input data's implicit clustering structure.However unlike Halo-OPTICS, the ordered-density plot is determinable and more easily interpretable -owing to the fact that absolute vertical differences correspond directly to overdensity factors.Clusters are represented with a colour cycle of nine colours (also depicted in Fig. 10) and have been identified with AstroLink's default settings (unlike in Fig. 8,  is now estimated from the data) .
While there are still ways to improve AstroLink's output (e.g.defining a locally-adaptive metric before estimating density, propagating data point uncertainties into cluster uncertainties, using an supplementary model-based approach that attributes additional data points to the AstroLink clusters, etc.).AstroLink is well-suited to the clustering problem of finding astrophysically relevant groups from large-scale data sets in both simulated and observational settings.

Figure 1 .
Figure 1.A high-level activity chart for the AstroLink algorithm.The nodes in the middle of the chart each correspond to specific subsections within Sec. 3 -as indicated in the text of each node.

Figure 2 .
Figure 2.An illustration of the AstroLink aggregation process (using  den = 10 &  link = 5) with new stages depicted in each panel.Here; a red-blue gradient indicates high-low log ρ values/edge weightings; unused points/edges are transparent; dotted contour lines represent the growing isodensity surfaces (as in the analogy from Sec. 3.4); and similarly, solid coloured lines represent those that envelope groups stored in  ≤ and  ≥ (these groups are also shown in the growing ordered-density plots within the inset panels).In panel 1, the edges () have been computed and the empty sets (,  ≤ ,  ≥ ) have been initialised.In panel 2, the first edge has been processed -connecting two points which are added to an ordered list,  1 .In panel 3, more edges have been processed so that three ordered lists now exist.In panel 4, ordered lists  1 and  2 have been merged to create  4 -the groups specified by  1 and  2 are also noted and stored within  ≤ and  ≥ respectively.In panel 5, all edges have been processed and all points have been appended to ordered lists -however multiple ordered lists still exist.In panel 6, ordered lists  3 and  4 are merged to create  5 -as if there was an additional edge connecting the ordered lists (illustrated with a dashed blue line).Following this process, AstroLink has produced an ordered list,  (=  5 in this case), and a hierarchy of groups that embody the structure within the data set.

Figure 3 .
Figure3.Examples of ordered-density plots (bottom) generated by As-troLink (using default settings) for three 1-dimensional data sets that have been randomly sampled from two standard normal distributions with varying separations between them (top).We can see that as the separation grows, it becomes easier to distinguish the two distributions as individual clusters.

Figure 4 .
Figure 4.The AstroLink aggregation tree and its relation to the ordereddensity plot for the 1-dimensional data set with the largest separation from Fig. 3.The markings denote the groups,  ≤ (blue) and  ≥ (semi-transparent yellow-green), found during the aggregation process.

Figure 5 .
Figure 5.An illustration of the prominence calculation (top) for the largest group in  ≤ and the resulting prominence distribution (bottom) for all such groups (groups are shown in Fig. 4).The largest group is objectively more clustered, and hence is a clear outlier ( > 6), from the others.The solid red line indicates the model for group prominences (detailed in Sec.3.5.2) and the solid blue line, positioned at  = , is labelled with the corresponding data-driven estimate of  (detailed in Sec.3.5.3).

Figure 6 .
Figure6.Clusters identified from the toy example of Fig.3using AstroLink with default settings.Shown here is the 1-dimensional probability distribution alongside the input data that was sampled from it (top) and the ordered-density plot, the final hierarchy tree, and the clusters' significances (bottom).This data is coloured according to the cluster that it has been allocated to by AstroLink.AstroLink has differentiated the two Gaussian distributions from the noisy density fluctuations within them with a statistical significance of 7.24 noise and 6.28 noise respectively.

Figure 7 .
Figure7.The run-times and time-complexities of AstroLink, CluSTAR-ND, and Halo-OPTICS when applied to the spatial positions (x) of particles within the synthetic galaxies from Galaxia.The curves of best fit correspond to the time complexities for the algorithms.AstroLink and CluSTAR-ND both exhibit a run-time improvement of many orders of magnitude over Halo-OPTICS, and even though their time complexities are the same, AstroLink's smaller constant factor means that (asymptotically) it is an additional ∼ 3 times faster than CluSTAR-ND.Furthermore, since AstroLink may be run in parallel for a larger portion of its computation, it is the most-suitable of the three algorithms for applications to large data sets.

Figure 8 .
Figure 8.The NRAMI between for the AstroLink, CluSTAR-ND, and Halo-OPTICS algorithms when applied to each synthetic galaxy and feature space combination (top) and the differences in the NRAMI between the AstroLink algorithm and its predecessors for each of these clustering scenarios (bottom).The NRAMI of a hypothetical set of clusterings that exactly classify the selfbound satellites within each MW type galaxy are also shown in the top panel as a reference.AstroLink outperforms CluSTAR-ND in ∼ 75% of comparisons and Halo-OPTICS in ∼ 50% of comparisons.

Figure 9 .
Figure 9. Ordered density plots produced by AstroLink when applied to the artificial galaxy, newhalo_young, over various feature space combinations.Clusters are represented with a colour cycle of nine colours (also depicted in Fig.10) and have been identified with AstroLink's default settings (unlike in Fig.8,  is now estimated from the data)

Figure 10 .
Figure 10.Illustrations of the spatial positions of clusters found by applying AstroLink to the artificial galaxy, newhalo_young, over various feature spaces.Clusters found by AstroLink are shown with the same colours as in the ordered-density plots of Fig. 9.
While step (ii) optionally removes any unwanted global (intra-galaxy/halo) over-dependencies on a subset of features before searching for substructure -i.e. if adaptive ≥ 1, a PCA transformation is applied and the components are re-scaled with unit variance per component.Step (iii) uses each point's  den nearest neighbours to estimate the local density -reducing the complexity of an -dimensional data set to the simplicity of a scalar value.