Detection of Open Cluster Members Inside and Beyond Tidal Radius by Machine Learning Methods Based on Gaia DR3

In our previous work, we introduced a method that combines two unsupervised algorithms: DBSCAN and GMM. We applied this method to 12 open clusters based on Gaia EDR3 data, demonstrating its effectiveness in identifying reliable cluster members within the tidal radius. However, for studying cluster morphology, we need a method capable of detecting members both inside and outside the tidal radius. By incorporating a supervised algorithm into our approach, we successfully identified members beyond the tidal radius. In our current work, we initially applied DBSCAN and GMM to identify reliable members of cluster stars. Subsequently, we trained the Random Forest algorithm using DBSCAN and GMM-selected data. Leveraging the random forest, we can identify cluster members outside the tidal radius and observe cluster morphology across a wide field of view. Our method was then applied to 15 open clusters based on Gaia DR3, which exhibit a wide range of metallicity, distances, members, and ages. Additionally, we calculated the tidal radius for each of the 15 clusters using the King profile and detected stars both inside and outside this radius. Finally, we investigated mass segregation and luminosity distribution within the clusters. Overall, our approach significantly improved the estimation of the tidal radius and detection of mass segregation compared to previous work. We found that in Collinder 463, low-mass stars do not segregate in comparison to high-mass and middle-mass stars. Additionally, we detected a peak of luminosity in the clusters, some of which were located far from the center, beyond the tidal radius.


INTRODUCTION
According to accepted theories, stars are born within a single molecular cloud as a cluster.As a result, cluster members share the same physical parameters and chemical elements.Additionally, there exists an interaction between the Galaxy and clusters that affects cluster formation and morphology.To gain a comprehensive understanding of a cluster, including aspects such as the initial and present mass function, cluster morphology, planet formation theories, tidal tails, and interactions between galaxies and clusters, we must identify not only members within the tidal radius but also those outside it, such as cluster escape members (Freeman & Bland-Hawthorn (2002), Friel (1995), Hoogerwerf & Aguilar (1999), de Bruĳne (1999)).Several theories have been proposed to describe the birth and formation of stars within clusters, such as the hierarchical theory (Kruĳssen (2012)) or the centered formation theory (Lada & Lada (2003)).By studying the morphology of clusters in a wide field of view, we can determine which theory is more accurate than the others (Hu et al. (2023)).Meanwhile, reliable cluster membership allows for the determination of the mass distribution of stars and the fraction of binary star systems within the cluster.This information can ★ E-mail: mnoormohammadi@aut.ac.ir † E-mail: khakian@aut.ac.ir ‡ E-mail: atefeh@ipm.ir then be compared with simulation methods, such as N-body simulations (Almeida et al. (2023), Maíz Apellániz (2019), Olivares et al. (2019), Pauwels et al. (2023)).
Extended stellar coronae and tidal tails play an important role in the study of cluster formation, evolution, and interactions between galaxies and clusters.To achieve this, we need to study clusters across a wide field of view, covering distances of up to hundreds arcmins (Bhattacharya et al. (2022), Meingast et al. (2021), Tarricq et al. (2022), Jerabkova et al. (2021), Tang et al. (2019), Ye et al. (2021), Hu et al. (2023), Lodieu et al. (2019), Bhattacharya et al. (2021)).The first and most crucial step in the study of star clusters is to identify reliable members.To achieve this goal, we require accurate and comprehensive data, along with methods that can work with this data and yield robust results.Membership determination within a star cluster occurs through two primary approaches: astrometric and photometric parameters (Kos et al. (2018), Kraus & Hillenbrand (2007), González-Díaz et al. (2019), Krone-Martins et al. (2010)).
Because stars within clusters originate from a common interstellar cloud, they share the same astronomical characteristics such as position, parallax, and proper motion.Additionally, these stars show a clear main sequence and, in the case of an old cluster, red giant branches.
In the current century, one of the popular and powerful methods that can identify relevant patterns within large datasets is machine learn-ing.To achieve high accuracy, machine learning algorithms require data with high precision.The Gaia data release contains information about billions of stars in our galaxy, with high-accuracy astrometry and photometry parameters.Many studies have been conducted using machine learning methods based on the Gaia data release to identify members of star clusters, some of which are mentioned here: Cantat-Gaudin et al. (2018) used Kmeans and UPMASK, Gao (2019), Gao (2018b), Gao (2018c), Gao (2018a) used GMM and Random Forest, Wilkinson et al. (2018), Bhattacharya et al. (2017) used DBSCAN, Agarwal et al. (2021) used KNN-GMM, Noormohammadi et al. (2023) used DBSCAN and GMM.All these works filtered data in some way.
In our previous work (Noormohammadi et al. (2023)), we identified reliable cluster members using a combination of two unsupervised machine learning algorithms: DBSCAN and GMM.The process involved three steps.First, the data were filtered based on astrometric and photometric conditions.Next, DBSCAN identified reliable candidates using proper motion and parallax information.Finally, GMM detected reliable members from the candidates based on their position, parallax, and proper motion.We compared our method with other machine learning methods based on Gaia DR2, because those methods were applied to Gaia DR2 (Cantat-Gaudin et al. (2018), Gao (2018c), Agarwal et al. (2021)).We showed that our method detected cluster members better than other methods in the clusterdense region.Some of the members detected by DBSCAN indicated a low probability of membership by GMM because they lay outside of the cluster-dense region.Additionally, some of these outer members lie within the range of proper motion, parallax, and CMD of GMM's high probability detection members.To identify these members, some of whom could be considered escape members, we introduce a method that combines three algorithms: DBSCAN, GMM, and Random Forest.This method can find members within a large field of view of a cluster and detect not only the cluster members but also the cluster escape members, thus presenting a better view of the cluster morphology.
In this work, we applied our method to 15 open clusters: nine of them were in previous work (under Gaia EDR3), and six of them are new.
In Section 2, the data conditions for 15 open clusters are explained.
In Section 3, the method was explained with a focus on a new step.The results in each step are shown in Section 4. In Section 5, we discussed our results by determining the tidal radius, studying mass segregation, and analyzing cluster luminosity.Finally, in Section 6, we summarized our work.

DATA
In 2013, Gaia was launched to provide comprehensive information about stars in the Milky Way.The first release of Gaia data (Gaia DR1) contained around 1.14 billion data sources, with more than 2 million having full astrometric parameters.The second release of Gaia data (Gaia DR2) included around 1.62 billion stars, with more than 1.33 billion having full astrometry parameters.In 2020, Gaia published the latest edition of data, which encompassed around 1.8 billion stars, with more than 1.46 billion having full astrometric parameters https://www.cosmos.esa.int/web/gaia/dr3.The accuracy of astrometric and photometric parameters in Gaia Data Release 3 is shown in Table 1.As shown in Table 1, stars brighter than 20 magnitudes have uncertainties below 0.5 for astrometric parameters.However, by increasing the magnitude from 20 mag to 21 mag, uncertainties grow to higher than 1.00 magnitude.
The last version of the Gaia data release (GDR3) is used in this work.A radius of 150 arcmin for all these clusters contains member stars and a high fraction of escape members, making it a suitable value for the search radius.

METHOD
In this work, three machine-learning algorithms are used to identify star cluster members and stars that are outside the tidal radius.In the previous work (Noormohammadi et al. (2023)), a machine learning method was presented to identify reliable members of 12 open clusters based on the Gaia EDR3.In this work, we developed our method by adding one supervised algorithm in order to detect members beyond the cluster dense region.This new method is formed with three steps, each of them has been described in the flow: Having observed some indications of proper motion and the CMD of the cluster, the detection data are sent to the next step.In this work, DBSCAN could perform well with large data sizes in three dimensions.

GMM (Gaussian mixture models)
The output of DBSCAN is used as input for GMM (Gaussian Mixture Model), which prepares the data source based on the conditions of the GMM algorithm.The GMM algorithm can detect data that have the same Gaussian distribution if the data satisfy three conditions: 1) using accurate data, 2) the rate of signal to noise must be significant, and 3) the structure of clusters among field stars must be indicated.Because of these conditions, some of the work eliminates huge volumes of data by filtering based on conditions such as astrometric parameters.However, in this work, we achieve this by using DBSCAN in the first stage.
In the next stage, the GMM algorithm was applied to 5 parameters: position in RA and DEC, proper motion in RA, DEC, and Parallax.Before applying the algorithm to clusters, all data were normalized using the scale function from the scikitlearn library https://scikit-learn.org/stable/modules/ generated/sklearn.preprocessing.scale.html.At the final stage, we analyzed the members that were detected by GMM based on proper motion and the CMD.If the selected data were without contamination (such as field stars), we returned to the DBSCAN step, increased the value of MinPts and Eps, and then applied GMM again.We must be cautious, as continuing this process may still result in contaminated data.The threshold represents the appropriate value for MinPts and Eps in DBSCAN, detecting the maximum number of reliable cluster members and optimally eliminating field stars.Since the GMM algorithm was applied to position parameters (RA, DEC), some of the outer members were eliminated automatically.Some of this eliminated data lie in the range of proper motion and parallax of cluster members and are also consistent with the CMD of cluster members.

Random Forest
In this work, after reliable cluster members were found by DBSCAN and GMM, the Random Forest algorithm was used for detecting outer members that lay in the range of proper motion and parallax of cluster members and matched their CMD.Random Forest can analyze astrometric and photometric parameters and does not need to normalize data.At this stage, we can identify data points that may correspond to escaping members within the cluster.These data points typically reside in the outer layer of the cluster.Additionally, this step provides us with the optimal field of view for observing the cluster.This field of view reveals the morphology of the cluster both inside and outside the tidal radius.In the next step, the data was divided into three samples: 1.Data that were not detected by DBSCAN were considered as field stars.To obtain suitable data for training the Random Forest algorithm, we filtered field stars based on the range of parallax values among cluster members.This range was determined based on detected members by GMM with a probability higher than 0.8.Selection of the range of parallax is higher than the maximum parallax value among cluster members and lower than the minimum parallax value, except for Alessi01, which has few members.The details of the parallax range and the number of field stars used for training data are shown in the Table 2 2. The stars detected by DBSCAN but with a probability lower than 0.8 attributed by the GMM to them were considered as suspicious stars.
3. The stars that were detected by the GMM algorithm with a probability higher than 0.8 were considered as cluster members.
In step three, the Random Forest algorithm was trained using field stars and cluster members.We performed a train-test split using the train − test split method from the sklearn.model− selection library(https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), with a 30 percent split (10 percent for Alessi01).Additionally, to investigate the best value for the random forest parameters, we calculated the F1 − score, which is shown in Table 2.We also analyzed the confusion matrix for each cluster, as depicted in Fig 6.
The hyperparameters were chosen based on achieving high accuracy for the F1 score and the confusion matrix.After that, it was applied to suspicious stars based on five parameters: three astrometric (proper motion in RA, proper motion in DEC, and Parallax) and two photometric (G magnitude and Bp-Rp color index).The members detected by Random Forest should be evaluated in comparison with cluster members that have a probability higher than 0.8 based on proper motion, parallax, and CMD.If stars detected by Random Forest lay within the range of proper motion and parallax and on the CMD of high-probability cluster members (higher than 0.8), that were detected by GMM in five dimensions (RA, DEC, pmRA, pmDEC, Parallax), they were considered as members outside the tidal radius.For all cluster we applied (n estimators=100, max depth=20, criterion=gini, random state=0) except King 06 (n estimators=50, max depth=10, criterion=gini, random state=0) and for NGC 2423 (n estimators=50, max depth=20, criterion=gini, random state=0) and for Melotte 72 (n estimators=60, max depth=20, criterion=gini, random state=0).We analyzed detection data by Random Forest based on probability and selected proper data based on the Color magnitude diagram.Finally, we selected cluster member stars with a probability higher than 0.5 for Alessi 01, King 06, NGC 752, M 38, M 41, M 47, and M 67 higher than 0.6 for NGC 2423,  4) and ( 5): Two free parameters of the DBSCAN algorithm (refer to the text for more details).( 6): Number of stars that are selected by the GMM algorithm as cluster members with a membership probability higher than 0.5.( 7): Cluster members with a membership probability higher than 0.8.( 8  MNRAS 000, 1-?? (2022) Figure 7.The Position of the Cluster's Inner and Outer Members Among the Field Stars: Black dots represent the field stars, while red dots indicate stars that were selected by Random Forest algorithm, and blue dots show stars that were selected by the GMM algorithm with a probability higher than 0.8.Red dots represent stars that were selected by Random Forest, while blue dots indicate stars that were selected by GMM with a probability higher than 0.8.

RESULTS
Fig In the Gaussian Mixture Model (GMM), we selected a cluster number equal to 2, which corresponded to the cluster and field excluding Melotte 72.
To distinguish members of Melotte 72 from field stars, we utilized three different values for the GMM cluster number.In the case of this specific cluster, all data points within the other two GMM clusters were considered as suspect data, subject to a decision by the Random Forest algorithm in the final step.If these stars are indeed members of the cluster, they were identified by the Random Forest in the last stage.The confusion matrix is shown in Fig 6.
In this work, stars that have a probability higher than 0.8 are considered as cluster members.As seen in Fig 2, members that are in the outer radius from the cluster center (cluster dense region) have a probability lower than 0.8, nevertheless, some of them can be selected as escape members.As shown in Fig 7, the morphology of the clusters can be observed in detail, including their members, corona, and tidal tails.This method can detect members of the smallest cluster, even those far away from the center of the clusters, such as Alessi 01 and Melotte 72.In the case of Melotte 72, which is at a high distance from Earth, as depicted in Fig 7, the candidate members fall within a large distance range of approximately 100 parsecs from the cluster.In Table 2 , we present the selection data at each step.Notably, for the richness cluster (M 35 and NGC 2099), the Random Forest algorithm detected more members compared to other clusters.Moving on to Table 3, it displays the physical parameters for the selected members using the Gaussian Mixture Model (GMM) with a probability higher than 0.8, and those identified by the Random Forest and comparison with Cantat-Gaudin et al. (2018).In this method, the selected parameters correspond with the physical characteristics detected by GMM with a probability higher than 0.8.In the case of the oldest cluster in this study (NGC 188), only a few data members were detected using Random Forest.This observation could be attributed to its age and dense shape.
As seen in Fig 10, stars detected by Random Forest lie in the main sequence, red giant branch, and also the binary region.These stars could be studied in other works to discuss star formation theory, check simulation codes related to cluster star evolution, survey the chemical elements of clusters, study cluster morphology, calculate the gravitational effect from the Galaxy to the cluster, estimate the cluster's initial mass, and determine a reliable value for the clusters age.
By viewing the region of members detected by GMM in Fig 7, Random Forest selected only a few members for cluster-dense regions that were detected by GMM.This could indicate that GMM algorithms detected cluster members in five dimensions in the clusterdense regions very well.The distance of the clusters is obtained from Bailer-Jones (2023).

DISCUSSION
To determine the distribution of cluster members, we first found the tidal radius by fitting the King profile (King (1962)).For this, we divided the cluster regions into several concentric rings.Next, we calculated the number density of stars in each ring using Equation 1, where   is number of stars in each ring and   is the distance from the center of the cluster for each ring.After that, the King profile was fitted, using Equation 2where   is surface density background,  0 is peak of density, and   is cluster core region.Finally, the tidal radius was calculated by Equation 3 (Santos et al. (2005)), where   is surface density background uncertainty.However, stars beyond the tidal radius exhibit a scattered distribution.Some members that were detected with Random Forest lie inside the tidal radius.The Random Forest detection method has improved tidal radius calculation.Table 4 shows the tidal radius and members within and outside the tidal radius for each cluster.However, one luminosity peak has been observed either inside or outside the cluster's tidal radius, which will be studied in future works.

CONCLUSION
For a comprehensive study of star clusters, including aspects such as membership inside and outside of the tidal radius, tidal tail morphology, formation and evolution of stars within clusters, and determination of cluster ages, we require a method capable of identifying reliable members across the wide field of view encompassing these clusters.In our previous work, we successfully identified reliable cluster members by combining two unsupervised machine-learning algorithms: DBSCAN and GMM.Applying our method to 12 distinct open clusters, we demonstrated its effectiveness in identifying reliable members within the tidal radius.However, the method also detected outside members that lay within a range of proper motion, parallax, and color-magnitude diagrams associated with high probability selection members.In the current study, we take a step further from our previous work by incorporating a supervised machine learning algorithm, Random Forest.With this method, we successfully identified outside members of 15 open clusters across the wide field of view, revealing the morphology of clusters at greater distances.Additionally, through fitting the King profile, we calculated the tidal radius and detected members beyond this radius.With a comprehensive view of cluster members, we searched for mass segregation in the understudy cluster and explored cluster luminosity.We found one peak of cluster luminosity far away from the cluster center; in some clusters, the peak is outside the tidal radius.The data obtained using this approach holds significant value for researching cluster's evolution, evaporation processes, interactions between the Galaxy and clusters, and theories related to star formation within these clusters.

Figure 9 .Figure 10 .
Figure 9.The parallax of the clusters inner and outer members.Red lines represent stars that were selected by Random Forest, while blue lines indicate stars that were selected by GMM with a probability higher than 0.8.
Fig 1 shows the distribution of member candidates among field stars for six clusters in two parameters: pmRA, and pmDEC.As seen in Fig 1 , the DBSCAN selection data reveal a dense distribution among the sample sources.This indicates that DBSCAN can detect data between huge sample sources using just two filters: positive parallax and stars brighter than 20 mag.Fig 2 to 5 show stars that were selected by the GMM algorithm in five parameters (RA, DEC, pmRA, pmDEC, and Parallax).In the Gaussian Mixture Model (GMM), we selected a cluster number equal to 2, which corresponded to the cluster and field excluding Melotte 72.To distinguish members of Melotte 72 from field stars, we utilized three different values for the GMM cluster number.In the case of this specific cluster, all data points within the other two GMM clusters were considered as suspect data, subject to a decision by the Random Forest algorithm in the final step.If these stars are indeed members of the cluster, they were identified by the Random Forest in the last stage.The confusion matrix is shown in Fig 6.In this work, stars that have a probability higher than 0.8 are considered as cluster members.As seen in Fig2, members that are in the outer radius from the cluster center (cluster dense region) have a probability lower than 0.8, nevertheless, some of them can be selected as escape members.Fig 5 shows a clear main sequence and for older clusters, a red giant branch.Fig 7 shows data that were detected with the Random Forest algorithm among GMM detection members and field stars based on five parameters (pmRA, pmDEC, Parallax, G magnitude, and Bp-Rp color index).Position parameters were not applied in the Random Forest model to obtain the best view of cluster morphology.As seen in Fig 7, stars selected by Random Forest are in the outer layer than the cluster center but these members are in the range of proper motion, parallax, and CMD of members that were selected by GMM with a probability higher than 0.8 as are shown in Fig 8 to Fig 10.As shown in Fig7, the morphology of the clusters can be observed in detail, including their members, corona, and tidal tails.This method can detect members of the smallest cluster, even those far away from the center of the clusters, such as Alessi 01 and Melotte 72.In the case of Melotte 72, which is at a high distance from Earth, as depicted in Fig7, the candidate members fall within a large distance range of approximately 100 parsecs from the cluster.In Table2, we present the selection data at each step.Notably, for the richness cluster (M 35 and NGC 2099), the Random Forest algorithm detected more members compared to other clusters.Moving on to Table3, it displays the physical parameters for the selected members using the Gaussian Mixture Model (GMM) with a probability higher than 0.8, and those identified by the Random Forest and comparison withCantat-Gaudin et al. (2018).In this method, the selected parameters correspond with the physical characteristics detected by GMM with a probability higher than 0.8.In the case of the oldest cluster in this study (NGC 188), only a few data members were detected using Random Forest.This observation could be attributed to its age and dense shape.As seen in Fig10, stars detected by Random Forest lie in the main sequence, red giant branch, and also the binary region.These stars could be studied in other works to discuss star formation theory, check simulation codes related to cluster star evolution, survey the chemical elements of clusters, study cluster morphology, calculate Fig 1 shows the distribution of member candidates among field stars for six clusters in two parameters: pmRA, and pmDEC.As seen in Fig 1 , the DBSCAN selection data reveal a dense distribution among the sample sources.This indicates that DBSCAN can detect data between huge sample sources using just two filters: positive parallax and stars brighter than 20 mag.Fig 2 to 5 show stars that were selected by the GMM algorithm in five parameters (RA, DEC, pmRA, pmDEC, and Parallax).In the Gaussian Mixture Model (GMM), we selected a cluster number equal to 2, which corresponded to the cluster and field excluding Melotte 72.To distinguish members of Melotte 72 from field stars, we utilized three different values for the GMM cluster number.In the case of this specific cluster, all data points within the other two GMM clusters were considered as suspect data, subject to a decision by the Random Forest algorithm in the final step.If these stars are indeed members of the cluster, they were identified by the Random Forest in the last stage.The confusion matrix is shown in Fig 6.In this work, stars that have a probability higher than 0.8 are considered as cluster members.As seen in Fig2, members that are in the outer radius from the cluster center (cluster dense region) have a probability lower than 0.8, nevertheless, some of them can be selected as escape members.Fig 5 shows a clear main sequence and for older clusters, a red giant branch.Fig 7 shows data that were detected with the Random Forest algorithm among GMM detection members and field stars based on five parameters (pmRA, pmDEC, Parallax, G magnitude, and Bp-Rp color index).Position parameters were not applied in the Random Forest model to obtain the best view of cluster morphology.As seen in Fig 7, stars selected by Random Forest are in the outer layer than the cluster center but these members are in the range of proper motion, parallax, and CMD of members that were selected by GMM with a probability higher than 0.8 as are shown in Fig 8 to Fig 10.As shown in Fig7, the morphology of the clusters can be observed in detail, including their members, corona, and tidal tails.This method can detect members of the smallest cluster, even those far away from the center of the clusters, such as Alessi 01 and Melotte 72.In the case of Melotte 72, which is at a high distance from Earth, as depicted in Fig7, the candidate members fall within a large distance range of approximately 100 parsecs from the cluster.In Table2, we present the selection data at each step.Notably, for the richness cluster (M 35 and NGC 2099), the Random Forest algorithm detected more members compared to other clusters.Moving on to Table3, it displays the physical parameters for the selected members using the Gaussian Mixture Model (GMM) with a probability higher than 0.8, and those identified by the Random Forest and comparison withCantat-Gaudin et al. (2018).In this method, the selected parameters correspond with the physical characteristics detected by GMM with a probability higher than 0.8.In the case of the oldest cluster in this study (NGC 188), only a few data members were detected using Random Forest.This observation could be attributed to its age and dense shape.As seen in Fig10, stars detected by Random Forest lie in the main sequence, red giant branch, and also the binary region.These stars could be studied in other works to discuss star formation theory, check simulation codes related to cluster star evolution, survey the chemical elements of clusters, study cluster morphology, calculate Fig 11 displays a fitted King profile for the detection members.As seen in Fig 11, the number density of stars decreases significantly beyond the tidal radius.The stars within and outside the tidal radius are shown in Fig 12.As seen in Fig 12, stars within the tidal radius show dense regions.

Figure 11 .Figure 12 .
Figure 11.The fitting of the King profile to cluster density.

Figure 14 .
Figure 14.The average luminosity of main sequence stars across different areas of the cluster is depicted.Additionally, the red line represents the tidal radius.

Table 1 .
Data uncertainties in Gaia DR3 Collaboration et al. (2023)were obtained in the Gaia Data Release 3 (GaiaCollaboration et al. (2023)).These clusters include NGC 2099, M 67, M 41, M 48, M 38, M 47, Alissi 01, Melotte 18, King 06, NGC 2343, NGC 188, Collinder 463, M 34, M 35, and NGC 752.These clusters exhibit a variety of properties in terms of age, metallicities, and number of members, which allows for a proper evaluation of the method.For this analysis, stars within a radius of 300 arcminutes for NGC 752 and 150 arcminutes for the other clusters, with positive parallax and magnitude brighter

Table 2 .
Results in every step.(1): Cluster name.(2): The number of sample sources from each cluster when filtered by photometric and astrometric conditions in this work.(3): Stars that DBSCAN selected among sample sources.(

Table 3 .
): Cluster members detected by Random Forest.(9): The evaluated F1 score value for Random Forest.(10): Range of parallax of field stars for training Random Forest.(11): Number of field stars for training Random Forest.Physical parameters of clusters MNRAS 000, 1-??(2022)