200,000 Candidate Very Metal-poor Stars in Gaia DR3 XP Spectra

Very metal-poor stars ($\rm[Fe/H]<-2$) in the Milky Way are fossil records of early chemical evolution and the assembly and structure of the Galaxy. However, they are rare and hard to find. Gaia DR3 has provided over 200 million low-resolution ($R \approx 50$) XP spectra, which provides an opportunity to greatly increase the number of candidate metal-poor stars. In this work, we utilise the \texttt{XGBoost} classification algorithm to identify $\sim$200,000 very metal-poor star candidates. Compared to past work, we increase the candidate metal-poor sample by about an order of magnitude, with comparable or better purity than past studies. Firstly, we develop three classifiers for bright stars ($BP$ $<$ 16). They are Classifier-T (for Turn-off stars), Classifier-GC (for Giant stars with high completeness), and Classifier-GP (for Giant stars with high purity) with expected purity of 52\%/45\%/76\% and completeness of 32\%/93\%/66\% respectively. These three classifiers obtained a total of 11,000/111,000/44,000 bright metal-poor candidates. We apply model-T and model-GP on faint stars ($BP$ $>$ 16) and obtain 38,000/41,000 additional metal-poor candidates with purity 29\%/52\%, respectively. We make our metal-poor star catalogs publicly available, for further exploration of the metal-poor Milky Way.


INTRODUCTION
Very metal-poor stars (VMP,[Fe/H]< −21 ; Beers & Christlieb 2005) are fossil records of early chemical enrichment history of the universe.The most metal-poor stars are likely to be some of the oldest stars that exist today, and their atmospheres contain information about the abundance pattern of gas in the early universe (e.g., Frebel & Norris 2015).Chemical abundances of a large sample of metalpoor stars can advance our understanding of early nucleosynthesis and thus constrain the early stellar masses, rotation rates, mixing processes, explosion energies, compact remnant masses (neutron stars or black holes), thermohaline convection and other stellar properties (e.g.Heger & Woosley 2010;Limongi & Chieffi 2012;Wanajo 2018;Jones et al. 2019;Ishigaki et al. 2021).Moreover, chemical abundances for these stars, together with kinematic data, can be utilised to understand the accretion history, and early formation of the Milky-Way (e.g.Hawkins et al. 2015;Das et al. 2020;Horta et al. 2021;Conroy et al. 2022;Belokurov & Kravtsov 2022;Rix et al. 2022, see Helmi 2020 for a review).
However, metal-poor stars are rare and difficult to find.Metal-poor stars only make up ∼0.1% of Milky Way stars (e.g.Starkenburg et al. 2016;El-Badry et al. 2018), and only few thousands of metal-poor stars have been spectroscopically confirmed in past surveys (e.g.Placco et al. 2018;Li et al. 2018;Chiti et al. 2021a).The typical method to search for metal-poor stars is first finding metal-poor candidates and then following up these stars with medium/highresolution spectra to get more detailed information (e.g.Beers & Christlieb 2005).Objective-prism surveys, photometric surveys and some wide area spectroscopic surveys are the major ways to search for metal-poor stars.Objective-prism surveys (Bond 1970;Bidelman & MacConnell 1973;Bond 1980) were once the most effective method to search for candidate metal-poor stars, which utilised lowresolution spectra ( ≈ 400) to estimate the strength of the CaII K line at 393.36 nm.The HK-I, HK-II, and Hamburg/ESO surveys (Beers et al. 1985(Beers et al. , 1992;;Frebel et al. 2006;Christlieb et al. 2008;Beers et al. 2017) found a total of ∼4500 VMP stars (Limberg et al. 2021a).More recently, photometric surveys are utilised to identify candidate metal-poor stars.SkyMapper Southern Sky Survey (SMSS) utilises SkyMapper  filter that reflect CaII H&K absorption features, together with SkyMapper , ,  photometry to derive metallicities (Onken et al. 2019;Chiti et al. 2021a).Analogously, Pristine utilises a narrow-band filter that is centred on the CaII H&K absorption lines, combined with SDSS broad-band  and  photometry to derive metallicities (Starkenburg et al. 2017;Aguado et al. 2019).Javalambre Photometric Local Universe Survey (J-PLUS) (Cenarro et al. 2019) and the Southern Photometric Local Universe Survey (S-PLUS) (Mendes de Oliveira et al. 2019) are also photometric surveys which utilise four SDSS-like (, , , ) and one modified SDSS (), and seven narrow-band filters to identify low-metallicity stars in the Galactic halo (Placco et al. 2021(Placco et al. , 2022;;Galarza et al. 2022).Another photometric selection method is Best & Brightest (Schlaufman & Casey 2014) which utilises all-sky APASS optical, 2MASS near-infrared, and WISE mid-infrared photometry to identify bright metal-poor star candidates through their lack of molecular absorption near 4.6 microns (Placco et al. 2019;Reggiani et al. 2020;Limberg et al. 2021b).Besides the aforementioned dedicated efforts, there are some large surveys that directly observe samples of stars at intermediate resolution spectra and estimate their metallicity, e.g., SEGUE, LAMOST, and RAVE surveys.These surveys have found several thousand of metal-poor stars.The Sloan Digital Sky Survey (SDSS; Eisenstein et al. 2011), and its Sloan Extension for Galactic Understanding and Exploration (SEGUE; Yanny et al. 2009) survey ( ≈ 2, 000), SEGUE-1 and -2, which motivated several highresolution follow-up campaigns (e.g., Aoki et al. 2012).The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) survey ( ≈ 1800; Deng et al. 2012), which has also triggered some high resolution observations (e.g., Li et al. 2022).LAMOST-I(DR7) released more than seven million spectra of stars in the Milky Way.The RAdial Velocity Experiment (RAVE;  ≈ 7, 000) (Kunder et al. 2017) delivered spectra for about 480,000 stars.However, the number of candidate metal-poor stars found from each survey is about a few dozens to at most a few thousand, which is too small for a statistical investigation on metal-poor stars, especially for extremely metal-poor ([Fe/H] < −3) or ultra metal-poor regime ([Fe/H] < −4).Thus we need a survey that can provide a much larger number of stellar spectra to enable us to find such objects.
The Gaia mission has brought a revolutionary change to Milky Way astronomy, because it provides astrometric data for billions of stars (Collaboration et al. 2016(Collaboration et al. , 2022)).In Gaia Data Release 3 (DR3), it released 200 million low-resolution XP spectra ( ≈ 50; De Angeli et al. 2022).Because of its low-resolution, the XP spectra can't provide detailed element abundances of stars.Additionally, Gaia GSP-Phot also does not provide accurate metallicity estimations for the most metal-poor stars (Andrae et al. 2022).However, some works have demonstrated that these low resolution XP spectra can be utilised to estimate effective temperature, surface gravity, and metallicity (e.g., Xylakis-Dornbusch et al. 2022;Andrae et al. 2023;Zhang et al. 2023).Thus, these 200 million low-resolution XP spectra give us an opportunity to greatly increase the number of candidate metal-poor stars, if we can make full use of them.
In this work, we identify metal-poor stars in the Gaia DR3 XP spectra using the XGBoost classification algorithm.In Section 2, we describe the XP spectra and other data we utilised in this work.In Section 3, we introduce XGBoost, discuss the training process, and evaluate the performance of the models.Then, we utilise XGBoost models to make a prediction on the XP spectra, shown and discussed in Section 4.Then, we compare our work with other surveys and projects and utilise existing high-resolution spectroscopic data to validate the performance of our models in section 5. Finally, we summarize this work in 6.
Gaia XP spectra: Gaia DR3 released low-resolution blue and red photometer spectra (/ or XP spectra) for 210 million stars.Metallicities were derived from these spectra in the Gaia GSP-Phot, but they are not accurate at low metallicities (Andrae et al. 2022).Thus, it is not efficient to directly utilise the GSP-Phot metallicity [M/H] in Gaia DR3 to search for metal-poor stars.The XP spectra have wide wavelength coverage (330 to 1050nm) and low-resolution.Because of its wide wavelength coverage, strong lines valuable for metallicity estimation are covered in it, such as Ca II K and Ca II infrared triplet, as well as broad-band or narrow-band photometry.Thus, in theory, XP spectra can be utilised to detect metal-poor stars.
The XP spectra are released as Hermite function coefficients rather than fluxes v.s.wavelength (Carrasco et al. 2021).In order to avoid information loss (Carrasco et al. 2021), the input for XGBoost model are XP spectra coefficients, rather than corresponding sampled XP spectra.XGBoost requires the input vectors to be of the same length, so we do not truncate the XP coefficients.
Before inputting the XP coefficients to the model, we first normalize and deredden them.We normalized XP coefficients by their first coefficient to remove apparent magnitude information.Additionally, to take into account reddening, we determined the extinction coefficients   ,   ,    to correct the normalized XP coefficient vectors C for extinction Here, the Ĉ is a truncated XP coefficient vector with first 10 elements,   ,    are vectors and    is a matrix.We fit for   ,   ,    by taking high extinction stars in APOGEE and matching them with stars with similar log  (surface gravity),  eff (effictive temperature) and metallicity, but at low extinction.The extinction utilised in this analysis is from a 2-D map by Schlegel et al. (1998).
LAMOST DR7 and APOGEE DR17 metallicity: In order to train our model to identify metal-poor stars, we need a sample of stars that already have reliable metallicity estimates to provide true labels.We utilised the spectroscopic metallicity from the LAMOST DR73 and APOGEE DR174 .LAMOST spectra ( ≈ 1800) cover the optical band from 370 to 900 nm.APOGEE spectra ( ≈ 22, 500) are a good complement to LAMOST, because they cover the infrared band from 1.51 to 1.70 m, which is more suited for dust extincted regions, i.e., the Galactic disk and bulge.In total, we have 4 × 10 6 LAMOST and 6.5 × 105 APOGEE stars.
Data Queries and Quality Cuts: We utilised the Whole Sky Database (WSDB) 5 for all queries (see Appendix C for the ADQL queries), which ingested the entire catalog for APOGEE DR17 and LAMOST DR7.We did not do any significant quality cuts, but we do not think this will significantly affect the results for a few reasons.First, classification models are less sensitive to quality cuts than regression models.Second, after comparing the overlapping very metal-poor stars in LAMOST and APOGEE, we found that even if a star is flagged as bad spectral fitting solutions in either APOGEE or LAMOST, it often still carries sufficient information regarding being very metal-poor or not.For example, for LAMOST we adopted quality flags of SNR > 20 and feh_err < 0.5 (e.g., Zhang et al. 2023), which removed 22% of our metal-poor training set.However, overlapping APOGEE spectra suggested that 84% of these were actually still very metal-poor.For APOGEE, metal-poor stars run up against the edge of the spectral grid, so using quality flags (e.g., FE_H_FLAG = 0) removed all stars with [Fe/H] < −2.25 even though they are very metal-poor in LAMOST.

Training and testing sets
The Gaia XP spectra with the LAMOST or APOGEE metallicity form the training and testing set in this work.We directly put them together because the average difference between LAMOST and APOGEE[Fe/H]is 0.007 dex, which is well below the typical uncertainty in metallicity of LAMOST (> 0.2 dex) or APOGEE (> 0.1 dex).Therefore, we conclude that these surveys are on similar metallicity scales within the range of parameters tested.Before the training process, we need to set some constraints on the training and testing set by intrinsic colour ( − ) 0 , magnitude  and extinction  ( − ).
For the training set, we only consider stars with ( − ) 0 > 0.5, because, as shown in the right panel of Figure 1, we do not have metalpoor samples with (−) 0 < 0.5.Note that the method utilised to calculate the (−) 0 excludes almost all of the  (−) > 2 stars, because the extinction coefficients should not be extrapolated outside the extinction range of this algorithm, as described in https://www.cosmos.esa.int/web/gaia/edr3-extinction-law.Additionally, we exclude fainter stars ( > 16) in the training set, because the XP spectra with  > 16 generally do not have high signal-to-noise ratio (S/N < 300).Thus, for the training set, we only consider stars that satisfy the following criteria: ( However, for the testing set, we only constrain the data by 0.5 < ( − ) 0 < 1.6 and  ( − ) < 2. We aim to see whether our classifiers that are trained on bright stars ( < 16) can be utilised to identify the faint metal-poor stars ( > 16).Thus, we include stars that satisfy the following criteria in the testing set: After applying cuts, we get 2.5 × 10 6 LAMOST stars and 4.5 × 10 5 APOGEE stars with XP spectra available, of which 4088 and 1295, respectively, are metal-poor stars with [Fe/H] < −2.In total, we utilise 2.9 × 10 6 spectra for training and testing, of which 0.2% are metal-poor stars.We select 4×10 5 of them as testing set and 2.5×10 6 of them as training set.
Figure 1 shows the colour-magnitude diagram of our training and testing sample.The horizontal axis is intrinsic colour (−) 0 , the vertical axis is the absolute  magnitude (without any parallax cut here).The left and middle panels show the stars from the LAMOST and APOGEE surveys in the training and testing sets, which comprises main-sequence turn-off, dwarf, and giant stars.The right panel shows the distribution of metal-poor stars in the training and testing set.The majority of metal-poor stars are turn-off and giant stars.Figure 1 suggests that our algorithm should only confidently identify metal-poor giants and turn-off stars, because metal-poor stars in other evolutionary stages would be extrapolation.Note that it is harder to find very metal-poor turn-off stars than giants.Because the resolution of XP spectra are low, the information we can get from them are close to what we can get from narrow band photometric surveys, but the photometric features of turn-off stars are less metallicity dependant, because they are hotter and absorption features are suppressed.Consequently, we utilise different models to find metal-poor turn-off and giant stars.We divide the training and testing sample into two parts, according to ( − ) 0 < 0.8 or > 0.8.The models trained on the former dataset are responsible for finding turn-off metal-poor stars, and the other models trained on the latter dataset are in charge of the giant metal-poor stars.As shown in the right panel of Figure 1, our dataset does not have many metal-poor dwarf stars, so we do not expect to find low-metallicity dwarf stars in this work.
The ( − ) 0 , , and || distribution of Gaia DR3 data, training and testing set are shown in Figure 2. Note that, we only include Gaia DR3 data with ( − ) 0 > 0.5 and  ( − ) < 2 that have XP spectra in this plot.We see that the distributions of the Gaia data included in this plot are pretty different from our training and testing set, especially for the ( − ) 0 and Galactic latitude  distributions, which reminds us that the metal-poor candidates we find may only be a small fraction of the total.

MODEL TRAINING AND VALIDATION
We choose the XGBoost algorithm to find metal-poor stars because it is a powerful and flexible algorithm that has been utilised in variety of sub-fields of astrophysics (e.g., Li et al. 2021;He et al. 2022;Pham & Kaltenegger 2022;Lucey et al. 2022;Rix et al. 2022).
The algorithmic principles for XGBoost are not complex.In short, XGBoost repeatedly builds decision trees to fit the residuals from the previous tree, until the residuals stop shrinking or it reaches the maximum number of trees, which is a free parameter.Then it sums the results from each tree, which are weighted by a learning rate (), and plug this value into the Sigmoid function, () = 1 1+ − , to calculate the probability of the input belonging to a certain category.For a detailed description of XGBoost, see Chen & Guestrin (2016).
In this work, we utilise the coefficients of normalized and dereddened XP spectra together with their corresponding[Fe/H]from LAMOST or APOGEE to compose training and testing sets to train the XGBoost model to identity metal-poor stars in Gaia DR3.We describe the training process and the performance of the well-trained models in this section.

Training process
In this work, we choose multi-classification algorithm to identify the metal-poor stars.The metallicity ([Fe/H]) of the training and testing samples ranges from −2.5 to +1.0.We utilise XGBoost models to classify the stars into four metallicity intervals:[Fe/H]< −2.0, −2.0 <[Fe/H]< −1.5, −1.5 <[Fe/H]< −1.0, and −1.0 <[Fe/H]< +1.0, with probabilities  0 ,  1 ,  2 ,  3 respectively.For a star, when its  0 is larger than the other probabilities, it will be classify as metal-poor star.The prediction uncertainty can be calculated from the probabilities of the multi-classification result, see appendix A for more details.We choose the XGBoost classification algorithm, rather than the regression algorithm, for following four reasons.(I) The minimum[Fe/H]of the training and testing set is −2.5, because of LAMOST and APOGEE analyses limitations, even though we do know there exist metal-poor stars with[Fe/H]< −2.5 in the data set.(II) Regression would waste a lot of computational power on deciding the specific metallicity value for non-metal-poor stars ([Fe/H] > −2.0) which we do not care about.(III) Unlike a regression algorithm, classification algorithm can more easily trade off completeness against purity.For samples that are difficult to identify, for example, turn-off stars and faint stars, we can sacrifice completeness for higher purity.
We utilise completeness and purity calculated on the test set to evaluate the performance of the models.Completeness refers to how completely our model can find all of the metal-poor stars.Purity refers to the fraction of true metal-poor stars for the set predicted to be metal-poor by our models.Positive and negative samples here refer to the metal-poor ([Fe/H] < −2) and non-metal-poor ([Fe/H] > −2) stars respectively.We divide the input samples into two training sets, according to their intrinsic colour: 0.5 < ( − ) 0 < 0.8 and 0.8 < ( − ) 0 , as shown in Figure 1, to find metal-poor turn-off and giant stars, respectively.Metal-poor giant stars make up 0.26% of the training set with 0.8 < ( − ) 0 .However, metal-poor turn-off stars are much rarer, only make up 0.06% of the training set with 0.5 < ( − ) 0 < 0.8.Thus, it could be expected that metal-poor turn-off stars will be more difficult to find than metal-poor giant stars.
In preliminary tests, we found that the extreme imbalance between positive ([Fe/H] < −2) and negative ([Fe/H] > −2) samples badly hinders our training process.To solve this problem, we processed the training sets in the following two steps: Step I: Utilise random under-sampling to randomly remove over-represented metal-rich stars in the training set.The negative ([Fe/H] > −2) to positive ([Fe/H] < −2) ratio of the training set after under-sampling is defined as NPR.We will change the NPR of the training set from 1 to the maximum value that the training set allowed.
Step II: Adopt over-sampling algorithm Synthetic Minority Over-sampling Technique (SMOTE) to populate the metalpoor stars in the training set that has been under sampled.The SMOTE algorithm is an over-sampling method which synthesizes new examples from the minority class by selecting neighboring examples in the feature space and then synthesizing a new sample at the point along the line connecting these two samples (Chawla et al. 2002).We utilise RandomSearchCV from scikit-learn (Pedregosa et al. 2011) to tune the XGBoost hyper-parameters.When training XGBoost, a lot of hyper-parameters can be adjusted, such as the learning rate (), the maximum depth of a tree, and the minimum loss reduction required to make a further partition on a leaf node of the tree ().In order to find the optimal set of parameters, we utilise RandomSearchCV from scikit-learn (Pedregosa et al. 2011).RandomSearchCV will go through points that are randomly selected from the predefined box in hyper-parameter space, as shown in below, to find the optimal set of parameters.
• n estimators: from 100 to 1200 in steps of 50 • max depth: from 2 to 15 in steps of 1 • learning rate: from 0.05 to 1 in steps of 0.05 • subsample: from 0.5 to 1 in steps of 0.05 • colsample bytree: from 0.3 to 0.9 in steps of 0.05 • min child weight: from 1 to 20 in steps of 1 • gamma: from 0 to 0.7 in steps of 0.02 In this work, finding metal-poor stars trades off purity for completeness.For each NPR, we utilise RandomSearchCV to find the optimal set of parameters.Figure 3 shows the completeness and purity of the well optimized model as a function of the training set NPR.The purple curves refer to the classifiers that are trained to find metal-poor giant stars , and the red curves refer to the classifier to find metal-poor turn-off stars.From Fig. 3 we see that increasing the NPR of the training set will increase the purity but decrease the completeness of the classifiers, and it is much easier to find metalpoor giant stars than metal-poor turn-off stars, just as we discussed before.The three vertical lines indicate the NPR that are chose for Classifier-GP (Green, 386), Classifier-GC (Yellow, 40), Classifier-  T (Blue, 1000).Classifier-GC (Giant Complete) here denotes the model utilised to find metal-poor giants with high completeness, Classifier-GP (Giant Pure) denotes the model utilised to find metalpoor giants with high purity, and Classifier-T (Turn-off) denotes the model utilised to find turn-off metal-poor stars.The (completeness, purity) for our Classifier-T, Classifier-GC, Classifier-GP are (40.0%,47.2%), (94.6%, 47.2%), (72.7%, 74.1%) respectively, which are derived by 3-fold cross-validation.

Models evaluation
After Comparing the distributions of Classifier-GC and Classifier-GP in the left panel, we see that the Classifier-GP can effectively remove the FP stars, although it loses some TP stars.On the other hand, the right panel shows that Classifier-GP loses some metal-poor stars with[Fe/H]< -2.8, which is the cost of high purity.This is why we provide Classifier-GC as supplement to Classifier-GP.Classifier-GC provides a high completeness dataset and Classifier-GP provide a high purity dataset.The good news for Classifier-GC is that most of the misclassified metal-poor still have rather low metallicity close to the [Fe/H]= −2 boundary.
The completeness and purity distributions of the classifiers on different ( − ) 0 , , and || intervals are shown in Figure 5.We utilise different colours and symbols to denote different models, and dashed and solid lines to denote faint or bright stars.Let's discuss the performance of the classifiers on bright stars ( < 16) first.Panel (a) and (d) show the performance of the classifiers as a function of ( − ) 0 .We see that Classifier-T has a comparable purity at the blue end of the classifiers-GP and classifiers-GC, but its completeness is lower than these two models, because it is harder to find metal-poor turn-off stars, we have to sacrifice the completeness for high purity, just as we discussed before.Panel (b) and (e) show the performance of classifiers as a function of brightness.We can see that bright stars tend to have higher purity and completeness than faint stars, because bright stars typically have higher signal to noise ratio.Panel (c) and (f) show the performance as a function of ||.The completeness and purity of our classifiers are lower in low-latitude region, because in this region extinction makes classification more difficult even with our coefficients extinction calibrations and higher contamination rate of metal-rich ([Fe/H] > −2) stars decrease the purity statistically.Note that, because there are few metal-poor turn-off stars at low or high galactic latitude in our training and testing sets, we increased the bin size for turn-off stars in these two panels to avoid statistical fluctuations.
Most of the stars with XP spectra released by Gaia DR3 are faint ( > 16), so it is worthwhile to evaluate the performance of the classifiers, which are trained on bright stars, on the faint stars.We utilise Classifier-T and Classifier-GP to make the prediction on faint stars.As shown in the dashed lines and open symbols of Figure 5, the overall purity for Classifier-T is 29%, for Classifier-GP is 52%.This purity is better than we expected, so we include the faint stars in our catalog.However, as shown in panel (d), (e), (f), the completeness for faint turn-off candidates is pretty low, less than 10%, which means that the faint metal-poor turn-off stars we have in our final catalogs only make up a very small fraction of the total.Because of the low / ratio for faint stars, it is harder for us to find the genuine metal-poor ones.Thus, under this circumstance, purity has a higher priority than completeness.We can make a Shannon-Entropy cut on the final results to increase their purity.More details about the Shannon-Entropy cut are shown in Appendix A.

RESULTS
We have three reliable classifiers, Classifier-T, Classifier-GC, and Classifier-GP.We now classify the 200 million XP spectra released in Gaia DR3, and obtain three corresponding candidate metal-poor star catalogs, as shown in Table 1, Table 2, and Table 3, which in total contain 200,000 metal-poor candidates.
The colour-magnitude diagram for these candidate metal-poor stars, without any parallax quality cut, is shown in Figure 7.The left/middle/right panel shows the colour-magnitude diagram for the candidate metal-poor stars identified by Classifier-T/Classifier-GC/Classifier-GP.From these panels we confirm that, in our catalogues, the candidate metal-poor stars are dominated by turn-off stars and giant stars.However, there are a small number of dwarf stars present in the cooler regions of the main sequence, as shown in the middle and right panels (  > 4, below the red dashed line).These red dwarf stars may be wrongly classified as metal-poor stars, because there are almost no red dwarf stars in the training sets for Classifier-GP and -GC.Table 4 shows that red dwarfs only make up a very small fraction of the metal-poor stars found by Classifier-GC and -GP, i.e., 1.7% for Classifier-GC, 0.7% for Classifier-GP ( < 16) and 6.5% for Classifier-GP ( > 16).Since the risk of contamination is higher, we include the absolute G band magnitude   in our final catalogues if users would like to filter out any potential dwarf contamination.
The distance distributions and galactic coordinate projections of the candidates are shown in Figure 8 and 9. Figure 8 shows the distance distributions of the candidate metal-poor stars.The distances are calculated by inverting the Gaia DR3 parallax.The distance to the Galactic centre is marked by the red dashed line (∼8 kpc from the Sun Bland-Hawthorn & Gerhard 2016).The blue lines are the distribution of candidate turn-off metal-poor stars, and the orange and green lines are the distribution of candidate giant metal-poor stars.For the distance distribution, comparing to candidate metalpoor giant stars, the turn-off stars are located closer to the Sun, as expected given their lower luminosities.The giants are distributed around the Galactic centre.This result indicates that the Galactic centre contain a large amount of metal-poor stars, i.e., the Milky Way hosts an ancient, metal-poor, and centrally concentrated stellar population (e.g.Rix et al. 2022).Figure 9 shows the skymap of the candidate metal-poor stars we found in Gaia DR3.Because the dereddening process excludes almost all of the high  ( − ) stars ( ( −) > 2), we do not obtain a lot of stars at low galactic latitude, as shown in figure 9. Bulge stars and halo stars are the dominant stars for our sample.
The bright spots in the Galactic coordinate projections are globular clusters (Harris 2010).After testing, we found that comparing with Classifier-GC that includes many globular clusters with −1.5 <[Fe/H]< −1.0, Classifier-GP excludes all of the globular clusters with average metallicity larger than −1.5 and most of the globular clusters with average metallicity within −2 to −1.5, but keeps all of the globular clusters with metallicity less than −2, which is a demonstration that Classifier-GP has relatively higher purity than Classifier-GC.Note that the galactic coordinate projections are also affected by the Gaia scanning law (see De Angeli et al. 2022) and crowding issues for XP spectra in globular clusters.

DISCUSSION
In this work, according to Table 4, we add up the numbers of metalpoor candidates found by Classifier-T, Classifier-GC ( < 16 at all   ), Classifier-GP ( > 16 at all   ) and obtained a total of 200,000 candidate metal-poor stars.Weighting each subsample by its purity in Table 4, we expect the catalog contains 88,000 genuine metal-poor stars (overall purity of 44%).

Comparing with other surveys
Table 6 shows our results compared to previous photometric selections.Huang et al. (2022) utilised SMSS DR2 and Gaia EDR3 photometry to estimate metallicity for 24 million stars.They obtained half a million very metal-poor ([Fe/H] < −2.0) stars, and over 25,000 extremely metal-poor ([Fe/H] < −3.0) stars.48270 very metal-poor candidates in Huang et al. (2022) are also predicted to be very metal-poor by our Classifiers.Chiti et al. (2021a) utilised SMSS DR2 photometry to derive photometric metallicities.They present more reliable metallicities of ∼280,000 stars with −3.75 ⩽[Fe/H]⩽ −0.75 down to  = 17.18,640 of them are candidate metal-poor stars ([Fe/H] < −2).After the validation by our training and testing set, we found their purity to be 49%; and there    Their model is a regression model which is trying to fit the metallicity for all stars, especially for metal-rich stars.As a result, their model may not do as well for metal-poor stars, which are only a very small part of the whole.In contrast, our models are more specialized, and only focus on finding metal-poor stars.(III) Because we choose classification algorithm rather than regression, we can trade off completeness against purity.For stars that are difficult to classify, for example turn-off stars, we can sacrifice the completeness to the higher purity with NPR (see Fig. 3) and SMOTE.(IV) The Gaia XP spectra we utilised has been dereddened, which may makes our predictions more accurate, even without WISE photometry.Out of 148,000 very metal-poor candidates in Andrae et al. (2023), there are 65949 stars are found to be very metal-poor with our Classifiers.Overall, we suggest that researchers and observers utilise this work together with Andrae et al. (2023) to decide what metal-poor candidates to follow up.Zhang et al. (2023) utilised a forward model to estimate stellar parameters ([Fe/H],  eff and ), revised distances and extinctions for 220 million stars with XP spectra.However, there is a trend that the metallicity derived by the forward model tend to be overestimated at very-metal-poor end, which is even more biased than the metallicity derived by Andrae et al. (2023).We think this bias   is caused by the imbalance of the numbers of the metal-poor and non-metal-poor stars in their training set.Martin et al. (2023) used the spectroscopic and photometric information of 219 million stars from Gaia DR3 to calculate synthetic narrow-band  magnitudes sensitive to metallicity. magnitudes mimic the observations of Pristine surveys.They derived the photometric metallicities for 30 million high signal-tonoise FGK stars.They identified 200,000 very metal-poor candidates and 8,000 extremely metal-poor candidates ([Fe/H] phot < −2 and [Fe/H] phot < −3 respectively).Because their data was released while this paper was already in review, we do not consider their results for our comparisons.

Validation with existing high-resolution spectra
There are plenty of high-resolution follow-up observations to the candidate metal-poor stars that have been obtained by previous studies.We can utilise these confirmed metal-poor stars to evaluate the completeness of our XGBoost models.The results are shown in Table 8.In this table, we utilise 6 metal-poor halo stars data sets, 3 metal-poor bulge data sets, 1 metal-poor disk star and 1 carbonenhanced metal-poor (CEMP; [C/Fe] > +0.7) data set to test our models.For each data set, we exclude stars of which dereddened colour ( − ) 0 < 0.5 and  ( − ) > 2. Then we divide each data set into turn-off metal-poor stars (( − ) 0 < 0.8) and giant metal-poor stars (( − ) 0 > 0.8).The total number of these stars are shown in third and fourth columns.Finally, we utilise the Classifier-T, Classifier-GC and Classifier-GP to predict the metallicity of these turn-off and giant metal-poor stars respectively  and get the corresponding completeness marked as completeness-T, completeness-GC and completeness-GP.This table shows that the completeness from these data sets is very close to the results from our test set, especially for the halo stars.We also test our classifiers on carbon enhanced metal-poor (CEMP) stars as shown in the last row of Table 8.As might be expected, the completeness of the classifiers on CEMP stars is not as high as other metal-poor stars, potentially because the enhanced carbon makes the metal-poor star spectra look more metal-rich.and, then, utilise them to train the models.Finally, we get three classifiers, Classifier-T, Classifier-GC, and Classifier-GP and utilise them to identify the metal-poor turn-off and giant stars in Gaia DR3 with XP spectra.We present the histogram of the testing result and the completeness/purity distributions for these models in Figure 4 and Figure 5.
In total, we obtained 200,000 metal-poor candidates with overall purity 44%.This number of metal-poor candidates is around an order of magnitude larger than previous work (e.g., Best & Brightest, SkyMapper, and Pristine), which has similar or even better purity.
We make the full catalog available in the supplementary online material.Table 1, 2, 3 Comparison to Andrae et al. (2023)

Figure 1 .
Figure 1.Colour-magnitude diagram of our training and testing sets.The horizontal axis is the Gaia intrinsic colour ( − ) 0 , the vertical axis is the Gaia absolute  magnitude.The LAMOST and APOGEE samples, which primarily comprise main-sequence turn-off, giants, and dwarfs stars, are shown in the left and middle panels.The right panel shows the metal-poor stars from LAMOST and APOGEE.Metal-poor stars are primarily turn-off and giant stars.We divide the training and testing set into two parts, according to ( − ) 0 , as shown in the red dashed line in the Figure.On the left/right side of red dashed line are the samples utilised to train the model to identify the turn-off/giants metal-poor stars.
Completeness and purity are defined as:

Figure 2 .
Figure 2. ( − ) 0 , , and absolute Galactic latitude distribution of training, testing and Gaia DR3 with ( − ) 0 > 0.5 and  ( −  ) < 2 in this plot, which have XP spectra.We only randomly select 2% of the Gaia DR3 data with XP spectra to display in this figure.

Figure 3 .
Figure 3.The completeness and purity of classifiers as a function of training NPR.The horizontal-axis is the negative to positive ratio of the training sample; the vertical-axis is completeness and purity of models.At each NPR, the classifiers were ran with the optimized set of hyper-parameters.The vertical lines with different colours refer to the NPR were chosen for Classifier-GC, Classifier-GP, and Classifier-T.The corresponding completeness and purity can be read from the vertical lines.The purple curves refer to the classifiers trained to find metal-poor Giants stars and the red curves refer to the classifiers aimed to find metal-poor turn-off stars.
the training process, we utilise the testing sets to evaluate the performance of the classifiers on different [Fe/H], , ( − ) 0 , and absolute Galactic latitude ||.Typically, there are three factors that effect the performance of the classifiers: stellar species (turn-off or giants stars), brightness, and reddening.In this work, we utilise intrinsic colour ( − ) 0 to denote the type of stars, because we do not have metal-poor dwarf stars in the training and testing sets, as shown in Figure 1. magnitude denotes the brightness of the stars.Additionally, the absolute || can be utilised as an indicator of reddening, because stars in low || regions, such as disk and bulge, often have severe extinction.The metallicity distribution for stars in the testing set classified as metal-poor by different classifiers is shown in Figure 4.The metallicity distribution for True Positive (TP), False Positive (FP) and False Negative (FN) samples in the testing set are shown in left and right panels, respectively.

Figure 4 .
Figure 4. Left panel show the metallicity distribution of stars that are predicted to be metal-poor.The dashed line is the boundary of true positive samples and false positive samples.Right panel shows the metallicity distribution of False-negative stars (i.e.metal-poor stars missed by XGBoost).

Figure 5 .
Figure 5.The completeness and purity of different classifiers as a function of intrinsic colour ( − ) 0 ,  band magnitude  and absolute Galactic latitude ||.

Figure 6 .
Figure 6.( − ) 0 , , and || distribution of the metal-poor candidates we found in Gaia DR3 by different classifiers.Note that the || for giants is skewed to very low || is because those are mostly towards the inner Galaxy (bulge/inner halo), as seen in Figure 8.

Figure 7 .
Figure 7. Colour-magnitude diagram of metal-poor stars we found in Gaia DR3 by different classifiers.The horizontal axis is dereddened colour ( − ) 0 and the vertical axis is the absolute  band magnitude (for stars without any parallax-quality cut).

Figure 8 .
Figure 8.The distance distribution of the candidate metal-poor stars we found in Gaia DR3.The red dashed line in the left panel refers to the Galactic centre.Lines with different colour refer to the candidate metal-poor stars identified by different classifiers.
stars ([Fe/H] < −2) record the chemical enrichment history, accretion events, and early stages of the Milky Way.However, they are rare and difficult to find.In this work, we train XGBoost models to identify metal-poor stars in Gaia DR3.The input to the models are the coefficients of normalized and dereddened XP spectra.The classifiers split the stars into different[Fe/H]intervals of −2.5 <[Fe/H] < −2, −2 < [Fe/H] < −1.5, −1.5 < [Fe/H] < −1, −1 < [Fe/H] < +1.Because of the extreme imbalance between positive and negative samples, we randomly exclude some negative samples and utilise the SMOTE algorithm to over-sample the training sets

Figure 9 .
Figure 9.The galactic coordinate projections of the candidate metal-poor stars we found through Classifier-GC, Classifier-GP, Classifier-T.The area of healpix pixel is 3.36 deg 2

Figure A2 .
Figure A2.Distributions of mean  0 ,  1 ,  2 ,  3 with 1  error bars of our three catalogues in different Shannon Entropy intervals.

Figure A3 .
Figure A3.Number and purity of the remaining metal-poor faint ( > 16) turn-off candidates as a function of Shannon Entropy threshold

Table 1 :
Metal-poor turn-off candidates found by Classifier-T. 0 ,  1 ,  2 ,  3 refer to the probability of a stars with −

Table 2 :
table is available in its entirety in the online supplementary material) Metal-poor giant candidates found by Classifier-GC.(This table is available in its entirety in the online supplementary material)

Table 3 :
(Li et al. 20222023)2014;Placco et al. 2019;Limberg et al. 2021b) available in its entirety in the online supplementary material) are 9218 stars also predicted to be metal-poor by our Classifiers.Pristine survey does not publicly release their data, but according toStarkenburg et al. (2017)andYouakim et al. (2020), Pristine has covered a sky area of ∼2500 deg 2 , at the time of those papers.In each ∼deg 2 field, they find ∼7 stars that have [Fe/H] < −2.5 down to magnitude of  = 18.The purity of Pristine to find stars with [Fe/H] < −2.5 is 49%(Aguado et al. 2019).The Best & Brightest initiative selected over 11,000 candidate VMP ([Fe/H] < −2) and EMP stars ([Fe/H] < −3), with an overall purity of 30% and 5% respectively(Schlaufman & Casey 2014;Placco et al. 2019;Limberg et al. 2021b).Comparing with other surveys, our work increases the number of candidate metal-poor stars by about an order of magnitude, but with similar or higher purity.The comparison results are shown in Table6.Recently,Andrae et al. (2023)utilised XGBoost and XP spectra, together with 38 narrowband colours derived from XP spectra and broadband surveys (Gaia: G, BP, RP and CatWISE: W 1 , W 2 ), to derive metallicity, Teff and logg for 175 million stars.They reduced the temperature-extinction degeneracy by introducing CatWISE W 1 and W 2 , which extend to the infrared regions, into the model.The metallcity were derived using the XGBoost regression model and the true labels came from APOGEE, and augmented by a set of very metal-poor stars(Li et al. 2022).Because we both utilise XGBoost algorithm and deal with the same data set, it is worth comparing our results with them.The comparisons are shown in Table7.Table-1 and table-2 are two tables published by (Andrae et al. 2023).In short, table-2 is a high accuracy subset of bright ( < 16) giant stars of table-1.Table 7 shows that, for giant candidates, Classifier-GP has higher purity and more candidates comparing with table-1

Table 4 :
The number, purity and completeness of metal-poor candidates we found by Classifier-T, Classifier-GC and Classifier-GP in different   and BP ranges.

Table 5 :
A summary table for Section 3 and 4.

Table 6 :
Comparison to other photometric surveys.The purity mentioned above are obtained from the comparison with LAMOST DR7 and APOGEE DR17.Except for Pristine, for which it is from Aguado et al. (2019), and not for[Fe/H]< −2 but −2.5.

Table 8 :
Prediction results for metal-poor stars that are confirmed by High-resolution spectra.