ABSTRACT

The Ly|$\alpha$| emission line is a characteristic feature found in high-z galaxies, serving as a probe of cosmic reionization. While previous works present various correlations between Ly|$\alpha$| emission and physical properties of host galaxies, it is still unclear which characteristics predominantly determine the Ly|$\alpha$| emission. In this study, we introduce a neural network approach to simultaneously handle multiple properties of galaxies. The neural-network-based prediction model that identifies Ly|$\alpha$| emitters (LAEs) from six physical properties: star formation rate, stellar mass, UV absolute magnitude |$M_\mathrm{UV}$|⁠, age, UV slope |$\beta$|⁠, and dust attenuation |$E(B-V)$|⁠, obtained by the spectral energy distribution fitting. The network is trained with galaxy samples from the VANDELS and MUSE spectroscopic surveys and achieves the performance of 77 per cent true positive rate and 14 per cent false positive rate. The permutation feature importance method shows that |$\beta$|⁠, |$M_\mathrm{UV}$|⁠, and |$M_*$| are important for the prediction of LAEs. As an independent validation, we find that 91 per cent of LAEs spectroscopically confirmed by the JWST have a probability of LAE higher than 70 per cent in this model. This prediction model enables the efficient construction of a large LAE sample in a wide and continuous redshift space using only photometric data. We apply the prediction model to the JWST photometric galaxy sample and obtain Ly|$\alpha$| fraction consistent with previous studies. Moreover, we demonstrate that the difference between the distributions of LAEs predicted by the model and the spectroscopically identified LAEs provides a strong constraint on the H ii bubble size.

1 INTRODUCTION

Ly|$\alpha$| emission, theoretically predicted by Partridge & Peebles (1967), is one of the powerful probes of high-z Universe. Significant efforts are invested in detecting Ly|$\alpha$| emission because of its ease of observability as one of the strongest intrinsic features in high-z galaxy spectra. Ly|$\alpha$| is also a powerful tool for exploring cosmic reionization, given its nature of being scattered by neutral hydrogen gas. Ly|$\alpha$| fraction, defined as the number ratio of Ly|$\alpha$| emitters (LAEs) to total galaxies, is one of the frequently used methods for investigating cosmic reionization. Previous studies have reported that Ly|$\alpha$| fraction sharply drops from |$z=6$| to |$z=7$|⁠, while the continuous increase at |$4 \lt z \lt 6$| (Stark, Ellis & Ouchi 2011; Schenker et al. 2014). This is thought to be the indicator of the neutral intergalactic medium (IGM) in the epoch of reionization (EoR). Recent work continues to derive Ly|$\alpha$| fraction (Fuller et al. 2020; Bolan et al. 2022; Goovaerts et al. 2023; Jones et al. 2024), constraining the evolution history of the neutral fraction.

Increasing the sample size of LAEs is important for carrying out a wide range of studies in high-z Universe. However, spectroscopic observations, while a valuable tool for identifying Ly|$\alpha$|⁠, require a long observing time, making it challenging to increase the sample size. Although the LAE selection using narrow-band (NB) filter observation allows for efficient surveys over a wide field of view, the wavelength range where NB filters cover is considerably limited. This limitation inevitably makes LAE samples discrete with respect to redshift. For spectroscopic samples, the limitation is somewhat relaxed, but it is still difficult to detect Ly|$\alpha$| emission at wavelengths obscured by the strong OH sky lines.

To understand the physical mechanism of Ly|$\alpha$| emission, previous studies have tried to find various correlations between Ly|$\alpha$| and the physical properties of host galaxies. In general, Ly|$\alpha$| luminosity increases with higher star formation rate (SFR) and halo mass (Khostovan et al. 2019). Typical cosmological simulations assign Ly|$\alpha$| luminosities using the relation between Ly|$\alpha$| and SFR or halo mass. However, Ly|$\alpha$| emission varies due to various internal and surrounding effects. Ly|$\alpha$| emission is sensitive to the existence of dust, and less dust galaxies tend to show Ly|$\alpha$| emission (Sobral et al. 2018b; Santos et al. 2020). Because UV slope |$\beta$| is thought to reflect the amount of dust, a bluer UV slope can be indicative of Ly|$\alpha$| emission. Younger age (Arrabal Haro et al. 2020) and lower mass (Khostovan et al. 2019; Santos et al. 2020) can be factors determining Ly|$\alpha$| emission. LAEs are suggested to have smaller sizes compared to normal star-forming galaxies (Law et al. 2012; Malhotra et al. 2012; Marchi et al. 2018; Paulino-Afonso et al. 2018; Shibuya et al. 2019; Ribeiro et al. 2020). Smaller sizes can enhance the escape of ionizing photons. Following the positive correlation between ionizing photon escape fraction, |$f_\mathrm{esc}$|⁠, and Ly|$\alpha$| photon escape fraction, |$f_\mathrm{esc, Ly\alpha }$| (Maji et al. 2022; Begley et al. 2024), small galaxies tend to show Ly|$\alpha$| emission. Some studies report that the spatial offset between the host galaxy and the Ly|$\alpha$| emission is a factor shaping Ly|$\alpha$| emission (Hoag et al. 2019; Lemaux et al. 2021). Trainor et al. (2016) argue that the velocity offset between the systemic redshift and Ly|$\alpha$| helps the escape of Ly|$\alpha$| photons. Gazagnes et al. (2020) suggest that H i covering fraction regulates the Ly|$\alpha$| emission. Metallicity is also thought to be related to Ly|$\alpha$| emission (Trainor et al. 2016). Hard UV emission from low metallicity stellar population ionizes the neutral hydrogen, which enhances the creation and the escape of Ly|$\alpha$| photons. Although much effort has been expended, understanding which parameters determine Ly|$\alpha$| radiation is hampered by the complicated nature of Ly|$\alpha$| emission.

Previous studies try to predict Ly|$\alpha$| emission, using the correlation between Ly|$\alpha$| and the physical properties of host galaxies (McCarron et al. 2022; Chávez Ortiz et al. 2023; Foran et al. 2023; Napolitano et al. 2023, but see Bolan et al. 2024). However, the prediction using linear regression is not entirely accurate because Ly|$\alpha$| emission shows correlations to multiple galaxy properties and the variance of the correlation is large. Some work uses multiparametric analysis to predict Ly|$\alpha$| emission (Runnholm et al. 2020; Hayes et al. 2023). Recent work begins to use machine learning techniques to deal with the multivariable problem of the complex Ly|$\alpha$| emission. McCarron et al. (2022) use k-nearest method, and Napolitano et al. (2023) use random forest method to predict Ly|$\alpha$| emission from galaxies.

In this paper, we use a neural network to predict Ly|$\alpha$| emission from galaxy properties. A neural network can handle non-linear relations between input parameters and an output value; therefore it is suitable for learning the complex relations between the physical properties and Ly|$\alpha$| emission. By using the neural network as a prediction model, the presence or absence of Ly|$\alpha$| emission can be predicted from physical properties that can be estimated only by broad-band (BB) photometry. We can efficiently detect LAEs in a wide field of view and in a wide and continuous redshift range, overcoming the limitations of spectroscopic observation or NB imagings. Recently, JWST observation has started to provide a number of reionization-era galaxies. The discovery of LAEs even at |$z\gt 7$|⁠, where H i IGM absorption should be intense, suggests the existence of ionized bubbles (Jones et al. 2024; Witstok et al. 2024). The overwhelming increase in the number of LAEs predicted by the network will allow us to explore the evolution of the neutral fraction and the structure of reionization.

This paper is structured as follows. Section 2 describes the observational data and their analysis for the construction of a training data set. The training and evaluation of the prediction model are described in Section 3. In Section 4, we discuss the correlation between input parameters and the probability of Ly|$\alpha$| inferred by the model. We apply the prediction model to galaxies detected by JWST observation and discuss the application to the study of reionization.

Throughout this paper, we use the AB magnitude system (Oke & Gunn 1983). We assume a |$\Lambda$|CDM cosmology with |$h = 0.7$|⁠, |$\Omega _\mathrm{m} = 0.3$|⁠, and |$\Omega _\Lambda = 0.7$|⁠. We use cMpc (pMpc) to indicate comoving (physical) scales.

2 DATA

We collect spectroscopic data from the VANDELS (McLure et al. 2018) and MUSE (Bacon et al. 2017; Herenz et al. 2017) spectroscopic surveys. The VANDELS survey targets the UDS (Ultra Deep Survey) and CDFS (Chandra Deep Field South) fields. The target selection is based on the photometric redshift of star-forming galaxies with |$i \lt 27$|⁠. The survey is carried out using VIMOS on VLT. The wavelength range of the spectra is |$4800\, \mathrm{\mathring{\rm A}} \lt \lambda \lt 9800\, \mathrm{\mathring{\rm A}}$|⁠, which captures the Ly|$\alpha$| emission at |$2.9 \lt z \lt 7.0$|⁠. The final data release of VANDELS (Garilli et al. 2021) contains 2087 galaxies at |$1 \lt z \lt 6.5$|⁠.

The MUSE survey is carried out in the GOODS-S (CDFS) and COSMOS fields using MUSE on VLT. MUSE survey is comprised of two different depth observations. MUSE-Wide (Herenz et al. 2017; Urrutia et al. 2019) covers the COSMOS and GOODS-S fields, and MUSE-Deep (Bacon et al. 2017, 2023; Inami et al. 2017) targets HUDF (Hubble Ultra-Deep Field) located at the centre of the GOODS-S field. A 1|$\sigma$| emission line detection sensitivity of MUSE-Wide data is |$1 \times 10^{-19}\, \mathrm{erg\, s^{-1}\, cm^{-2}\, \mathring{\rm A}^{-1}}$|⁠, and that of MUSE-Deep data is |$5.5 \times 10^{-20}\, \mathrm{erg\, s^{-1}\, cm^{-2}\, \mathring{\rm A}^{-1}}$|⁠. The wavelength range of the spectra is |$4800\, \mathrm{\mathring{\rm A}} \lt \lambda \lt 9300\, \mathrm{\mathring{\rm A}}$|⁠. Object detection relies on the presence of emission lines, leading to select low-mass galaxies. Therefore, the MUSE survey complements the galaxy of the VANDELS survey, which is basically a magnitude-limited sample. Kerutt et al. (2022) determine the Ly|$\alpha$| measurement of LAEs detected with the MUSE-Wide and MUSE-Deep surveys. Schmidt et al. (2021) provide a catalogue of 2052 galaxies at |$1.5 \lt z \lt 6.4$|⁠. Fig. 1 shows the relation between the redshift and |$M_\mathrm{UV}$| of the VANDELS and MUSE samples.

The relation between the redshift and $M_\mathrm{UV}$ of galaxies of VANDELS (blue) and MUSE (orange).$M_\mathrm{UV}$ is calculated using SED fitting (see Section 2.3).
Figure 1.

The relation between the redshift and |$M_\mathrm{UV}$| of galaxies of VANDELS (blue) and MUSE (orange).|$M_\mathrm{UV}$| is calculated using SED fitting (see Section 2.3).

2.1 Ly|$\alpha$| flux measurement

We measure the Ly|$\alpha$| flux of spectroscopically observed galaxies from VANDELS and MUSE at |$3 \lesssim z \lesssim 6$|⁠. In this redshift range, we assume the reionization is completed and the attenuation of Ly|$\alpha$| emission by the global neutral fraction is negligible. For VANDELS samples, we use the spectroscopic data to measure the Ly|$\alpha$| flux and equivalent width (EW), which are not included in the catalogue of Garilli et al. (2021). First, we fit the continuum redwards of Ly|$\alpha$| in the wavelength range of |$1250\, \mathrm{\mathring{\rm A}} \lt \lambda \lt 2500 \, \mathrm{\mathring{\rm A}}$| in the rest frame with a power law with an index of |$\alpha$| (⁠|$A\lambda ^\alpha$|⁠). The UV continuum is then subtracted using a fitted power function. We determine the Ly|$\alpha$| line flux by directly integrating the continuum-subtracted spectra over |$1215.67 \pm 3\, \mathrm{\mathring{\rm A}}$|⁠. The Ly|$\alpha$| EW is derived by dividing the Ly|$\alpha$| flux by the UV continuum flux at the position of Ly|$\alpha$| (⁠|$\lambda = 1215.67\, \mathrm{\mathring{\rm A}}$|⁠), which is calculated using the result of power-law fitting to the redwards of Ly|$\alpha$| as

(1)

where A and |$\alpha$| are the fitting parameters. For strong emitters (⁠|$\mathrm{EW}_0 \gt 10\, \mathrm{\mathring{\rm A}}$|⁠), which often have Ly|$\alpha$| emission extending beyond |$\pm 3 \, \mathrm{\mathring{\rm A}}$|⁠, we determine the integration window by fitting the Ly|$\alpha$| emission line with a Gaussian function. When the Ly|$\alpha$||$\mathrm{EW}_0$| calculated over |$1215.67 \pm 3\, \mathrm{\mathring{\rm A}}$| is larger than |$10\, \mathrm{\mathring{\rm A}}$|⁠, we recalculate the Ly|$\alpha$| flux by directly integrating the observed flux over |$\pm 3\, \sigma$| around the central wavelength of the Ly|$\alpha$| emission. Because the Ly|$\alpha$| emission often has an asymmetric profile or double-peaked profile, we use the directly integrated flux instead of the total area of the fitted Gaussian function. The typical Gaussian |$\sigma$| is |$\sim 1\, \mathrm{\mathring{\rm A}}$|⁠. We assume the Ly|$\alpha$| flux extending out of the |$\pm 3\sigma$| range is negligible.

To calculate the error of Ly|$\alpha$| flux, we use Monte Carlo simulation. We randomly add Gaussian error, extracted from the noise spectrum, to each wavelength bin of the spectrum. We generate 1000 noise-added spectra and calculate Ly|$\alpha$| flux of all the spectra in the same way as described above. The flux and 1|$\sigma$| error are determined by the median and 16–84th percentile of the distribution of flux measurement for the 1000 spectra, respectively. We visually inspect the spectra and remove objects whose Ly|$\alpha$| are contaminated by severe noise or sky emission lines.

Talia et al. (2023) calculate EW of VANDELS galaxies by fitting with a model spectrum that considers stellar continuum and emission and absorption lines. EW measurements are mostly consistent with this study for the objects with positive EW values (i.e. Ly|$\alpha$| detected objects). There is some systematic offset for the negative EW values because we do not deal with Ly|$\alpha$| absorption. This does not impact our study because we are only interested in whether a galaxy emits Ly|$\alpha$| line or not. As described in Section 3, we select LAEs based on visual inspection.

We use the Ly|$\alpha$| flux and EW of the MUSE sample measured in Kerutt et al. (2022). The measurement procedures of Kerutt et al. (2022) are as follows. Ly|$\alpha$| emission lines are fitted using an asymmetric Gaussian function. To calculate EW, continuum flux is estimated using HST bands. UV continuum slope is taken into account by using two HST bands if possible. The UV continuum is assumed to be a power-law function (⁠|$f (\lambda) \propto \lambda ^\alpha$|⁠).

The flux calibration of VANDELS spectra is carried out using i-band photometry. This corrects for the flux loss in VANDELS spectra. The measurement of Ly|$\alpha$| flux using MUSE IFU can be performed beyond the aperture used in photometry. It leads to extracting an extended component of Ly|$\alpha$| emission whereas VANDELS observation takes into account the Ly|$\alpha$| flux from the core component. To calibrate this difference, we compare 25 objects appearing in both the VANDELS and MUSE catalogues. The median and 16th–84th percentile of the ratio of Ly|$\alpha$| flux between VANDELS and MUSE is |$0.55^{+0.56}_{-0.31}$|⁠. We regard this value as the correction factor and multiply 0.55 by the Ly|$\alpha$| flux and EW of MUSE galaxies.

2.2 Broad-band photometry

In addition to the spectroscopic measurements, we collect BB photometry to obtain the physical properties of the sample. BB photometry of VANDELS galaxies is included in VANDELS catalogue (Garilli et al. 2021). Near-UV to near-IR photometry is observed with HST and Spitzer/IRAC in the central regions of UDS and CDFS. Photometric data for the regions that are not covered by HST and Spitzer/IRAC are taken from ground-based observations, such as CFHT, Subaru, VISTA, and UKIRT.

Since the BB photometry of the MUSE sample is not included in the MUSE catalogue, they are extracted from CANDELS (Guo et al. 2013), 3D-HST (Skelton et al. 2014), and UVUDF (Rafelski et al. 2015). Cross-matching of the sources and UV continuum counterparts is performed in Kerutt et al. (2022). Following their results, we obtain 720 objects matched in CANDELS, 354 in 3D-HST, and 364 in UVUDF after excluding duplication among these surveys. Table 1 summarizes the number of galaxies and the list of available filters for the VANDELS and MUSE surveys.

Table 1.

The overview of the number of galaxies and the list of available filters for the VANDELS and MUSE surveys.

Spectroscopic surveyVANDELSMUSE
Target fieldCDFSUDSGOODS-SGOODS-SCOSMOSGOODS-S
PhotometryHSTGroundHSTGroundCANDELS3D-HST3D-HSTUVUDF
Number of galaxies41033539735572065289364
Available filtersUUuuUUuF435W
 F435WBBBF435WF435WBF606W
 F606WIA484VVF606WBgF755W
 F755WIA527F606WrF755WVVF850LP
 F814WF606WriF814WF606WF606WF105W
 F850LPIA624izF850LPRrF125W
 F098M|$R_c$|F814W|$z^{++}$|F098M|$R_c$|iF140W
 F105WIA679zYF105WF755WF814WF160W
 F125WIA738YJF125WIz
 F160WIA767F125WHF160WF850LP|$z^{++}$|
 |$K_s$|ZJK|$K_s$|F125WY
 IRAC1F850LPF160WIRAC1IRAC1JF125W
 IRAC2YHIRAC2IRAC2F140WJ
 J|$K_s$|IRAC3F160WF140W
 HKIRAC4HF160W
 |$K_s$|IRAC1 –|$K_s$|H
 IRAC1IRAC2IRAC1|$K_s$| –
 IRAC2IRAC2IRAC1
  – – –IRAC3IRAC2
 IRAC4IRAC3
 IRAC4
Spectroscopic surveyVANDELSMUSE
Target fieldCDFSUDSGOODS-SGOODS-SCOSMOSGOODS-S
PhotometryHSTGroundHSTGroundCANDELS3D-HST3D-HSTUVUDF
Number of galaxies41033539735572065289364
Available filtersUUuuUUuF435W
 F435WBBBF435WF435WBF606W
 F606WIA484VVF606WBgF755W
 F755WIA527F606WrF755WVVF850LP
 F814WF606WriF814WF606WF606WF105W
 F850LPIA624izF850LPRrF125W
 F098M|$R_c$|F814W|$z^{++}$|F098M|$R_c$|iF140W
 F105WIA679zYF105WF755WF814WF160W
 F125WIA738YJF125WIz
 F160WIA767F125WHF160WF850LP|$z^{++}$|
 |$K_s$|ZJK|$K_s$|F125WY
 IRAC1F850LPF160WIRAC1IRAC1JF125W
 IRAC2YHIRAC2IRAC2F140WJ
 J|$K_s$|IRAC3F160WF140W
 HKIRAC4HF160W
 |$K_s$|IRAC1 –|$K_s$|H
 IRAC1IRAC2IRAC1|$K_s$| –
 IRAC2IRAC2IRAC1
  – – –IRAC3IRAC2
 IRAC4IRAC3
 IRAC4
Table 1.

The overview of the number of galaxies and the list of available filters for the VANDELS and MUSE surveys.

Spectroscopic surveyVANDELSMUSE
Target fieldCDFSUDSGOODS-SGOODS-SCOSMOSGOODS-S
PhotometryHSTGroundHSTGroundCANDELS3D-HST3D-HSTUVUDF
Number of galaxies41033539735572065289364
Available filtersUUuuUUuF435W
 F435WBBBF435WF435WBF606W
 F606WIA484VVF606WBgF755W
 F755WIA527F606WrF755WVVF850LP
 F814WF606WriF814WF606WF606WF105W
 F850LPIA624izF850LPRrF125W
 F098M|$R_c$|F814W|$z^{++}$|F098M|$R_c$|iF140W
 F105WIA679zYF105WF755WF814WF160W
 F125WIA738YJF125WIz
 F160WIA767F125WHF160WF850LP|$z^{++}$|
 |$K_s$|ZJK|$K_s$|F125WY
 IRAC1F850LPF160WIRAC1IRAC1JF125W
 IRAC2YHIRAC2IRAC2F140WJ
 J|$K_s$|IRAC3F160WF140W
 HKIRAC4HF160W
 |$K_s$|IRAC1 –|$K_s$|H
 IRAC1IRAC2IRAC1|$K_s$| –
 IRAC2IRAC2IRAC1
  – – –IRAC3IRAC2
 IRAC4IRAC3
 IRAC4
Spectroscopic surveyVANDELSMUSE
Target fieldCDFSUDSGOODS-SGOODS-SCOSMOSGOODS-S
PhotometryHSTGroundHSTGroundCANDELS3D-HST3D-HSTUVUDF
Number of galaxies41033539735572065289364
Available filtersUUuuUUuF435W
 F435WBBBF435WF435WBF606W
 F606WIA484VVF606WBgF755W
 F755WIA527F606WrF755WVVF850LP
 F814WF606WriF814WF606WF606WF105W
 F850LPIA624izF850LPRrF125W
 F098M|$R_c$|F814W|$z^{++}$|F098M|$R_c$|iF140W
 F105WIA679zYF105WF755WF814WF160W
 F125WIA738YJF125WIz
 F160WIA767F125WHF160WF850LP|$z^{++}$|
 |$K_s$|ZJK|$K_s$|F125WY
 IRAC1F850LPF160WIRAC1IRAC1JF125W
 IRAC2YHIRAC2IRAC2F140WJ
 J|$K_s$|IRAC3F160WF140W
 HKIRAC4HF160W
 |$K_s$|IRAC1 –|$K_s$|H
 IRAC1IRAC2IRAC1|$K_s$| –
 IRAC2IRAC2IRAC1
  – – –IRAC3IRAC2
 IRAC4IRAC3
 IRAC4

2.3 SED fitting

We estimate the galaxy properties using spectral energy distribution (SED) fitting code cigale (Boquien et al. 2019). It assumes the single stellar population model of Bruzual & Charlot (2003) and the Chabrier initial mass function (Chabrier 2003). We adopt |$\tau$|-model star formation history with |$\tau = 10\!-\!120\, \mathrm{Myr}$|⁠. The searching range of model parameters are age with |$4\!-\!1600\, \mathrm{Myr}$|⁠, metallicity with |$0.05\!-\!1Z_\odot$|⁠, dust attenuation with |$0.0 \lt E(B-V) \lt 1.0$|⁠, and the ionization parameter with |$-4 \lt \log U \lt -2$|⁠. The dust attenuation curve assumes Calzetti extinction law (Calzetti et al. 2000). Redshift is fixed at the spectroscopic redshift. The spectroscopic redshift is taken from the catalogue of Garilli et al. (2021) for the VANDELS sample and Schmidt et al. (2021) for the MUSE sample. For more information about the spectroscopic redshift measurement, see Pentericci et al. (2018) and Schmidt et al. (2021) for VANDELS and MUSE, respectively.

We remove galaxies that fail SED fitting from the sample. To make sure SED fitting is correctly performed, we exclude galaxies with reduced |$\chi ^2 \gt 2.5$|⁠. While the redshift of the sample is spectroscopically confirmed, we impose that the measurement of photometric redshift is correctly determined. When we apply the prediction model to galaxies that have no spectroscopic observation, we use photometric redshift for SED fitting. In Section 2.4, we fix the redshift at the photometric redshift during SED fitting. If photometric redshift measurement is poorly determined, SED fitting gives a wrong answer. It is impossible to accurately infer LAEs using the prediction model with the wrong SED fitting. For this reason, we require |$\Delta z =|z_\mathrm{phot}-z_\mathrm{spec}|/(1+z) \lt 0.15$| to ensure the accuracy of photometric redshift from the catalogues of VANDELS, CANDELS, 3D-HST, and UVUDF. As shown in Fig. 2, most of the faint, lower mass objects come from the MUSE survey, and they are all classified as LAEs. This is because the MUSE survey is based on emission line detection while the VANDELS survey is a mass-limited sample. To mitigate the impact of this difference on the prediction model, we remove objects with |$M_* \lt 10^8 M_\odot$|⁠. Active galactic nuclei that are identified by broad emission lines in the VANDELS Survey (Garilli et al. 2021) are excluded from the final sample. For the MUSE sample, we remove two Type 2 QSOs that are mentioned in Urrutia et al. (2019). The number of galaxies for the training data set is 926 from VANDELS and 507 from MUSE, respectively.

The distribution of six physical properties ($\beta$, $E(B-V)$, SFR, $M_\mathrm{UV}$, $M_*$, and age). Blue, orange, and green histograms show non-LAEs from VANDELS, LAEs from VANDELS, and LAEs from MUSE, respectively. Galaxies taken from MUSE are all classified as LAEs because MUSE catalogue selects objects based on line detections.
Figure 2.

The distribution of six physical properties (⁠|$\beta$|⁠, |$E(B-V)$|⁠, SFR, |$M_\mathrm{UV}$|⁠, |$M_*$|⁠, and age). Blue, orange, and green histograms show non-LAEs from VANDELS, LAEs from VANDELS, and LAEs from MUSE, respectively. Galaxies taken from MUSE are all classified as LAEs because MUSE catalogue selects objects based on line detections.

2.4 COSMOS2020 and SC4K samples

To validate the trained model with galaxies that have no spectroscopic measurement, we use two photometric samples: COSMOS2020 and SC4K. COSMOS2020 (Weaver et al. 2022) provides a deep multiband catalogue of galaxies. COSMOS2020 derived photometric redshift using UV to IR photometry. It is expected that the majority of the galaxies in the COSMOS2020 catalogue are non-LAEs.

SC4K (Sobral et al. 2018a) is an NB and medium-band (MB) selected LAE catalogue. SC4K uses 16 filters to identify |$\sim 4000$| LAEs in the COSMOS field at several discrete redshift ranges at |$2 \lt z \lt 6$|⁠. To estimate the physical properties of SC4K LAEs, we obtain BB photometry by cross-matching with the COSMOS2020 catalogue. The number of SC4K LAEs matched in the COSMOS2020 catalogue is 3453.

For a fair comparison of the two catalogues, we select galaxies whose photometric redshift from the COSMOS2020 catalogue is included in the redshift ranges of SC4K LAEs. When cross-matching the SC4K sample and the COSMOS2020 sample, LAE candidates of SC4K are only 1.5  per cent of galaxies of COSMOS2020 within the redshift ranges of the SC4K LAEs. Note that LAE candidates of SC4K, which indicate a flux excess in NB or MB, have large EWs with |$\mathrm{EW}_0\gt 25\, \mathrm{\mathring{\rm A}}$| (Sobral et al. 2018a), while The COSMOS2020 sample may include LAEs with smaller EWs.

The physical properties of COSMOS and SC4K galaxies are estimated by SED fitting using CIGALE in the same way as in Section 2.3. Because spectroscopic redshift is not available, we fix the redshift at the photometric redshift for COSMOS2020 galaxies. The redshift of SC4K LAEs is fixed at the central wavelength of NB/MB corresponding to a Ly|$\alpha$| emission.

As COSMOS2020 has a wide variety of galaxies, we select galaxies based on the same selection criteria as in the training data set. The criteria are the number of photometries with |$S/N \gt 5$| is larger than or equal to ten, |$-22 \lt M_\mathrm{UV} \lt -18$|⁠, |$M_* \gt 10^8M_\odot$|⁠, and reduced |$\chi ^2 \lt 2.5$|⁠. We apply the criteria of |$-22 \lt M_\mathrm{UV} \lt -18$|⁠, |$M_* \gt 10^8M_\odot$|⁠, and reduced |$\chi ^2 \lt 2.5$| to the SC4K sample because they are detected in the fewer number of filters compared to COSMOS2020. The numbers of galaxies are 67 068 and 2273 from COSMOS2020 and SC4K, respectively.

2.5 JWST sample

We apply the prediction model to galaxies detected by JWST observation in Section 4.4. We use public JWST imaging data in case of CEERS (ERS-1345; Finkelstein et al. 2023), the COSMOS and UDS fields from PRIMER (GO-1837; Dunlop et al. 2021), and the GOODS-N/-S fields from FRESCO (Oesch et al. 2023) and GO-1963 (Williams et al. 2021). The photometry is performed with Grizli (Brammer 2023) for HST and JWST imagings. The images and multiband catalogues are available online.1 We use the multiband catalogue of the public release version 6.2

SED fitting is carried out in the same way as in Section 2.3 using cigale to estimate the physical properties of the galaxies. The redshift is fixed at the best-fitting photometric redshift determined by eazy (Brammer, van Dokkum & Coppi 2008).

Similar to Section 2.4, we select objects with |$-22 \lt M_\mathrm{UV} \lt -18$|⁠, |$10^8 \lt M_*/M_\odot \lt 10^{11}$|⁠, reduced |$\chi ^2 \lt 2.5$|⁠, and redshift range of |$2 \lt z \lt 6$|⁠. Galaxies with calculated stellar mass of |$M_* \gt 10^{11}$| are excluded because the SED fitting is poorly determined. Additionally, we exclude galaxies that have flux 3|$\sigma$| higher than the model spectrum calculated by cigale at the wavelength shorter than Ly|$\alpha$| to assure the accuracy of photometric redshift. The number of galaxies satisfying the condition above is 6938. The median value of the stellar mass of the JWST sample is |$10^{8.5}\, M_\odot$|⁠. The JWST sample contains less massive galaxies than the training data set with a median stellar mass of |$10^{9.1}\, M_\odot$|⁠, and COSMOS2020 with |$10^{9.3}\, M_\odot$|⁠.

3 PREDICTION OF LAE FROM PHYSICAL PROPERTIES

3.1 Training data set

We construct a prediction model using the training data set consisting of galaxies from VANDELS and MUSE. For the prediction of LAEs, we use six parameters derived from SED fitting (SFR, stellar mass |$M_*$|⁠, UV absolute magnitude |$M_\mathrm{UV}$|⁠, age, UV slope |$\beta$|⁠, and dust attenuation |$E(B-V)$|⁠). We use these six parameters because they are thought to be related to Ly|$\alpha$| emission and can be estimated by SED fitting with BB photometries. To train the model, we categorize the VANDELS data into two labels: LAE and non-LAE. LAEs are selected based on visual inspection. It may seem better to set a certain threshold of EW to classify the two like LAE surveys using the NB technique. However, LAEs detected with NB techniques tend to be biased towards LAEs with high Ly|$\alpha$| EW. Since our galaxy sample is based on spectroscopic data, we can use the detailed analysis of Ly|$\alpha$| flux as described in Section 2. The measurement reveals a number of galaxies with faint (⁠|$\mathrm{EW}_0 \lt 10\, \mathrm{\mathring{\rm A}}$|⁠) but clear Ly|$\alpha$| emission lines. On the other hand, despite being a non-LAE, it sometimes shows |$\mathrm{EW}_0 \sim 5\, \mathrm{\mathring{\rm A}}$| due to uncertainty of the spectrum. Thus, selecting LAEs with an EW threshold may miss these faint LAEs or suffer from contaminants. Furthermore, some of the galaxies have both emission and absorption at the position of Ly|$\alpha$| line. It makes difficult the precise measurements of Ly|$\alpha$| EW because uncertainties in the EW measurement increase when EW decreases. For these reasons, we rely on visual inspection to select LAEs. As a result, the galaxies with |$\mathrm{EW}_0 \gt 7\, \mathrm{\mathring{\rm A}}$| are all classified as LAEs by our visual inspection. The boundary between LAE and non-LAE roughly corresponds to |$\mathrm{EW}_0=3\, \mathrm{\mathring{\rm A}}$|⁠. We confirmed that non-LAE is approximately distributed around |$\mathrm{EW}_0=0\, \mathrm{\mathring{\rm A}}$|⁠. Since the MUSE sample is originally selected by the detection of Ly|$\alpha$| emission lines, all galaxies are classified as LAEs. The final data set consists of 520 LAEs and 406 non-LAEs from VANDELS and 507 LAEs from MUSE. The data set is split into an 80 per cent training sample and a 20  per cent test sample.

Fig. 2 shows the distributions of the six physical properties, and Fig. 3 shows the relation between EW and the six physical properties. As shown by previous studies (Sobral et al. 2018b; Khostovan et al. 2019; Arrabal Haro et al. 2020; Santos et al. 2020), LAEs tend to have lower SFR, lower mass, fainter UV magnitude, younger age, bluer UV slope, and less dust content.

The relation between the EW$_0$ of Ly$\alpha$ and six physical properties ($\beta$, $E(B-V)$, SFR, $M_\mathrm{UV}$, $M_*$, and age). Colours are the same as Fig. 2. The error bar at the bottom right on each panel shows the typical error of each parameter and EW$_0$.
Figure 3.

The relation between the EW|$_0$| of Ly|$\alpha$| and six physical properties (⁠|$\beta$|⁠, |$E(B-V)$|⁠, SFR, |$M_\mathrm{UV}$|⁠, |$M_*$|⁠, and age). Colours are the same as Fig. 2. The error bar at the bottom right on each panel shows the typical error of each parameter and EW|$_0$|⁠.

3.2 Neural network

We use a neural network to predict the probability that a galaxy exhibits Ly|$\alpha$| emission lines. The input parameters are six physical properties derived from SED fitting (SFR, |$M_*$|⁠, |$M_\mathrm{UV}$|⁠, age, |$\beta$|⁠, and |$E(B-V)$|⁠).

We construct a neural network architecture using TensorFlow/Keras (Chollet et al. 2015; Developers 2023). The neural network consists of five hidden layers with 64 nodes per layer. To determine the optimal number of hidden layers, we test several neural network architectures changing the number of hidden layers. We find that the number of hidden layers smaller than four does not have enough capacity to separate LAEs and non-LAEs. On the other hand, the number of hidden layers larger than five does not improve the performance. We use eLU with a parameter |$\alpha =0$| as an activation function after each hidden layer. We apply 25 per cent dropout connection after each hidden layer to prevent overfitting. The final layer of the neural network is a dense layer with a single node, using a sigmoid activation function to ensure the output values are between 0 and 1. The learning rate sets 0.001 as an initial value, decaying by a factor of 0.96 every 100 steps. We use Adam optimizer (Kingma & Ba 2015) and binary cross-entropy loss. Training proceeds with a maximum of 400 epochs, but it ends at the point that gives the minimum validation loss to avoid overfitting.

During the training, we employ 5-fold cross-validation in which the training sample is divided into five parts, each with 20 per cent. We train the network five times, each time using a different 20 per cent part as the validation set and the remaining 80 per cent part for training. After that, we take the average of the outputs of five networks. This ensures that our network does not overfit to any specific subset of the data.

Besides, we also employ a Monte Carlo approach to account for the uncertainty in the input physical properties. We randomly add Gaussian errors to the six input parameters. The standard deviation of the Gaussian error is set to the uncertainties of the parameters from the cigale output. For each run, the network is trained with an uncertainty-added data set in the same way as described above. We repeat this procedure ten times. The sample splitting of the 5-fold cross-validation is fixed in the whole process. The output score is an average of the outputs of ten uncertainty-added models. We regard this score as the probability of LAE, |$P(\mathrm{LAE})$|⁠, though it is not strictly a probability. This ensemble approach helps improve the robustness of our predictions and ensures our model takes into account the uncertainty in the measured physical properties.

3.3 Validation

Here, we check how well the network performs with the remaining 20 per cent of the test sample that we do not use for training. Fig. 4 shows the relation between the output |$P(\mathrm{LAE})$| of the trained network and the input physical parameters. Similar to Fig. 3, |$P({\mathrm{LAE}})$| has strong correlation with |$\beta$| and |$M_*$| and moderate correlation with SFR, |$M_\mathrm{UV}$|⁠. Age and |$E(B-V)$| are less sensitive to |$P(\mathrm{LAE})$|⁠. Fig. 5 shows the distribution of the output probability of LAEs of the test sample. The LAEs in the test sample tend to have a higher probability, while non-LAEs show the opposite trend. As shown in Fig. 5, no galaxy have |$P(\mathrm{LAE}) \lt 0.1$|⁠. This is because the parameter distributions of non-LAEs typically overlap with those of LAEs. There exist LAEs with |$\beta \gt -2$| or |$E(B-V) \gt 0.2$| where non-LAEs are dominant. As a result, none of the input parameters is a distinctive indicator of non-LAEs, and no galaxy has |$P(\mathrm{LAE}) \lt 0.1$|⁠. On the other hand, a large fraction of LAEs have |$P(\mathrm{LAE}) \gt 0.9$|⁠. This is because the training data set does not include low-mass non-LAEs, so almost all of the galaxies with |$M_\mathrm{UV} \gt -19$| or |$M_* \lt 10^9\, M_\odot$| are classified as LAEs. In such case, the network can preferentially predict the low mass galaxies as LAE with |$P(\mathrm{LAE}) \gt 0.9$|⁠. Fig. 6 shows the ROC curve of the trained model. When we define LAEs as |$P(\mathrm{LAE}) \gt 0.7$|⁠, the predicted LAE sample has 77 per cent true positive rate (TPR) and 14 per cent false positive rate (FPR). Similarly, for the threshold of |$P(\mathrm{LAE}) \gt 0.5$|⁠, TPR is 91 per cent, and FPR is 38 per cent, and for the threshold of |$P(\mathrm{LAE}) \gt 0.8$|⁠, TPR is 67 per cent, and FPR is 7 per cent. The threshold of |$P(\mathrm{LAE}) \gt 0.8$| is preferable when the purity of a sample is important, while |$P(\mathrm{LAE}) \gt 0.5$| is suitable for reducing missed detections despite potential contamination. It should be noted that there is no z-dependence of these values and the ROC curve does not change significantly with redshift. The area under the ROC curve (AUC) is 0.88. If we select LAEs based only on UV slope |$\beta$|⁠, the AUC is 0.82. Similarly, when selecting LAEs based on one of the parameters, |$M_*$| and SFR, the AUCs are 0.84 and 0.77, respectively. This demonstrates the advantage of the neural network, which handles multiple variables compared to the selection based on a single variable.

The relation between the output probability of $P(\mathrm{LAE})$ and the six input parameters for VANDELS LAEs (orange), MUSE LAEs (green), and non-LAEs (blue).
Figure 4.

The relation between the output probability of |$P(\mathrm{LAE})$| and the six input parameters for VANDELS LAEs (orange), MUSE LAEs (green), and non-LAEs (blue).

The distribution of output probability of $P(\mathrm{LAE})$ for LAEs (orange), and non-LAEs (blue).
Figure 5.

The distribution of output probability of |$P(\mathrm{LAE})$| for LAEs (orange), and non-LAEs (blue).

The ROC curve of the trained model. True positive rate (TPR) and false positive rate (FPR) are defined as $\mathrm{TPR} = \mathrm{TP}/(\mathrm{TP} + \mathrm{FN})$ and $\mathrm{FPR} = \mathrm{FP}/(\mathrm{TN} + \mathrm{FP})$, where TP, FN, TP, and TN are true positive, false negative, true positive, and true negative, respectively. The value along the curve indicates the threshold of $P(\mathrm{LAE})$.
Figure 6.

The ROC curve of the trained model. True positive rate (TPR) and false positive rate (FPR) are defined as |$\mathrm{TPR} = \mathrm{TP}/(\mathrm{TP} + \mathrm{FN})$| and |$\mathrm{FPR} = \mathrm{FP}/(\mathrm{TN} + \mathrm{FP})$|⁠, where TP, FN, TP, and TN are true positive, false negative, true positive, and true negative, respectively. The value along the curve indicates the threshold of |$P(\mathrm{LAE})$|⁠.

As described above, the MUSE sample occupies the fainter parameter space, all of which are classified as LAEs. To test the impact of adding the MUSE sample, we train the network with only the VANDELS sample. The AUC does not change when we classify the VANDELS LAEs and non-LAEs using the model. Therefore, the addition of the MUSE sample does not degrade the performance of the model. Adding the MUSE sample extends the range of the input parameters, in particular fainter objects, as shown in Fig. 2 without sacrificing the performance of the prediction model. |$P(\mathrm{LAE})$| is found not to correlate with Ly|$\alpha$| flux or EW. Therefore, unfortunately, it seems difficult to predict Ly|$\alpha$| flux and EW with this model.

Napolitano et al. (2023) detect LAEs using a random forest classifier. Fig. 10 of Napolitano et al. (2023) reports the total number of TP, FP, TN, and FN for their test sample. Following the result, the TPR is 65 per cent, and the FPR is 9 per cent. This result is similar to our result when adopting the threshold of |$P(\mathrm{LAE}) \gt 0.8$|⁠. This shows that a neural network and a random forest exhibit similar performance while various techniques are available for this kind of analysis. In this study, we select a neural network due to its robust performance and ability to handle complex data structures.

4 DISCUSSION

4.1 Feature importance

We discuss which input parameters have the most strong influence on the determination of |$P(\mathrm{LAE})$|⁠. The permutation feature importance (PFI; Altmann et al. 2010) is one of the methods to evaluate the relative importance of the input parameters, which is derived by shuffling the values of a specific input variable in the data set and assessing its impact on the model’s performance. When an input parameter is important to the model output, model performance gets worse significantly after shuffling the values. On the other hand, a parameter is less important if the model performance does not vary much. Therefore, the difference can be regarded as the importance of the input parameter. Fig. 7 shows the importance of the input parameters derived by the permutation feature importance method. |$\beta$|⁠, |$M_\mathrm{UV}$|⁠, and |$M_*$| are found to significantly affect the model’s output, as inferred by the correlation shown in Fig. 4. On the other hand, SFR, |$E(B-V)$|⁠, and age show the importance consistent with zero. However, the permutation feature importance method can provide misleading results for correlated features because the model can infer from other correlated parameters even when shuffling values of one parameter. Therefore, low importance does not necessarily mean that the parameter is not relevant to the model output when the parameter is correlated with other parameters. The other possibility is that the parameters with the importance consistent with zero have another correlated parameter more strongly connected to the model output. As inferred by the physical connection with |$\beta$|⁠, this correlation between |$\beta$| and |$E(B-V)$| can degrade the measurement of importance of |$E(B-V)$|⁠. Similarly, SFR has a strong correlation with |$M_*$|⁠, and the importance of SFR might be underestimated. The age does not show an apparent correlation with the other parameters therefore we interpret that the age does not impact the model output. It might also be due to the age estimate of SED fitting that has a large uncertainty as shown in Fig. 3.

Relative importance of the six parameters estimated by permutation feature importance. The error bar shows the standard deviation of values derived for 50 networks (5-fold cross-validation $\times$ 10 Monte Carlo simulations).
Figure 7.

Relative importance of the six parameters estimated by permutation feature importance. The error bar shows the standard deviation of values derived for 50 networks (5-fold cross-validation |$\times$| 10 Monte Carlo simulations).

4.2 Misclassified objects

UV slope |$\beta$| is the primary factor to determine |$P(\mathrm{LAE})$|⁠; bluer UV slope leads to the emergence of Ly|$\alpha$| emission. However, the training data set contains a distinct population of LAEs with a red UV slope. The prediction model tends to miss these red LAEs, leading to false negatives. Several observational studies suggest the existence of old LAEs, with ages significantly above 100 Myr (Finkelstein et al. 2009; Pentericci et al. 2009; Iani et al. 2024). Shapley et al. (2001) proposed two distinctive old and young populations in Ly|$\alpha$| bright phases: a galaxy appears as an LAE (blue LAE) during its initial starburst epoch when it is still dust-free, then it becomes a dusty Lyman break galaxy (LBG) having Ly|$\alpha$| absorption after ISM enrichment. Ly|$\alpha$| emission could appear again as red LAE when a galaxy becomes less dusty, at least on the line of sights, due to an outflow when it is older than a few |$\times 10^8\, \mathrm{yr}$|⁠. Furthermore, understanding the physical mechanism of red LAEs and adopting other parameters that are also sensitive to Ly|$\alpha$| emission are required to correctly detect red LAEs. As shown in Fig. 4, there is a strong correlation between the probability of LAEs and the stellar mass, with low-mass galaxies showing higher probabilities of LAEs. This result seems to be reasonable given that lower mass galaxies tend to emit Ly|$\alpha$|⁠. However, this correlation contributes to false positives because all low-mass galaxies do not necessarily emit Ly|$\alpha$|⁠. Besides, the prediction model might be biased towards misclassified low-mass galaxies because all lower mass galaxies of the training data set are classified as LAEs as shown in Fig. 4. Even if a galaxy has a blue UV continuum, Ly|$\alpha$| emission can be obscured due to a viewing angle effect. In a patchy ISM scenario, an ionized channel that opens towards the line of sight boosts the escape of Ly|$\alpha$| photons, but otherwise, Ly|$\alpha$| photons are scattered by a neutral cloud (Smith et al. 2019).

Galaxies with the most pronounced Ly|$\alpha$| emission generally have the smallest Ly|$\alpha$| velocity offsets (e.g. Tang et al. 2024). This may reflect lower H i column densities that allow Ly|$\alpha$| to escape without diffusing to large velocities. Furthermore, introducing a factor related to H i content, such as size (Pucha et al. 2022), may improve the model performance. As described in Section 2, we added low-mass LAEs from the MUSE survey to complement the VANDELS sample. This sample bias can influence the prediction results as discussed above. However, our prediction model realizes the classification of LAEs for at least moderate-mass galaxies with the current limited data set. For the more accurate prediction of low-mass LAEs, we need to construct a prediction model trained with an unbiased data set regarding stellar mass using deeper observations. While adding the MUSE sample to the training data set does not degrade the classification of moderate-mass LAEs mainly from the VANDELS survey as shown in Section 3.3, including even lower mass non-LAEs from e.g. JWST surveys will enhance the model’s performance, especially for classification of low-mass galaxies.

4.3 Validation with other galaxy survey data

To test whether the trained model is applied to other observational samples (especially samples without spectroscopy), we apply the model to two publicly available galaxy samples, COSMOS2020 and SC4K. We calculate |$P(\mathrm{LAE})$| for COSMOS2020 and SC4K galaxies using the trained neural network. Fig. 8 shows the distribution of |$P(\mathrm{LAE})$| of COSMOS2020 and SC4K galaxies. As expected, SC4K LAEs indicate higher |$P(\mathrm{LAE})$| while COSMOS2020 galaxies have a lower probability. This result shows that the neural network is applicable to galaxies where spectroscopic redshift is not available. Assuming all of the SC4K galaxies are LAEs, the TPR is 72 per cent when selecting LAEs with |$P(\mathrm{LAE}) \gt 0.7$|⁠. Note that SC4K selects LAE candidates, which may contain contaminants. While |$P(\mathrm{LAE})$| distribution of the COSMOS2020 galaxies is biased towards non-LAEs as expected, it has relatively more galaxies that show |$P(\mathrm{LAE})\gt 0.7$| than in Fig. 5 even though many of them are expected to be non-LAE. This may be attributed to the fact that COSMOS2020 contains a small portion of LAEs, especially those with |$\mathrm{EW}_0 \lt 25\, \mathrm{\mathring{\rm A}}$|⁠.

The distribution of $P(\mathrm{LAE})$ for COSMOS2020 (blue) and SC4K (orange) inferred by the prediction model.
Figure 8.

The distribution of |$P(\mathrm{LAE})$| for COSMOS2020 (blue) and SC4K (orange) inferred by the prediction model.

4.4 Application to JWST photometric data

JWST has been exploring a deep and detailed view of distant galaxies in the middle of the reionization phase. Narrow but deep optical–to–IR multicolour data are provided by JWST, but Ly|$\alpha$| optical spectroscopy at |$z\lt 7$| is limited, and Ly|$\alpha$| from the reionization epoch at |$z\gt 7$| is hampered by IGM absorption. By applying our prediction model to this data, we not only confirm the presence or absence of Ly|$\alpha$| emission lines at |$z\lt 7$|⁠, but also predict whether galaxies in the reionization era originally emit Ly|$\alpha$| emission lines before undergoing IGM attenuation. By observing Ly|$\alpha$| emission lines for the sample, a more accurate Ly|$\alpha$| fraction can be derived, which constrains the reionization history. In principle, this method can determine whether the observed non-LAE was originally an LAE, thus it allows spatial mapping of cosmic neutral fraction (e.g. Yoshioka et al. 2022) to reveal the patchy reionization process. We show the results of applying the predictive model to the post-reionization galaxies at |$3 \lt z \lt 6$|⁠, which is the same redshift range of galaxies in the training data set. Then, we apply to reionization era galaxies and constrain the size of ionized bubbles at |$z \gt 7$|⁠.

4.4.1 LAE prediction

We derive |$P(\mathrm{LAE})$| of galaxies detected with JWST using the prediction model. The correlation between |$P(\mathrm{LAE})$| and the input parameters is shown as blue points in Fig. 9. The galaxies tend to have higher |$P(\mathrm{LAE})$| compared to the training data set because JWST observations detect fainter galaxies. Same as Section 3, we define LAEs as galaxies predicted to have |$P(\mathrm{LAE}) \gt 0.7$|⁠, and non-LAEs as those with |$P(\mathrm{LAE}) \lt 0.7$|⁠.

The relation between the six parameters and $P(\mathrm{LAE})$ inferred by the prediction model for all JWST galaxies (blue) and spectroscopically confirmed LAEs (orange) in the JWST fields.
Figure 9.

The relation between the six parameters and |$P(\mathrm{LAE})$| inferred by the prediction model for all JWST galaxies (blue) and spectroscopically confirmed LAEs (orange) in the JWST fields.

4.4.2 Comparison with spectroscopically confirmed LAEs

There are several spectroscopically confirmed LAEs in JWST fields. Urbano Stawinski et al. (2024) present 126 LAEs at |$2.8 \lt z \lt 6.5$| in the CEERS field. Ly|$\alpha$| emission is detected using Keck/DEIMOS before the launch of JWST in 2021. They are confirmed as LAEs by visual inspection. After the advent of JWST, Jones et al. (2024) and Saxena et al. (2024) detect 16 LAEs at |$5.6 \lt z \lesssim 8$| using JWST/NIRSpec observation. Jones et al. (2024) select LAEs with EW greater than 3|$\sigma$| error. We evaluate the prediction model using these spectroscopically confirmed LAE samples. For Urbano Stawinski et al. (2024), 73 out of 126 LAEs are included in the JWST imaging catalogue, and the physical properties from the SED fitting described above are available for 64 galaxies with reduced |$\chi ^2\lt 2.5$|⁠. Four LAEs of Jones et al. (2024) are included in the JWST imaging catalogue. Comparing the spectroscopic redshift and photometric redshift of the LAEs, both redshifts agree well. Predicted |$P(\mathrm{LAE})$| of spectroscopically confirmed LAEs are also shown in Fig. 9. 62 (91 per cent) out of 68 spectroscopically confirmed LAEs have |$P(\mathrm{LAE})$| higher than 0.7, satisfying the criteria defined above. While this high success rate might come from the fact that most of the JWST LAEs are low-mass galaxies, which bias the prediction model towards LAEs, the high |$P(\mathrm{LAE})$| values of galaxies with |$M_* \lt 10^9\, M_\odot$| is reasonable given their blue UV slope |$\beta \lt -2$|⁠. When limiting LAEs with |$M_* \gt 10^9\, M_\odot$|⁠, the prediction model shows 67 per cent success rate (12 out of 18). In addition to the results in Section 4.3, this result shows our prediction model is applicable to galaxy samples without spectroscopy even with JWST observation.

4.4.3 Ly|$\alpha$| fraction

Ly|$\alpha$| fraction defined as a ratio of the number of LAEs to star-forming galaxies existing in the Universe is widely used to constrain the evolution of the cosmic neutral fraction in EoR. Here we show the redshift evolution of Ly|$\alpha$| fraction at |$3 \lt z \lt 6$| predicted by the model. Because the EW distribution depends on |$M_\mathrm{UV}$|⁠, previous studies draw the Ly|$\alpha$| fraction at |$-21.75 \lt M_\mathrm{UV} \lt -20.25$| and |$-20.25 \lt M_\mathrm{UV} \lt -18.75$| separately. Many of the training data set galaxies have similar |$M_\mathrm{UV}$| to these two |$M_\mathrm{UV}$| ranges, so we calculate Ly|$\alpha$| fraction of JWST galaxies with the prediction model in the same |$M_\mathrm{UV}$| ranges.

Fig. 10 shows the evolution of Ly|$\alpha$| fraction. We derive the uncertainty of Ly|$\alpha$| fraction arising from a Bernoulli trial and completeness and contamination in the LAE sample. The uncertainty from the Bernoulli trial is given by a binomial proportion confidence interval (Kusakabe et al. 2020). The completeness and contamination are estimated by the performance of the prediction model (Section 3.3). Ly|$\alpha$| fraction is increasing at |$3 \lt z \lt 6$| similar to the previous results while the amplitude is systematically higher. This can be attributed to the difference in the definition of LAEs; previous studies select LAEs that have EW|$_0$| of Ly|$\alpha$| emission higher than |$25 \, \mathrm{\mathring{\rm A}}$|⁠, whereas we select LAEs by visual inspection. In our selection, as described in Section 3.1, galaxies in the training data set with smaller |$\mathrm{EW}_0$| down to |$3\, \mathrm{\mathring{\rm A}}$| are also classified as LAEs. This leads to the detection of more LAE by the prediction model, resulting in a higher Ly|$\alpha$| fraction in Fig. 10. To fairly compare the estimated Ly|$\alpha$| fraction with the previous studies, we estimate the percentage of LAEs with an |$\mathrm{EW}_0$| greater than |$25\, \mathrm{\mathring{\rm A}}$| among all LAEs. Using the training data set consisting of VANDELS and MUSE galaxies, we calculate the number ratio of LAEs with |$\mathrm{EW}_0 \gt 25 \, \mathrm{\mathring{\rm A}}$| to total LAEs. The number ratio of LAEs with |$\mathrm{EW}_0 \gt 25 \, \mathrm{\mathring{\rm A}}$| depends on the UV absolute magnitude of galaxies: 0.22 and 0.48 for |$-21.75 \lt M_\mathrm{UV} \lt -20.25$| and |$-20.25 \lt M_\mathrm{UV} \lt -18.75$|⁠, respectively. We correct the Ly|$\alpha$| fraction with |$\mathrm{EW}_0 \gt 25 \, \mathrm{\mathring{\rm A}}$| of model predicted JWST LAEs by multiplying the above number ratios. After the correction, the expected Ly|$\alpha$| fraction |$X_\mathrm{Ly\alpha }^{25}$| are consistent with the previous results for both |$-21.75 \lt M_\mathrm{UV} \lt -20.25$| and |$-20.25 \lt M_\mathrm{UV} \lt -18.75$|⁠. The Ly|$\alpha$| fraction reproduces the increasing trend at |$z \lt 6$|⁠. Unfortunately, the number of galaxies to which SED fitting can be applied is still too small to show model predictions of the Ly|$\alpha$| fraction without the IGM attenuation at |$z\gt 6$|⁠, but if this becomes possible in the future, it will help to understand the reionization history.

The redshift evolution of Ly$\alpha$ fraction for $-21.75 \lt M_\mathrm{UV} \lt -20.25$ (left) and $-20.25 \lt M_\mathrm{UV} \lt -18.75$ (right). Blue points are calculated from LAEs selected with $P(\mathrm{LAE}) \gt 0.7$ from JWST galaxies. Orange points indicate the fraction of LAEs with $\mathrm{EW}_0 \gt 25 \, \mathrm{\mathring{\rm A}}$ for JWST sources calibrated by the EW–$M_\mathrm{UV}$ relation of the training data set. Black points show the previous results taken from Stark et al. (2011), Curtis-Lake et al. (2012), Ono et al. (2012), Schenker et al. (2014), Tilvi et al. (2014), and Cassata et al. (2015).
Figure 10.

The redshift evolution of Ly|$\alpha$| fraction for |$-21.75 \lt M_\mathrm{UV} \lt -20.25$| (left) and |$-20.25 \lt M_\mathrm{UV} \lt -18.75$| (right). Blue points are calculated from LAEs selected with |$P(\mathrm{LAE}) \gt 0.7$| from JWST galaxies. Orange points indicate the fraction of LAEs with |$\mathrm{EW}_0 \gt 25 \, \mathrm{\mathring{\rm A}}$| for JWST sources calibrated by the EW–|$M_\mathrm{UV}$| relation of the training data set. Black points show the previous results taken from Stark et al. (2011), Curtis-Lake et al. (2012), Ono et al. (2012), Schenker et al. (2014), Tilvi et al. (2014), and Cassata et al. (2015).

4.4.4 LAE spatial distribution

Fig. 11 shows the 3D distribution of predicted LAEs in the CEERS field. This demonstrates that the prediction model can detect LAEs in a wide redshift range continuously. There seem to be gaps in the distribution of LAEs at |$z=4.0$| and 5.5. This is because at |$z=4.0$| and 5.5 the number of galaxies included in the JWST catalogue is small. As described in Section 3.3, the performance of the prediction model is independent of redshift. The relation between the distribution of LAEs and underlying large-scale structure is thought to be the key in galaxy formation history. Fig. 11 also shows the density distribution of LAEs and non-LAEs in redshift slices of |$3.0 \lt z \lt 3.5$| and |$4.5 \lt z \lt 5.0$|⁠. As discussed in Section 3.3, the network tends to predict |$P(\mathrm{LAE}) \gt 0.8$| when galaxies are faint (⁠|$M_\mathrm{UV} \gt -19$|⁠). To make a fair comparison of the distributions between LAEs and non-LAEs with the balanced numbers of galaxies, we only use galaxies with |$M_\mathrm{UV} \lt -19.5$|⁠. The density distribution is calculated using the kernel density estimation method with a Gaussian kernel. The kernel size is determined by the typical clustering scale of LAEs. The correlation length of clustering of LAEs at |$z=3$| is |$\sim 2\, \mathrm{cMpc}$| (Ouchi et al. 2010; Ito et al. 2021), which corresponds to |$0.02\, \mathrm{deg}$| at |$z=3$|⁠. Interestingly, an LAE overdensity coincides with a non-LAE overdensity at |$3.0 \lt z \lt 3.5$| while high-density regions with LAEs seem to be segregated from the non-LAE overdensities at |$4.5 \lt z \lt 5.0$|⁠. This result may provide an interesting example regarding the arguments over whether LAEs trace the underlying dark matter distribution. Bielby et al. (2016) show that the clustering properties of LAEs can be understood as those on the low-mass side of LBGs. Sobral et al. (2018a) argue that bright LAEs are good tracers of the most overdensities in the Universe. On the other hand, Shimakawa et al. (2017) report that LAEs are deficient in a protocluster core. Momose et al. (2021) propose a possible selection bias that Ly|$\alpha$| emissions behind H i IGM overdense regions are hard to detect. Ito et al. (2021) find that the bias parameter of LAEs differs from those of general star-forming galaxies selected with photometric redshift. It would be interesting to quantify the differences in the distribution of LAE and non-LAE, but that is beyond the scope of this paper.

Top panel shows the 3D distribution of LAEs detected by the prediction model in the CEERS field. The grey shaded region indicates the redshift slices of $3.0 \lt z \lt 3.5$ and $4.5 \lt z \lt 5.0$, where the density distribution of LAEs and non-LAEs are shown in the bottom panels. Bottom four panels show the density distribution of LAEs (upper) and non-LAEs (lower) in the CEERS field in the redshift slices of $3.0 \lt z \lt 3.5$ (left) and $4.5 \lt z \lt 5.0$ (right), respectively. The colour map represents the overdensity ($\delta = \sigma /\bar{\sigma }-1$), where $\sigma$ is the density distribution calculated with the kernel density estimation. The black points indicate LAEs (top) and non-LAEs (bottom).
Figure 11.

Top panel shows the 3D distribution of LAEs detected by the prediction model in the CEERS field. The grey shaded region indicates the redshift slices of |$3.0 \lt z \lt 3.5$| and |$4.5 \lt z \lt 5.0$|⁠, where the density distribution of LAEs and non-LAEs are shown in the bottom panels. Bottom four panels show the density distribution of LAEs (upper) and non-LAEs (lower) in the CEERS field in the redshift slices of |$3.0 \lt z \lt 3.5$| (left) and |$4.5 \lt z \lt 5.0$| (right), respectively. The colour map represents the overdensity (⁠|$\delta = \sigma /\bar{\sigma }-1$|⁠), where |$\sigma$| is the density distribution calculated with the kernel density estimation. The black points indicate LAEs (top) and non-LAEs (bottom).

4.4.5 Constraints on the ionized bubble size at |$z \sim 7.18,\ 7.49$| in the CEERS field

In the reionization epoch, Ly|$\alpha$| emission is basically difficult to detect because it is attenuated by the neutral H i IGM. Sometimes, however, Ly|$\alpha$| emission is observed even at |$z\gt 7$|⁠, especially in the overdense region of LAE, which is likely to be an ionized bubble. Jung et al. (2024), Chen et al. (2024), and Napolitano et al. (2024) detected Ly|$\alpha$| emission in the CEERS field at |$z=7.1$| and |$z=7.5$|⁠. Chen et al. (2024) expect large ionized bubbles (⁠|$R_\mathrm{ion} \gt 1\, \mathrm{pMpc}$|⁠), which are thought to be carved out by the LAEs and surrounding fainter galaxies. On the other hand, Napolitano et al. (2024) suggest moderate size ionized regions (⁠|$R_\mathrm{ion} \lesssim 1\, \mathrm{pMpc}$|⁠) around LAEs.

In the moderately neutral Universe, Ly|$\alpha$| emission is attenuated by the IGM. We assume the relation between galaxy physical properties and the Ly|$\alpha$| emission does not change with redshift, and our prediction model can predict whether galaxies in EoR intrinsically emit Ly|$\alpha$|⁠. Therefore, we can assess whether they have suffered IGM attenuation. The comparison of the predicted LAE probability of each galaxy with the observed or unobserved Ly|$\alpha$| emission can constrain the size of the ionized bubble. To achieve this goal, we use spectroscopically confirmed galaxies reported in Jung et al. (2022), Nakajima et al. (2023), Tang et al. (2024), Chen et al. (2024), and Napolitano et al. (2024) in the CEERS field. |$P(\mathrm{LAE})$| of these galaxies are predicted using the network in the same way as Section 4.4, except for fixing redshift at spec-z when the SED fitting. We use only HST photometries for galaxies located outside of the NIRCam footprints. We remove the galaxies with |$M_* \lt 10^8M_\odot$| same as the training data set. Fig. 12 shows the galaxy distribution with the predicted |$P(\mathrm{LAE})$|⁠.

The distribution of spectroscopically confirmed galaxies around the expected ionized regions. The filled and open circles show galaxies with detected Ly$\alpha$ emission and not detected in their spectra, respectively. The circles are coloured with the value of $P(\mathrm{LAE})$. The black circles show no $P(\mathrm{LAE})$ value because the galaxies are $M_* \lt 10^8 M_\odot$. The black open triangles show the galaxies with no UV counterpart in JWST/HST photometry or no spectral coverage at Ly$\alpha$. The orange shades are the expected ionized regions. The NIRCam footprint of the CEERS field is shown on the RA-DEC plane at the far end.
Figure 12.

The distribution of spectroscopically confirmed galaxies around the expected ionized regions. The filled and open circles show galaxies with detected Ly|$\alpha$| emission and not detected in their spectra, respectively. The circles are coloured with the value of |$P(\mathrm{LAE})$|⁠. The black circles show no |$P(\mathrm{LAE})$| value because the galaxies are |$M_* \lt 10^8 M_\odot$|⁠. The black open triangles show the galaxies with no UV counterpart in JWST/HST photometry or no spectral coverage at Ly|$\alpha$|⁠. The orange shades are the expected ionized regions. The NIRCam footprint of the CEERS field is shown on the RA-DEC plane at the far end.

At |$z = 7.18$|⁠, three LAEs are closely located, and one LAE is at |$z = 7.10$|⁠. The distance between the two structures is |$4.3 \, \mathrm{pMpc}$| in the 3D space. Two galaxies between these structures do not have Ly|$\alpha$| emission, even though one of them is predicted as LAEs with |$P(\mathrm{LAE}) \gt 0.9$| and would have been intrinsically Ly|$\alpha$| emitting. While its stellar mass is low (⁠|$10^{8.2}M_\odot$|⁠), it is reasonable that it is an intrinsic LAE given its UV slope |$\beta$| (⁠|$-2.4$|⁠). The lack of Ly|$\alpha$| emission in the observed spectra is attributed to the high neutral fraction in the IGM. Therefore, it is reasonable to consider that each of the LAE structures at |$z = 7.10$| and |$z = 7.18$| has a moderate-size ionized bubble (⁠|$R \lt 1 \, \mathrm{pMpc}$|⁠) rather than the galaxies within |$z = 7.10\!-\!7.18$| reside in a single large ionized bubble.

Similarly at |$z \sim 7.5$|⁠, six LAEs are reported across the CEERS field. One LAE is located in the north-east, and five LAEs are within a distance of |$2.2 \, \mathrm{pMpc}$| in the south-west. Between the two structures, four spectroscopically confirmed galaxies exist, two of them do not show Ly|$\alpha$| emission in their spectra. One galaxy without Ly|$\alpha$| emission is also intrinsically predicted as LAEs with |$P(\mathrm{LAE}) \gt 0.9$|⁠. Its stellar mass is |$10^{8.8}M_\odot$|⁠, |$M_\mathrm{UV}$| is |$-20.7$|⁠, and UV slope |$\beta$| is |$-2.4$|⁠. We regard the system at |$z = 7.5$| consisting of at least two ionized regions while it is not clear whether the south-west structure is in a large ionized bubble or smaller ionized bubbles (Napolitano et al. 2024).

5 SUMMARY

In this study, we develop a model to predict the probability that a galaxy shows Ly|$\alpha$| emission based on the neural network approach. Our main results are summarized as follows.

  • As a training data set of the neural network, we collect spectroscopic information of Ly|$\alpha$| from VANDELS and MUSE spectroscopic surveys. We measure Ly|$\alpha$| flux and EW of Ly|$\alpha$| emission line. We conduct SED fitting using cigale to derive the physical properties of the galaxies. The spectra of galaxies are visually inspected to construct a training data set of galaxies with and without Ly|$\alpha$| emission. In total, the training data set consists of 1027 LAEs and 406 non-LAEs.

  • We train a neural network that predicts whether a galaxy has a Ly|$\alpha$| emission line from six physical parameters, i.e. SFR, stellar mass, UV absolute magnitude |$M_\mathrm{UV}$|⁠, age, UV slope |$\beta$|⁠, and dust attenuation |$E(B-V)$|⁠. We employ a Monte Carlo approach to account for the uncertainty in the input physical parameters. The trained prediction model shows the performance of 77 per cent true positive rate and 14 per cent false positive rate when we define LAEs as |$P(\mathrm{LAE}) \gt 0.7$|⁠. The area under the ROC curve is 0.88.

  • By the permutation feature importance method, we find that |$\beta$|⁠, |$M_\mathrm{UV}$|⁠, and |$M_*$| have an impact on the prediction of LAEs.

  • We apply the prediction model to COSMOS2020 sources and SC4K LAEs. Although these galaxies do not have spectroscopic information, our model outputs a reasonable |$P(\mathrm{LAE})$| for each of them, thus validating its applicability to galaxy samples other than the training data set.

  • We use public JWST observations in the CEERS, COSMOS, GOODS-N, GOODS-S, and UDS fields to select LAEs with the prediction model. 91 per cent of the spectroscopically confirmed LAEs in the JWST fields are evaluated as |$P(\mathrm{LAE}) \gt 0.7$|⁠, which indicates the validity of the prediction model.

  • We calculate Ly|$\alpha$| fraction of the JWST galaxies at |$3 \lt z \lt 6$| based on the output |$P(\mathrm{LAE})$| of the prediction model. The Ly|$\alpha$| fraction reproduces the increasing trend in this redshift range found in the previous studies.

  • Using LAEs detected by the prediction model from JWST observations, we show a continuous spatial distribution of LAEs over |$3 \lt z \lt 6$| in the CEERS field. We find some similarities (⁠|$3.0 \lt z \lt 3.5$|⁠) and discrepancies (⁠|$4.5 \lt z \lt 5.0$|⁠) in the density distribution of LAEs and non-LAEs. We also investigate the galaxy distribution around the expected ionized bubbles at |$z=7.18$| and |$z=7.49$|⁠. The comparison between the predicted |$P(\mathrm{LAE})$| and the spectroscopically observed Ly|$\alpha$| flux suggests that the ionized structures at |$z=7.18$| and |$z=7.49$| are comprised of separated moderate size (⁠|$R_\mathrm{ion} \lesssim 1 \, \mathrm{pMpc}$|⁠) ionized bubbles rather than surrounded by one large (⁠|$R_\mathrm{ion} \gt 2\, \mathrm{pMpc}$|⁠) ionized bubble.

ACKNOWLEDGEMENTS

We thank the anonymous referee for the constructive comments that improved the quality of the paper. This research was supported by IGPEES, WINGS Program, the University of Tokyo. This research was supported by a grant from the Hayakawa Satio Fund awarded by the Astronomical Society of Japan. NK was supported by the Japan Society for the Promotion of Science through Grant-in-Aid for Scientific Research 21H04490. YT is supported by the Forefront Physics and Mathematics Program to Drive Transformation (FoPM), a World-leading Innovative Graduate Study (WINGS) Program, the University of Tokyo, Iwadare Scholarship Foundation and JSPS KAKENHI Grant Number JP23KJ0726. Some of the data products presented herein were retrieved from the Dawn JWST Archive (DJA). DJA is an initiative of the Cosmic Dawn Center (DAWN), which is funded by the Danish National Research Foundation under grant DNRF140.

Software:adstex (https://github.com/yymao/adstex), astropy (Astropy Collaboration 2013, 2018), jupyter (Kluyver et al. 2016), keras (Chollet et al. 2015), matplotlib (Hunter 2007), numpy (Harris et al. 2020), pandas (McKinney 2010; The pandas development Team 2023), scikit-learn (Pedregosa et al. 2011), scipy (Virtanen et al. 2020), tensorflow (Developers 2023), uncertainties (http://pythonhosted.org/uncertainties/).

DATA AVAILABILITY

The data from VANDELS are available at http://vandels.inaf.it/. The MUSE catalogue of Schmidt et al. (2021) is available at https://cdsarc.u-strasbg.fr/viz-bin/cat/J/A + A/654/A80. The data from CANDELS are available at https://archive.stsci.edu/hlsp/candels. The data from 3D-HST are available at https://archive.stsci.edu/prepds/3d-hst/. The data from UVUDF are available at https://archive.stsci.edu/hlsp/uvudf. The data from COSMOS are available at https://cosmos.astro.caltech.edu/. The data from SC4K are available at https://academic.oup.com/mnras/article/476/4/4725/4858393. The photometric catalogues of JWST are available at https://dawn-cph.github.io/dja/index.html.

Footnotes

REFERENCES

Altmann
A.
,
Toloşi
L.
,
Sander
O.
,
Lengauer
T.
,
2010
,
Bioinformatics
,
26
,
1340

Arrabal Haro
P.
et al. ,
2020
,
MNRAS
,
495
,
1807

Astropy Collaboration
,
2013
,
A&A
,
558
,
A33

Astropy Collaboration
,
2018
,
AJ
,
156
,
123

Bacon
R.
et al. ,
2017
,
A&A
,
608
,
A1

Bacon
R.
et al. ,
2023
,
A&A
,
670
,
A4

Begley
R.
et al. ,
2024
,
MNRAS
,
527
,
4040

Bielby
R. M.
et al. ,
2016
,
MNRAS
,
456
,
4061

Bolan
P.
et al. ,
2022
,
MNRAS
,
517
,
3263

Bolan
P.
et al. ,
2024
,
MNRAS
,
531
,
2998

Boquien
M.
,
Burgarella
D.
,
Roehlly
Y.
,
Buat
V.
,
Ciesla
L.
,
Corre
D.
,
Inoue
A. K.
,
Salas
H.
,
2019
,
A&A
,
622
,
A103

Brammer
G.
,
2023
, grizli.
Zenodo

Brammer
G. B.
,
van Dokkum
P. G.
,
Coppi
P.
,
2008
,
ApJ
,
686
,
1503

Bruzual
G.
,
Charlot
S.
,
2003
,
MNRAS
,
344
,
1000

Calzetti
D.
,
Armus
L.
,
Bohlin
R. C.
,
Kinney
A. L.
,
Koornneef
J.
,
Storchi-Bergmann
T.
,
2000
,
ApJ
,
533
,
682

Cassata
P.
et al. ,
2015
,
A&A
,
573
,
A24

Chabrier
G.
,
2003
,
PASP
,
115
,
763

Chávez Ortiz
Ó. A.
et al. ,
2023
,
ApJ
,
952
,
110

Chen
Z.
,
Stark
D. P.
,
Mason
C.
,
Topping
M. W.
,
Whitler
L.
,
Tang
M.
,
Endsley
R.
,
Charlot
S.
,
2024
,
MNRAS
,
528
,
7052

Chollet
F.
, et al. ,
2015
,
Keras
. Available at:

Curtis-Lake
E.
et al. ,
2012
,
MNRAS
,
422
,
1425

Developers
T.
,
2023
, TensorFlow.
Zenodo

Dunlop
J. S.
et al. ,
2021
,
PRIMER: Public Release IMaging for Extragalactic Research, JWST Proposal. Cycle 1, ID. #1837
.

Finkelstein
S. L.
,
Rhoads
J. E.
,
Malhotra
S.
,
Grogin
N.
,
2009
,
ApJ
,
691
,
465

Finkelstein
S. L.
et al. ,
2023
,
ApJ
,
946
,
L13

Foran
G.
,
Cooke
J.
,
Reddy
N.
,
Steidel
C.
,
Shapley
A.
,
2023
,
Publ. Astron. Soc. Aust.
,
40
,
e052

Fuller
S.
et al. ,
2020
,
ApJ
,
896
,
156

Garilli
B.
et al. ,
2021
,
A&A
,
647
,
A150

Gazagnes
S.
,
Chisholm
J.
,
Schaerer
D.
,
Verhamme
A.
,
Izotov
Y.
,
2020
,
A&A
,
639
,
A85

Goovaerts
I.
et al. ,
2023
,
A&A
,
678
,
A174

Guo
Y.
et al. ,
2013
,
ApJS
,
207
,
24

Harris
C. R.
et al. ,
2020
,
Nature
,
585
,
357

Hayes
M. J.
,
Runnholm
A.
,
Scarlata
C.
,
Gronke
M.
,
Rivera-Thorsen
T. E.
,
2023
,
MNRAS
,
520
,
5903

Herenz
E. C.
et al. ,
2017
,
A&A
,
606
,
A12

Hoag
A.
et al. ,
2019
,
MNRAS
,
488
,
706

Hunter
J. D.
,
2007
,
Comput. Sci. Eng.
,
9
,
90

Iani
E.
et al. ,
2024
,
ApJ
,
963
,
97

Inami
H.
et al. ,
2017
,
A&A
,
608
,
A2

Ito
K.
et al. ,
2021
,
ApJ
,
916
,
35

Jones
G. C.
et al. ,
2024
,
A&A
,
683
,
A238

Jung
I.
et al. ,
2022
,
preprint
()

Jung
I.
et al. ,
2024
,
ApJ
,
967
,
73

Kerutt
J.
et al. ,
2022
,
A&A
,
659
,
A183

Khostovan
A. A.
et al. ,
2019
,
MNRAS
,
489
,
555

Kingma
D. P.
,
Ba
J.
,
2015
,
3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings
.

Kluyver
T.
et al. ,
2016
,
Positioning and Power in Academic Publishing: Players, Agents and Agendas
,
87

Kusakabe
H.
et al. ,
2020
,
A&A
,
638
,
A12

Law
D. R.
,
Steidel
C. C.
,
Shapley
A. E.
,
Nagy
S. R.
,
Reddy
N. A.
,
Erb
D. K.
,
2012
,
ApJ
,
759
,
29

Lemaux
B. C.
et al. ,
2021
,
MNRAS
,
504
,
3662

Maji
M.
et al. ,
2022
,
A&A
,
663
,
A66

Malhotra
S.
,
Rhoads
J. E.
,
Finkelstein
S. L.
,
Hathi
N.
,
Nilsson
K.
,
McLinden
E.
,
Pirzkal
N.
,
2012
,
ApJ
,
750
,
L36

Marchi
F.
et al. ,
2018
,
A&A
,
614
,
A11

McCarron
A. P.
et al. ,
2022
,
ApJ
,
936
,
131

McKinney
W.
,
2010
, in
van der Walt
S.
,
Millman
J.
, eds,
Proceedings of the 9th Python in Science Conference
, Austin, Texas. p.
56

McLure
R. J.
et al. ,
2018
,
MNRAS
,
479
,
25

Momose
R.
,
Shimasaku
K.
,
Nagamine
K.
,
Shimizu
I.
,
Kashikawa
N.
,
Ando
M.
,
Kusakabe
H.
,
2021
,
ApJ
,
912
,
L24

Nakajima
K.
,
Ouchi
M.
,
Isobe
Y.
,
Harikane
Y.
,
Zhang
Y.
,
Ono
Y.
,
Umeda
H.
,
Oguri
M.
,
2023
,
ApJS
,
269
,
33

Napolitano
L.
et al. ,
2023
,
A&A
,
677
,
A138

Napolitano
L.
et al. ,
2024
,
A&A
,
688
,
A106

Oesch
P. A.
et al. ,
2023
,
MNRAS
,
525
,
2864

Oke
J. B.
,
Gunn
J. E.
,
1983
,
ApJ
,
266
,
713

Ono
Y.
et al. ,
2012
,
ApJ
,
744
,
83

Ouchi
M.
et al. ,
2010
,
ApJ
,
723
,
869

Partridge
R. B.
,
Peebles
P. J. E.
,
1967
,
ApJ
,
147
,
868

Paulino-Afonso
A.
et al. ,
2018
,
MNRAS
,
476
,
5479

Pedregosa
F.
et al. ,
2011
,
J. Mach. Learn. Res.
,
12
,
2825

Pentericci
L.
,
Grazian
A.
,
Fontana
A.
,
Castellano
M.
,
Giallongo
E.
,
Salimbeni
S.
,
Santini
P.
,
2009
,
A&A
,
494
,
553

Pentericci
L.
et al. ,
2018
,
A&A
,
616
,
A174

Pucha
R.
,
Reddy
N. A.
,
Dey
A.
,
Juneau
S.
,
Lee
K.-S.
,
Prescott
M. K. M.
,
Shivaei
I.
,
Hong
S.
,
2022
,
AJ
,
164
,
159

Rafelski
M.
et al. ,
2015
,
AJ
,
150
,
31

Ribeiro
B.
et al. ,
2020
,
preprint
()

Runnholm
A.
,
Hayes
M.
,
Melinder
J.
,
Rivera-Thorsen
E.
,
Östlin
G.
,
Cannon
J.
,
Kunth
D.
,
2020
,
ApJ
,
892
,
48

Santos
S.
et al. ,
2020
,
MNRAS
,
493
,
141

Saxena
A.
et al. ,
2024
,
A&A
,
684
,
A84

Schenker
M. A.
,
Ellis
R. S.
,
Konidaris
N. P.
,
Stark
D. P.
,
2014
,
ApJ
,
795
,
20

Schmidt
K. B.
et al. ,
2021
,
A&A
,
654
,
A80

Shapley
A. E.
,
Steidel
C. C.
,
Adelberger
K. L.
,
Dickinson
M.
,
Giavalisco
M.
,
Pettini
M.
,
2001
,
ApJ
,
562
,
95

Shibuya
T.
,
Ouchi
M.
,
Harikane
Y.
,
Nakajima
K.
,
2019
,
ApJ
,
871
,
164

Shimakawa
R.
et al. ,
2017
,
MNRAS
,
468
,
L21

Skelton
R. E.
et al. ,
2014
,
ApJS
,
214
,
24

Smith
A.
,
Ma
X.
,
Bromm
V.
,
Finkelstein
S. L.
,
Hopkins
P. F.
,
Faucher-Giguère
C.-A.
,
Kereš
D.
,
2019
,
MNRAS
,
484
,
39

Sobral
D.
,
Santos
S.
,
Matthee
J.
,
Paulino-Afonso
A.
,
Ribeiro
B.
,
Calhau
J.
,
Khostovan
A. A.
,
2018a
,
MNRAS
,
476
,
4725

Sobral
D.
et al. ,
2018b
,
MNRAS
,
477
,
2817

Stark
D. P.
,
Ellis
R. S.
,
Ouchi
M.
,
2011
,
ApJ
,
728
,
L2

Talia
M.
et al. ,
2023
,
A&A
,
678
,
A25

Tang
M.
et al. ,
2024
,
MNRAS
,
531
,
2701

The pandas development Team
,
2023
, pandas-dev/pandas: Pandas.
Zenodo

Tilvi
V.
et al. ,
2014
,
ApJ
,
794
,
5

Trainor
R. F.
,
Strom
A. L.
,
Steidel
C. C.
,
Rudie
G. C.
,
2016
,
ApJ
,
832
,
171

Urbano Stawinski
S. M.
et al. ,
2024
,
MNRAS
,
528
,
5624

Urrutia
T.
et al. ,
2019
,
A&A
,
624
,
A141

Virtanen
P.
et al. ,
2020
,
Nat. Methods
,
17
,
261

Weaver
J. R.
et al. ,
2022
,
ApJS
,
258
,
11

Williams
C. C.
et al. ,
2021
,
UDF medium band survey: Using H-alpha emission to reconstruct Ly-alpha escape during the Epoch of Reionization, JWST Proposal. Cycle 1, ID. #1963
.

Witstok
J.
et al. ,
2024
,
A&A
,
682
,
A40

Yoshioka
T.
et al. ,
2022
,
ApJ
,
927
,
32

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.