HAYATE: Photometric redshift estimation by hybridising machine learning with template fitting

Machine learning photo-z methods, trained directly on spectroscopic redshifts, provide a viable alternative to traditional template fitting methods but may not generalise well on new data that deviates from that in the training set. In this work, we present a Hybrid Algorithm for WI(Y)de-range photo-z estimation with Artificial neural networks and TEmplate fitting (HAYATE), a novel photo-z method that combines template fitting and data-driven approaches and whose training loss is optimised in terms of both redshift point estimates and probability distributions. We produce artificial training data from low-redshift galaxy SEDs at z<1.3, artificially redshifted up to z=5. We test the model on data from the ZFOURGE surveys, demonstrating that HAYATE can function as a reliable emulator of EAZY for the broad redshift range beyond the region of sufficient spectroscopic completeness. The network achieves precise photo-z estimations with smaller errors ($\sigma_{NMAD}$) than EAZY in the initial low-z region (z<1.3), while being comparable even in the high-z extrapolated regime (1.3<z<5). Meanwhile, it provides more robust photo-z estimations than EAZY with the lower outlier rate ($\eta_{0.2}\lesssim 1\%$) but runs $\sim100$ times faster than the original template fitting method. We also demonstrate HAYATE offers more reliable redshift PDFs, showing a flatter distribution of Probability Integral Transform scores than EAZY. The performance is further improved using transfer learning with spec-z samples. We expect that future large surveys will benefit from our novel methodology applicable to observations over a wide redshift range.


INTRODUCTION
Wide-field imaging surveys are a fundamental driver of astronomical discovery in the fields of galaxy evolution and cosmology.Galaxy redshifts are a key component in the application of the large-survey data, representing the measurement of galaxy distances.They are crucial for identifying objects present in the early Universe, tracing the evolution of galaxy properties over cosmic time and constraining cosmological models.
There are two major methods for determining galaxy redshifts: using spectroscopically identified spectral line features (spectroscopic redshifts, hereafter spec-z's), or via multiband photometry (photometric redshifts, Baum 1962;Butchins 1981;Connolly et al. 1995;Hildebrandt et al. 2010, hereafter photo-z's).Spec-z's are typically much more accurate but more observationally costly than photoz's; there is a trade-off between the sample size of a dataset and the precision of redshift estimates (Salvato et al. 2019).In the context of upcoming large surveys, extragalactic astronomy will benefit from photo-z estimation at an unprecedented level as follow-up spectroscopy can never keep pace with present and future large imaging surveys, e.g., the Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST; LSST Science Collaboration et al. 2009), ★ E-mail: stanigawa@swin.edu.au the Dark Energy Survey (DES; Dark Energy Survey Collaboration et al. 2016), the Nancy Grace Roman Space Telescope (Spergel et al. 2015), the James Webb Space Telescope (JWST; Finkelstein et al. 2015), the Hyper Suprime-Cam Subaru Strategic Program (HSC-SSC; Aihara et al. 2018;Aihara et al. 2022), the Euclid mission (Euclid Collaboration et al. 2020) and the Kilo-Degree Survey (KiDS; Hildebrandt et al. 2021).Thus, efficient and accurate estimation of photo-z's is a topic that has fundamental importance in various fields of research.
There are two main approaches to photo-z estimation.One is the template fitting method, a kind of model fitting approach (e.g., Arnouts et al. 1999;Benítez 2000;Bolzonella et al. 2000;Feldmann et al. 2006;Brammer et al. 2008;Eriksen et al. 2019), while the other is the data-driven method of empirical modelling based on spec-z's-machine learning (ML;e.g., Carrasco Kind & Brunner 2013;Graff et al. 2014;Almosallam et al. 2016;Sadeh et al. 2016;Cavuoti et al. 2017a;Izbicki et al. 2017;Graham et al. 2018).The main advantage of template fitting is it can be generally applied at any redshift.It is, however, unable to learn from data to improve performance, which is fundamentally constrained by the template set.In contrast, the benefit of the data-driven method is generalisation to "unseen data" via learning from the given dataset.It potentially outperforms template fitting by learning a mapping from photometry to redshift and bypassing potentially unrepresentative templates.
This also reduces the computational demands for photo-z estimation compared to the one-on-one matching between individual objects and templates.However, it typically can not be expected to work outside the redshift range present in the spec-z training set.
Template fitting methods, in which photo-z's are derived from matching the broad-or medium-band photometry of an observed galaxy to pre-defined SED templates, have proven to be very useful.The template library commonly employed for photo-z study has been updated over the past few decades, exploiting observed (e.g., Bolzonella et al. 2000;Ilbert et al. 2006Ilbert et al. , 2009;;Salvato et al. 2009) and synthetic (e.g., Carnall et al. 2018;Battisti et al. 2019;Boquien et al. 2019;Bellstedt et al. 2020) galaxy SEDs.With this method we can estimate photo-z's for any region of colour space at any redshift.However, the photo-z estimation with this technique still relies on a limited set of pre-defined templates (which may be more or less representative of the observed galaxy population) as well as the fitting algorithm.The template fitting method is likewise often computationally intensive and inappropriate for the ongoing and future large survey projects, which would require feasible solutions for analysing unprecedentedly large datasets in peta-scale regimes depending on the science cases.
ML techniques employ an algorithmic model for learning from a given dataset to capture its underlying patterns and then utilise the learned model to make predictions on new data.They are able to learn from large volumes of data and automatically capture inherent patterns therein that may not be apparent to humans.In the context of photo-z prediction, this represents a promising route to estimate redshifts from an unprecedentedly huge dataset composed of multiband photometric data associated with spec-z information.
Different ML algorithms have been utilised in previous works on photo-z estimation.Carrasco Kind & Brunner (2013) introduced a photo-z method based on prediction trees and random forest (RF) techniques (Breiman & Schapire 2001).The Multi Layer Perceptron with Quasi Newton Algorithm (MLPQNA; Brescia et al. 2013Brescia et al. , 2014) ) contributed to many photo-z works as an excellent demonstration of feed-forward neural networks.Sadeh et al. (2016) applied multiple ML methods to their model that utilises artificial neural networks (ANNs) and boosted decision trees, while Jones & Singal (2017) presented a Support Vector Machine (SVM) classification algorithm for photo-z estimation.These photo-z based ML methods are generally trained to learn the complex relationship between photometry and distance of observed galaxies.Most of them have been actually tested on the publicly available data from the PHoto-z Accuracy Testing (PHAT) program (Hildebrandt et al. 2010;Cavuoti et al. 2012), performing comparably in terms of photo-z accuracy.
ANNs have been one of the most popular ML algorithms used in photo-z study, which are inspired by the biological neural networks of the human brain (Mcculloch & Pitts 1943;Hopfield 1982).They can theoretically approximate any complex function based on the Universal Approximation Theorem (Cybenko 1989;Hornik 1991), allowing a model to map nonlinear relationships between photometry and redshift.In particular major advances have been produced exploiting the flexibility of fully connected neural networks (FCNNs), in which each neuron in one layer is connected to all neurons in the next layer.
A major stumbling block for photo-z based ML approaches is incompleteness in spectroscopic training samples commonly used as the ground truth redshift.This limitation could prevent a trained model from functioning as intended, i.e. generalising robustly to new examples outside the training set.In particular, spec-z catalogues used for training are typically biased towards the bright part of the magnitude parameter space and are incomplete for high-z objects as well.This also explains why photo-z estimations at high redshifts still rely on existing template fitting methods rather than ML techniques, although they are more common at  ≲ 1.Moreover, training-based methods do not generally allow for reliable extrapolation beyond a known range of data that can be well represented by the training data.The target redshift range for ML is therefore limited to low-z regions of sufficient spectroscopic completeness with higher success rate in obtaining accurate redshifts for brighter objects.
Furthermore, both template-and ML-based photo-z codes generally fall short in producing valid probability density functions (PDFs) of redshift, which fully characterise the results of photo-z estimation (Schmidt et al. 2020).Per-galaxy photo-z PDFs have been commonly applied to estimate the ensemble redshift distribution  () of a sample of galaxies, an estimator critical to cosmological parameter constraints from weak gravitational lensing analysis (e.g., Mandelbaum et al. 2008;Sheldon et al. 2012;Bonnett et al. 2016;Hildebrandt et al. 2017).Schmidt et al. (2020) demonstrated the vulnerability of each single model to a specific flaw in the population of output PDFs in spite of the precise photo-z point estimates.We still lack a model that can produce well-calibrated redshift PDFs and that can be readily adapted to new studies of galaxy evolution and cosmology.Wolf (2009) proposed an example solution for producing accurate redshift distributions from stacked PDFs, although addressing not typical galaxies but specifically quasars under certain conditions.Combining  2 template fits and empirical approaches likely preserve both benefits in one framework; empirical training sets can complement unreliable PDFs generated with the  2 technique based on imperfect templates if matching the distribution and calibration of query samples.This, however, essentially requires an appropriate treatment of error scale used for smoothing the appearance of samples in feature space and controlling the width of derived PDFs.
Traditional ML approaches have generally delivered better performance than template-based methods within the range of training spec-z coverage (Newman & Gruen 2022).The trade-off between the strengths of ML and template fitting inspires the hybridisation of their distinctive advantages.Training the model on simulated photometry is one strategy to overcome the challenges of assembling a complete, reliable and unbiased training sample of sufficient size.Artificial SED samples are often generated using a stellar population synthesis (SPS) code with arbitrary selection of free parameters (e.g., Eriksen et al. 2020;Ramachandra et al. 2022).Zhou et al. (2021) applied a set of best-fit SEDs for the COSMOS catalogue using the template fitting code LePhare, produced based on typical SPS spectra (Bruzual A. & Charlot 1993;Bruzual & Charlot 2003).A complete training set of simulated galaxies should compensate for the sparse sampling of spec-z data allowing for interpolation between spectroscopically observed objects and even extrapolation to the faintest ones (Newman & Gruen 2022).The fidelity of the mock training samples is still liable to many stellar evolution uncertainties that have long plagued SPS models (Conroy 2013).Constructing such an ideal SED dataset requires further improvements to SPS models and to our knowledge of the underlying galaxy population.
Alternatively, the template fitting code EAZY (Brammer et al. 2008) provides more flexible galaxy SEDs, which fits a linear combination of basic spectral templates to the observed photometry onthe-fly.They developed a minimal template set of synthetic SEDs representing the "principal components", following the templateoptimisation routines introduced by Blanton & Roweis (2007).The template set is calibrated with semi-analytical models rather than biased spectroscopic samples, which are complete to very faint magnitudes, along with a template error function to account for wavelength-dependent template mismatch.The applicability of EAZY to diverse redshift coverage has been demonstrated with a plethora of photometric catalogues (e.g., Treister et al. 2009;Wuyts et al. 2009;Cardamone et al. 2010;Muzzin et al. 2013;Nanayakkara et al. 2016;Straatman et al. 2016;Strait et al. 2021).In particular, the reliability of EAZY photo-z's was thoroughly assessed with comprehensive photometric samples presented by Straatman et al. (2016, hereafter S16), which include medium-bandwidth filters from the FourStar galaxy evolution (ZFOURGE) surveys.
In this work, we present a novel hybrid photo-z method that combines template fitting and data-driven approaches to exploit the best aspects of both.Our photo-z network is trained with mock photometric data generated based on the ensemble of template SEDs provided by EAZY.This is particularly motivated by exploiting knowledge of galaxy SEDs at low-z, where template fitting is assumed to be reliable, and applying their rest-frame SEDs to a higher redshift range.The full training set of mock SEDs is thus generated by redshifting best-fit SEDs derived with EAZY for the S16 photometric catalogue objects of  ≲ 1, whose simulated redshifts are distributed in a broader range up to  = 5.We develop photo-z convolutional neural networks (CNNs; Lecun et al. 1998;LeCun et al. 2004) optimised to simultaneously produce both a well-calibrated set of redshift PDFs and accurate point estimates.The trained model is tested with updated S16 spectroscopic samples, whose performance is evaluated based on photo-z metrics commonly used for measuring the quality of both output PDFs and the corresponding point estimates.
Our ML strategy benefits from recent advances in the field of domain adaptation (Csurka 2017;Wang & Deng 2018;Wilson & Cook 2020), which allows a model to learn domain-invariant features shared between discrepant data distributions.The simulation-based ML model here is trained with synthetic data, which can be further advanced by transfer learning (Pan & Yang 2010), where a model pre-trained on one task is re-purposed on another related task.Pretraining the feature extraction layers on a large external dataset then fine-tuning on a smaller training set alleviates overfitting compared to simply training from scratch on the small dataset.We can thus finetune the simulation-based photo-z network with a limited amount of spectroscopic data by re-training the last layers on real datasets with spec-z information (Eriksen et al. 2020).This optimisation scheme in principle aids in correcting the gap between mock and observed training samples.
Our novel approach is to "extrapolate" training methods outside their initial redshift ranges from the viewpoint of the original template fits.Training with domain adaptation can be performed on high-z simulated data by capturing a realistic range of galaxy SED properties determined from reliable low-z data.In place of spectroscopic data we leverage the demonstrated accuracy of template fitting, overcoming the traditional redshift limitation of ML photo-z codes.In essence, the CNN-based hybrid model is thus designed to function as an efficient emulator of EAZY.The interpolative nature of supervised ML approaches could even infer photo-z point estimates more precisely and robustly than those provided by the original template-based method.Incorporating the flavour of template fitting into the ML framework potentially improves the quality of photo-z PDFs as well.Ultimately, we aim to improve photo-z estimation for JWST photometry, which will have coverage at redder wavelengths than previously available.
This paper is organised as follows.In §2, we present the photometric catalogues used in this work.In §3, we detail our method for producing mock photometric data (with a noise model) via simulations.§4 describes the development of our ML photo-z networks and the framework for evaluating their performance.§5 presents results on testing different photo-z models on the ZFOURGE catalogue data and comparing their performance in photo-z and PDF metrics commonly used for major photo-z studies.In §6 we discuss some of the issues raised by the work.Finally, in 7 we summarise the work and discuss future prospects.Throughout this paper, we assume a ΛCDM cosmology with Ω  = 0.3, Ω Λ = 0.7 and  0 = 70km s −1 Mpc −1 .

CATALOGUE DATA
This work introduces a hybrid photo-z based ML method that benefits from the template fitting algorithm of EAZY, aimed at deriving photo-z PDFs of galaxies extracted from the ZFOURGE photometric catalogues (Straatman et al. 2016).ZFOURGE data products comprise 45 nights of observations with the FourStar instrument (Persson et al. 2013) on the 6.5 m Magellan Baade Telescope at Las Campanas in Chile.It observed three survey fields including CDFS (Giacconi et al. 2002), COSMOS (Scoville et al. 2007) and UDS (Lawrence et al. 2007) with five near-IR medium-bandwidth filters,  1 ,  2 ,  3 ,   , and   , along with broad-band   .Pushing to faint magnitude limits of 25-26 AB achieves the mass completeness limit of ∼ 10 8  ⊙ at  ≲ 1, also advancing the study of intermediate to high redshift objects.
S16 includes data from publicly available surveys at 0.3-8 µm, constructing comprehensive photometric catalogues, each with a total of 39 (CDFS), 36 (COSMOS) and 27 (UDS) medium-and broad-band flux measurements.The individual objects were cross-matched with the compilation of publicly available spec-z catalogues provided by Skelton et al. (2014) as well as the first data release from the MOS-DEF survey (Kriek et al. 2015) and the VIMOS Ultra-Deep Survey (Tasca et al. 2017).These samples have been used to demonstrate the benefit of including the FourStar medium bands in the input for improving the photo-z accuracy with a better sampling of galaxy SEDs (Straatman et al. 2016).
Throughout, the catalogue data utilised for this work are limited to objects with a use flag of 1, which represents reliable data with good photometry and a low likelihood of contamination with stars or blending with another source.These sources are obtained from regions of the images with sufficiently high S/N.We thus construct test catalogue samples with use = 1 and total   -band magnitude < 26, providing the galaxy population that can be used in large statistical studies.Our main target objects are high-z galaxies of  ≳ 1.3, whose photo-z estimations have not been well explored by ML methods.We set the lower limit to 1.3 as that is a typical bound for which spec-z's are incomplete, since the galaxy optical light is redshifted in to the near infra-red.The model is nonetheless required to make predictions across the whole redshift range (including lower  ′ ), since we cannot exclusively select high-z objects a priori from real observations.Our spec-z samples are therefore limited only with an upper bound of 5, which are adopted as a test set for evaluating the model's performance on the broad redshift range between 0 <  spec < 5.
Additionally, we incorporate ancillary spec-z data from the latest releases of several surveys into our original S16 catalog, with a matching radius of 1 ′′ .All the catalogues are supplemented by the final data releases from the MOSDEF (Kriek et al. 2015) and MOSEL (Tran et al. 2020;Gupta et al. 2020) surveys.The fourth data release from the VANDELS surveys (Garilli et al. 2021) provides auxiliary spec-z's for CDFS and UDS, while the ZFIRE survey (Nanayakkara et al. 2016) for COSMOS.We only extract reliable data with the best quality flag individually defined for each survey catalogue.
As a further step, two of the authors (KG and IL) visually in-spected spectra where the spec-z and EAZY photo-zs differed signficantly.We removed objects deemed likely misidentifications, providing sample sizes of 1100 (CDFS), 425 (COSMOS) and 127 (UDS) from the original S16 catalogue.The size of each supplemented sample ( > 1.3) is as follows: 1273 in CDFS, 741 in COSMOS and 314 in UDS, an increase of 173, 316 and 187, respectively.

TRAINING SET OF MOCK PHOTOMETRIC DATA
In this section, we discuss the generation of mock photometric data used for training the ML model.The entire process is divided into two major parts, both of which are important for creating a training sample that can sufficiently cover the colour space occupied by the test sources.§3.1 describes the method of producing mock SEDs from EAZY best-fits for a limited sample of low-z galaxies in S16.In §3.2, the noise model is introduced to apply realistic errors to simulated photometry, which allows for the construction of reliable mock photometric data.

Mock galaxy SEDs
We simulate galaxy SEDs up to  = 5 by redshifting the EAZY best-fit SEDs for low-z objects with  EAZY < 1.3 in S16.This enables us to produce SEDs of galaxies in the target redshift range between 1.3 <  < 5 purely based on a galaxy population at lower redshifts.The selection criteria of the low-z sources also ensures the generated sample fully covers typical SED types, since ZFOURGE is very complete to low masses at  ≲ 1.3, where the 80% mass completeness limit reaches down to ∼ 10 8 − 10 8.5  ⊙ (Straatman et al. 2016).We thus first extract EAZY best-fits for objects with  EAZY < 1.3 that are included in the photometric catalogues of S16.The total number of selected low-z sources is 17,891.These empirical SEDs are technically unique, since EAZY fits an ensemble of nine representative spectral templates to each set of observed fluxes.The major part of our simulated sample thus consists of typical SED types empirically obtained from low-z observations but assumed to be present at much higher redshifts.
We then artificially redshift these pre-defined SEDs from the limited redshift range of  EAZY < 1.3 to simulated redshifts ( sim 's) in a much broader range of 0 <  sim < 5.For each mock SED, a set of simulated wavelength and flux density per unit wavelength ( sim ,  sim ) measurements are derived from the EAZY output ( EAZY ,  EAZY ) with the following equations: where  EAZY and  sim are the luminosity distance for the observed and simulated galaxies.
The simulated data are generated with a uniform distribution with respect to  = log(1 + ), which is adopted as our output variable instead of the simple redshift (Baldry 2018).This adapts to the evaluation scheme commonly used in most photo-z studies, where the redshift estimation error is defined as  = /(1 + ).Using  as a reasonable photo-z error is ascribed to different photometric uncertainties for a given set of broad-band filters, which typically have an approximately constant resolution of  =   /  ∼ ., where   is an observed wavelength. thus shows a constant error if an observational error of  purely scales with the filter spacing   while   with (1 + ).
The uniform distribution of simulated 's ensures that the number density of the training data is constant at any , which is required for developing a photo-z network whose error estimations are not biased in the entire redshift range.One of our goals is to build a model that produces reliable redshift PDFs as well as single-point estimates, which is implemented by outputting probabilities for 350  class bins, as described in §4.1.We generate multiple mock SEDs from a given low-z source by randomly drawing  in each of 35 equally discretised bins, whose resolution is 10 times lower than the output probability vector.The sample size of our mock SEDs consequently results in ∼ 600, 000.
Our knowledge of the underlying galaxy SEDs is exclusively attributable to objects observed with the FourStar medium-band filters.The high number of filters in these photometric data ensures the individuality of each empirical template, which would be otherwise standardised into a small set of simplified representations.This allows us to efficiently generate realistic high-z SEDs even in the absence of large amounts of data about the distant universe.We note that the current framework does not take into consideration the difference in population between low-z and high-z galaxies due to their evolution.Handling this issue in a robust manner is beyond the scope of this paper, but our input fluxes are normalised to remove magnitude information, as described in §4.1, which should alleviate the impact on the model's performance.

Photometry simulations with noise application
The photometry for the mock SEDs is simulated using a transmission curve for each filter adopted in S16, producing a noiseless flux per unit wavelength F for the band .Establishing a realistic photometric sample then requires artificially applying an observational error to each noiseless flux.The fundamental concept of our fiducial noise model (which we call 'empirical') is to introduce actual observational noise for one test source  into simulated photometry of each mock SED.
We explore the most appropriate noise realisation for a given simulated SED in comparison with the observed data.This requires a measure of similarity in SED shape   between noiseless simulations F and noised observations ( F, , Ẽ, ), where ( F, , Ẽ, ) is a set of flux and error observed for the band  from the source .An approximate SED shape is captured by normalising all the fluxes and errors of each object by its own   -band photometric measurement.Each pair of simulated and catalogue sources are then compared based on these normalised photometric data, f and ( f, , ẽ, ) (here we denote normalised data with lower case).
For each mock galaxy, the similarity between f and ( f, , ẽ, ) is measured by assuming each simulated flux  , follows a Gaussian distribution given a standard deviation ẽ, .EAZY also adapts to template mismatch with a rest-frame template error function   ().The total flux uncertainty   , is given by where  ,rest is the rest-frame central wavelength of the filter , expressed with the observed wavelength   as  ,rest =   /(1 +  sim ).
We thus assume  , ∼  f ,   , 2 to estimate a probability  , that the observed f, is realised, given by The product of fluxes across each band then measures the stochastic similarity of the mock galaxy to the catalogue source : where  covers   broad-and medium-band filters adopted in S16 which do not contain missing values.The similarity measure   consequently needs to be defined in a form that should be generally applicable to comparing any pairs, since the effective number of filters   is not fixed for all the catalogue sources, dependent on .
One reasonable measurement is given by which can function as a probability of realisation for an object .
We additionally adopt a magnitude prior (|) following Straatman et al. ( 2016) for computing a probability of drawing a test source , expressed as where   is the   -band apparent magnitude.One catalogue object is randomly picked with a probability ( sim , ), whose errors { ẽ, }  are applied to each simulated SED including its missing values.The noised flux  , is then obtained by denormalising  , ∼  ( f , ẽ2 , ).We also establish simpler noise models to explore the benefit of our empirical one: (i) Noiseless: all the noiseless simulated fluxes are fed to the photo-z network as inputs, given by   = F .
(ii) Missing: for each mock SED, we randomly draw one test source from the spec-z catalogue whose missing values for some band filters are directly incorporated into the simulated photometric data.
(iii) Const: photometry for each mock SED is performed with a constant noise   over the entire wavelength range.  is obtained by assuming an arbitrarily selected S/N for   -band photometry, where S/N is a random variable ranging between 3 and 30.Each noiseless flux point then varies following a Gaussian distribution with   ∼  F ,  2  , which also reflects the missing values in the same method as the Missing model (ii).
Fig. 1 shows the simulated photometry for an example mock SED, whose noised fluxes are generated with the four different noise models.The Missing model (ii) drops one flux value as missing, which is represented by the red cross, while the Const model (iii) further adds constant errors to the remaining fluxes.More realistic photometry can be simulated with the Empirical model (iv), where the empirical noise is applied to the noiseless fluxes which is extracted from the test sample.
We then train the CNN models, whose architecture is introduced in §4.3, on the different simulated datasets for CDFS, each generated with one of the four noise models.Testing them on the same spec-z catalogue sample allows us to explore the most effective noise model.The performance of each CNN is evaluated with the accuracy  NMAD and the outlier rate  0.2 of photo-z point estimates, as described in §4.4.The Const model (iii) shows further improvements by applying simple artificial noise to the noiseless fluxes, reducing  NMAD and  0.2 to ∼ 0.013 and ∼ 2.3%.Significantly better scores can be obtained as well by training the model on more realistic mock data generated with the Empirical model (iv), which result in  NMAD ∼ 0.009 and  0.2 ∼ 1.5%.These results indicate the empirical noise application shows the smallest disparity between simulations and observations.We therefore conclude that the the Empirical model (iv) can produce mock photometric data which best represents the test catalogue samples.The empirical treatment of noise in the training set further improves the precision of PDFs derived for the query set, which can translate into matching the error scales of the distinct samples (Wolf 2009).Effectively, the combination of our chosen noise model, our loss function, and the nonlinearity of the neural networks may allow the model to treat the error scale as a parameter and optimise it such that the smoothing scale of the combined error more effectively matches that of our target data." We randomly generate five realisations of empirical noise based on the same mock SED sample for each field.This provides stochastically different photometric samples, each constructed by matching the given simulated galaxies with randomly selected catalogue data following the relative probability ( sim , ).They are independently used for training different networks, whose predictions are subsequently combined with the ensemble learning method, as discussed in §4.7.We note that missing values present in the test catalogue samples are incorporated into the photometry simulation.This allows our training set to intrinsically contain information on the corresponding missing data, which does not require imputing missing values entailed by the test set for evaluating the model's performance on real data.

ML PHOTO-Z MODEL
We can assess the performance of photo-z networks on the S16 test catalogue by first training them with the mock data.§4.1 describes the input and output, which are designed for yielding redshift PDFs from normalised photometric data.In §4.2 and §4.3, the architectures of two different photo-z networks are introduced: a fully connected neural network (FCNN) and a CNN-based model HAY-ATE.§4.4 discusses commonly-used evaluation metrics for photo-z point estimates and their PDFs.§4.5 describes the fiducial training configuration for each network, whose lower-level output PDFs are combined with the ensemble learning method, as discussed in §4.7.In §4.6, we discuss the benefit of transfer learning using spec-z's for further improvements.

Inputs and outputs
Our training set contains simulated high-z galaxies which mirror the pupulation of low-z ZFOURGE sources; no evolution of the galaxy population is accounted for.We thus remove information on magnitudes from the input, which are critically influenced by the formation and evolution of galaxies and highly correlated with redshift.Each galaxy is consequently identified purely based on its SED shape.Our input variables are thus primarily flux ratios, which are obtained for each galaxy by normalising photometric measurements with its total   -band flux provided by S16.The photometry is a product of stackedFourStar/ -band and deep pre-existing -band imaging.The super-deep image achieves a maximum limiting depth at 5 significance of 26.2 − 26.6, 25.5 and 25.7 mag in CDFS, COSMOS and UDS, respectively.Using the total  -band flux as a baseline Testing a trained model on the spec-z catalogue also requires handling missing values, which are inevitably present in real data.We adopt a standard approach of imputation by a constant value, replacing all missing values in the normalised input data with -1.The negative substitute value can exclusively represent a lack of effective data distinguished from the other flux measurements, which should be zero or more.As depicted in Fig. 1, our missing data replacement strategy, represented by the Missing model (ii) described in § 3.2, markedly improves the model's performance compared to other imputation methods.We note that each data point can potentially represent no flux measurement as distinct from a missing value, an important distinction when mapping from photometry to redshifts.Therefore using a zero value is not appropriate as a placeholder for missing data.We could also employ a more complex method to substitute missing values, depending on the individual dataset, such as interpolation/extrapolation and k-nearest neighbours.As these approaches generate fake (though plausible) values for imputation, they could potentially degrade the precision of estimated photo-z's.
The input fluxes are also combined with their observational errors, which are used for weighting each residual between the template and observed fluxes in the EAZY fitting algorithm (Brammer et al. 2008).The supplementary information on the uncertainty of each photometric measurement can enhance the robustness of the colourredshift mapping predicted by photo-z networks (Zhou et al. 2021).The number of input variables  input is thus twice the number of observational filters  filter , with  input = 76, 70 and 52 for CDFS, COSMOS and UDS, respectively.
Our ML approach is to cast the photo-z estimation task into a classification problem by binning the target redshift range into discretised classes and returning a list of probabilities by which an example is found in a given target bin.Multiple-bin regression has been used with template fitting methods in the past, but the benefit of this approach has been demonstrated in recent ML photo-z studies (Pasquet-Itam & Pasquet 2018;Pasquet et al. 2019;Lee & Shin 2021), generally improving the photo-z accuracy.In the context of a model's development, the probabilistic scrutiny of the redshift PDF allows one to explore the causes of poor performance on some specific objects.Reproducing realistic redshift PDFs as well as single-point estimates could potentially contribute to improving cosmological analyses (e.g., Mandelbaum et al. 2008;Myers et al. 2009;Palmese et al. 2020).
Each PDF produced by our ML models is an output of the softmax function, which contains probabilities in  = log(1 + ) classes with a uniform distribution within 0 <  ≲ 1.8, corresponding to the redshift range 0 <  < 5.The resolution of  bins approximates the PDF of  provided by EAZY as the output vector.The configuration adopted by Straatman et al. (2016) lets the algorithm explore a grid of redshifts with a step of   = 0.005(1 + ).The constant  bin width can be thus expressed as   ∼   /(1 + ) = 0.005, which leads the photo-z network to output a vector of 350 probabilities as a PDF of  in our target redshift range.

Optimisation of a baseline FCNN model
We select a fully connected neural network (FCNN) as a baseline model, since it is commonly applied in photo-z estimation works.Tables B1 and B2 summarise some previous works which apply FCNNs to photo-z estimation, where the network was trained on spectroscopic samples in most cases.This requires a huge amount of observational data and consequently results in a limited target redshift range up to no more than ∼ 1 − 2. The number of filter bands used for photometric data is seldom as many as ∼ 10 as well, since cross-matching multiple catalogues tends to significantly reduce the sample size.
The updated S16 contains a much larger amount of photometric information with ∼ 40 filter bands, while our simulation method allows for training networks on sufficient mock data in a broader redshift range up to 5. The architecture of the baseline FCNN should thus reflect the larger-scale configuration with more trainable parameters.Other relevant works that have introduced photoz based ML models trained with simulations typically adopt huge networks consisting of many layers and neurons: for example, { input : 600 : 400 : 250 × 13 :  output } in Eriksen et al. (2020) and { input : 512 : 1024 : 2048 : 1024 : 512 : 256 : 128 : 64 : 32 :  output } in Ramachandra et al. (2022).We perform k-fold cross validation to explore the most appropriate architecture and optimise its hyperparameters by training the models on the simulated data generated with the Empirical noise model (iv), as described in § 3.2.
Our photo-z code is designed for classifying input photometric data into 350  bins, providing the output vector that represents a PDF of .We thus employ the standard categorical cross-entropy (CCE) loss function (Baum & Wilczek 1987;Solla et al. 1988) where   and   are the ground truth and the score returned by the softmax function for each class .The redshift classifier is tuned so that the -prediction accuracy is maximised and the loss is minimised using one-hot encoding with   = 1 only for a true class.
For each FCNN we consider two types of hyperparameters relating to the architecture, the number of layers ( layer ) and the number of neurons in each layer ( neuron ), as well as those relating to the algorithm, namely learning rate and the dropout rate.The latter (algorithmic) parameters are thus optimised for each set of the architectural ones.Fig. 2 shows the results on hyperparameter optimisation for the FCNN, presenting the validation accuracy and loss for each combination of  layer and  neuron within the ranges  layer ∈ [1, 9] and  neuron ∈ [1, 500].The accuracy is defined as the percentage of predicted redshift classes that match with true ones.Note, we don't expect accuracy to reach 100% even when performing well, since we expect scatter into neighbouring redshift bins as photo-zs are intrinsincally uncertain, and some redshifts will lie closer to the bin boundaries.Nevertheless, for a fixed validation sample, it is a good relative indicator.We explore other metrics below.
Each panel presents changes in accuracy scores with  neuron for a given  layer .We find that the accuracy levels off with increasing  neuron if the individual layers contain sufficient neurons.This is not affected by the number of layers in general with the accuracy converging to ≳ 30%.The minimum loss can be attained by the model with ( neuron ,  layer ) = (500, 3), with no significant improvement from increasing the number of trainable parameters with larger  neuron or  layer .The architecture of our FCNN model is therefore constructed from three layers with 500 neurons, since a smaller architecture is preferable to a larger one for the same performance.The number of weights to be trained is ∼ 700, 000.
Fig 3 visualises the overall architecture of the optimised baseline model with some details excluded.Each layer is followed by ReLU non-linearities, 5 per cent dropout and a batch normalisation layer.
The input flux ratios along with their observational errors are fed into the network with missing values included, which produces the softmax output of 350  probabilities.
In the initial exploratory phase of this research other ML techniques were also tested, using a similar hyperparameter optimisation strategy.The performance of random forests (RFs) and support vector machines (SVMs) was examined with different sets of hyperparameters: the number of estimators and max depth for RFs and (, ) for SVMs, where  controls the complexity of the decision surface while  the range influenced by a single data point.Each model was developed with its best hyperparameters, but underperformed the FCNN in that their validation accuracies only reached just under 30%.This indicates that neural networks are more appropriate for our photo-z estimation scheme than other major ML approaches.In particular, with neural networks we have the ability to optimise the loss function for PDF recovery (see discussed in 5.1.2).

Architecture of HAYATE
We further develop a CNN-based photo-z network and compare the performance of these different ML approaches.As before, the output is a probability vector on discretised redshift bins, which translates the regression problem into a classification task and provides redshift PDFs as well as their point estimates.The output PDF is produced by combining multiple networks independently trained with different configurations, representing an ensemble of stochastic variants for each test object.
We build HAYATE with the CNN architecture inspired by the VGG neural network (VGGNet; Simonyan & Zisserman 2015), one of the simplest CNN structures commonly used for image classification and object detection.The extended variant of the VGG model, VGG19, consists of 16 convolution layers with 5 max pooling layers followed by 3 fully connected layers and 1 softmax layer.It features an extremely small receptive field, a kernel of 3 × 3, which is the smallest size that can capture the neighbouring inputs.Stacking multiple 3 × 3 convolutions instead of using a larger receptive field leads to a deeper network, which is required for better performance (Emmert-Streib et al. 2020).VGG-based models have been successfully applied to astronomical images, e.g. for the identification of radio galaxies (Wu et al. 2019), classification of compact star clusters (Wei et al. 2020) and detection of major mergers (Wang et al. 2020).
The VGG network is fundamentally designed for handling higherdimensional image data (with multiple colour channels) rather than 1D photometric data.It should be thus applied to photo-z estimation with a much smaller architecture, since the number of trainable parameters originally reaches up to ∼ 144 million.Zhou et al. (2021) have introduced a 1D CNN used for deriving spec-z's from spectral data, which can provide some insight into the application of CNNs to photo-z estimation.The input layer includes two channels of spectral data and errors, while the output layer contains multiple neurons representing the probability of each redshift interval.The spec-z analysis is thus performed as a classification task using the feature maps obtained through two convolutional layers, which are followed by two fully connected layers.The number of parameters is consequently far less than that of a CNN commonly used for image processing, totalling no more than ∼ 350, 000.
The task of photo-z prediction can be treated in the same fashion but with input flux and output probability vectors of lower resolution than spec-z.We construct HAYATE as a simplified variant of the VGG network, whose architecture is illustrated in Fig. 4. The input 2 ×  filter matrix involves 2 rows of flux ratios and normalised observational errors, convolved with a kernel of 2 × 3 using zero padding of size 1 and followed by the 3 ×  filter matrix.Adopting zero padding in the column direction, we then convolve it with a kernel of 3 × 3 to obtain a 1D vector of size  filter .The major components are a following sequence of 6 convolutional layers with 32, 32, 64, 128, 256 and 512 kernels each.The fundamental concept of the VGG network is particularly reflected by a 1D kernel of size 3 used for a convolution operation in each layer.We basically connect the convolutional layers with batch normalisation, dropout and 1D max pooling layers.A fully connected layer is set with 350 neurons in the end, each outputting a softmax probability of finding an object at a given .
We have explored more efficient architectures in supplementary experiments, only to conclude that the one mentioned above should perform best among several simple CNNs.We also note that the number of trainable weights for HAYATE is approximately the same as that of the baseline FCNN described in §4.2.

Evaluation metrics
All the ML photo-z's are estimated from the redshift PDFs in the same method as implemented in EAZY by Straatman et al. (2016).Each point estimate is obtained by marginalizing exclusively over the peak of the redshift PDF which shows the largest integrated probability.This adapts to the degeneracy of template colours with redshift, which produces a PDF with multiple peaks.
The quality of photo-z estimates is evaluated based on the residuals with respect to their spec-z's, which are given by where  phot and  spec are photometric and spectroscopic redshifts.
Each ML photo-z is immediately recovered from a point estimate of , expressed as  phot =   −1.We employ the following commonly used indicators as statistical metrics to evaluate the model's performance in single-point estimations: •  NMAD : normalised absolute median deviation of Δ, described as which is robust to Δ outliers.
We also use the probability integral transform (PIT; Polsterer et al. 2016) to properly estimate the calibration of the redshift PDF () generated by different photo-z models, which is defined by where  spec, corresponds to the true redshift of the test source .If the predicted PDFs are well calibrated with respect to the spec-z's, the histogram of the PIT values, or its PDF  (), is equivalent to the uniform distribution  (0, 1).The flat distribution indicates that the predicted PDFs are neither biased, too narrow nor too broad.Conversely, underdispersed and overdispersed PDFs exhibit U-shaped and centre-peaked distributions, respectively, while a systematic bias present in the PDFs is represented by a slope in the PIT distribution.
The following evaluation metrics are used for quantifying the global property of output PDFs: • CvM: score of a Cramér-von Mises (Cramér 1928) where  () and   () are cumulative distribution functions (CDFs) of  () and  (0, 1), respectively.This corresponds to the meansquared difference between the CDFs of the empirical and true PDFs of PIT.
The reliability of individual PDFs with respect to spec-z's is represented by the Continuous Ranked Probability Score (CRPS; Hersbach 2000; Polsterer et al. 2016), which is given by where   () and  spec, () are CDFs of   () and  spec, for the source , respectively. spec, () here corresponds to the CDF of ( −  spec, ) where  ( −  spec, ) is the Heaviside step-function, which gives 0 for  <  spec, and 1 for  ≥  spec, .It reflects the simplest form for the unknown true distribution of , given by the Dirac Delta function between  and  spec, .The CRPS thus represents the distance between   () and  spec, (), or the difference between the empirical and ideal redshift PDFs.We thus assess the reliability of individual output PDFs with the median value of  , which is robust to outliers than their mean value: • CRPS: median of all CRPS values obtained from a sample, We introduced the CRPS metric primarily because it can be used as part of the ANN optimisation process.Table 1 summarises the characteristics of these indicators.In summary,  NMAD and CRPS reveal the quality of individual outputs while  0.2 , KL and CvM represent the global property obtained from the distribution of their key attributes.

Training process
The mock photometric data are divided into three parts representing training, validation and test datasets.The test sample contains 20% of the whole set of simulated galaxies, while the rest is split into the training and validation sets with 70% and 30% of the remaining data randomly selected, respectively.The individual networks are trained with a joint loss that combines the CCE loss  CCE and the CRPS loss  CRPS , given by equations ( 8) and ( 14), respectively.The CCE loss is frequently used for multi-class classification problems, responsible for the accuracy of single point estimates.The CRPS loss can function as a penalty for failing to produce reliable PDFs, which would be otherwise neglected in a classification task with the single CCE loss.
The joint loss is optimised by an Adam optimiser (Kingma & Ba 2015) in the training process, which is given by where  and  are the weights to the linear combination of the CCE and CRPS losses.We first explore the appropriate values for  and  in a pre-training process so that  CCE and  CRPS equivalently contribute to the total at loss convergence with  CCE ≃  CRPS .This is achieved by updating them after each training epoch  with the following equations: where  +1 and  +1 are the updated coefficients used for the next training step.The training terminates with an early stopping method after 10 epochs of no improvement in model's performance on the held-out validation set.We then train the network from scratch using the fixed coefficients of the convergence for  and  which are obtained from the pre-training.

Transfer learning
We apply transfer learning to HAYATE to build an empirically finetuned model (HAYATE-TL) and see if it can exploit the spec-z information.In transfer learning, typically the last layers of a pre-trained network are re-trained on a different dataset while the rest of the layers' weights remain frozen (fixed).A pre-trained model is a saved network that has been trained on a large dataset, then learns new features from a distinct training sample in another domain or regime.
Here we fine-tune the last two convolutional layers which have been trained on the simulated datasets with the observed samples with spec-z information.It should be noted that re-training more layers does not show a significant improvement (in this case, and in general -see Kirichenko et al. (2022)) in the model's performance and we thus allow only the last two layers to be trainable with spectroscopic observations.The estimated photo-z's of a test sample are to some degree dependent on which particular SEDs are included in the training and test sets.We implement a method of building a robust estimator by combining multiple PDFs for each object, which are produced by distinct models whose training sets do not contain the same object.The entire spec-z catalogue is sliced into 90% and 10% for training and test sets, respectively, which is repeatedly performed to provide 50 different 90-10 splits.Each test set source is then included in 5 different test samples, whose corresponding training sets are used for optimising 15 lower-level networks in the framework of ensemble learning, as discussed in §4.7.The output PDF obtained with transfer learning thus results in a combination of 75 different PDFs provided for each object.

Ensemble learning
Randomness appears in many aspects of the training process, which makes the weights of the model converge to different local minima of the loss function even if the datasets used are exactly the same.Prior to the training, splitting a dataset into training, validation and test sets is often done randomly depending on each experiment.The initial values of the weights are also randomised so that the training processes start with different initial states.During the training, the shuffled batches also lead to different gradient values across runs, while a subset of the neurons are randomly ignored by the dropout layers.
The effect of local minima can be reduced by performing Bootstrap AGGregatING (Bagging, Breiman 1996;Dietterich 1997), which integrates multiple models trained on different datasets that are constructed by sampling the same training set with replacement.The main principle behind the bagging algorithm is to build a generic model by combining a collection of weak learners that are independently trained with the uncorrelated subsets from the original training set.The composite strong learner can outperform a single model established on the original sample (Rokach 2010).
A random forest ensemble (Breiman & Schapire 2001) is commonly adopted in the field of ensemble learning, which is characterised by a number of decision trees, each trained on a different subset of the entire training sample.The benefit obtained from these techniques has been demonstrated for a wide range of regression and classification tasks in astronomy (e.g., Way & Srivastava 2006;Carrasco Kind & Brunner 2014;Kim et al. 2015;Baron & Poznanski 2017;Green et al. 2019).Some ML photo-z studies have succeeded in applying the construction of prediction trees and the RF techniques to improve the redshift estimation accuracy.(Carrasco Kind & Brunner 2013;Cavuoti et al. 2017b).
We rather use a smaller subset of the full simulated data for training each network, instead of generating a bootstrapped sample of full sample size  train .The training sub-samples are constructed by partitioning the full data into 3, which ensures the independence of each subset while the training is computationally less intensive due to the smaller sample size.We thus train each network on a sub-sample of size  train /3, obtained from the 5 individual training sets of different noise realisations.The ensemble of multiple PDFs  ,  () is thus given by where  is the index of the simulated dataset of different noise realisation while  discriminates the sub-samples.It follows the output PDF of each sample galaxy is produced by averaging 15 lower-level predictions, whose typical example is shown in Fig. 5.This allows for outputting more robust and reliable PDFs than those obtained with a single network (Sadeh et al. 2016;Eriksen et al. 2020).

RESULTS
We evaluate the performance of HAYATE on the spec-z samples in S16, particularly for CDFS and COSMOS, each containing ∼ 1000 galaxies with photometric data of ≲ 40 filters.It is also tested on the smaller sample from UDS for a supplementary experiment, although no more than 312 objects are available with 26 input fluxes provided for each.Table 2 gives an overview of the results for EAZY, HAYATE and HAYATE-TL along with the baseline FCNN.We also probe the benefit of learning from simulated data by training the CNN of the same architecture as HAYATE purely with the spec-z data from scratch.
Their performance is evaluated with the metrics for measuring the quality of photo-z point estimates ( NMAD and  0.2 ) and output PDFs (KL, CvM and CRPS), as summarised in Table 1.Each of the metrics is depicted in Fig. 6, separated by field.We compare our ML models' performance with EAZY, the underlying template fitting algorithm, whose 1 range of the individual metric is represented by the shaded region in each panel.
§5.1 and 5.2 describe the perfomance of HAYATE and HAYATE-TL.In §5.3, we discuss the benefit of our simulation-based CNN method, which outperforms the other ML approaches.§5.4 presents example archetypes of photo-z outliers useful in exploring the limitations of, and potential improvements to HAYATE when dealing with catastrophic errors in photo-z estimation.

Photo-z statistics
Table 2 compares the performance of HAYATE with EAZY.We see that HAYATE's point estimates are comparable to, or better than, EAZY, with  NMAD significantly reduced from 1.14 to 0.96 and 1.53 to 1.42 for CDFS and COSMOS, respectively.Probing the distribution of the test data on a  spec − Δ plane further provides insights into how accurate photo-z's can be attained.The upper row in Fig. 7 shows the results for CDFS, where the photo-z's derived with HAYATE ( HAYATE ) and EAZY ( EAZY ) are plotted on the left and middle panels, respectively.Figs.A1 and A2 also present the outcomes for COSMOS and UDS (Appendix A) in the same manner.We can see from these figures that the distribution of errors between the two methods are comparable.
We also compare the residuals of HAYATE and EAZY and see a strong correlation.The bottom row of each figure shows plots of Δ EAZY v.s.Δ HAYATE for the test spec-z sample, which represent photo-z errors for EAZY and HAYATE expressed as Δ EAZY = ( EAZY − spec )/(1+ spec ) and Δ HAYATE = ( HAYATE −  spec )/(1 +  spec ).The data points are generally aligned along the diagonal identity line, which demonstrates the ability of HAYATE to reproduce photo-z estimates that could be provided by EAZY.The trained network replicates the high accuracy of the template fitting code but with an execution time ∼ 100 times faster.This indicates HAYATE can learn to function as a reliable and efficient emulator of EAZY.The resulting mock data should thus coincide with likely EAZY template fits of the corresponding observed data.
The results for CDFS and COSMOS show slight improvements in photo-z point estimates with  NMAD reduced by 16% and 7%, respectively.The right panel of each upper row in Fig. 7, A1 and A2 shows this is mainly attributable to low-z galaxies, which presents  NMAD separately for sub-samples obtained by splitting the test set with a redshift threshold of  spec = 1.3.This can be expected, since the underlying SED templates used for training are constructed with reliable low-z data.The interpolative nature of ML approaches further surpasses the one-on-one matching of the original template fitting by circumventing the individual template mismatch.
Another significant benefit of our ML method is the ability to generalise to galaxies at higher redshifts in the absence of a large body of high-z data.The photo-z precision obtained with HAYATE shows no significant difference from that with EAZY even in the extrapolated high-z regime 1.3 <  spec < 5.The extensibility of the target redshift range ensures that the simulations are sufficiently effective beyond the training domain of firm underlying knowledge.
Applying ML approaches to the template fitting algorithm also produces a more robust photo-z estimator than EAZY.The outlier rate  0.2 of estimated photo-z's significantly drops from 1.26% to 0.94%, 1.90% to 1.22% and 1.28% to 0.32% for CDFS, COSMOS and UDS, respectively, as shown in Table 2. HAYATE is therefore less prone to catastrophic failures in photo-z predictions, performing well on 14 test sources whose photo-z errors would be outliers if derived with EAZY.We classify them as Class A, along with other sample groups defined based on a set of Δ HAYATE and Δ EAZY for each galaxy as presented in Table 3. Class B, on the other hand, only contains 2 galaxies, which are photo-z outliers for HAYATE but not for EAZY.20 catalogue objects classified as Class C in Table 3 show catastrophic solutions for photo-z computation with both models.Their wrong photo-z estimates are, however, quite similar between the two different methods, which are plotted on the diagonal in each bottom panel of Fig. 7, A1 and A2.From visual inspection, they are obviously not well-represented by the SED template set of EAZY, which indeed includes some rare objects such as 10 AGNs and 3 dusty star-forming galaxies.Improving photo-z estimations for Class C objects thus requires extending the population of galaxy templates used by EAZY and for simulating our high-redshift sources.Brescia et al. (2019) also note that to increase the accuracy of AGN photoz's with template-fitting methods, the inclusion of morphological information (extended or point-like) likely provides the biggest improvement.Example plots of inputs and outputs for these sources are shown in Fig. 11, which are further discussed in § 5.4.

PDF statistics
The quality of output PDFs is generally improved with our ML method as measured by KL, CvM and CRPS, detailed in Table 2 and shown in Fig. 6.HAYATE particularly shows better PIT distributions for CDFS and COSMOS, with KL and CvM significantly lower than those derived with EAZY.Fig. 8 presents the PIT histograms of HAYATE and EAZY for these two fields, along with their CDFs used for quantifying deviations from uniformity.KL provides a dissimilarity measure between the predictive and uniform distributions of PIT, while CvM is a CDF-based metric intuitively represented by the area filled between the corresponding CDF and the identity line.We can see that the PIT distribution of HAYATE looks flatter than that of EAZY, which is reflected by the smaller KL and CvM, indicative of better-calibrated PDFs.
A major contributor to this is the application of ensemble learning to generating the combined PDFs.Fig. 9 shows the metrics estimated for different numbers of CNNs whose individual predictions are combined into the ensemble PDF.The single network is trained on the whole training sample, while the multiple models are built  Table 3. Classification of the test objects based on their photo-z's estimated with HAYATE and EAZY.Objects of Class A are photo-z outliers for EAZY whose estimates are significantly improved with our ML method so that they are no longer outliers for HAYATE.Class B contains a few galaxies on which EAZY conversely outperforms, while photo-z outliers for both models are classified into Class C. Class D includes 'normal' galaxies whose photo-z's provided by both models are not catastrophically wrong with respect to the spec-z's.with ensemble learning, as discussed in §4.7.Increasing the number of networks remarkably improves KL and CvM, which converges as the number of individual PDFs increases to ∼ 15.Our fiducial configuration therefore uses 15 networks, as depicted by the vertical line in each panel.ML photo-z codes typically provide PIT distribution characteristics of convex shape, indicative of overly broad PDFs that are unlikely to include spec-z's in their tails (Schmidt et al. 2020).The broadening of PDFs suggests an intrinsic function of ML training approaches that adds implicit smoothing to the effective error scale (Wolf 2009).Conversely, HAYATE produces redshift PDFs whose PIT distribution is rather similar to that obtained with the template fitting code EAZY, as one can see in Fig. 8.An over-representation of extreme values is evidenced by a concave histogram, implying overly narrow PDFs.The ensemble learning technique is aimed at alleviating this tendency by combining multiple predictions.Eriksen et al. (2020) have demonstrated a similar effect whereby combining multiple networks reduces the number of objects with the lowest and highest PIT values.The improvement in PIT distribution proved to be caused by decreasing photo-z outliers.However, this does not apply to HAYATE, since the outlier rate does not significantly drop with multiple PDFs combined.Our ensemble approach samples the potential solution space from different local minima on the loss surface.

Object class
We assess the overall form of PIT distributions to probe if the output PDFs are well calibrated on average.This requires that the CDF value at the spec-z should be random, rather than that each PDF is well constrained with respect to its true redshift.The quality of the individual PDFs thus has to be assessed in conjunction with CRPS, which represents a distance between the CDF of a given PDF and a step function with a step at its spec-z.In a derivative form, this can be translated into how different the PDF is from the corresponding delta function.We nonetheless often find a trade-off between KL/CvM and CRPS, or the uniformity of the PIT distribution and the sharpness of each PDF.A simple explanation for this is that a narrower PDF tends to be better constrained at its spec-z with lower CRPS, which is more likely to increase the number of PIT values at extreme edges.
We see in Fig 9 that ensemble learning also improves CRPS slightly as the number of networks increases.At convergence, HAYATE's PDF statistics are each comparable to, or better than, EAZY.The validity of each output PDF is further supported by the sufficiently small CRPS.In Fig. 9, we see that the primary contribution to this is incorporating the CRPS term into the joint loss function.The results The analysis on the PDF statistics indicates the output PDFs derived with HAYATE are more reliable than those obtained with EAZY, with respect to spec-z's, while their overall population is statistically more self-consistent.We thus demonstrate HAYATE attains good-quality photo-z PDFs by leveraging the benefits of multiple ap-proaches, which performs as an ensemble of lower-level networks optimised for the joint loss.The higher precision of photo-z point estimates indeed results from improved output PDFs yielded by HAYATE-TL.Table 2 shows significantly better CRPS for CDFS and COSMOS, which reduces from 0.83 to 0.70 and 1.76 to 1.48 with the assistance of spec-z information for re-training.The fine-tuning of the pre-trained network thus better constrains each redshift PDF over the peak around the spec-z.The improved PDF consequently provides a more precise photo-z point estimate.

Improvements with transfer learning
The HAYATE-TL photo-z outliers still comprise exactly the same objects as HAYATE.The outlier rate consequently remains the same despite the re-training with observations.This reflects the limitation of transfer learning from simulations, which exclusively benefits an 'ordinary' test object whose colour-redshift relation can be fine-tuned by training with the remaining sample.It can not adapt to 'anomalous' sources whose photometric data are not sufficiently represented in the training set along with reliable spec-z's.We therefore conclude the photo-z outliers found with HAYATE are intrinsically inconsistent with the input-output mapping derived from both the template-based training set and the observed data.
Table 2 also presents the result for the CNN model which is trained on the same spec-z samples as used for transfer learning but completely from scratch.HAYATE-TL significantly outperforms the most common method of training purely with spec-z data, although both fundamentally learn with the same observed samples.Training with simulations proves to supplement the insufficient spec-z sample for training.These results indicate transfer learning is effective for making a minor adjustment both for output redshift PDFs and their single-point estimates using spec-z information.Training with simulations lays important groundwork for the subsequent observational fine-tuning.

CNN v.s. FCNN
Evaluating the metrics for the two ML models, the baseline FCNN and HAYATE, reveals the CNN-based architecture shows overall improvements on photo-z point estimates.Table 2 and Fig. 6 show significant drops in  NMAD ,  0.2 and CRPS for all the fields.This indicates that the CNN is more likely to yield reliable redshift PDFs with precise point estimates than a FCNN.
The superior performance of HAYATE compared to the FCNN model indicates the benefit of prioritising local feature extraction from the combined arrays of fluxes and photometric errors.CNNs are particularly suited to high-dimensional data such as image processing since convolutional kernels require many fewer trainable parameters than FCNNs.Convolution operations are performed primarily for extracting local information and preserving the spatial relationship between features; these features become more abstract from layer to layer through the network.The CNN photo-z models have commonly been trained on galaxy images instead of photometry summary information, which allows for learning with supplementary information on the spatial flux distribution (Pasquet et al. 2019;Schuldt et al. 2021;Henghes et al. 2022;Lin et al. 2022;Zhou et al. 2022).Our ML instead leverages the demonstrated ability of CNNs to capture and interpret the local features of galaxy SEDs obtained from the flux distribution over a range of wavelengths.

Analysis of individual redshift PDFs
The robustness of our method can be explored with by visual inspection of individual PDFs predicted by HAYATE and EAZY.We particularly focus on the test objects classified as Class A, defined in Table 3, which are responsible for the improved outlier rate vs template fitting approaches.Fig. 10 shows example star-forming galaxies of Class A, whose input photometric data and output PDFs are presented in the top and bottom rows of each panel, respectively.HAYATE obviously performs better than EAZY on these objects, providing more reliable PDFs with respect to their spec-z's, which are represented by the black circles on the horizontal axes.This results in more accurate photo-z point estimates than the EAZY predictions, shown by the coloured and gray vertical lines.
We can gain further insight into the reason for the improvements by probing the best-fit SEDs derived with EAZY when fixed to the photo-z's.They are represented by the coloured lines plotted with the input fluxes, which can be compared to the gray dotted lines for the corresponding SEDs of EAZY photo-z's.One major failure of template fitting is to misinterpret the spectral features of the Lyman and Balmer breaks, or the Lyman-alpha and Balmer emission lines (Benítez 2000;Brammer et al. 2008).Some of the PDFs produced by HAYATE indeed show minor peaks around their corresponding EAZY photo-z's, showing the learned degeneracy inherited from the original template fitting algorithm.
We may glean further clues for improving HAYATE's output PDFs by investigating typical photo-z outliers, although the outlier rate is low with  0.2 ≲ 1%.One major group consists of rare objects whose photometric data are not well represented by the training samples.We can see some likely AGN data points (estimated from visual inspection of spectra) in the lower left panels of Fig. 7 and A1, depicted by triangles outside the region of |Δ HAYATE | < 0.2 and |Δ EAZY | < 0.2.These objects are included in Class C, whose photo-z's are outliers both for HAYATE and EAZY often with similar point estimates.For the example AGNs presented in Fig. 11, both models provide PDFs of erroneous photo-z's, although their input photometric data are obtained with the brightest magnitudes and the highest SNRs.Most of them show no minor peaks at the spec-z's in the distributions.This reveals the ensemble of standard EAZY templates can not intrinsically cover galaxy SEDs of some rare objects, which means that the simulated training datasets will also lack objects of this class, and such anomalous objects result in catastrophically wrong solutions.
Class C also contains some star-forming galaxies which result in incorrect photo-z predictions.They are varied in photometric SNR and derived spec-z's, whose results are presented in Fig. 12. HAY-ATE and EAZY both predict quite similar photo-z's for each object, although they are significantly divergent from the spec-z.One can find the best-fit SEDs derived at  HAYATE and  EAZY which indeed look well fitted to the input fluxes.A deficiency in the template set and thus HAYATE's training data means that neither can correctly classify these difficult objects.
Photo-z point estimates of HAYATE are generally correlated with those computed by EAZY, as discussed in § 5.1.1.The network is clearly able to exploit the demonstrated ability of template fitting to predict precise photo-z's.Assessing the model's performance on the individual outliers further demonstrates how our hybrid approach has internalised the fundamental functions of EAZY, including even the failure to produce reliable PDFs for some difficult objects.One remarkable benefit of HAYATE is, however, the potential to remedy the vulnerability of EAZY to the misinterpretation of spectral features, particularly characterised by the Lyman and Balmer breaks.

DISCUSSION
We have demonstrated the potential for HAYATE to contribute to efficient data mining for future large surveys with the following benefits: (i) Our method can be applied to a broad redshift range including high-z galaxies which are deficient in reliable observational data for training.The network trained with template SEDs from EAZY can function as a reliable emulator, with ∼ 100 times shorter running time.
(ii) The analysis of  NMAD reveals that in the interpolative regions of the low-z colour space, the ML methodology results in shown by the shaded region and the solid black line, respectively.For each object their photo-z point estimates are given by the coloured and gray upside-down triangles, while the spec-z by the down arrow.
higher accuracy in photo-z estimation than the original template fitting approach.It also performs comparably well even in the high-z extrapolative regime.
(iii) HAYATE is likewise more robust to photo-z outliers than EAZY.In particular, its output photo-z PDFs are less vulnerable to the degeneracy of redshift caused by misinterpretation of the Lyman and Balmer breaks from input photometric data.
(iv) Optimising the joint loss comprising  CCE and  CRPS keeps the credibility of individual PDFs comparable to that for EAZY in terms of CRPS.
(v) Ensemble learning shows significant improvements in KL and CvM, which enables HAYATE to provide redshift PDFs better calibrated than EAZY with a flatter PIT distribution.∼ 5 − 30% depending on the sample size of spec-z datasets.We expect to benefit from the fine-tuning with spec-z information for future photo-z studies since spectroscopy will be conducted along with imaging in many upcoming survey projects.
(vii) Training with simulations shows remarkable improvements in both photo-z and PDF statistics compared to the purely observation-based training.This enables us to utilise ML techniques for redshift estimations where only small spec-z samples are available: in this work, no more than 1274, 738 and 312 objects in CDFS, COSMOS and UDS, respectively.
(viii) Our empirical noise application method allows any missing values to be included in the input photometric data.This can enhances photometric resolution and spectral coverage of the target photometric sample without reducing sample size, which is compiled from multiple sub-catalogues by cross-matching between individual sources of different many-band photometry.
It is worth pointing out that, although a hybrid method optimised to perform well in the high-redshift regime, there is no clear step-change in performance beyond a certain threshold.HAYATE performs well across the entire parameter range, exploiting the strengths of the two approaches previously considered somewhat disjoint.
Exploring billions of objects catalogued by the Stage IV dark energy surveys will require the exploitation of photo-z's at the expense of reliable spec-z information.A simplistic extrapolation from past campaigns would estimate over 10,000 years of continuous integration time for obtaining spectra of the LSST "gold sample" galaxies (Newman & Gruen 2022).We may need ≳ 30,000 spectra in training and calibration of photo-z's for a Stage IV survey, from ≳ 15 widelyseparated fields of ∼ 0.09 deg 2 each; for instance, this corresponds to an estimated survey time of over a few years even with the Extremely Large Telescope for the LSST survey depths (Newman et al. 2015).
The limited redshift range targeted by the Euclid survey could rather demand a smaller spec-z sample size to meet the cosmology requirements for photo-z calibration, but still exceeding ∼ 5000 (Stanford et al. 2021).Efficient and accurate estimation of photo-z's will have fundamental importance in various fields of extragalactic astronomy and cosmology as the pace of follow-up spectroscopy is never able to sufficiently meet the required volume of objects from such imaging surveys.
Template-fitting methods perform well on the current generation of surveys, though at the data volumes of the levels expected for the Stage III and IV surveys the compute power needed becomes a nontrivial issue.A photo-z estimator that is orders of magnitude faster while preserving excellent performance would assist in scaling these data pipelines.
This work builds on the performance of EAZY, applied to S16 (including the data products from the ZFOURGE survey), which covers 128 arcmin 2 to a limit of  ∼ 26 in the   band for CDFS, producing imaging of ∼ 30, 000 galaxies.The ongoing and future surveys will probe much fainter objects in wider survey areas, producing observational catalogues of unprecedentedly large sample size.
Our hybrid method can be applied to any photometric catalogues by simulating photometry using the corresponding transmission curves for mock SEDs with simulated noise based on the observational errors.The simulation-based catalogue construction also allows training methods to be extrapolated outside their initial redshift ranges, i.e.  < 1.3 in this work.We set the upper bound of the target redshift range to 5 considering the number of spec-z data available as the test samples.This can be reasonably extended to much higher redshifts depending on the target survey or photometric catalogue.Some recent works have proposed using simulated photometric data in training photo-z networks (e.g., Eriksen et al. 2020;Ramachandra et al. 2022), but the redshift range still covers up to ∼ 1.2 at the highest.
The mainstream in traditional ML approaches have involved training a photo-z algorithm exclusively with spec-z information (e.g., Firth et al. 2003;Brescia et al. 2013Brescia et al. , 2014;;Bonnett 2015;Sadeh et al. 2016;Jones & Singal 2017).The accuracy of predicted photoz's essentially depends on the quality and completeness of the training dataset, which requires large spec-z samples.The target redshift range for ML has been thus limited to low-z regions of sufficient spectroscopic completeness.This accounts for the current prevalence of template-based methods for high-z galaxies, although ML approaches are rather common at  ≲ 1.The extensibility of the target redshift range is one critical functionality of our hybrid method that will enable to infer accurate photo-z's of faint high-z galaxies obtained from the upcoming survey projects.
We have also demonstrated the potential of transfer learning to fine-tune the pre-trained model with spec-z information and improve the photo-z precision of normal sources, whose estimations are not outliers with respect to Δ.This will significantly benefit future ML photo-z studies, since spectroscopy will be conducted along with imaging in many upcoming survey projects.
Forthcoming programs from the JWST will provide opportunities to both improve the algorithm and to put it into practice.Deep spectroscopic data will bolster the training set available for transfer learning.For instance, the ongoing JWST Guaranteed Time Observations (GTO) program, the JWST Advanced Extragalactic Survey (JADES; Rieke 2020; Bunker et al. 2020) will provide Near InfraRed Spectrograph (NIRSpec) spectroscopy covering in 'DEEP' survey mode a smaller survey area of 46 arcmin 2 in HUDF/GOODS-S but to a much fainter limit of  ∼ 30.Another survey mode, the 'MEDIUM' survey, will cover no less than 190 arcmin 2 in GOODS-S and GOODS-N to a limit of  ∼ 29.JADES will observe ∼ 5000 galaxies at 1 <  < 5, ∼ 2000 − 4000 galaxies at  > 5 and ∼ 300 galaxies at  > 6.
On the other hand, deep imaging surveys, such as Public Release IMaging for Extragalactic Research (PRIMER Dunlop et al. 2021) will probe an even larger area of ∼ 400 arcmin 2 than GTO in COS-MOS and UDS to a limit of  ∼ 28.5, revealing ∼ 100 − 200 galaxies out to  ∼ 12.The COSMOS-Webb (Kartaltepe et al. 2021) will also produce wide area imaging covering 0.6  2 in COSMOS to a limit of  ∼ 28, expected to offer near-IR imaging of half a million galaxies along with 32,000 in the mid-IR and identify hundreds of massive quiescent galaxies in the first 2 Gyr ( > 4).These imaging datasets will be excellent candidates for applying HAYATE, and via transfer learning will leverage the smaller spectroscopic surveys.
The non-Gaussianity of predicted PDFs also distinguishes HAY-ATE from other commonly used approaches, which tend to assume the underlying components to be Gaussian (D'Isanto & Polsterer 2018;Eriksen et al. 2020;Lima et al. 2022).HAYATE yields nonparametric PDFs as the outputs of the softmax activation.Realistic redshift PDFs should indeed contain non-Gaussian properties such as asymmetry and tails, reflecting an interplay of various features in target photometric data: for instance, filter functions, the set of filters used and their observational error distributions.They may also be vulnerable to colour-redshift degeneracies, which are represented by multiple peaks.These individual features could have a significant impact on cosmological measurements (Mandelbaum et al. 2008;Palmese et al. 2020).
Nevertheless, HAYATE still fails to return a uniform PIT distribution, required for applying the the output PDFs to estimating  () of an ensemble of galaxies (Newman & Gruen 2022).EAZY is not vulnerable to systematic broadening or narrowing of output PDFs in general (Wittman et al. 2016;Schmidt et al. 2020).The better calibration of redshift PDFs offered by HAYATE thus provides insight into obtaining an even flatter PIT distribution that could meet the requirements for many high-precision cosmology measurements.Improving the ensemble learning approach, combined with transfer learning depending on the science case, is an potential avenue of research, which has contributed to significantly reducing KL and CvM.
Another issue to be addressed is the fidelity of simulated training data.The success of transfer learning indicates that there remains an intrinsic disparity in data quality between simulated and observed datasets, and that the mock photometric data used for training can be further improved.In essence, the quality of the mock SEDs relies on the performance of EAZY, while the noise model affects simulated photometry for a given test sample.These aspects of our hybrid method simultaneously lead to HAYATE's ability to emulate EAZY and its limitations, while surpassing the performance of the original template fitting code.

CONCLUSION
We have developed a novel photo-z CNN, dubbed HAYATE, based on a hybrid method that incorporates the demonstrated ability of template fitting into the latest empirical modelling.It is primarily aimed at combing the benefits of ML-and template-based approaches by performing as an efficient ML emulator of EAZY beyond the limitations of spec-z completeness at low-z.This was achieved by extrapolating the SED coverage obtained from low-z photometric data to higher redshifts.Technically, we artificially redshifted EAZY best-fit SEDs for the S16 sources of  < 1.3 such that the training set of mock SEDs covers a broader redshift range up to  = 5.Further advancements were likewise explored via simultaneous optimisation of output photo-z PDFs and point estimates, aided by the modern ML techniques: training with the joint loss function ( §4.5), ensemble learning ( §4.7), and transfer learning (4.6).The photo-z networks of different configurations, as well as EAZY, were tested on the updated S16 spec-z samples, evaluated based on commonly used performance metrics for measuring the quality of photo-z point estimates and output PDFs:  NMAD ,  0.2 , KL, CvM, and CRPS, as described in §4.4.Considering the applicability of our methodology to a variety of catalogues, HAYATE should generalise to a flexible set of photometric bands.The current framework is a bespoke solution for a specific catalogue with a fixed combination of broad-band filters.We may develop an extended architecture where the input involves a broader range of photometric band filters by allowing missing data to be incorporated into those unavailable to a given catalogue.A single model could then adapt to different catalogues simultaneously by learning on a collection of individual training samples.An upgraded model is under development and will be the subject of a future work.
Further improvements require a strategy to extend the training sample beyond the scope of EAZY predictions.The simplest approach would be to incorporate a broader range of galaxy SEDs from external sources into the training set, enhancing the model's robustness to those photo-z outliers whose typical SEDs are not included in the EAZY outputs.This particularly applies to some of the example galaxies discussed in §5.4 including AGNs.
Blended spectra are a likely source of photo-z errors, and difficult to eliminate in the preprocessing stage.All photo-z methodologies are vulnerable to this source of contamination, for which a correct redshift is not defined.It is possible that future methods, such as ML-based algorithms which directly consume the 2D spectra, could mitigate this further.
Our hybrid method may both benefit from and complement other recent developments.Wang et al. (2023b) have demonstrated promising results by using carefully chosen priors to break the age-massredshift degeneracy, and have recovered accurate photo-zs using the Prospector- (Leja et al. 2017) stellar population properties inference code, simultaneously recovering redshift with other stellar properties using Bayesian inference.(Wang et al. 2023a) exploit simulation based inference (SBI; Cranmer et al. 2020), which allows efficient sampling of computationally-expensive models, to massively accelerate this multi-parameter fitting compared to nested sampling by up to a factor of 10 4 .These methods, applied to simulated JWST data, efficiently recovered photo-zs with comparable outlier rates ( NMAD ∼ 0.04) along with multi-modal PDFs.
Fig.1shows the simulated photometry for an example mock SED, whose noised fluxes are generated with the four different noise models.The Missing model (ii) drops one flux value as missing, which is represented by the red cross, while the Const model (iii) further adds constant errors to the remaining fluxes.More realistic photometry can be simulated with the Empirical model (iv), where the empirical noise is applied to the noiseless fluxes which is extracted from the test sample.We then train the CNN models, whose architecture is introduced in §4.3, on the different simulated datasets for CDFS, each generated with one of the four noise models.Testing them on the same spec-z catalogue sample allows us to explore the most effective noise model.The performance of each CNN is evaluated with the accuracy  NMAD and the outlier rate  0.2 of photo-z point estimates, as described in §4.4.Fig. 1 presents the results, revealing the Noiseless model (i) causes a catastrophic failure in photo-z estimations since the training sample does not contain any errors and missing values in photometric measurements.This can be improved by incorporating missing values into the training set which reflect those of the test sample.The Missing model (ii) achieves much better results of  NMAD ∼ 0.03 and  0.2 ∼ 20% than those of the Noiseless model (i) with  NMAD ∼ 0.4 and  0.2 ∼ 60%.

Figure 1 .
Figure1.Simulated photometry with different noise models for the same mock SED.All the simulated fluxes are shown by the red circles with error bars, while the red crosses represent missing data.The top panels present purely integrated photometry without artificial noise, but without (left) and with (right) some missing values included based on a randomly picked catalogue source.These flux points are drawn from the Gaussian distributions with a constant variance over all wavelengths in the bottom left panel.The bottom right panel exhibits the artificial noise generated from the photometric data of a catalogue source with a similar SED shape.

Figure 2 .Figure 3 .Figure 4 .
Figure2.Optimisation of architecture and hyperparameters for the FCNN models using 4-fold cross-validation.Each panel presents changes in validation accuracy with the number of neurons ( neuron ) as the blue circles for a given number of layers ( layer ).The accuracy score along with its estimation error is given by the mean and standard deviation of the validation accuracy over all folds.The validation loss is also shown by the red circles.The dotted horizontal line represents the validation loss obtained from the baseline model, comprising 3 layers with 500 neurons each, which is presented by the red triangle.

Figure 5 .
Figure 5. Top: inputs of fluxes and observational errors for an example object in CDFS.The normalised fluxes and photometric errors are presented by the black circles with error bars.The gray line shows the corresponding best-fit SED derived with EAZY.Bottom: ensemble of output PDFs as a function of  = log(1 + ), shown by the shaded region coloured in purple.The solid lines in different colours are lower-level PDFs produced by 15 different networks for the source presented in the top panel, which are combined into the thick purple line as an ensemble.

Figure 6 .
Figure 6.Visualisation of the comparison in the photo-z and PDF statistics between different models presented in Table 2.The results for CDFS, COSMOS and UDS are provided in the left, middle and right columns, respectively, which are individually coloured in red, green and blue.Each row shows evaluation scores of one metric for the baseline FCNN, HAYATE, HAYATE-TL and EAZY.The shaded region in each panel represents the 1 range of the individual metric obtained from EAZY for a given field; the error bars are 1.

Figure 7 .
Figure 7. Top: distributions of the spec-z catalogue sample for CDFS on the  spec − Δ plane, which are obtained by testing HAYATE (left) and EAZY (middle).Each data point is presented in different markers and colours, which represent the galaxy type and   -band magnitude.The threshold of Δ outliers is set to 0.2, which is shown by the vertical lines, and the outliers are represented by the filled markers.The right panel presents the comparison of  NMAD between low-z ( < 1.3) and high-z ( > 1.3) samples, individually derived with HAYATE and EAZY.Bottom: comparison of photo-z errors (|  ℎ −    |) between HAYATE (Δ HAYATE ) and EAZY (Δ EAZY ) for the same sample presented in the top panel.Each panel contains residual plots of the individual objects, whose entire distribution is presented in the left panel while the zoom-in plot within the outlier threshold of |Δ | = 0.2 is in the right panel.The shaded region represents the area where  HAYATE is better than  EAZY .

Figure 10 .
Figure10.Example star-forming galaxies of Class A. The input fluxes with errors are plotted in the top row of each panel, along with the missing values represented by the black crosses.The gray dotted line show the best-fit SEDs optimised with the photo-z by EAZY ( EAZY ), while the coloured solid line represents the result at the fixed photo-z derived with HAYATE ( HAYATE ).The bottom row compares the photo-z PDFs produced by HAYATE and EAZY, shown by the shaded region and the solid black line, respectively.For each object their photo-z point estimates are given by the coloured and gray upside-down triangles, while the spec-z by the down arrow.

Figure 11 .
Figure 11.Example AGNs of Class C in CDFS and COSMOS.

Figure 12 .
Figure 12.Example star-forming galaxies of Class C.

Table 1 .
Summary of evaluation metrics.We employ  NMAD and  0.2 for measuring the accuracy of photo-z point estimates, while KL, CvM and CRPS are responsible for assessing the quality of photo-z PDFs.Rate of catastrophically wrong  phot Outlier rate of Δ with |Δ | > 0.2 KL  ( phot ) Calibration of produced  ( phot ) : PIT [Eq.(11)] Divergence of PIT distribution  (  ) from uniformity CvM Dissimilarity between CDF of  (  ) and identity line CRPS Reliability of  ( phot ) w.r.t. spec   [Eq.(14)] Median of

Table 2 .
Performance of different photo-z models on the spec-z samples provided by S16.The number of inputs ( input ) and the sample size of spec-z data ( spec ) are presented in the second and third columns, for CDFS, COSMOS and UDS from top to bottom.For each field, the photo-z and PDF statistics are shown for the baseline FCNN, HAYATE, HAYATE-TL and EAZY.We additionally train the CNN of the same architecture as HAYATE purely with the spec-z data from scratch to exhibit the benefit of training with simulations.All the uncertainties are the standard deviation derived from bootstrap resampling.