Imputation of Missing Photometric Data and Photometric Redshift Estimation for CSST

Accurate photometric redshift (photo-$z$) estimation requires support from multi-band observational data. However, in the actual process of astronomical observations and data processing, some sources may have missing observational data in certain bands for various reasons. This could greatly affect the accuracy and reliability of photo-$z$ estimation for these sources, and even render some estimation methods unusable. The same situation may exist for the upcoming Chinese Space Station Telescope (CSST). In this study, we employ a deep learning method called Generative Adversarial Imputation Networks (GAIN) to impute the missing photometric data in CSST, aiming to reduce the impact of data missing on photo-$z$ estimation and improve estimation accuracy. Our results demonstrate that using the GAIN technique can effectively fill in the missing photometric data in CSST. Particularly, when the data missing rate is below 30\%, the imputation of photometric data exhibits high accuracy, with higher accuracy in the $g$, $r$, $i$, $z$, and $y$ bands compared to the $NUV$ and $u$ bands. After filling in the missing values, the quality of photo-$z$ estimation obtained by the widely used Easy and Accurate Zphot from Yale (EAZY) software is notably enhanced. Evaluation metrics for assessing the quality of photo-$z$ estimation, including the catastrophic outlier fraction ($f_{out}$), the normalized median absolute deviation ($\rm {\sigma_{NMAD}}$), and the bias of photometric redshift ($bias$), all show some degree of improvement. Our research will help maximize the utilization of observational data and provide a new method for handling sample missing values for applications that require complete photometry data to produce results.


INTRODUCTION
The measurement of galaxy redshift plays a crucial role in understanding the large-scale structure of the universe, as well as the formation and evolution of galaxies (Tasca et al. 2009;Mo et al. 2010;Conselice 2014).By systematically detecting and analyzing the redshift of galaxies, their distances can be accurately determined, laying the foundation for studying various physical properties of galaxies, including mass, luminosity, and the scale of various extreme phenomena.Moreover, measuring galaxy redshift allows for a deeper investigation into e.g. the structure, origin, and evolution of the observable universe as a whole (Abdalla et al. 2011;Zhou et al. 2021).
The accuracy of estimating photo- is affected by various factors, including the quality of photometric data, the coverage range of photometric bands, and the selection of redshift estimation methods, among others (Newman & Gruen 2022).Among these factors, the coverage range of photometric bands is considered particularly critical.Keeping the redshift estimation method and the quality of photometric data constant, utilizing photometric data with broader band coverage can improve the precision of photo- estimation.This is attributed to the fact that a wider band coverage enables the capture of more spectral information and provides additional data points for analysis, leading to more accurate redshift determinations (Salvato et al. 2019).
For example, the photo-s obtained in the COSMOS2015 catalogue (Laigle et al. 2016) have been shown to be more precise and accurate compared to surveys like the Dark Energy Survey (DES) and the Sloan Digital Sky Survey (SDSS) (Gunn et al. 2006).This improvement can be attributed to the fact that the COSMOS photo-s were computed using more than 30 bands that cover a wide range of the electromagnetic spectrum, in contrast to the limited four or five optical bands utilized in the DES and SDSS surveys.
Although leveraging photometric data from more bands can enhance the accuracy of photo- estimation, it is inevitable that many sources will have missing observations in one or more bands during practical observations in large-scale astronomical surveys due to various limitations.These factors comprise galaxies or regions that have not been observed in a specific band, galaxies or regions that are masked, photometric measurements that do not meet the detection threshold of the catalog, or photometry measurements characterized by a notably low signal-to-noise ratio (Euclid Collaboration: Humphrey et al. 2023).The lack of such data not only greatly affects the accuracy and reliability of photo- estimation, but also makes some estimation methods unusable due to incomplete photometry data.For instance, many machine learning algorithms developed for photo- estimation usually need full data from multiple bands as input (Lu et al. 2024;Fotopoulou & Paltani 2018;Zhou et al. 2021).Hence, it is crucial to tackle the problem of missing data in observed photometry to maximize the utility of survey data and to achieve precise photo- estimation for these sources.
Various photo- estimation methods handle missing band data differently.For template fitting methods, missing bands are typically disregarded during redshift estimation (Bolzonella et al. 2011;Arnouts et al. 2002;Ilbert et al. 2006;Brammer et al. 2008).For example, in the context of EAZY (Brammer et al. 2008),If data is missing for a specific filter of an object during redshift calculation, the flux value for that band is typically set to a value more negative than the expected negative flux values, such as -99.This practice is undertaken to guarantee that this value is below any truly measured negative flux, as EAZY inherently manages non-detections with negative fluxes.Furthermore, EAZY also excludes objects in the catalog that have fewer filters than a specified number to uphold data reliability and precision.
When it comes to machine learning algorithms, a prevalent method to address missing data involves imputing values using techniques that are unrelated to the reasons for the data being absent.In the context of tree-based learning algorithms, various strategies can be employed to handle missing values.One common approach is to allocate a predefined constant value, such as -99.9, to signify the absence of data.Alternatively, missing values can be substituted with the mean, median, or minimum value of the corresponding feature, calculated across the galaxy sample using only the available data points.The selection of the imputation technique is guided by the specific demands of the algorithm and the nature of the dataset (Fotopoulou & Paltani 2018;Mucesh et al. 2021;Schirmer et al. 2022).In neural network-based learning algorithms, missing values are generally not allowed, and complete data from multiple bands are required as input.Objects with photometric SEDs that are missing one or more bands are typically discarded from the analysis (Zhou et al. 2021(Zhou et al. , 2022)).This highlights the importance of data completeness for neural network applications.
While these methods for handling missing data work well in some photo- estimation algorithms, such as in tree-based CPz (Classification-aided photometric redshift estimation) algorithm, where substituting with the mean produces good outcomes, or in certain galaxy classification applications like selecting quiescent galaxies based on methods by Euclid Collaboration: Humphrey et al. (2023), it is worth noting that these imputation methods for missing data do not provide close-to-real predictive values.Therefore, in other methods that require complete multi-band photometry data for accurate photo- estimation, the applicability of galaxies with missing data is limited by these imputation methods.
With the increasing popularity of machine learning, various machine learning methods for data imputation have emerged in the literature (Van Buuren & Groothuis-Oudshoorn 2011;Pereira et al. 2020;Yoon et al. 2018;Shang et al. 2017;Lee et al. 2019).These methods are also gradually being utilized to address the issue of missing data in the field of astronomy (Ren et al. 2020;Pichara & Protopapas 2013;Keerin & Boongoen 2022;Luken et al. 2021).In this study, we employed a deep learning method based on Generative Adversarial Imputation Networks (GAIN) (Yoon et al. 2018) to impute missing data across one or multiple observed wavelength bands in a large sample of galaxies.This method has been tested in a small radio continuum catalogue and proven effective (Luken et al. 2021).To the best of our knowledge, this is the first time that this method has been applied to photo- estimation, with the exception of Luken et al. (2021).Our aim is to ensure that the imputed data align closely with the actual observational values, thus reducing the influence of missing data on photo- estimation and enhancing the precision of the estimates.
We implemented the GAIN method on simulated optical survey data obtained from the Chinese Space Station Telescope (CSST).Anticipated to be launched within next two years, the CSST will share the orbit with the China Manned Space Station (Zhan 2011;Cao et al. 2018;Gong et al. 2019).The CSST survey is designed to cover approximately 17,500 deg 2 over a period of approximately 10 years, encompassing the optical and near-infrared (NIR) bands from approximately 250 nm to 1000 nm.The 5 limit for point source magnitudes can reach around 26 AB mag for the , , and  bands, and approximately 24.5 to 25.5 for the other bands.Figure 1 illustrates the transmissions that are under test, including detector quantum efficiency, of the seven photometric filters employed by CSST.The primary scientific goals of CSST involve investigating The solid curves illustrate the transmissions of the seven photometric bands used in CSST.These curves take into account the impact of detector quantum efficiency.For more information on the specific transmission parameters, please refer to Cao et al. (2018) the evolution of large-scale structure, the properties of dark matter and dark energy, as well as galaxy formation and evolution, among others (Gong et al. 2019;Cao et al. 2022;Zhan 2021).Thus, accurate photo- measurements are essential for achieving these objectives.
We rigorously assessed the accuracy of the imputed photometric values and then conducted photo- estimation using template fitting.By comparing the changes in photo- estimation accuracy before and after imputation, we evaluated the effectiveness of the missing data imputation method we employed.It is important to note that after imputation using our proposed method, galaxies with missing values can seamlessly be utilized in various photo- estimation methods or other applications, just like galaxies with complete data.
The organization of this paper is as follows.In Section 2, we explain the process used to generate the mock photometry catalogues utilized in this study.Section 3 provides a detailed description of our adopted missing value imputation method and presents the results obtained from applying the method to the CSST mock missing samples.In Section 4, we compare the accuracy of photo- obtained using the EAZY method before and after imputation.Finally, we summarize the results and draw conclusions in Section 5.

MOCK DATA
In this section, we will provide a brief overview of how the mock data is generated.More detailed information on the mock process can be found in Zhou et al. (2021).This catalog has been employed in several studies aligned with CSST objectives, with a focus on exploring machine learning and spectral energy distribution fitting techniques.The mock data is designed to have similar characteristics as the observations from the CSST survey, including redshift, magnitude distribution, and galaxy types.To ensure a high level of realism in simulating galaxy images for the CSST photometric survey, we utilize mock image generation techniques based on observations taken within the COSMOS field using the Advanced Camera for Surveys of the Hubble Space Telescope (HST-ACS), while incorporating CSST instrumental effects.The mock flux data of galaxies are measured from these images using aperture photometry.The COSMOS HST-ACS survey encompasses an area of approximately 2 deg 2 in the F814W band, which has a spatial resolution similar to that of the CSST, with an 80% energy concentration radius of  80 ∼ 0. ′′ 15 (Cao et al. 2018;Gong et al. 2019;Koekemoer et al. 2007;Massey et al. 2010;Bohlin 2016).Moreover, the COSMOS HST-ACS F814W survey exhibits significantly lower background noise compared to the CSST survey, expected to be approximately 1/3 of that experienced in the CSST survey.This attribute provides a solid foundation for the simulation of CSST galaxy images.For more comprehensive information on the mock data generation process, we recommend referring to Zhou et al. (2021).Here, we summarize and rewrite the important points.
First, we select an area of 0.85 × 0.85 deg 2 from the HST ACS survey, where ∼ 192,000 galaxies can be identified.Then we rescale the pixel size from 0. ′′ 03 of the HST survey to 0. ′′ 075 of the CSST survey.The identified galaxies are extracted as square stamp images with galaxies at the centers of images.The image sizes are 15 times the semi-major axis of galaxies, which can be obtained in the COS-MOS weak lensing source catalog (Leauthaud et al. 2007), so our galaxy images have different sizes.Other sources in the image are masked and replaced by background noise, and only the galaxy image in the center is preserved.
Next, we proceed to rescale the galaxy images from the HST-ACS F814W survey to the CSST flux level.This is done by utilizing galaxy spectral energy distributions (SEDs) to obtain the CSST 7band images.The galaxy SEDs are generated by fitting the fluxes and other photometric information provided in the COSMOS2015 catalog using the LePhare code (Arnouts et al. 1999;Ilbert et al. 2006;Laigle et al. 2016).During this fitting process, the photo-s from the catalog are fixed.The SED templates used for fitting are also sourced from this catalog, and they are extended from ∼ 900Å to ∼ 90Å using the BC03 method (Bruzual & Charlot 2003).This extension allows for the inclusion of fluxes from high-redshift galaxies in all CSST photometric bands.Further details can be found in the work of Cao et al. (2018).
We select around 100,000 high-quality galaxies with reliable photo- measurements for the SED fitting process.In addition to dust extinction, we also consider emission lines such as Ly, H, H, [OII], and [OIII].After fitting the galaxy SEDs, we can calculate the theoretical flux data by convolving them with the CSST filter transmission curves, as depicted in Figure 1.Simultaneously, we calculate the fluxes of the F814W images using an aperture size of 2 times the Kron radius (Kron 1980).The CSST 7-band images are then produced by rescaling the fluxes accordingly.To match the CSST observations, the background noise is also adjusted to the same level.Further details regarding the noise adjustment can be found in Zhou et al. (2022).As a result, we obtain the mock CSST galaxy images for the seven CSST photometric bands.
To measure the flux in our galaxy mock data, we employ aperture photometry.Initially, we determine the Kron radius along the major and minor axes, allowing us to define an elliptical aperture size of 1 times    .Within this aperture, the flux and its corresponding error can be calculated for each band.
The final CSST mock catalog comprises measurements in seven bands of the CSST, including flux, flux error, and photo-, for nearly 100,000 galaxies.These CSST mock galaxies were selected from the COSMOS catalog, which utilizes photo- estimates computed from over 30 bands covering a wide range of the electromagnetic spectrum.Laigle et al. ( 2016) conducted a verification process by comparing photo- estimates in the COSMOS2015 catalog with various spectroscopic survey samples.Additional information regarding the accuracy of the photo- estimates and the characteristics of spectroscopic redshift (spec-) samples can be found in Tables 4 and 5, as well as Figures 11 and 12 presented by Laigle et al. (2016).Based on the demonstrated precision and accuracy of the photo- estimates in the COSMOS catalog, we consider them reliable and have adopted them as the true redshift values, referred to as  true hereafter.
From this CSST simulated catalog, we further selected a highquality photometric sub-sample, where the signal-to-noise ratio (SNR) is greater than 10 in either the  or  band, with valid observations in other bands.
The sub-sample obtained in this way is referred to as the Highquality CSST sample (HCS), which includes 40,763 sources.
Figure 2 illustrates the distribution of redshifts for the HCS dataset.The plot shows that the majority of sources have redshifts concentrated around the range of  = 0.8 − 1.0.The redshift distribution extends from 0 to 5, implying the inclusion of sources spanning a wide range of cosmic distances.Figure 3 provides the distribution of AB magnitudes in the  and  bands for the HCS dataset.This plot gives an overview of the brightness distribution in these specific bands for the selected high-quality sub-sample.
Furthermore, we simulated different missing rates to create multiple sub-samples for imputation testing.The missing rate is defined as /7, where  is the number of missing bands.For each source in HCS, we randomly deleted values from one band, resulting in a sub-sample with a missing rate of approximately 14.3%, known as the One-band Missing Sample (OMS).Similarly, we also randomly removed 2, 3, or 4 different bands for each source in HCS, creating three additional sub-samples referred to as the Two-band Missing Sample (TMS), Three-band Missing Sample (TrMS), and Four-band Missing Sample (FMS).Each source in these sub-samples has 2, 3, or 4 missing values, with missing rates approximately 28.6%, 42.9%, and 57.1% respectively.

Missing Data
Missing values are frequently encountered in astronomical datasets, with various underlying causes.These include: 1) incomplete observations, such as galaxies or regions not observed in specific wavelength bands, as well as galaxies or regions that are masked.2) instrument constraints and adverse observing conditions, leading to photometric measurements that fall below catalog detection thresholds or have notably low signal-to-noise ratios.3) recording errors and data corruption.4) missing values resulting from combining multiple survey data, for example, due to variations in depth between different surveys or the absence of certain objects in a particular survey.
Based on the missingness assumptions, the issue of missing data can be classified into three categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Little & Rubin 2019;Graham 2009;Van Buuren 2018).In astronomical research, missing data can also be classified into these three categories.Specifically, MCAR indicates that missing data is random and unrelated to any observed or unobserved factors; MAR suggests that missing data is related to observed variables but not to unobserved variables; MNAR indicates that missing data is related to unobserved factors.In the majority of astronomical research samples, most missing data is typically categorized as either MCAR or MAR, with MNAR being less common.
For data that are MCAR or MAR, many imputation algorithms can be utilized to estimate missing values based on existing observed/measured data.An effective imputation algorithm should not only accurately predict missing values but also maintain the consistency and accuracy of feature-label relationships after imputation.
State-of-the-art imputation methods can be classified into two categories: discriminative methods and generative methods.Discriminative methods focus on modeling the relationship between observed and missing data directly, often using statistical techniques or machine learning algorithms to predict missing values based on other variables.On the other hand, generative methods aim to model the joint distribution of observed and missing data, creating a complete representation of the data distribution.These methods use approaches such as Bayesian networks or latent variable models to impute missing values by considering the underlying structure of the data.Discriminative methods encompass MICE (White et al. 2011), MissForest (Stekhoven & Bühlmann 2012), and matrix completion techniques, while generative methods involve expectationmaximization algorithms and deep learning models like denoising autoencoders (DAE) and generative adversarial networks (GAN).Yoon et al. (2018) introduced a novel approach that leverages the wellknown GAN framework for filling missing data.They termed this method Generative Adversarial Imputation Nets (GAIN).Through evaluating GAIN's performance on diverse datasets, they discovered that GAIN significantly outperforms other state-of-the-art imputation methods in terms of imputation accuracy in substituting MCAR data.Further research has shown that the GAIN method also performs well on MAR data (Dong et al. 2021).
Next, we provide a detailed description of the GAIN method used in this study and present the results obtained by applying the GAIN method to mock missing samples from CSST.

GAIN Algorithm
The GAIN algorithm is based on the well-known GAN framework and is designed to handle the unique characteristics of the imputation problem.Various experiments conducted on real-world datasets demonstrate that GAIN outperforms current state-of-the-art imputation techniques.For a more comprehensive understanding, we highly recommend referring to the research paper by Yoon et al. (2018).
Here, we will provide a general overview of the GAIN algorithm.
The architecture of GAIN is shown in Figure 4.It consists of two main components: a generator () and a discriminator ().The  takes in incomplete data as input and aims to accurately impute the missing values.It does so by generating plausible values for the missing data based on the observed data and any available contextual information.On the other hand, the 's role is to distinguish between the observed (complete) data and the imputed data generated by the .It is trained to minimize the classification loss, which involves classifying whether each component was observed or imputed.Both the  and  are trained simultaneously in an adversarial process.The 's training objective is to maximize the 's misclassification rate, while the 's objective is to correctly classify the observed and imputed components.This adversarial training process ensures that the  improves its imputation performance over time, gradually generating more accurate imputations.The architecture of GAIN builds upon and adapts the standard GAN architecture to handle the specific challenges and characteristics of the missing data imputation problem.
The detailed imputation procedures with GAIN can be summarized as follows.Firstly, let's consider a dataset X that requires imputation.Here, M is a mask matrix of the same shape as X, used to identify the missing values.Additionally, Z is a matrix of the same shape as X, representing random noise that is independent of all other variables.To address potential issues with different variable scales affecting variable weights, we standardize all variables in X to a range of 0-1.Next, a fully connected multilayer neural network generator  is constructed. takes X, M, and Z as inputs.During the initial training phase, the output of  can be represented as: The imputed matrix, denoted as X, can be defined as: A fully connected multilayer neural network discriminator  is then constructed. takes the imputed dataset generated by the generator and a hint matrix H as inputs.The hint matrix H is created using a hint mechanism to enhance discrimination: Here, B is a random variable: The discriminator D outputs the probability of a value being real or fake: Next, we construct adversarial networks and start training.We randomly extract a mini-batch of k samples from the dataset for the training process.Training begins by optimizing the loss function for , followed by alternating training between  and , optimizing their respective loss functions: and Here,   represents the sum of mean squared errors (MSE) between the observed values and the predicted values at the corresponding locations, and  is a hyper-parameter.Finally, we apply the inverse standardization process to restore the imputed values back to their original scale.

Imputation Results on CSST Mock Data
In this section, we assess the imputation performance of GAIN using four datasets: OMS, TMS, TrMS and FMS.In all experiments, the GAIN model is configured with a 6-layer generator and discriminator.
The number of hidden nodes in each layer is 14, 14, 7, 7, 7 and 7, respectively.The activation function used for each layer, except for the output layer, is ℎ.The output layer employs the  activation function.The number of batches is set to 256 for both the generator and discriminator.The hyperparameter  is set to 100, and the hint rate is set to 0.85.The accuracy of imputation was evaluated by measuring the imputation error, which is defined as the difference between the imputed values and the actual values.In this study, the normalized root mean square error (NRMSE) was used as the assessment metric.NRMSE was defined as follows: where x represents the imputed value and   represents the original value, and  is the total number of missing values.
The GAIN method was applied for imputation 100 times for each All the subscripts denote the CSST filter, and the superscripts represent the index of the galaxy sample."" represents the true observed magnitude of the galaxies, " " represents the predicted magnitude after imputation, "" denotes the random noise, " " represents the probability distribution.Additionally, the symbol "X" in red represents the missing magnitudes in that particular filter.
of the simulated incomplete datasets, OMS, TMS, TrMS, and FMS, with each imputation involving a complete retraining.The NRMSE values obtained from these 100 imputations were analyzed, and the average NRMSE and standard deviation were calculated.
Table 1 shows the imputation errors (NRMSE) and their standard deviations obtained from using the GAIN method to impute missing values in the OMS, TMS, TrMS, and FMS datasets.According to the data in the table, it can be observed that for the OMS and TMS datasets, which have relatively low missing rates (missing rate < 30%), the GAIN method performs well and yields small imputation errors.This suggests that when one or two bands of data are missing in CSST photometry data, the GAIN method proposed in this study can accurately restore the missing values.However, as the data's missing rate gradually increases, the imputation error of the GAIN method also increases.For the TrMS and FMS datasets, with missing rates of 42.9% and 57.1% respectively, which are higher than the OMS and TMS datasets, the imputation errors for missing data in each band are significantly increase.
Furthermore, from our research, we found that among the seven bands of CSST, when imputing missing values using the GAIN method, the  band has the largest imputation errors, followed by the  band, while the , , , , and  bands show very good imputation performance, especially at lower missing rates.Apart from the model itself, we believe that the accuracy of CSST photometry data may also have an impact on the results.In the CSST photometry data, the  band has the least accuracy, followed by the  band, while the accuracy of the , , , , and  bands is the highest.This suggests that larger photometry data errors may introduce certain uncertainties to the model training, leading to larger imputation uncertainties when filling missing values.Figure 5 displays the distribution of photometry data errors for each band in our CSST simulated data (HCS).
From the figure, it can be observed that both the mean and variance of the error distributions for the  and  bands are significantly higher than those for the other bands.
After imputing missing data 100 times, we take the average of the imputed values as the predicted value for the missing data, and consider the standard deviation as the error of the predicted value.Figures 6 shows the comparison between the true values and the predicted  values for the OMS dataset.The solid line represents  predict =  true , and the dashed line represents the result of linear regression fitting to the data points in the subplot.It can be clearly seen that the GAIN method is effective in imputing missing photometry data, especially at lower missing rates.Similarly, in the TMS dataset, the GAIN method also performs well.However, in cases of higher missing rates such as the TrMS and FMS datasets, the imputation accuracy of the GAIN method significantly decreases.Figure 7 shows the comparison between the predicted values and true values in the TrMS dataset, where the missing rate reaches 57.1%.By comparing Figure 6 and Figure 7, the impact of missing rates on imputation performance can be clearly observed.The lower the missing rate, the better the imputation effect of the GAIN method on the photometry data.
In order to compare the differences between imputed values and true values across different bands, we utilized density plots to visualize the filling status of each band within the OMS dataset.In Figure 8, the distribution of discrepancies between the predicted values by GAIN and the true values for missing data in each band is displayed by the black line, along with the Gaussian fit represented by the red line.It can be observed from the graph that the differences between the filling values and the true values in each band are close to zero and concentrated, indicating high filling accuracy.However, the  and  bands tend to exhibit a broader distribution of dif-ferences and a higher density of larger differences.Moreover, our research findings suggest that the differences between filling values and true values become more pronounced in datasets with higher missing rates, such as TMS, TrMS, and FMS.
Finally, we compared the error distribution of predicted values with that of true values.Through this comparison, we can comprehensively evaluate the accuracy and reliability of the model, understand the degree of difference between predicted results and true values, and provide guidance for further model improvement.In Figure 9, we illustrate the error distribution of predicted values (black line) and true values (red line) in all datasets.It can be observed from the graph that, in all bands, the errors of our model's predicted values are larger than those of the true values, especially in the  and  bands , which is an expected and consistent outcome.Nevertheless, as illustrated in the lower right subplot, the overall error distribution of predicted values closely corresponds with that of true values.

APPLICATIONS IN PHOTO-𝑍 ESTIMATION
In the previous section, missing values were filled in the simulated datasets (OMS, TMS, TrMS, and FMS) of CSST.In this section, the EAZY template fitting method will be utilized to estimate the photo-s of these datasets before and after imputation.The necessity of missing data imputation and the effectiveness of the GAIN method will be validated by comparing the alterations in the quality metrics of photo- estimation for the samples.

Photo-𝑧 Quality Metrics
To assess the quality of our photo- estimates on the sample, we introduce three metrics.The first metric is the normalized median absolute deviation (NMAD), which measures accuracy and is defined as (Brammer et al. 2008) where true is the reference redshift used as the 'ground truth' and  phot is the predicted photo-.NMAD is preferred over standard deviation as it is less sensitive to outliers and incorporates a scaling factor of 1.48, allowing NMAD to be interpreted as the standard deviation for normally distributed data.The second metric is the proportion of catastrophic outliers (   ).Sources whose photo- estimates satisfy the following condition (Fotopoulou & Paltani 2018;Euclid Collaboration: Humphrey et al. 2023),    are considered catastrophic outliers, meaning their photo-s are incorrect.
The final metric is the bias of the photometric redshifts (), denoted as which examines whether and to what extent we systematically overestimate or underestimate the redshifts of the galaxies.

Photo-𝑧 Results
There are many methods available for estimating redshift from photometric data, mainly divided into template fitting and machine learning methods (for details, see the review by Salvato et al. (2019)).Among template fitting methods, EAZY is a commonly used software.For example, Yang et al. (2014) used EAZY to estimate the redshift of the Hawaii-Hubble Deep Field-North (H-HDF-N) survey catalog, while Chen et al. (2018) utilized EAZY to estimate the photo-s of X-ray point sources in the XMM-Large Scale Structure (XMM-LSS) survey region.Both studies demonstrated the excellent performance of EAZY.In the upcoming tests, we will use EAZY to estimate the photo-s of our samples.According to the study by Desprez et al. (2020), when run with the same configuration, different template fitting methods can provide almost identical results.The observed differences are not due to differences in the performance of template fitting methods, but rather due to differences in their configurations.Therefore, we expect that our test results will also hold true for other template fitting methods.
In our EAZY analysis, we made use of the standard CWW+KIN template set, which is based on the CWW empirical template set (Coleman et al. 2003) with the extension prescribed by Kinney et al. (1996).This template set includes six templates and is commonly employed in photo- estimation.Additionally, we incorporated the  band apparent magnitude prior (|  ), which represents the redshift distribution of galaxies with the apparent magnitude   .The parameter _  _  was set to 0, indicating a uniformly spaced redshift grid.Other parameters were kept at their default settings.The input bands included the seven CSST bands.
Upon utilizing the EAZY software and incorporating the -band apparent magnitude prior, the photo- estimation results for all samples in this research are presented in Table 2.The term "Before imputation" denotes the results of samples prior to data imputation, whereas "After imputation" indicates the results after filling in the missing data.It can be seen from Table 2 that the quality of photo- estimation has significantly improved overall after missing value imputation.As the level of missing data in the sample increases (from OMS->TMS->TrMS->FMS), the enhancement in the accuracy of photo- estimation after imputing missing values becomes more noticeable compared to pre-imputation.In the case of OMS, there are only minimal changes in the three metrics assessing photo- estimation quality.However, from TMS to TrMS and then to FMS, there is a substantial improvement in these metrics.The catastrophic outlier fraction (   ) has shown relative improvements of 12.1% (TMS), 24.7% (TrMS), and 28.5% (FMS) after imputation.Furthermore, both the normalized median absolute deviation ( NMAD ) and the bias of photometric redshift () have also significantly improved.Importantly, in scenarios with higher rates of missing data (such as TrMS and FMS), utilizing imputed data for photo- estimation has demonstrated increased accuracy and practical value compared to the pre-imputation results.
On the other hand, the results of using EAZY for photo- calculation without incorporating prior information are shown in Table 3. From the table we can see that after imputing all missing values in the samples, the three metrics evaluating the quality of photo- estimation have also improved.However, unlike Table 2, the improvement in the quality of photo- estimation is not monotonically increasing with the increasing proportion of missing data in the samples.For example, from OMS to TMS and then to TrMS and finally to FMS, the improvement rate of the catastrophic outlier fraction   shows a monotonically increasing trend in Table 2, while in Table 3, the improvement rates of   are not monotonically increasing.More specifically, without prior information in all datasets, TMS shows the most significant improvement in the quality of photo- estimation after filling in the missing values, with the catastrophic outlier rate   improving by approximately 17.1%, the normalized median absolute deviation ( NMAD ) improving by about 11.3%, and the photometric redshift bias () improving by 29.8%.The OMS dataset follows, with   improving by 11.6%,  NMAD improving by 23.5%, and  improving by 6%.Conversely, the TrMS and FMS datasets with higher rates of missing data show a smaller improvement in metrics after imputation, and their actual utility is limited due to the lower accuracy of photo- estimation in these samples.
By considering the results in Section 3.3, we can understand the reasons for these differences.In the GAIN method, the accuracy of imputation for the  band is relatively high even with a high data missing rate.Therefore, by incorporating prior information of the  band in the EAZY algorithm, the results of photo- estimation can be significantly improved, even in the TrMS and FMS datasets, as shown in Table 2.
Upon comparing Table 2 and 3, we can also see that for samples containing missing data, such as OMS, TMS, TrMS, and FMS, either adding prior information to the r-band or imputing missing values can enhance the accuracy of photo- estimation.However, the improvement in accuracy of photo- is more significant when both adding prior information to the -band and imputing missing values simultaneously.The enhancement from adding -band prior information without imputing missing values can be observed in the "Before Imputation" columns of Table 2 and 3, while the improvement from imputing missing values without adding prior information can be seen in the "Before Imputation" and "After Imputation" columns of Table 3.The increase in accuracy of photo- after imputing missing values and adding -band prior information can be observed in the "Before Imputation " and "After Imputation" columns of Table 2.
In addition, we further tested the addition of prior information from other bands, such as the -band, and found that this can also improve the photo- results for each sample.However, compared to adding prior information to the -band, the improvement in photo- after adding it to other bands is less significant.This is because within the redshift range of our samples, the importance of all bands other than NUV in photo- estimation is lower than that of the band, as illustrated in Figure 13 of Lu et al. (2024).Additionally, the photometric accuracy of the  band is the lowest, making it unsuitable for constructing prior information.Therefore, in the photo- tests presented in this paper, we considered adding prior information specifically for the -band.

CONCLUSIONS
In this study, we employed a deep learning technique known as Generative Adversarial Imputation Networks (GAIN) (Yoon et al. 2018) to impute missing photometric data in CSST.Our study will help improve the utilization of CSST observational data in the future and provide a new alternative method for handling missing data in large observation samples.Although our study focuses on CSST, this method can also be applied to ongoing or upcoming surveys such as LSST, Euclid, DES, and others.By following the outlined imputation method in this paper, datasets with missing values can be efficiently integrated into various software applications.
The CSST survey includes photometry in seven bands, namely , , , , , , and , spanning optical and near-infrared wavelengths.Photometric observations for CSST were simulated using data from HST-ACS and the COSMOS catalog, taking into account the instrument effects specific to CSST.Mock galaxy images in seven bands were generated, and flux along with observational error data were calculated using photometric apertures.Initial samples for this study were selected based on sources with signal-to-noise ratios exceeding 10 in the  or  bands and valid observations present in all bands.Subsequently, photometric data for all seven CSST bands of each galaxy in this sample was randomly eliminated to create sub-samples with varying data missing rates.
Upon examining these sub-samples with missing data, we discovered that the GAIN method is effective in filling in the absent photometric data.Specifically, when the data missing rate is below 30% in the samples, the accuracy of imputing the photometric data is notably high.Further comprehensive research indicates that among the seven observational bands of CSST, the , , , , and  bands display the highest imputation accuracy, followed by the  band, with the  band exhibiting the lowest imputation accuracy.We believe that, in addition to factors related to the model itself, the poor quality of photometric data in the  band of CSST may also impact the training of the GAIN model, leading to a significant decrease in imputation accuracy for missing values in the  band.
After filling all missing samples using the GAIN method, we further utilized the template fitting software EAZY to obtain the photo-s of the samples before and after imputation, and calculated three indicators to evaluate the quality of photo- estimation, including catastrophic outlier fraction (   ), normalized median absolute deviation (    ), and photometric redshift bias ().By comparing the changes in these three indicators before and after imputation, we further verified the effectiveness of the GAIN method proposed in this study.Detailed research findings show that regardless of whether prior information is added in the EAZY software, imputing missing values significantly improves the quality of photo- estimation.In particular, with the inclusion of prior information on the -band, the improvement in photo- quality increases with the proportion of missing data in the samples.Furthermore, in cases of high data missing rates (>30%), the photo-s obtained after imputing missing data exhibit significant enhancement compared to before imputation, making rough photo- estimation possible for such galaxies.With higher missing rates, such as those observed in TrMS and FMS subsamples, the imputation errors tend to be more pronounced.Despite the increased errors, it is important to note that the photo- estima-tion shows a more substantial improvement in these instances.This suggests that even with the inherent challenges posed by high missing rates, the accuracy of photo- estimation can benefit significantly from effective imputation techniques.

Figure 1 .
Figure1.The solid curves illustrate the transmissions of the seven photometric bands used in CSST.These curves take into account the impact of detector quantum efficiency.For more information on the specific transmission parameters, please refer toCao et al. (2018)

Figure 2 .
Figure 2. The galaxy redshift distribution of the HCS dataset.The distribution peaks around  = 0.8 ∼ 1.0, and can reach maximum at  ∼ 5.

Figure 3 .
Figure 3. the distribution of AB magnitudes in the  and  bands for the HCS dataset.

Figure 4 .
Figure 4.The architecture of GAIN.All the subscripts denote the CSST filter, and the superscripts represent the index of the galaxy sample."" represents the true observed magnitude of the galaxies, " " represents the predicted magnitude after imputation, "" denotes the random noise, " " represents the probability distribution.Additionally, the symbol "X" in red represents the missing magnitudes in that particular filter.

Figure 5 .
Figure 5.The apparent magnitude error distribution for each band in our simulated CSST data (HCS).

Figure 6 .
Figure 6.Imputed values vs. True values for the OMS dataset: The x-axis represents the true photometric values of the missing bands in the OMS dataset, while the y-axis represents the predicted values obtained using the GAIN method.The solid line represents  predict =  true , and the dashed line represents the results of linear regression fitting for all data points in the subplot.

Figure 7 .
Figure 7. Imputed values vs. True values for the TrMS dataset: The x-axis represents the true photometric values of the missing bands in the FMS dataset, while the y-axis represents the predicted values obtained using the GAIN method.Both solid and dashed lines have the same meaning as in figure 6.

Figure 8 .
Figure 8. Density plots illustrating the distribution of discrepancies between the imputed values and true values for each band in the OMS dataset.The black line represents the distribution of differences between imputed and true values, while the red line indicates the Gaussian fit of this distribution.

Figure 9 .
Figure 9.The error distribution of predicted values (black line) and true values (red line) for each band in the OMS dataset.The last subplot (bottom right) represents the overall error distribution of all bands.

Table 2 .
Quality metrics of photo- estimation for HCS, OMS, TMS, TrMS, and FMS datasets before and after imputation using EAZY with  band prior.

Table 3 .
Quality metrics of photo- estimation for HCS, OMS, TMS, TrMS, and FMS datasets before and after imputation using EAZY without prior.