AGImpute: imputation of scRNA-seq data based on a hybrid GAN with dropouts identification

Abstract Motivation Dropout events bring challenges in analyzing single-cell RNA sequencing data as they introduce noise and distort the true distributions of gene expression profiles. Recent studies focus on estimating dropout probability and imputing dropout events by leveraging information from similar cells or genes. However, the number of dropout events differs in different cells, due to the complex factors, such as different sequencing protocols, cell types, and batch effects. The dropout event differences are not fully considered in assessing the similarities between cells and genes, which compromises the reliability of downstream analysis. Results This work proposes a hybrid Generative Adversarial Network with dropouts identification to impute single-cell RNA sequencing data, named AGImpute. First, the numbers of dropout events in different cells in scRNA-seq data are differentially estimated by using a dynamic threshold estimation strategy. Next, the identified dropout events are imputed by a hybrid deep learning model, combining Autoencoder with a Generative Adversarial Network. To validate the efficiency of the AGImpute, it is compared with seven state-of-the-art dropout imputation methods on two simulated datasets and seven real single-cell RNA sequencing datasets. The results show that AGImpute imputes the least number of dropout events than other methods. Moreover, AGImpute enhances the performance of downstream analysis, including clustering performance, identifying cell-specific marker genes, and inferring trajectory in the time-course dataset. Availability and implementation The source code can be obtained from https://github.com/xszhu-lab/AGImpute.


Introduction
Single-cell RNA sequencing (scRNA-seq) data provide gene expression profiles at a single-cell resolution which can help biological researchers identify cell types and explore the process of cell development (Lee et al., 2022, Anderson et al., 2023, Song et al., 2023).However, scRNA-seq data are characterized by high sparsity and high noise, with 65%-90% zeros including true and false zero counts (Gan et al., 2022).True zeros indicate that the genes are not expressed, while false zeros are dropout events caused by technical noise (Gr€ un et al., 2014), such as low capture efficiency of mRNA, amplification bias, and library size differences.The numerous dropout events will compromise the reliability of the downstream analysis of scRNA-seq data.Therefore, identifying and imputing dropout events are important tasks for improving the quality of scRNA-seq data (Vallejos et al., 2017, Raevskiy et al., 2023).
In recent years, two groups of dropout event imputation methods for scRNA-seq data have been proposed.One class is to re-estimate all gene expression values while imputing dropout events, and the other is to identify the dropout events from the zeros first and then impute them.In the first category, several statistical model-based dropout imputation methods have been developed by borrowing information from similar cells or genes, which commonly involve clustering cells or genes and then imputing dropout events (Berrevoets et al., 2023, Mahmoudi, 2023).SAVER (Huang et al., 2018) borrowed information from similar genes, then used the average of output posterior distribution as the new predicted gene express values.MAGIC (van Dijk et al., 2018) utilized Markov transfer matrices to estimate all gene expression values in similar cells, while VIPER (Chen and Zhou, 2018) borrowed information from similar cells within cell subtypes identified by regression.ENHANCE (Wagner et al., 2019) borrowed information from neighboring cells and then estimated all expression values of low-noise principal components (PC) genes.These genes were determined based on the evaluation of technical noise captured by a single PC.In addition, some deep learning-based methods have been proposed to integrate gene expression distribution for imputation.For example, scIGANs (Xu et al., 2020) utilized a Generative Adversarial Network (GAN) to generate new values as imputations for dropout events but facing the challenge of losing biological information due to a potential mismatch between the generated data and the original scRNA-seq data.The above-mentioned methods change most gene expression values, which lead to excessive imputation and potentially introduce new noise into scRNA-seq data.
To address the excessive imputation, some statistical model-based methods have been developed that involve identifying dropout events before imputation.For example, scImpute (Li and Li, 2018) identified dropout events through estimating the dropout probability using a statistical model and imputed the identified dropouts subsequently.scLRTC (Pan et al., 2021) used cell similarity measurements and constructed a third-order tensor to estimate the gene expression values, in which some zero values were re-estimated and nonzero values were kept.These changed zeros were identified as dropout events, and their estimated values were taken as the dropout imputations.However, the number of true zeros varies among different cells due to cell heterogeneity and batch effects, which is ignored in these studies.Differentially estimating the number of dropout events for different cells is critical for effectively imputing dropout events.
Hence, we have proposed a method, named Autoencoder-GAN based imputation for scRNA-seq data (AGImpute).In AGImpute, a dynamic threshold estimation mechanism is designed to adaptively identify the number of dropout events for different cells.Then, an Autoencoder-GAN model is constructed to impute the identified dropout events by leveraging information from both similar cells and gene expression distributions.To evaluate its performance, AGImpute is compared to seven existing dropout events imputation methods on two simulated and seven scRNA-seq datasets, based on several criteria, including the number of dropout imputations, clustering performance (Buettner et al., 2015), cellspecific marker genes identification, and trajectory inference.

Materials and methods
In this article, we consider that different cells have different numbers of dropout events, and propose AGImpute method containing a dropout identification module and an Autoencoder-GAN module, shown in Fig. 1.In the dropout identification module, a dropout probability matrix is constructed by defining a probability mass function (PMF), which fits scRNA-seq data based on the combination of Zero-inflated Poisson (ZIP), Gaussian, and Zero-inflated Negative Binomial (ZINB) distributions.Based on the dropout probability matrix, a dropout probability difference matrix is constructed, and a dynamic threshold estimation is proposed to determine the number of dropout events for different cells.In the Autoencoder-GAN module, the identified dropout events are imputed by combing Autoencoder (Fanai and Abbasimehr, 2023) and GAN (Goodfellow et al., 2014) that incorporates preclustering labels from Leiden clustering.

Preclustering by Leiden
To incorporate information from similar cells, we perform Leiden clustering to group cells.The input gene expression matrix M contains n cells and m genes, where m i;j represents the expression of the j-th gene in the i-th cell.n cells are clustered into k clusters by Leiden, and the matrix M is divided into k submatrices:

Adaptive estimation of dropout events for different cells
A dropout identification module is used to identify dropout events in scRNA-seq data.In this module, a mixed distribution combining ZIP, Gaussian, and ZINB is used to fit scRNA-seq data and estimated dropout probability.Then, a dropout probability difference matrix is derived from the dropout probability matrix, and a dynamic threshold estimation is designed to adaptively predict the number of dropout events for different cells.

Prediction of dropout probability based on a mixed distribution
Previous studies have used Poisson and ZINB distribution to fit scRNA-seq data, since the former can fit technical noise well and the latter can solve large amounts of zeros (Qiu et al., 2023).To solve the problem of ZINB being unsuitable for high-noise data (Wang et al., 2018), we design a mixed distribution combining ZIP, Gaussian, and ZINB to fit scRNA-seq data.In this mixed distribution, ZIP is used to fit the distribution of zeros including true and false zeros, and Gaussian distribution is used to fit the distribution of nonzeros.The probability statistical model is extremely sensitive to the initialization parameters, and ZINB can accurately capture the true and false zero ratio.So ZINB is used to initialize the parameters of ZIP and the proportion of dropout events.The PMF of ZIP is defined as in formula (1).
where / is the proportion of true zeros for a gene that is zeroexpressed across all cells, and y is an indicator function that describes the number of expected false zeros.k is the mean of the Poisson distribution.
The PMF of ZINB is defined as in formula (2). (2 where h is the probability parameter of the Negative Binomial distribution. The PMF of the Gaussian distribution is denoted as in formula (3).
where l represents the mean of all nonzeros and r is the variance of all nonzeros.
Based on the three distributions, we define the final mixed distribution as in formula (4).To initialize the parameters, we perform the Expectation Maximization (EM) algorithm that would effectively estimate the parameters of a complex statistical model by iteratively optimizing the likelihood function.
where k is the preclusters label, and x k i is the expression of the j-th gene in the i-th cell belonging to the k-th cell cluster.h i is the proportion of zeros in all values./ k i , k k i , l k i , and r 2k i also denote the corresponding values in the k-th cell cluster, respectively.
Then, based on the final mixed distribution, the dropout probability of the j-th gene in the k-th cluster d k i is defined as in formula (5).Finally, a dropout probability matrix M p is constructed by merging the k dropout probability submatrices.

Adaptive estimation of the number of dropout events based on a dynamic threshold estimation mechanism
The number of dropout events is commonly set as a uniform threshold in all cells, failing to account for the differences between individual cells.To solve this problem, we develop a dynamic threshold estimation mechanism.Inspired by scRecover method (Miao et al., 2019), we assume that adjacent genes with considerable variation in their dropout probabilities would be an indication of dropout events and try to adaptively estimate the dropout events by capturing these signals.First, genes are sorted in ascending order for every cell based on the dropout probability in the matrix M p , resulting in a row-sorted matrix M 0 p .Then, for each row in matrix M 0 p , the differences between the dropout probabilities of adjacent elements (genes) are calculated, and a dropout probability difference matrix M d is defined as in formula (6).
Where M p i;j ð Þ denotes the dropout probability of the j-th gene in the i-th cell.
By considering the maximum dropout probability difference and the number of corresponding dropout events, we define the threshold of the number of dropout events for different cells.For the i-th cell in the matrix M d , the maximum dropout probability difference is the maximum value in the i-th row.For n cells, let where N i denote the vector of the number of elements in each row i with an index less than j, supposing the j-th element in the i-th row corresponds to max Then, the top t maximum values in V max values are retained, and the mean of the corresponding values in V num is calculated, denoted as m t .m t is the threshold for determining the number of dropout events.
To estimate the number of dropout events for each cell precisely, a regression strategy is developed to adjust V num i ½ � dynamically.First, the difference between V num i ½ � and m t is calculated.Then, V num i ½ � is adjusted to approach m t by increasing it if it is less than m t or decreasing it otherwise.So, an updated V 0 num i ½ � value is achieved, defined in formula (7).Based on the updated V 0 num i ½ �, the corresponding dropout probability is calculated for cell i in the dropout probability matrix M p , and the dropout events are also identified in the gene expression matrix M, separately.V where x is a regulation coefficient with an initial value of 1.For a scRNA-seq dataset, we select the top 4000 high variable genes by using Scanpy, and the identified dropout events of these genes in each cell will be imputed.

Imputation for dropout events based on a hybrid Autoencod-er-GAN
A hybrid Autoencoder-GAN module is constructed to impute the identified dropout events.To consider the gene expression distribution, Autoencoder is used to learn a lowdimensional embedding space and GAN is used to generate the dropout imputation.To consider the similar cells, the k Leidon are input respectively, and k Autoencoder-GAN is obtained.

Training strategy
A mean squared error (MSE) loss function is used to minimize the difference between the input data of the encoder and the reconstructed data of the decoder, as defined in formula (8).
where n is the number of samples.
In the GAN layer, the gene expression vector of each cell is transformed into a 100 � 100 matrix as input.Then, the generator generates synthetic data similar to the true data and the discriminator distinguishes between generated and true data.Adversarial training is performed using a mini-max objective function shown in formula ( 9), and stops after 200 epochs.Finally, the output matrix of the GAN is reshaped back into the gene expression matrix of cells.The k matrices reconstructed by the Autoencoder-GAN are combined into a new matrix.
where the random variable x follows the distribution P data of the true data, and the random variable z follows the distribution p z of the generated data.

Imputation for dropout events
By considering the more similar cells, the k-means method is used to split preclusters larger than 400 cells into subclusters M s .In each subcluster, a smooth interpolation is implemented to impute dropout events, that is, the dropout events are imputed by calculating the mean gene expression value across all cells.

Datasets
To test the ability of the AGImpute to impute dropout events, we applied AGImpute and other methods in two simulated datasets and seven scRNA-seq datasets.Two simulated datasets are generated by the Splatter R Package (Zappia et al., 2017), including dropout data with a 60% dropout rate and the corresponding true data with a 0% dropout rate, both of them contained 20 000 genes, 1000 cells, and five cell types.
The seven scRNA-seq datasets are downloaded from NCBI Gene Expression Omnibus (GEO) and summarized in Table 1.

Results
AGImpute is compared to seven state-of-the-art imputation methods, including ENHANCE, MAGIC, SAVER, scImpute, scIGANs, scLRTC, and VIPER.We first test the influence of selecting different numbers of highly variable genes in AGImpute, relating to the number of imputations.Then, we compare the performance of downstream analysis of the eight imputation methods, including clustering performance, identifying cell-specific marker genes, and inferring trajectory.

Analyzing the influence of selecting different numbers of highly variable genes
By selecting different numbers of highly variable genes in the Deng dataset, specifically 2000, 4000, 6000, 8000, and 10 000, we test the differences in the number of dropout imputations, coefficient of variation (CV), and clustering performance, shown in Fig. 2. From Fig. 2a, the number of dropout imputations increases linearly with the number of highly variable genes.The number of dropout imputations with 10 000 highly variable genes is 4 times higher than that with 2000 highly variable genes.Figure 2b provides the CV between the raw data and the AGImpute-imputed data with different numbers of highly variable genes.From Fig. 2b, there are minimal dispersion differences before and after imputation for the first 125 cells, and gradual differences for the latter half.When the number of highly variable genes is 2000 or 4000, CV is more stable.Figure 2c presents the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) of clustering results.It shows that there is a minimal influence on clustering performance for different numbers of dropout imputations.The best NMI and ARI are obtained when the number of highly variable genes is 4000.Therefore, we select the first 4000 highly variable genes to minimize the number of dropout imputations.

Comparing the similarity between the imputed data using different imputation methods and the raw data
To evaluate the ability to impute dropout events without excessive imputation, AGImpute is compared to seven imputation methods in terms of the number of dropout imputation, the difference between the raw data and the imputed data in terms of value and distribution, shown in Fig. 3.
Figure 3a shows the number of dropout imputations generated by the eight methods on five scRNA-seq datasets.AGImpute achieves significantly fewer dropout imputations than other methods, and only 3.4% of the dropout imputations are generated by scImpute.Figure 3b provides the gene expression density plots of eight imputation methods on the Adam dataset.The dots in the plots represent cells, the x-axis indicates the mean expression level of all genes in the cell before imputation, and the y-axis indicates the value after imputation.It visualizes the distribution of cells and describes the difference between the raw data and the imputed data.To quantify the differences, the Pearson correlation coefficient is calculated.In the gene expression density plots of AGImpute, cells distribute along the 45 � line showing a minimal difference from the raw data.In other methods, scattered distributions are exhibited which indicates a greater difference from the raw data.Moreover, the Pearson correlation coefficient also validates this finding.Therefore, AGImpute imputes the few dropout events while preserving more original information.

Comparing the clustering performance of different imputation methods
To evaluate the effect of dropout imputation on clustering performance, eight imputation methods are applied with hierarchical clustering on five scRNA-seq datasets.
Clustering performance is evaluated in terms of ARI, NMI, and Jaccard index.A higher value closer to 1 indicates better clustering performance.The results are shown in Fig. 4a, in  AGImpute which bar graphs of mean and variance are presented in the left column, while a heatmap of clustering metrics is shown in the right column.In the left columns, AGImpute, ENHANCE, and scLRTC show the best clustering performance, while scIGANs performs poorly.In the right column, AGImpute maintains stable clustering performance on all five datasets.However, we find that AGImpute has no obvious advantage in clustering performance compared with ENHANCE and scLRTC.The reason is that AGImpute has little impact on clustering performance due to its small amount of imputation and small impact on data distribution.Further, to observe the cell type identification, Uniform Manifold Approximation and Projection (UMAP) is performed on the Adam dataset, which is shown in Fig. 4b.The results of AGImpute, SAVER, and scLRTC are similar to the raw data, while the other four methods confuse distinguishing cell types.Therefore, it is found that AGImpute improved clustering performance without altering the cell types of the raw data.

Analyzing the cell-specific marker genes
The cell-specific marker genes of the raw data and AGImpute-imputed data on the Cell Type dataset are shown in Fig. 5, respectively.Figure 5a provides the visualization results based on UMAP.We observe the expression changes in three cellspecific marker genes including Ret, Aqp4, and Foxi1, before and after imputation.The mutations or rearrangements of the Ret gene promote tumor formation, while the Aqp4 gene plays an essential role in brain water balance, astrocyte migration, nerve signal transmission, and neuroinflammation.The specific function of Foxi1 is unclear, but it may play an important role in cochleo-vestibular development and embryonic development.In the raw data, the expression of the Ret gene is low, the expression of the Aqp4 gene is relatively high, and the expression of the Foxi1 gene is moderate.In contrast, AGImpute-imputed data enhance the expression of the three genes significantly for improved cell type identification.Figure 5b shows the gene expression for H1 and DEC

Figure 1 .
Figure 1.The workflow of AGImpute.(a) The input of gene expression matrix M. (b) Preclustering of matrix M into k submatrices using Leiden.(c) Design of dropout identification module with dropout probability matrix M p and dynamic threshold estimation.(d) Selection of final identified dropout events as an intersection of the top 4000 highly variable genes and identified dropout events.(e) Construction of Autoencoder-GAN module to reconstruct k submatrices and generate merged matrix M g .(f) Smoothing of identified dropout events by splitting preclusters with more than 400 cells into subclusters M s by k-means, and calculation of mean for each subcluster.(g) Generation of an imputed matrix as output.

Figure 2 .
Figure 2. Influence of highly variable gene on dropout imputation in AGImpute.(a) Histogram of the number of dropout imputations.(b) Coefficient of variation between the raw data and the AGImpute imputed data.(c) Boxplots of Adjusted Rand Index and Normalized Mutual Information of clustering results on AGImpute-imputed data.

Figure 3 .
Figure 3.Comparison of imputation performance between AGImpute and other methods.(a) Histogram of the number of dropout imputed by different imputation methods.(b) Boxplots of the proportion of zeros in true data, dropout data, and imputed data on the simulated dataset.(c) Boxplots of the proportion of zeros, mean, and standard deviation of gene expression values on the real Adam dataset.(d) Gene expression density plot using eight interpolation methods on the Adam dataset, demonstrating differences between imputed and raw data.

Table 1 .
The list of seven real scRNA-seq datasets.