scVIC: deep generative modeling of heterogeneity for scRNA-seq data

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) has become a valuable tool for studying cellular heterogeneity. However, the analysis of scRNA-seq data is challenging because of inherent noise and technical variability. Existing methods often struggle to simultaneously explore heterogeneity across cells, handle dropout events, and account for batch effects. These drawbacks call for a robust and comprehensive method that can address these challenges and provide accurate insights into heterogeneity at the single-cell level. Results In this study, we introduce scVIC, an algorithm designed to account for variational inference, while simultaneously handling biological heterogeneity and batch effects at the single-cell level. scVIC explicitly models both biological heterogeneity and technical variability to learn cellular heterogeneity in a manner free from dropout events and the bias of batch effects. By leveraging variational inference, we provide a robust framework for inferring the parameters of scVIC. To test the performance of scVIC, we employed both simulated and biological scRNA-seq datasets, either including, or not, batch effects. scVIC was found to outperform other approaches because of its superior clustering ability and circumvention of the batch effects problem. Availability and implementation The code of scVIC and replication for this study are available at https://github.com/HiBearME/scVIC/tree/v1.0.


Introduction
Cells are fundamental units of multicellular organisms, and each cell plays a unique biological role in growth and development.In large part, the heterogeneity among cells arises from differences in the expression of genetic information, such as transcriptome; therefore, the analysis of differential expression becomes an important step in determining the role of single cells.Recently, there has been an upswing in the development of tools aimed at the analysis of high-throughput, single-cell RNA sequencing data (scRNA-seq data).scRNAseq allows the collection of massive transcriptional expression at single-cell resolution, making it an ideal tool to study the cell-cell heterogeneity in developmental biology, oncology, and immunology.However, technical difficulties have arisen because of inefficient RNA capture and the requirement for deep sequencing to capture low-abundance transcripts in single cells, which, however, can cause frequent dropout events, making transcriptional expression noisy and sparse (Hebenstreit 2012).Another technical variability involves the bias of batch effects, which has received increased attention as transcriptome sources become more diverse.However, the introduction of batch effects can also confound biological heterogeneity (Hicks et al. 2018).Failure to eliminate these two technical influences will inevitably complicate subsequent analysis and lead to misinterpretation of the resultant heterogeneity.
An essential step in exploring cellular heterogeneity is the ability to cluster cells into subpopulations.To date, numerous clustering methods for scRNA-seq data have been proposed.To handle dropout events, CIDR (clustering through imputation and dimensionality reduction) first imputes gene expression profiles and then performs hierarchical clustering on its main coordinates of dissimilarity matrix through principal coordinate analysis (Lin et al. 2017).SIMLR (Single-cell Interpretation via Multi-kernel LeaRning) first learns a similarity measure from single-cell RNA-seq data via multiple kernel learning, which applies a designed diffusion-based step to reduce the effects of noise and dropout events.Spectral clustering is then performed on the similarity measure (Wang et al. 2017).Rare cell type identification (RaceID) was developed to reveal rare cell types in complex populations of single cells, basically by k-means clustering and, recently, an improved version by k-medoids clustering on the similarity matrix (Gr€ un et al. 2016).Semisoft clustering with pure cells (SOUP) is an algorithm that reveals the clustering structure for both pure and transitional cells.It first identifies pure cells from an expression similarity matrix and then applies k-means clustering to the pure cells (Zhu et al. 2019).However, neither RaceID nor SOUP explicitly models the technical variability that results from dropout events and then learns a similarity representation based on the original noisy data matrix, which is no longer accurate.Moreover, these four methods separate similarity representation from subsequent clustering, which often leads to spurious heterogeneity.Under the recently popular framework of Neural Networks (NN), scDeepCluster provides a deep count autoencoder algorithm (Eraslan et al. 2019) that nonlinearly reduces the highdimensionality of original data to the low-dimensionality of a latent space while reconstructing the denoised data.This is combined with a deep embedding clustering algorithm (Xie et al. 2016) designed to simultaneously apply clustering on compressed data residing in latent space (Tian et al. 2019).scziDesk improves upon scDeepCluster in that it learns a more clusterfriendly latent space by gradually strengthening the affinity within highly similar compressed data (Chen et al. 2020).Recently, graph neural networks have also been successfully applied to the unsupervised clustering of scRNA-seq data (Wang et al. 2021, Gu et al. 2022).Lei et al. (2023) specifically discussed the above methods and concluded that they all have room for improvement.
Apart from dropout events, scRNA-seq is challenged by data that may originate from multiple batches.Absent proper accounting, batch effects can be misinterpreted as true biological signals.The widespread application of scRNA-seq only means new computational challenges for integrating datasets from different batches and platforms as more complex datasets are presented (Yu et al. 2023).The clustering algorithms mentioned above do not consider the bias caused by batch effects.Indeed, for datasets with strong batch effects, the authors of these algorithms recommend that researchers first use other existing algorithms to remove batch effects.For example, to address batch effects, the single-cell data analysis tool Seurat (Stuart et al. 2019) first exploits canonical correlation analysis (CCA) (Butler et al. 2018) to analyze the correlation between two batches of single-cell transcriptome data.Then, it further exploits the mutual nearest neighbor (MNN) (Haghverdi et al. 2018) to match cells between the two batches.Based on neural networks, scVI (single-cell variational inference) (Lopez et al. 2018) integrates measured expression values to approximate the true posterior distribution of variables in the latent space, while simultaneously considering the capture of bias caused by dropout events and batch effects.
However, the algorithms mentioned above typically only target either batch effects removal or clustering.For example, scVI does not model cell heterogeneity in a built-in manner.To further explore heterogeneity, scVI suggests using other classical clustering algorithms on inferred latent variables.To establish a heterogeneous model, the Gaussian Mixture Variational Autoencoder (GMVAE) in scVAE incorporates a categorical latent random variable into the traditional probabilistic graph for single-cell gene expression data in the context of variational autoencoder (VAE) (Grønbech et al. 2020).It uses two additional neural networks to deduce the assumed conditional prior distribution given different categories and variational approximate conditional posterior distribution given measured expression values.However, GMVAE does not consider batch effects.For simultaneous batch effects removal and clustering of scRNA-seq data analysis, specific clustering methods have been proposed.For instance, deep embedding for single-cell clustering (DESC) removes batch effects in single-cell analysis by employing neural networks to iteratively optimize clustering objective functions on scRNA-seq data.As long as the technical variability between batches is smaller than the true biological heterogeneity, DESC can gradually eliminate batch effects (Li et al. 2020).scDEC exploits generative adversarial networks (GANs) to simultaneously learn the latent representation and infer cell labels, which could be extended for the integrative analysis of multibatch scRNA-seq data (Liu et al. 2021).Single-cell multi-omics deep clustering (scMDC) employs a multimodal autoencoder to jointly learn the latent features of deep embedding for clustering analysis across different data sources (Jovic et al. 2022).
Herein, we introduce deep generative modeling of heterogeneity for scRNA-seq data (scVIC), which accommodates Gaussian mixture modules (GMM) under the framework of VAE.We also propose a specific coordinate descent algorithm to optimize all parameters, including those in VAE and GMM, to make the training robust.scVIC explicitly models both biological heterogeneity and technical variability derived from dropout events and the bias of batch effects; therefore, inferred heterogeneity is free from technical variability.

scVIC
We assume that the differential expression of single cells originates from some latent variables that exhibit heterogeneity, and with that assumption, we proceed to construct a neural network to learn the generative process from these latent variables to the posterior distribution of gene expression.The neural network architecture of scVIC within the framework of a VAE is schematically illustrated in Fig. 1A.The probabilistic decoder on the right corresponds to the generative process of single-cell gene expression, which is triggered by two latent variables through probabilistic decoding.These two unmeasured latent variables refer to the scaling factor and intrinsic cell representation, respectively.The scaling factor represents technical variability caused, e.g. by capture efficiency and sequencing depth.Intrinsic cell representation, i.e. a vector representation of the inherent properties of single cells, is assumed to be a biological factor underlying differential expression among cells.Both factors are assumed to follow Gaussian-based distributions such that the scaling factor is assumed to be a 1D normal distribution, while whereas representation is assumed to be a mixture of Gaussian distributions, thus describing the inherent biological heterogeneity of single cells.Using the probabilistic decoder corresponding to these two factors, we can obtain specific parametric values of gene expression distribution, and we specify gene expression distribution as the classical zero-inflated negative binomial (ZINB) distribution.The probabilistic encoder on the left corresponds to the inference of the conditional (variational) posterior distribution of two latent variables given the measured gene expression.The measured gene expression is fed into the neural network corresponding to the probability encoder to obtain the specific distribution parameters of the assumed variational posterior distribution.This variational posterior distribution is used to approximate the true posterior distribution of the two latent variables.If the expression comes from different experimental batches, batch effects removal is necessary.Therefore, we also fed the batch annotation into the neural network to model the batch effects.
The optimization of scVIC consists of two parts.One part is conducted under the framework of variational inference and stochastic optimization, while the other part is conducted under the framework of expectation-maximization (EM).Together, these two parts form our final optimization algorithm based on coordinate descent.

Probabilistic model
First, we present the assumed probabilistic generative process of single-cell gene expression in the scVIC model, along with the corresponding probabilistic graph model shown in Fig. 1B.Each observed expression value x ng is independently drawn from the following process: (1) (2) (3) (5) Equation ( 1) models the probability of latent variables corresponding to the intrinsic biological properties of single cells.Equation (2) models the probability of latent variables corresponding to single-cell scaling factors, and Equations ( 3)-( 8) represent ZINB distribution involving scaling factors, which is modeled as a probabilistic neural network that serves as the recognition model in VAE.By iteratively mapping and sampling through the neural networks, x ng can be generated from two sampled latent variables, which is actually a probabilistic encoding process in the semantic context of variational autoencoders.In Equations ( 3) and ( 6), the variable s n refers to the additional measured information variables of the nth cell, such as batch annotation, and these are also fed into the two neural networks.The variable h ng in Equation ( 7) is a binary variable constructed by simulating whether the nth cell drops out the gth gene expression in the ZINB distribution.By incorporating these two variables, the probabilistic generative model assumed by scVIC successfully and effectively models the technical variability of dropout events and batch effects.
In common VAE, the prior distribution for the latent variable z is set to standard multivariate Gaussian (Kingma and Welling 2014).To incorporate heterogeneity, we let z be Gaussian mixture, while the covariance of each mixture component remains the identity matrix.C denotes the number of mixture components of GMM, and over the cth component, π c denotes mixture ratio and μ c denotes mean of GMM; scVIC therefore, the density function of the simplified GMM is calculated as: Our notation basically follows that from scVI (Lopez et al. 2018).l μ and l σ 2 R B þ in Equation ( 2) parameterize prior distribution for the scaling factor, where B denotes the number of batches, and � μ t and � σ t are set to be the empirical mean and variance of the log-library size per batch, respectively.The parameter θ g 2 R þ in Equation ( 4) denotes gene-specific inverse dispersion, and ρ g n denotes mean of the negative binomial distribution in the ZINB distribution.f ρjz;s in Equation (3) and f αjz;s in Equation ( 6) are neural networks mapping the latent space and batch annotation to the dimension of genes: R d × f0; 1g B ! R G .To avoid ambiguity, the learnable parameters of the neural network have been omitted in the notation.Superscript annotation (ρ g n ; α g n ) refers to a single entry corresponding to a specific gene g.Network f ρjz;s decodes mean of proportional transcription expression across genes, and the sum of the mean over the gene dimension equals 1.The intermediate vector t n w ng can therefore be interpreted as the expected transcriptional expression of gene g.Network f αjz;s decodes the frequency of technical dropout events occurring at a particular entry (Pierson and Yau 2015).

Inference via coordinate descent
Given the prior generative process of x ng and the observed transcriptional expression matrix X, all we want are posterior inference.However, the true posterior density is intractable in our case because the denominator (integral of the marginal distribution over all the latent variables) pðx n js n Þ is intractable in the Bayes rule.Note that the latent variables w ng , h ng , and y ng can be integrated out because pðxjz n ; l n ; s n Þ has a closed-form density, which is actually a ZINB distribution (Gr€ un et al. 2014).The zero-inflated negative binomial distribution based on a neural network with a scaling factor is denoted as the ZINB-NN distribution (see details in Supplementary Note 1).The ZINB-NN contains three parameters: the mean of the scaled negative binomial distribution μ g n ¼ t n f g ρjz;s ðz n ; s n Þ, the probability of zero inflation α g n ¼ f g αjz;s ðz n ; s n Þ, and the inverse dispersion parameter for each gene θ g .
Then, variational inference (Blei et al. 2017) is exploited to approximate the posterior pðz n ; l n jx n ; s n Þ (see details in Supplementary Note 2).Based on the mean-field assumption, the variational lower bound is calculated as follows: In this objective function, we divide the trainable parameters into two parts.The first part consists of the dispersion parameter θ g for each gene and the neural network parameters in f ρjz;s ; f αjz;s ; f ½μ z ;σ z �jx;s and f ½μ t ;σ t �jx;s .The second part consists of parameters of μ c and π c .When the parameters of the second part are fixed, the parameters of the first part are optimized directly in a variational Bayesian inference fashion, and the specific probabilistic graph is shown in Fig. 1C.Conversely, when the parameters of the first part are fixed, the parameters of second part can be optimized within the framework of EM, and the specific probabilistic graph is shown in Fig. 1D.Therefore, coordinate descent on two blocks is exploited to optimize all trainable parameters.
When the parameters of the second part are fixed, we further unfold the GMM by an additional categorical latent random variable as: z n � Normalðμ cn ; IÞ: (10) Equation ( 9) denotes category distribution for the component latent variable, and Equation (10) denotes the normal distribution corresponding to each category.Then, we obtain another variational lower bound: When the variation distribution approximates with sufficient gradation to the posterior distribution, both can converge to the marginal likelihood.Therefore, we optimize the lower bound L 0 .The specific form of pðx n jz; t; s n Þ is given by Supplementary Equation (S1).The assumptions for qðz; tjx n ; s n Þ; qðzjx n ; s n Þ, and qðtjx n ; s n Þ are given by Supplementary Equations (S2), (S3), and (S4), respectively.p(t) is given by Equation (2).p(c) and pðzjcÞ are given by Equations ( 9) and ( 10).The closed form of KLðqðtjx n ; s n ÞjjpðtÞÞ (see details in Supplementary Note 3) is given by Supplementary Equation (S5).
The distribution corresponding to the categorical variable c is a discrete categorical distribution, so KLðqðcjz n ÞjjpðcÞÞ can be solved by directly summing over the categories, and the closed form of qðcjzÞ is given as: Then, the parameters of the first part can be optimized in the framework of variational inference and stochastic optimization, using the reparameterization and mini-batch trick employed in VAE.After computing the gradient of the variational lower bound L 0 with respect to the first set of parameters using the backpropagation algorithm, we use the gradient-based optimizer Adam to update these parameters.This optimization is based on a neural network framework, which naturally allows parallel computation.
When the parameters of the first part are fixed, minimization of the variational lower bound L equals maximization by Since Monte Carlo sampling is completed upon optimization of the parameters of the first part, maximizing L 00 equals maximum likelihood estimation of GMM on the Monte Carlo samples drawn from the variational approximate posterior of the latent variable z, which can then be completed by classic EM.Specifically, in the ith optimization iteration, given the lth sampling point of nth cell and cth component, the estimation for step i is ; IÞ : Correspondingly, the maximization of EM, which actually optimizes the parameters in the Gaussian mixture, for step i is EM is carried out on the low-dimensional space of the latent variable z n , in turn making it easy to perform parallel computations.
We integrate the two optimization steps and use iterative coordinate descent to infer all parameters of scVIC.Algorithm 1 provides a detailed description.
Hyper-parameters of scVIC are set to default values of scVI, where the dimensionality of z is set to 10, the number of samples L is 1, the training batch size N b is 128, the maximum number of epochs N e is 400, and the learning rate is 1e−3.

Clustering
After parametric optimization based on coordinate descent is completed, the transcriptomic expression x n can be probabilistically encoded through the learned neural network using latent variables.The distribution parameters of one latent variable, z n , are determined by the trained neural network f ½μ z ;σ z �jx;s ; therefore, the expectation μ z n can be regarded as an encoded form of x n in a deterministic manner.Then, in the probabilities computed by q cjz ðμ z n Þ, the category corresponding to the maximum value can be assigned as the labeled category of x n .This clustering process is a built-in part of scVIC and does not require the use of other clustering algorithms.This, however, does not preclude the use of other clustering algorithms to enhance the deterministic encoding of μ z n .When compared with other dimensionality reduction and clustering algorithms, clustering on scVIC latent space can improve clustering performance essentially because the latent variables obtained by scVIC consider heterogeneity.We chose the popular algorithm of community detection, Louvain (Blondel et al. 2008), as the default clustering algorithm.Therefore, to distinguish the source of labels, we use "scVIC-Louvain" to denote the application of the Louvain algorithm to the encoding of scVIC.

Datasets
We evaluated scVIC using five real biological scRNA-seq datasets and one simulated dataset.We selected datasets whose size was >10 4 , including three datasets without batch effects [TRACHEA, TURTLE, and BACH, referred to Chen Estimator of L 0 .Maximize L0 .end for for c 2 ½1; C� do The TRACHEA dataset (Schaum et al. 2018) consists of gene expression data from 11 269 mouse tracheal tissue cells obtained using 10× genomics sequencing technology.The dataset includes annotations for five biological cell types provided by the original authors.
The TURTLE dataset (Tosches et al. 2018) consists of gene expression data from 18 664 cells in the dorsal cortex of turtle.The dataset includes biological annotations for the 15 cell types provided by the original authors.
The BACH dataset (Bach et al. 2017) consists of gene expression data from 23 184 epithelial cells.The dataset includes annotations for eight major biological cell types provided by the original authors.
The PBMC dataset consists of scRNA-seq data from two batches of peripheral blood mononuclear cells (PBMCs) from a healthy donor (4K PBMCs and 8K PBMCs) (Zheng et al. 2017).Quality control metrics were derived using the R package cellrangerRkit (v.1.1.0).Quality metrics were extracted from CellRanger throughout the molecule-specific information file.After filtering as in Cole et al. (2019), 12 039 cells are extracted with 10 310 sampled genes, and then is used to obtain biologically meaningful clusters (Macosko et al. 2015).Genes that could not be matched with the bulk data used for differential expression are filtered, were which we were left with g ¼ 3346.
The RETINA dataset consists of bipolar cells from Shekhar et al. (2016); it contains 27 499 cells and 13 166 genes from two batches after their original pipeline for filtering.Biological cluster annotation from 15 cell types from the author was used.
In the simulation, we first used an R package called "splatter" (Zappia et al. 2017) to generate simulation datasets.Our experiments can be divided into two parts: one that does not consider the bias of batch effects and one that consider the bias of batch effects.For the first type of simulated data, we compare scVIC, which does not consider the bias of batch effects, with many methods that are not equipped to remove batch effects.To make this comparison, we further design two different simulation scenarios, one involving cell clusters of the same size and the other involving cell clusters of different sizes, which we, respectively call "balanced experiments" and "imbalanced experiments."In the balanced experiments, we generated datasets with clusters 3, 4, 5, 6, and 7.In the imbalanced experiments, we generated datasets fixed at five clusters, but the cluster group size followed a geometric sequence with ratio ranges in f0.6, 0.7, 0.8, 0.9, 1.0g, where smaller ratio means a larger difference in cluster size.For both, we set a dropout rate that ranges from 5% to 25% (dropout.type¼ "experiment", dropout.shape ¼ −1, de.facScale ¼ 0.2 and dropout.midranges from −1.5 to 0.5, while other parameters are set to default values) to simulate the "dropout" event.Both the gene number and cell number are fixed at 2500.For every scenario, we generate 10 datasets with different random seed to inspect the average performance of the different methods by calculating their median ARI and NMI values.For the second type of simulated data, we consider that real transcriptome expression data may come from different batches; therefore, we included batch effects in the simulation.Similarly, we designed two different scenarios for balanced and imbalanced simulations with parameter settings identical to those without batch effects.Here, for the simulation of batch effects, we fixed the dataset to three batches, the size of which follows a geometric sequence with a ratio of 0.7.

Clustering metrics: adjusted rand index
Given the two clustering results U and V, we calculate the following four quantities: (a) number of objects in a pair placed in the same group in U and in the same group in V; (b)number of objects in a pair placed in the same group in U and in different groups in V; (c) number of objects in a pair placed in the same group in V and in different groups in U; and (d) number of objects in a pair placed in different groups in U and in different groups in V. Adjusted rand index(ARI) has been proposed in the form of

Clustering metrics: normalized mutual information
Given two clustering results U and V on a set of data points, NMI is defined as IðU; VÞ=maxðHðUÞ; HðVÞÞ, where I(U, V) is the mutual information between U and V, and H(U) represents the entropy of the clustering result U IðU; VÞ ¼ where jU p j and jV q j denote the cardinality of the pth cluster in U and the nqth cluster in V, respectively.The entropy of each cluster assignment is calculated as follows:

Divergence of batch mixing
We use KL divergence to evaluate the performance of various single-cell clustering algorithms for batch effects removal, i.e. the stochasticity of cells from different batches mixed together within each cluster.The KL divergence of batch mixing for B different batches is calculated as where p b is the proportion of cells from batch b among all cells, and q b is the proportion of cells from batch b in a given region based on clustering algorithm results.We calculate the KL divergence of batch mixing using the regional mixing KL divergence defined above using 50 randomly chosen cells from all batches.The regional proportion of cells from each batch is calculated based on the set of K nearest neighbor (KNN) cells from each randomly chosen cell (K ¼ 50).The final KL divergence is then calculated as the average of the regional KL divergence.We repeated this procedure for 100 iterations with different randomly chosen cells and finally obtained the average.

Heterogeneity of the simulation data
First, we compared the performance of all eight clustering algorithms on balanced and imbalanced simulated data without considering batch effects.As shown in Fig. 2A, scVIC achieved ARI scores of 0.93 and 0.89 on the balanced and imbalanced simulated data, respectively, while the performance of the scVIC-Louvain further improved ARI evaluation scores to 0.94 and 0.93, respectively, both significantly outperforming CIDR (0.21, 0.31), SOUP (0.56, 0.62), SIMLR (0.21, 0.23), and RaceID (0.17, 0.23).For other neural network-based algorithms, scziDesk, which also combines dimensionality reduction and clustering, achieved ARI evaluation scores of 0.85 and 0.66, respectively.Because scVI does not have built-in clustering, we similarly applied Louvain on its dimensionality reduction results, termed as scVI-Louvain, and achieved ARI evaluation scores of 0.82 and 0.71 using this simple two-step approach of dimensionality reduction and clustering.These scores are lower than those of scVIC, which is also a variational model, but considers heterogeneity.As another neural network model that considers heterogeneity, the GMVAE in scVAE achieves ARI evaluation scores of only 0.18 and 0.14, owing to its instability, even performing worse than the two-step scVI-Louvain.Our algorithm outperforms the seven other algorithms on both balanced and imbalanced simulated data without considering batch effects, as well as on NMI evaluation scores, as shown in Supplementary Fig. S1A.
As shown in Fig. 2C, the performance of the different algorithms under different simulation parameters can be further compared.Each subplot shows the performance of different algorithms as the simulated dropout rate gradually increases.As the dropout rate increases, the ARI of all algorithms decreases.However, it is noteworthy that scVIC and scVIC-Louvain are superior to other algorithms in almost all simulation parameter combinations.Furthermore, our algorithm is relatively robust, with almost all ARIs above 0.8, when the simulated data are relatively clear.For example, this occurs in balanced simulations where the number of simulated categories is ≤6, and the capture loss rate is ≤16.8%.This can also occur in imbalanced simulations where the imbalance ratio is ≤0.7, and the capture loss rate is ≤16.8%.Compared with the three other neural network-based algorithms, scziDesk and scVI-Louvain are in the second tier.The performance of scziDesk is better than that of the two-step algorithm based on the dimensionality reduction results of scVI, whereas scVAE is not comparable with other algorithms based on its instability.In the imbalanced simulated data, we noticed that SOUP, which is based on the characteristics of pure cell clustering, can effectively alleviate the difficulties caused by data imbalance, and its results are relatively stable.However, overall, it is not as good as our algorithm in terms of ARI.Its performance is only comparable to ours in the simulated data with an imbalanced ratio of 0.6.Similar results can also be obtained from the NMI evaluation metric, as shown in Supplementary Fig. S1C.
Next, we compared the performance of all four algorithms based on clustering and batch effects removal performance for both balanced and imbalanced simulated data, while considering batch effects.As shown in Fig. 2B, scVIC-Louvain has the best clustering performance on both balanced and imbalanced simulated data with ARI evaluation metrics of 0.96 and 0.94, respectively.We compared scVIC-Louvain with Seurat-Louvain and scVI-Louvain, which apply Louvain to the dimensionality reduction results obtained after batch effects removal using Seurat or scVI, respectively.scVIC and Seurat-Louvain are comparable, whereas scVI-Louvain is inferior.On balanced and imbalanced simulated data, the ARI scores for scVIC are 0.94 and 0.89, respectively, whereas for Seurat-Louvain it is 0.92 and 0.93.The ARI scores for scVI-Louvain are 0.89 and 0.85, respectively.However, it is worth noting that Seurat-Louvain performs much worse than the other algorithms in terms of batch effects removal, as shown in Supplementary Fig. S2.DESC, which is also based on neural networks and combines batch effects removal and clustering, performs poorly with ARI scores of 0.32 and 0.21, respectively.Our algorithm outperforms the other three algorithms in terms of both batch effects removal and clustering on balanced and imbalanced simulated data considering batch effects.This conclusion was also validated by the NMI score, as shown in Supplementary Fig. S1B.
The performance of different algorithms on different simulation parameters can be further compared, as shown in Fig. 2D.In most cases, scVIC and scVIC-Louvain outperform the other algorithms.Unlike the results presented in the simulation data without considering batch effects, when the simulated data are relatively clear, our algorithm is comparable to other algorithms.However, our algorithm outperforms other algorithms when the dropout rate of simulation data is relatively high, reaching up to 23%.When the simulated data are relatively clear, scVI-Louvain, as a more direct approach, can also achieve good results.Seurat-Louvain trades off poor performance in batch effects removal for good clustering results.scVIC simultaneously considers batch effects removal and clustering, especially when the simulated data have a high dropout rate.scVIC can achieve better overall results.It is worth noting that DESC, which also combines batch effects removal and clustering based on neural networks, consistently performs poorly in various parameter combinations.Similar results can also be obtained using the NMI evaluation metric, as shown in Supplementary Fig. S1D.

Heterogeneity of the real biological data not involving batch effects
We compared the performance of all seven clustering algorithms on three biological datasets, TRACHEA, TURTLE, and BACH, which were free from batch effects.It is worth mentioning that the annotation of actual biological datasets is mostly manually annotated on the basis of the results of community detection algorithms similar to Louvain combined with a certain biological significance.Therefore, we selected the results of scVIC-Louvain for comparison with those of other algorithms.In addition, based on the instability of scVAE during optimization, it was not included in the comparison below.The performance of different algorithms for both ARI and NMI evaluation metrics is shown in Fig. 3A.On the TRACHEA dataset, the performance of three neural network-based algorithms, scVIC-Louvain (with ARI and NMI values of 0.94 and 0.87, respectively), scVI-scVIC Louvain (0.93 and 0.86), and scziDesk (0.89 and 0.85), outperformed the other four algorithms, including CIDR (0.34 and 0.50), RaceID (0.14 and 0.39), SIMLR (0.05 and 0.18), and SOUP (0.27 and 0.48).On the TURTLE dataset, scVIC-Louvain (0.83 and 0.86) and scVI-Louvain (0.82 and 0.85) outperformed scziDesk (0.56 and 0.70), CIDR (0.26 and 0.48), RaceID (0.35 and 0.57), SIMLR (0.54 and 0.63), and SOUP (0.10 and 0.52).On the largest dataset, BACH, scVIC-Louvain (0.88 and 0.83), and scziDesk (0.87 and 0.85) performed similarly and slightly outperformed scVI-Louvain (0.78 and 0.78).As the data size reaches a certain scale, the performance of CIDR (0.82 and 0.79), RaceID (0.62 and 0.70), SIMLR (0.77 and 0.77), and SOUP (0.46 and 0.64) improves.Overall, in biological datasets without batch effects, the results of scVIC-Louvain are generally better and more stable than those of scVI-Louvain and scziDesk.In turn, their performance is generally better than that of CIDR, RaceID, SIMLR, and SOUP.These four algorithms tend to perform better as the data size reaches a certain scale.
To compare their dimensionality reduction clustering results, we further projected the latent variables of scVIC and scVI onto a 2D space using the UMAP visualization algorithm (Becht et al. 2019) as shown in Fig. 3B.For the TRACHEA and TURTLE datasets, the clustering evaluation metrics indicate that scVIC-Louvain performs slightly better than scVI-Louvain, however there is small difference between the visualization results of the two methods.This suggests that for biological datasets on which scVI yields a strongly heterogeneous latent space, scVIC can effectively preserve similar heterogeneity.For the BACH dataset, scVI-Louvain performed significantly worse than scVIC-Louvain in clustering evaluation metrics.We can observe specific differences between the two methods in the visualization of their latent spaces, mainly concentrated in the Hsd, Hsp, and Lp categories.When using Louvain for further classification on the latent space generated by scVI, these three categories appear to be intertwined, while scVIC has, to some extent, mitigated this phenomenon.Using Louvain on the latent space generated by scVIC accurately identifies the class with the largest number of samples, Hsd.However, an overlap still exists between the Hsp and Lp classes.This phenomenon could be explained by the presence of further subdivisions of subgroups within the BACH dataset that still exhibit some degree of heterogeneity.The original author's annotation is detailed enough to classify up to 15 categories.Within the Hsd category, three subcategories are possible, two for Hsp, and two for Lp.This, to some extent, validates the rationality of the two algorithms.If only following the default eight major cell types provided by the original author, scVIC outperforms scVI in terms of clustering.

Heterogeneity of the real biological data involving batch effects
We compared three neural network-based algorithms on the biological datasets RETINA and PBMC that have batch effects, as shown in Fig. 4A.For both datasets, scVIC-Louvain outperforms scVI-Louvain and DESC, where batch effects removal is evaluated by KL distance metric.A smaller KL distance indicates better removal.For the RETINA dataset, scVIC-Louvain outperforms the other two algorithms in terms of the cluster evaluation metric ARI.From the visual results, we can see that DESC divides the largest category, RBC, into three subcategories, while scVI-Louvain divides it into two relatively even subcategories.However, scVIC-Louvain performed much better, and most samples were clustered into one category, which significantly improved the ARI evaluation metric.The performance of scVIC-Louvain is also superior to those of the other two algorithms based on the NMI evaluation metric that incorporates subset separation.At the same time, we can see that DESC is clearly inferior to the other two algorithms in removing batch effects, especially for the category RBC wherein mixture of the two batches is quite uneven, and the unevenness is clearly visible

Figure 1 .
Figure 1.Overview of scVIC.(A) Neural network schematic of scVIC.(B) Generative probabilistic model of scVIC.Shaded vertices represent observed random variables, and empty vertices represent unobserved latent random variables.Shaded diamonds represent constants as a priori, and empty diamonds represent global parameters that need to be estimated.Edges signify conditional dependency.(C) Probabilistic model when parameters of Gaussian mixture distribution are fixed during parametric optimization.(D) Probabilistic model when parameters of the NN are fixed during parametric optimization.

Figure 2 .
Figure 2. Comparison of clustering algorithms on simulation datasets.Bar charts of ARI for different clustering algorithms on simulation datasets (A) not involving batch effects and (B) involving batch effects.For different simulation parameters, line charts of ARI for different clustering algorithms on simulation datasets (C) not involving batch effects and (D) involving batch effects.The source data used in this figure is placed in the Supplementary Tables.

Figure 3 .
Figure 3.Comparison of clustering algorithms on biological datasets not involving batch effects.(A) Bar charts of clustering metrics for different algorithms.The source data used in this sub-figure is placed in the Supplementary Tables.(B) Visualization for different algorithms (scVIC-Louvain and scVI-Louvain).

Figure 4 .
Figure 4. Comparison of clustering algorithms that consider batch effects on biological datasets involving batch effects.(A) Comparison of visualization performance and metrics for different algorithms that consider batch effects.(B) Comparison of visualization performance and metrics for different algorithms that do not consider batch effects (scVIC-Louvain and scVI-Louvain).
Expression matrix ½x n � N , Batch annotation ½s n � N , Number of batches N b , Number of epochs N e , Four initialized neural networks, and Initialized distribution parameters from ZINB and GM.Ensure: Four optimized neural networks and Optimized distribution parameters from ZINB and GM.Initialize neural networks.Initialize ½θ g � G ; ½π c � C and ½μ c � C .