scAce: an adaptive embedding and clustering method for single-cell gene expression data

Abstract Motivation Since the development of single-cell RNA sequencing (scRNA-seq) technologies, clustering analysis of single-cell gene expression data has been an essential tool for distinguishing cell types and identifying novel cell types. Even though many methods have been available for scRNA-seq clustering analysis, the majority of them are constrained by the requirement on predetermined cluster numbers or the dependence on selected initial cluster assignment. Results In this article, we propose an adaptive embedding and clustering method named scAce, which constructs a variational autoencoder to simultaneously learn cell embeddings and cluster assignments. In the scAce method, we develop an adaptive cluster merging approach which achieves improved clustering results without the need to estimate the number of clusters in advance. In addition, scAce provides an option to perform clustering enhancement, which can update and enhance cluster assignments based on previous clustering results from other methods. Based on computational analysis of both simulated and real datasets, we demonstrate that scAce outperforms state-of-the-art clustering methods for scRNA-seq data, and achieves better clustering accuracy and robustness. Availability and implementation The scAce package is implemented in python 3.8 and is freely available from https://github.com/sldyns/scAce.


Supplementary Tables
: ARI values given different values of β during adaptive cluster merging on the eight datasets.
In the pre-training stage, the main goal of scAce is to obtain an initialized VAE network, so we set β to a relative small value, 0.01 × m. At the adaptive cluster merging stage, the value of β directly affect the final clustering performance, so we focused on evaluating its impact in this stage. We evaluated β's value in 0.01 × m, 0.007 × m, 0.004 × m and 0.001 × m, while controlling the other parameters to remain unchanged, and conducted experiments on the simulated dataset and seven real datasets.     Figure  Human kidney Mouse hypothalamus Turtle brain Figure S4: Comparison of the clustering methods on three real scRNA-seq datasets (Human kidney, Mouse hypothalamus, and Turtle brain) by applying the ten clustering methods to randomly selected subsamples of the full datasets.  Figure S5: Comparison between results of scAce and cluster initialization. (A) ARI and NMI values of scAce's clustering results and initial clustering results obtained by setting the resolution parameter in the Leiden algorithm such that the initial cluster number was the same as the true cell type number.
(B) UMAP plots of initial clusters obtained by setting the resolution parameter in the Leiden algorithm such that the initial cluster number was the same as the true cell type number.

Data pre-processing
For all scRNA-seq datasets used, scAce pre-processes the raw count matrix using Scanpy [?]. The preprocessing steps are as follows. First, genes expressed in fewer than 3 cells and cells with fewer than 200 expressed genes are filtered out. The remaining count matrix is denoted as X = [x ij ] ∈ R m×n . Second, the read counts in each cell are normalized by the library size factor of that cell. The library size factor is defined as the total count in a cell divided by the median total counts across all cells. Third, the normalized counts are log-transformed and then converted to z-scores, such that the expression level of every gene has zero mean and unit variance across all cells. The processed gene expression matrix is denoted asX ∈ R m×n .

Derivation of the VAE loss function
For sample X, our goal is to use only X to learn the joint distribution of data X and the hidden variable Z through an unsupervised deep generative model such that the observed data X can be generated through Z using the model p(X|Z). Assuming that the model is controlled by the parameter θ, our goal can then be expressed as maximizing the expected log-likelihood value of the observed data X over the entire distribution of the hidden variable Z as follows [1]: (S1) Since we do not know the exact form of the true p(Z), we define the variational distribution q ϕ (Z|X) to approximate p(Z). The KL divergence is used to measure how close the two distributions are to one another; a smaller KL divergence denotes a greater similarity between the two distributions.
Converting them to the Lagrangian form, we obtain the following objective function: where β ≥ 1 is the regularisation factor and the model becomes the conventional VAE model when β = 1. Since a small KL divergence term will lead to poor reconstructability of the model, the KL divergence term is enhanced by setting the coefficient β to improve the reconstructability of the model. Ignoring the fixed constants, the objective function can be written as: To better characterize the scRNA-seq count data, we use the ZINB distribution as the generating distribution p θ (X|Z). Therefore, E Z∼q ϕ (Z|X) [log p θ (X|Z)] in equation (S4) can be approximated as follows: Then, we assume that q ϕ (Z|X) follows a Gaussian distribution. For the second term in equation (S4), we can calculate it by Monte Carlo estimation: Summarizing the above results, the loss function of the β-VAE network can be obtained as: where x ij is the i-th row and j-th column of the matrix X; µ ij , θ ij , π ij are the i-th row and j-th column of the matrices M, Θ, Π of the ZINB distribution, respectively; (σ x ) 2 ij and (µ x ) ij are the i-th row and j-th column of the matrices σ 2 x and µ x (of the hidden layer Gaussian distribution), respectively.

Cluster splitting method
In the cluster initialization step of scAce, we include the option of clustering enhancement, which allows the user to start with cluster labels produced by another clustering method and obtain enhanced clustering results by optimizing the VAE network and cluster centroids. Since the existing clustering results from the other method may contain incorrect clusters with cells from multiple cell types, in order to maximize the enhancement, it is necessary to first split the initial clusters into purer subclusters. Similar to the adaptive criterion used in the cluster merging step (see Methods), we split a previous cluster if its intra-cluster distance is greater thand/2, whered is the average inter-cluster distance [2,3]. Specifically, the intra-cluster distances of all previous clusters are calculated, and the cluster with the largest distance is split into two clusters if its intra-cluster distance is greater thand/2. Then, the average inter-cluster distance,d, is updated. The above splitting process is repeated until the intra-cluster distances of all clusters are smaller thand/2.
When splitting a cluster into two smaller clusters, we use the Kaufman algorithm to identify the new centroids of the two clusters [4]. Suppose we would like to split cluster i, the algorithm works as follows.
1. For each cell in this cluster, we calculate the sum of its distances to all other cells in cluster i.
The cell with the smallest sum is selected as the first centroid, and its index and coordinates are denoted as q i1 and c i1 , respectively.
2. For cells q and r in cluster i (q, r ∈ N i and q, r ̸ = q i1 ), we calculate where D q is the distance between cell q and the first centroid, and d qr is the distance between cells q and r .