scTPC: a novel semisupervised deep clustering model for scRNA-seq data

Abstract Motivation Continuous advancements in single-cell RNA sequencing (scRNA-seq) technology have enabled researchers to further explore the study of cell heterogeneity, trajectory inference, identification of rare cell types, and neurology. Accurate scRNA-seq data clustering is crucial in single-cell sequencing data analysis. However, the high dimensionality, sparsity, and presence of “false” zero values in the data can pose challenges to clustering. Furthermore, current unsupervised clustering algorithms have not effectively leveraged prior biological knowledge, making cell clustering even more challenging. Results This study investigates a semisupervised clustering model called scTPC, which integrates the triplet constraint, pairwise constraint, and cross-entropy constraint based on deep learning. Specifically, the model begins by pretraining a denoising autoencoder based on a zero-inflated negative binomial distribution. Deep clustering is then performed in the learned latent feature space using triplet constraints and pairwise constraints generated from partial labeled cells. Finally, to address imbalanced cell-type datasets, a weighted cross-entropy loss is introduced to optimize the model. A series of experimental results on 10 real scRNA-seq datasets and five simulated datasets demonstrate that scTPC achieves accurate clustering with a well-designed framework. Availability and implementation scTPC is a Python-based algorithm, and the code is available from https://github.com/LF-Yang/Code or https://zenodo.org/records/10951780.


Compared methods
We compared to several advanced clustering methods, along with their corresponding programming languages and links, as shown in the Table 2.

Performance evaluation
Similar to most clustering methods, we select four commonly used metrics, namely NMI, AMI, ARI, and ACC, as the evaluation criteria.

Normalized Mutual Information (NMI)
The Mutual Information (MI) is defined as: Where p (x, y) is the joint distribution of random variables X and Y , p (x) and p (y) are the marginal distributions of X and Y respectively, and NMI is defined as the mutual information between the predicted assignment U and the ground truth labels V divided by the entropy of U and V : Where H (U ) and H (V ) represent the entropy of the predicted categories and the ground truth categories, respectively.

Adjusted Mutual Information (AMI)
The AMI measures the similarity between the clustering result and the ground truth labels, and is defined as: Where M I represents mutual information,E [M I] represents the expected mutual information, and H (U ) and H (V ) represent the entropy of the predicted categories and the ground truth categories, respectively.

Adjusted rand index (ARI)
ARI evaluates the similarity between predicted cluster assignments and true labels, and is defined as: Here, n represents the total number of cells; a represents the number of samples that are assigned to the same type in both the true cell types and clustering results; b represents the number of samples that are assigned to different cell types in the true cell types but the same type in the clustering results; c represents the number of samples that are assigned to the same type in the true cell types but different types in the clustering results; d represents the number of samples that are assigned to different cell types in both the true cell types and clustering results.

Fig. 4 .
Fig. 4. UMAP visualization plots.(a)-(f) show the original data directly plotted.The other panels display visualization plots after processing with scTPC.

Table 1 .
Summary of the parameters.

Table 2 .
Summary of the compared methods.