scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and Dirichlet process mixture model

Abstract Motivation Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is an important step in revealing cellular heterogeneity. Many clustering methods have been proposed to discover heterogenous cell types from scRNA-seq data. However, adaptive clustering with accurate cluster number reflecting intrinsic biology nature from large-scale scRNA-seq data remains quite challenging. Results Here, we propose a single-cell Deep Adaptive Clustering (scDAC) model by coupling the Autoencoder (AE) and the Dirichlet Process Mixture Model (DPMM). By jointly optimizing the model parameters of AE and DPMM, scDAC achieves adaptive clustering with accurate cluster numbers on scRNA-seq data. We verify the performance of scDAC on five subsampled datasets with different numbers of cell types and compare it with 15 widely used clustering methods across nine scRNA-seq datasets. Our results demonstrate that scDAC can adaptively find accurate numbers of cell types or subtypes and outperforms other methods. Moreover, the performance of scDAC is robust to hyperparameter changes. Availability and implementation The scDAC is implemented in Python. The source code is available at https://github.com/labomics/scDAC.


Fig. S1 .
Fig. S1.Comparisons of clustering performance between scDAC and the other widely-used methods on the Baron dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number.(d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.

Fig. S2 .
Fig. S2.Comparisons of clustering performance between scDAC and the other widely-used methods on the Slyper dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number.(d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.

Fig. S3 .
Fig. S3.Comparisons of clustering performance between scDAC and the other widely-used methods on the Zilionis dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.NA denotes that the method failed to produce clustering results on this dataset.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number (d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.SC3 crashed and failed to produce results on this dataset.Therefore, the corresponding results are not depicted or maked as NA in the figure.

Fig. S4 .
Fig. S4.Comparisons of clustering performance between scDAC and the other widely-used methods on the Muraro dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number.(d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.

Fig. S5 .
Fig. S5.Comparisons of clustering performance between scDAC and the other widely-used methods on the Segerstolpe dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number.(d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.

Fig. S6 .
Fig. S6.Comparisons of clustering performance between scDAC and the other widely-used methods on the Ghanem dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number.(d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.

Fig. S7 .
Fig. S7.Comparisons of clustering performance between scDAC and the other widely-used methods on the Bhattacherjee dataset.(a) UMAP visualization of the low-dimensional representations of these methods.The upper panel is annotated with true labels and the lower panel is annotated with predicted labels.The first column is the UMAP plot of scDAC, and the other columns are the plots of other methods.(b) NMI, ARI, SC and mean scores of scDAC and the other methods.Each row represents a clustering method.The three columns of circles from left to right represent clustering indicators NMI, ARI and SC scores respectively.The rectangles on the right represent the mean scores of these three indicators.The size of the circles and rectangles correspond to the scores: the bigger one means the better performance.The darkness of color of the circles and rectangles correspond to the ranking: the lightest one means the top rank.The clustering methods are sorted according to the mean scores in descending order.(c) The bar plot of Deviation Ratios between the predicted labels by different methods and the true one.The x axis represents different methods, and the y axis represents the Deviation Ratio value.The blue bar denotes positive deviation and the red bar negative deviation.Shorter bars represent better results.AE+K-means, PCA+K-means, GLDC and SDCN were not involved in DR comparison since they require input of cell type number.(d) The Sankey plot of scDAC shows the correspondence between the predicted labels and the ground truth.

Table S1 .
Description of the 10 scRNA-seq datasets used in the study

Table S2 .
Cell types and cell numbers included in the subsampled datasets