Deep single-cell RNA-seq data clustering with graph prototypical contrastive learning

Abstract Motivation Single-cell RNA sequencing enables researchers to study cellular heterogeneity at single-cell level. To this end, identifying cell types of cells with clustering techniques becomes an important task for downstream analysis. However, challenges of scRNA-seq data such as pervasive dropout phenomena hinder obtaining robust clustering outputs. Although existing studies try to alleviate these problems, they fall short of fully leveraging the relationship information and mainly rely on reconstruction-based losses that highly depend on the data quality, which is sometimes noisy. Results This work proposes a graph-based prototypical contrastive learning method, named scGPCL. Specifically, scGPCL encodes the cell representations using Graph Neural Networks on cell–gene graph that captures the relational information inherent in scRNA-seq data and introduces prototypical contrastive learning to learn cell representations by pushing apart semantically dissimilar pairs and pulling together similar ones. Through extensive experiments on both simulated and real scRNA-seq data, we demonstrate the effectiveness and efficiency of scGPCL. Availability and implementation Code is available at https://github.com/Junseok0207/scGPCL.


Implementation Detail.
scGPCL is implemented in Python 3 (version 3.8.8) using PyTorch (https: //pytorch.org/) with Pytorch Geometric (https://github.com/pyg-team/ pytorch_geometric) packages. For all experiments, except for the learning rate that is searched in {1e-2, 1e-3, 1e-4, 1e-5}, all the hyper-parameters are fixed to default values and shared across all the datasets. Specifically, we use a two-layer GraphSAGE [12] encoder and two-layer MLP decoder where the hidden dimensions are (256, 64), and (64, 256), respectively. To train our model, we use adamW [18] optimizer. The batch size (i.e., the number of anchor cell nodes) is set to 1024 and the number of cell nodes and gene nodes on the subgraph are set to 2048 and 1024, respectively, where these nodes are sampled using HGSampler in Pytorch Geometric. Through all experiments, the maximum pre-training epoch and fine-tuning epoch are fixed to 200 and 100, respectively, and the threshold for convergence test for pre-training phase (i.e., r) and fine-tuning phase (i.e. tol) are set to 0.99 and 0.1, respectively. The temperature hyperparameter for contrastive loss (i.e., τ ) is set to 0.25 and the degree of freedom of the Student's t-distribution (i.e., α) is set to 1. For the prototypical contrastive learning, we set the cluster size to K, 1.5K, and 2K where K is ground-truth cluster number in each dataset. In addition, the balance coefficients λ 1 , λ 2 , and λ 3 are set to 1.0, 0.05, and 1.0, respectively and λ 2 is set to 0 during the first 100 warm-up epochs. All the experiments are conducted on Intel Xeon Silver 4210R CPU and NVIDIA GeForce RTX 3090 (24GB).

Evaluation metrics
To compare the clustering performance of the scGPCL with the state-of-theart baselines, we use four standard evaluation metrics, i.e., NMI, CA, ARI, and F1-score, defined as follows: 1) Normalized Mutual Information (NMI) evaluates the clustering quality by measuring the uncertainty of predicted class labels. Specifically, when groundtruth cell type S and the cluster assignment of models C are given, NMI is calculated as follows: where I(·, ·) denotes the mutual information between two distributions, and H denotes the entropy function. NMI is ranged 0.0 and 1.0 and becomes higher when the predicted cluster assignment and ground-truth are well aligned.
2) Clustering Accuracy (CA) measures the clustering performance in a manner similar to that of the supervised classification. More precisely, we find the best matching function which maps the predicted cluster assignments to the ground-truth cell types, and then calculate the alignment between them. CA is computed as follows: where N is the number of instances and m is a matching function which maps the predicted cluster assignments to the ground-truth cell types, and s i and c i denote the ground-truth cell type and predicted cluster assignment of the i-th cell, respectively.
3) Adjusted Rand Index (ARI) adjusts the Rand Index (RI) which is defined as where a represents the number of pairs that successfully belong to the same cluster, while b represents the number of pairs correctly labeled as not belonging to the same cluster. ARI is computed as follows: where E[RI] means the expectation of RI. ARI is ranged between -1 and 1, and becomes larger when the agreement between ground-truth cell types and predicted cluster assignment is similar. 4) F1-score is the harmonic mean of precision and recall which is suitable for the imbalanced data where the precision is the proportion of positive instances out of positively predicted instances and recall is the proportion of positively predicted instances out of all positive instances on the binary classification setting. F1-score is computed as follows: To measure the clustering performance, we also find the best matching function to map the predicted cluster assignments to the ground truth cell types and generalize to the multi-class setting by applying micro and macro average scheme (i.e., Micro-F1 and Macro-F1, respectively).

Comparison between different graph structure
To validate the impact of the input data types for the cell clustering task, in Supplementary Figure S1, we compare the clustering performance of encoders applied on three different input data types (i.e., Multi-layer Perceptron (MLP) on cell features, GNNs on a cell-cell graph, and GNNs on a cell-gene graph). We observe that GNNs on both the cell-cell graph and the cell-gene graph show good performance over relatively low dropout rates (i.e., < 40%) thanks to the reflection of the relational information between cells, while the performance of GNNs on the cell-cell graph significantly degrades when the dropout rate is higher than 40%. We argue that this is because the highly sparse gene expression matrix makes it hard to accurately compute the similarities between cells, which eventually leads to the drop in the quality of the cell-cell graph. In consequence, this incurs a negative effect on the neighborhood aggregation scheme of GNNs, which eventually results in a poor performance of GNNs on the cell-cell graph.

Ablation study regarding different reconstruction losses
We compare the three reconstruction-based losses (i.e., negative ZINB likelihood loss, NB loss and MSE loss). In Supplementary Figure S15, we observe that when they are applied to scGPCL, all of them had similar clustering performance, in contrast to the case when they are applied to existing methods (i.e., DCA, scDeepCluster, and scDSC) as shown in [9,22,11] We attribute this to the fact that, unlike existing methods, scGPCL learns the cell representations using both the instance-wise and the prototypical contrastive losses, and thus the reconstruction loss is not the main component of scGPCL in terms of learning the cell representation. Through these results, we verify that using the MSE loss can be an option to learn cell representations, which enjoys a faster training time than ZINB loss as reported in Supplementary Figure S27. In addition, we show the benefits of the fine-tuning phase of scGPCL and provide the sensitivity of the associated balance coefficient (i.e., λ 3 ) in Supplementary Figure S16 and S19, respectively.

Ablation study regarding different augmentation strategies
Recall that the augmentation scheme is composed of two steps: sampling two different subgraphs and feature masking. Hence, to verify the effectiveness of the augmentation scheme, we conduct ablation studies on each step in Supplementary Figure S20. First, to verify the effectiveness of the feature masking step, we run scGPCL without masking features of the bipartite cell-gene graph (we refer to this as w/o feature masking). Second, to verify the effectiveness of sampling different subgraphs, we sample a single subgraph and use it on both views (we refer to this as same subgraph). Experiment results show that removing either one of the two steps generally degrades the performance, and especially sampling the same subgraph on both views incur a large performance drop. Through theses results, we argue that both feature masking and subgraph sampling steps are important, and especially subgraph sampling is a key augmentation strategy in scGPCL. It is worth noting that the subgraph sampling step not only plays a role as an augmentation strategy, but also alleviates the memory issue incurred by large bipartite cell-gene graph.

Hyper-parameter sensitivity analysis
We provide several sensitivity analyses results to check the effect of each hyperparameter in scGPCL. First of all, we report the performance of scGPCL with a different number of anchor cell nodes (i.e., batch size) in Supplementary Figure S21. It does not show a significant change in performance as the number of anchor cell nodes is changed, which shows that scGPCL is robust to the number of anchor cell nodes.
In Supplementary Figure S22, we check whether the number of clustering processes (i.e., T ) on prototpyical contrastive loss affects the performance of scGPCL. We observe that there is no significant change when the number of clustering processes is changed, but there is a trend that conducting clustering many times helps to increase model performance. We argue that it verifies that finding semantically similar groups across various granularities can help the training process of scGPCL. In addition, in Supplementary Figure S23, we also report the sensitivity analysis regarding the cluster sizes on prototypical contrastive loss. It shows that it is better to select the number of clusters similar to the number of ground truth clusters, but it shows robust performance within a reasonable range.
Finally, in Supplementary Figure S24, we generate more views to sample more negative samples from differently augmented graphs. Through this experiment, we check that using more views does not improve the performance of scGPCL. We argue that it is due to the fact that using two views in scGPCL already provides sufficient negative samples for training.

Scalability of the scGPCL
In practice, the size of single-cell data increases as the sequencing technology advances, and accordingly, the scalability with respect to the number of cells is one of the important evaluation factors when comparing various models. To verify the scalability of scGPCL, we generate a simulated dataset by varying the number of cells from 1K to 100K (i.e., [1K, 5K, 10K, 30K, 50K, 70K, 100K]) with 3000 genes and the other hyper-parameters are set to default values. In Supplementary Figure S26a, we observe that the running time of scGPCL in the pre-training phase and fine-tuning phase does not increase exponentially as the number of cells increases, and even shows reasonable a running time when the number of cells is extremely large (i.e., total running time of scGPCL for 100K cells is 8,887 seconds). Additionally, we report the performance of scGPCL as the number of cells increases in Supplementary Figure S28 to show that it can obtain accurate clustering performance regardless of the number of cells. Through these results, we verify the scalability of scGPCL from both of the efficacy and efficiency perspectives. This implies that scGPCL can be applied to large-scale scRNA-seq datasets, and thus practical in reality.

Marker gene identification
To provide the biological interpretability of the clustering results obtained from scGPCL, we report the marker gene identification result [24] on the Zeisel dataset. Specifically, we find the top 10 differentially expressed genes (DEGs) using the Wilcoxon rank sum test based on the clusters detected by each method and compute the overlap with the gold standard cell types. In Supplementary Figure S29b, DEGs computed based on clustering results obtained from scG-PCL highly focus on one cell type except for the 'ependymal' cell type that is hard to detect with the ground truth cell type (upper left). Note that scG-PCL succeeds in learning clusters with 'mural' cell type, whereas other baseline methods fail to do so. We argue that it is because, as can be shown in Supplementary Figure S29a, the 'mural' cell type belongs to a minority class with a small number of cells. Through this result, we can annotate cluster 1, 2, 4, 5, 7, 8, and 9 as oligodendrocytes, microglia, interneurons, mural, astrocytes, ca1pyramidal, endothelial, and s1pyramidal cell types, respectively, and once again demonstrate that scGPCL can learn cells that belong to the minority cell type in a biological perspective. We also report the ranking of the DEGs in different clusters in Supplementary Figure S30

Gene set enrichment analysis
To further investigate the biological meaning of the clusters obtained from scGPCL, we perform Gene Set Enrichment Analysis (GSEA) on the mural cluster (i.e., cluster 5) in Zeisel dataset, which is specifically identified by scGPCL compared to the baseline methods. Specifically, we select the top 100 DEGs from cluster 5 annotated as mural cell type and perform KEGG enrichment analysis using these genes. As shown in Supplementary Figure S31, Focal adhesion pathway that plays a crucial role in the development and maintenance of cellmatrix adhesions obtained the highest score. This result is biologically relevant as ephrin-B2 in mural cells plays a role in regulating the formation and turnover of focal adhesions [10]. Additionally, we also report the result of gene ontology (GO) enrichment analysis for biological process, cellular component, and molecular function in Supplementary Figure S32.

Cell-Cell Communication analysis
We conduct cell-cell communication (CCC) inference to validate whether scG-PCL accurately captures intercellular relationships. To assess the performance of CCC, we use the Mouse Brain Atlas data [21], which has a gold standard interaction result, following the evaluation protocol suggested by the Open Problems in Single-Cell Analysis group (https://openproblems.bio/results/cell_cell_ communication_source_target/). Specifically, we employ the liana [7] framework and use CellPhoneDB [8] as the CCC method, using different clustering results from each clustering method. In Supplementary Figure S33, we compare the performance of CCC inference with the odds ratio [2] calculated as the ratio of true and false positives among the top 10% ranked interactions versus the same ratios for the remainder of the interactions. The result demonstrates that scGPCL shows comparable performance as when using the ground truth cell type in detecting interactions between source and target cell types, whereas other baseline methods do not perform well. Based on this finding, we verify that scGPCL effectively captures the cell-cell relationship.

Baseline methods
The cluserting performance of scGPCL is compared with eight state-of-thearts baseline methods incorporating graph based methods, instance-wise contrastive learning method, ZINB-based autoencoder methods, graph based deep learning methods, and non-deep learning based methods.
• Clustering through imputation and dimensionality reduction (CIDR) [17] implicitly imputes gene expression data to alleviate the impact of dropout and then calculates dissimilarity matrix. After that, it performs PCoA and clustering using the first few principal coordinates.
• Deep embedded clustering (DEC) [26] jointly optimizes the feature representation and cluster assignments using the self-training strategy.
• Single-cell model-based deep embedded clustering (scDeepCluster) [22] simultaneously learns the cell representation and clustering assignment following DEC, and replaces the MSE loss with negative likelihood of ZINB distribution.
• Single-cell graph neural networks (scGNN) [25] aggregates the relationship between cells in cell-cell graph using GNNs and adopts left-truncated mixture Gaussian (LTMG) model to reflect the heterogeneous gene expression patterns. In addition, it infuses cell-type specific information using cluster autoencoder.
• Contrastive self-supervised clustering of scRNA-seq data (contrastive-sc) [6] adopts instance-wise contrastive learning scheme for the scRNA-seq data by randomly generating two differently augmented views in terms of cell features.
• Structural deep clustering for single-cell RNA-seq data (scDSC) [11] enhances scDeepCluster by jointly optimizing ZINB-based autoencoder and the GNNs on cell-cell graph to aggregate cell-cell relationship.
• scNAME [24] utilizes an auxiliary mask estimation task to grasp the gene pertinence by distinguishing the uncorrupted structure and achieve intracluster compacteness and inter-cluster separation using neighborhood contrastive loss that enhances similarity between k-nearest neighbors.
Benchmarking the scDeepCluster [22] and scDCC [23], we differently conduct pre-processing steps on raw read count data following the reported steps on each paper to satisfy their own hypothesis. For all deep learning based baseline methods, we search their best learning rate in {1e-2, 1e-3, 1e-4, 1e-5} and their default learning rate, and the other hyper-parameter setting is set to their default parameter. We utilize the official code for each baseline and it is summarized in Supplementary Table S2.

Data Availability
• The Camp dataset [3] was sequenced to understand liver bud (LB) by analyzing component lineages. We download the dataset from GSE81252.
• The Mouse Embryonic Stem cells (Mouse ES cells) [16] dataset was obtained from high-throughput droplet-microfludic approach for parallel barcoding, and authors reveal the heterogeneous onset of differentiation after leukemia inhibitory factor (LIF) withdrawl. We obtain the dataset from GSE65525 and we concatenate the data which is seperated by the different days from LIF withdrawal (i.e., 0 days, 2 days, 4 days, and 7 days).
• Mouse bladder cells [13], obtained from Mouse Cell Atlas (MCA) project, was sequenced by Microwell-seq that is a high-throughput scRNA-seq platform. The batch gene removed digital gene expression (DGE) matrix was downloaded from https://figshare.com/s/865e694ad06d5857db4b and we define the labels by following the author's annotation of cells.
• The Zeisel dataset [28], which reveals the cell types of the mouse cortex and hippocampus by biclustering method, was counted using unique molecule identifier (UMI) assays. The dataset was obtained from GSE60361 and corresponding cell types are founded on http://linnarssonlab.org/ cortex/.
• The Worm neuron cells dataset [4], which profiles nearly 50,000 single-cell transcriptomes in the nematode Caenorhabditis elegans at the L2 stage, was sequenced by sci-RNA-seq (single-cell combinatorial indexing RNA sequencing) platform. We follow the instruction (https://atlas.gs. washington.edu/worm-rna/docs/) to access the raw data and use the subset of neuronal cells. To evaluate alignment with the predicted cluster assignment, we exclude the cells labeled 'Unclassified neurons'.
• The 10X PBMC dataset [29], which profiles the peripheral blood mononuclear cells from a healthy donor, was collected by 10X scRNA-seq platform. We obtained the filtered gene barcode matrix from https://support. 10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k and use graph based-clustering result as a label following https://support. 10xgenomics.com/single-cell-gene-expression/software/pipelines/ latest/output/analysis to evaluate clustering performance.
• The Human kidney cells dataset [27], which identifies the cellular identity of renal tumors, was profiled by the 10X scRNA-seq platform. We obtained the dataset released by author (https://github.com/xuebaliang/ scziDesk/tree/master/dataset/Young) • The Baron dataset [1], which determines individual pancreatic cells from human and mouse, was implemented by a droplet-based method. We obtained the dataset from GSE84133 and conduct experiments using human cells.
• The Shekhar mouse retina cells dataset [20], identifies cell types of the mouse retinal bipolar cells (BCs) was profiled by Drop-seq platform. The dataset was obtained from GSE81904 and corresponding cell types are founded on the https://scrnaseq-public-datasets.s3.amazonaws.com/ index.html?prefix=manual-data/shekhar/. Then we remove the cells which contain more than 10% mitochondrially derived transcripts and then filter out cells of which ≤ 500 genes are detected. After that we also remove the genes of which ≤ 30 cells are detected or having ≤ 60 transcripts following the preprocessing steps from the Hemberg group (https:// github.com/hemberg-lab/scRNA.seq.datasets/blob/master/R/shekhar. R).
The statistics of the datasets are summarized in Table S1. Note that except for the Shekhar mouse retina cells dataset which already filters out some cells and genes, we filter out the cells and genes with no count, and keep top 20% highly variable genes following the best practice [19] that selecting highly variable genes between 1000 and 5000 is recommended.

Future Works
Although scGPCL alleviates the sampling bias of instance-wise contrastive loss using the prototypical contrastive loss, there still remains misaligned cells that may cause inefficiency in the contrastive learning framework due to the unsupervised nature of clustering task. Recently, scDCC [23] extends the task of unsupervised scRNA-seq clustering to a semi-supervised setting by using available prior knowledge. We expect that since existing studies [15,5,14] in computer vision domain show that cancelling false negative pairs [5,14] or adding true positive pairs [15] can help eliminate bias of contrastive learning framework, we expect that scGPCL can be enhanced by guiding unlabeled cells using few labeled cells under the semi-supervised setting.

Supplementary Figures
Supplementary Figure S1: Performance comparison of different encoder structures on the simulated dataset over various dropout rates. We construct the autoencoder style models which have the same 2-layer MLP decoders which output the three parameters of ZINB distribution (i.e., mean, dispersion, and dropout probability) and three different encoders that are MLP, GNNs on a cell-cell graph, and GNNs on a cell-gene graph, respectively. More precisely, pearson correlation is used to calculate the similarity between cells to construct cell-cell graphs following scGNN [25] and scDSC [11], and we conduct experiments by varying the number of nearest neighbors (i.e., Cell-Cell Graph (10) represent the 10 nearest neighbors graph based on pearson correlation). The cell-gene graph is constructed by connecting two nodes, if a particular gene is expressed in that cell as proposed on scGPCL. Supplementary Figure S15: Ablation study on scGPCL with different reconstruction-based losses.
Supplementary Figure S16: Ablation study regarding the benefit of the fine-tuning phase in scGPCL.
Supplementary Figure S20: Ablation study regarding different augmentation strategies in scG-PCL.
Supplementary Figure S21: Sensitivity analysis regarding the number of anchor nodes (i.e., batch size) in scGPCL. Note that since there exist only 777 cells in Camp dataset, we report the result of 512 and 777 anchor nodes for Camp dataset.
Supplementary Figure S23: Sensitivity analysis regarding the cluster sizes on prototypical contrastive loss in scGPCL. In this experiment, K is the ground-truth cluster number in dataset and [a,b,c] means that we conduct clustering 3 times with cluster size a, b, and c

Supplementary Algorithms
Algorithm S1 Algorithm for the pre-training phase of scGPCL.

17:
Calculate the clustering task oriented loss L Cluster using Q and P (Eqn. 11)

18:
end for