CLAIRE: contrastive learning-based batch correction framework for better balance between batch mixing and preservation of cellular heterogeneity

Abstract Motivation Integration of growing single-cell RNA sequencing datasets helps better understand cellular identity and function. The major challenge for integration is removing batch effects while preserving biological heterogeneities. Advances in contrastive learning have inspired several contrastive learning-based batch correction methods. However, existing contrastive-learning-based methods exhibit noticeable ad hoc trade-off between batch mixing and preservation of cellular heterogeneities (mix-heterogeneity trade-off). Therefore, a deliberate mix-heterogeneity trade-off is expected to yield considerable improvements in scRNA-seq dataset integration. Results We develop a novel contrastive learning-based batch correction framework, CIAIRE, which achieves superior mix-heterogeneity trade-off. The key contributions of CLAIRE are proposal of two complementary strategies: construction strategy and refinement strategy, to improve the appropriateness of positive pairs. Construction strategy dynamically generates positive pairs by augmenting inter-batch mutual nearest neighbors (MNN) with intra-batch k-nearest neighbors (KNN), which improves the coverage of positive pairs for the whole distribution of shared cell types between batches. Refinement strategy aims to automatically reduce the potential false positive pairs from the construction strategy, which resorts to the memory effect of deep neural networks. We demonstrate that CLAIRE possesses superior mix-heterogeneity trade-off over existing contrastive learning-based methods. Benchmark results on six real datasets also show that CLAIRE achieves the best integration performance against eight state-of-the-art methods. Finally, comprehensive experiments are conducted to validate the effectiveness of CLAIRE. Availability and implementation The source code and data used in this study can be found in https://github.com/CSUBioGroup/CLAIRE-release. Supplementary information Supplementary data are available at Bioinformatics online.


Cells
Step 1 Step 2 Step 3 Step 4 Step • MAT 2 : Python package MAT 2 was used for all datasets. All datasets were preprocessed as described in our manuscript. MAT 2 needs precomputed anchors between batches for integration. We used the anchors exported from Seurat::FindIntegrationAnchors function for MAT 2 . Preprocessed dataset and inter-batch anchors were used as input for MAT 2 . MAT2.BuildMAT2.train function and MAT2.BuildMAT2.evaluate function were used to integrate preprocessed datasets, which produced batch corrected low-dimensional embeddings. Training and inference both used default parameters.
• SMILE: Python package SMILE was used for all datasets. All datasets were preprocessed as described in our manuscript. SMILE further standardized features of each dataset by removing the mean and scaling to unit variance. SMILE.SMILE.SMILE_trainer function was used to integrate the preprocessed datasets with default parameters. Then, the preprocessed datasets were input to the encoder network in mini batch (due to batch normalization layer in the network) to obtain batch corrected low-dimensional embeddings.
• CLEAR: Python package CLEAR was used for all datasets. CLEAR script provided many options for preprocessing, such as filtering, log-normalization, selection of highly variable genes, and so on. Through our experiments, we found that using log-normalization and selection of HVGs produced consistently good results. Preprocessed dataset was saved in .h5ad format and its path was used as the parameter '-input_h5ad_path' when running CLEAR.py in terminal. Other parameters were as default.
• iSMNN: R package iSMNN was used for all datasets. iSMNN exploits cell type annotations and performs iterative mutual nearest neighbors refinement to improve MNNCorrect [2]. If cell type annotations unknown, iSMNN clusters each batch separately and annotates clusters using known marker genes. In our experiment, the ground truth cell type annotations were used as input of iSMNN. Each dataset was preprocessed sequentially using Seurat::SplitObject, Seurat::NormalizeData, Seurat::FindVariableFeatures (top 2000 by default, top 5000 for Muris dataset). Then, preprocessed datasets and cell type annotations were used as input for iSMNN function with 'Short.run' strategy and 5 iterations. Other parameters were as default. Note that iSMNN requires at least one shared cell type among all batches and the minimum number of cells with a shared cell type in any batch should be greater than 'k.anchor' (default=5). In Lung dataset, there exists multiple shared cell types among all batches but the number of cells with a shared cell type in some batches is lower than 5, which will cause running errors. We tried to turn down the 'k.anchor' parameter but found it would cause other issues.

Supplementary Note 2 A. Label transfer between scRNA-seq datasets
Label transfer means leveraging information from annotated datasets (reference datasets) to automatically annotate cells in those datasets without cell type annotations (query datasets). Based on CLAIRE's integrated low-dimensional embeddings, k-nearest neighbors (kNN) classifier can be built between reference datasets and query datasets to infer the types of query cells. Other competing methods can also use kNN classifier to transfer labels. We included seven competing methods including Scanorama, Seurat, Harmony, INSCT, MAT 2 , SMILE, and CLEAR for a comparison. The evaluation metric is accuracy which is computed on the common cell types between reference datasets and query datasets. We set the k of kNN classifier to 10 for all methods.
MouseCellAtlas, PBMC, Immune Human, and Lung datasets were used for experiments. Details of four datasets can be found in Table 1 from our manuscript. For MouseCellAtlas and PBMC datasets, label transfer is performed from 'Batch 1' to 'Batch 2' and from 'Batch 2' to 'Batch 1'. Average accuracy of two transfer tasks is used as the results for these two datasets. For Immune Human datasets, batch 'Oetjen_A' contains the most diverse cell types (16, 16 in total), which is used as reference datasets and the other batches are used as query datasets. For Lung dataset, batch '5' contains the most diverse cell types (15, 17 in total), which is used as reference dataset and the other batches are used as query datasets. Supplementary Table S4 showed the benchmarking results. It can be observed that CLAIRE achieved accurate label transfer on the first three datasets and reached the highest accuracy in two out of four datasets.

B. Cross-omics label transfer
CLAIRE can transfer cell type information between scRNA-seq and scATAC-seq. PBMC multiome dataset 1 was used for experiment. In this dataset, scRNA-seq and scATAC-seq profiles were simultaneously collected in the same cells. We treated the datasets as originating from two different experiments. As CLAIRE needs matched input features between datasets, we generated a rough estimate of the transcriptional activity of each gene by quantifying ATAC-seq counts in the 2 kbupstream region and gene body, using the GeneActivity function in the Signac (v.1.7.0) [3] package. There are 19 cell types in total. The number of total cells is 20824 (10412+10412) and the number of common genes between two modalities is 18353. Two label transfer tasks are defined on this dataset: transfer labels from RNA to ATAC and transfer labels from ATAC to RNA. Accuracy is used as the evaluation metric.
We performed the same preprocessing steps as described in our manuscript and fed the preprocessed dataset to CLAIRE for training and inference. On the integrated low-dimensional embeddings, kNN classifier was used to infer cell types of query cells. For a comparison, MAT 2 was also tested on this dataset and its integrated data matrix was used to transfer labels using kNN classifier. We set the k of kNN classifier to 10 for both methods. For RNA->ATAC transfer task, MAT 2 achieved 76% accuracy and CLAIRE achieved 85% accuracy. For ATAC->RNA transfer task, MAT 2 achieved 77% accuracy and CLAIRE achieved 86% accuracy. We visualized the integration results via UMAP in Supplementary Figure S12. It can be observed that CLAIRE achieved more sufficient omics mixing than MAT 2 and cellular heterogeneity in CLAIRE's results was clearer, which well explained CLAIRE's higher accuracy. Overall, these results demonstrated that CLAIRE can accurately transfer cell type annotations across scRNA-seq and scATAC-seq.

C. Trajectory analysis
We collected two mouse neocortex datasets from Du et al. [4]. The first dataset (dataset A) contains 10261 cortical cells from E10.5, E12.5, E14.5, E16.5 and E18.5 mouse embryos and the second dataset (dataset B) contains 6390 cortical cells from E11.5, E13.5, E15.5 and E17.5 mouse embryos. Biological variations are confounded by batch effect in raw data as shown in Supplementary Figure S13. The inferred diffusion pseudotime is ambiguous to understand. We applied CLAIRE to integrate these two datasets. Supplementary Figure S13 showed the integration results. It can be observed that CLAIRE successfully removed batch effect while preserving the heterogeneity between different cell states. Further, we evaluated whether CLAIRE preserved the contiguous structure of cells. Specifically, Scanpy's diffusion pseudotime [5] function was used for the integrated data, raw dataset A, and raw dataset B to infer their pseudotime, respectively. Following Ruan et al. [6], the NEC cells were specified as root cells for pseudotime inference. Then, we computed spearman correlation between the pseudotime of raw dataset A and its corresponding part in the integrated data. Correlation for dataset B was also computed. CLAIRE achieved correlation coefficients of 0.91 and 0.93 on dataset A and dataset B respectively, which suggested that CLAIRE can well preserves the contiguous structure of original data after batch correction.           Ablation study for learning rates on Pancreas dataset. 'lr=1e-4' is the default setting. When learning rate is smaller (1e-5), longer training epochs are needed for batch correction metrics to converge. Larger learning rate (lr=1e-3) performs similarly with lr=1e-4.   S13. UMAP visualization of mouse neocortex datasets using raw data and integrated data from CLAIRE. Cells are colored by batch labels in the first column, colored by cell types in the second column, and colored by diffusion pseudotime in the third column.