-
PDF
- Split View
-
Views
-
Cite
Cite
Wenbo Guo, Dongfang Wang, Shicheng Wang, Yiran Shan, Changyi Liu, Jin Gu, scCancer: a package for automated processing of single-cell RNA-seq data in cancer, Briefings in Bioinformatics, Volume 22, Issue 3, May 2021, bbaa127, https://doi.org/10.1093/bib/bbaa127
- Share Icon Share
Abstract
Molecular heterogeneities and complex microenvironments bring great challenges for cancer diagnosis and treatment. Recent advances in single-cell RNA-sequencing (scRNA-seq) technology make it possible to study cancer cell heterogeneities and microenvironments at single-cell transcriptomic level. Here, we develop an R package named scCancer, which focuses on processing and analyzing scRNA-seq data for cancer research. Except basic data processing steps, this package takes several special considerations for cancer-specific features. Firstly, the package introduced comprehensive quality control metrics. Secondly, it used a data-driven machine learning algorithm to accurately identify major cancer microenvironment cell populations. Thirdly, it estimated a malignancy score to classify malignant (cancerous) and non-malignant cells. Then, it analyzed intra-tumor heterogeneities by key cellular phenotypes (such as cell cycle and stemness), gene signatures and cell–cell interactions. Besides, it provided multi-sample data integration analysis with different batch-effect correction strategies. Finally, user-friendly graphic reports were generated for all the analyses. By testing on 56 samples with 433 405 cells in total, we demonstrated its good performance. The package is available at: http://lifeome.net/software/sccancer/.
Introduction
Cancer is one of the most lethal diseases worldwide. Cellular heterogeneities and complex microenvironments are major challenges for current cancer research. Recently, various single-cell RNA-sequencing (scRNA-seq) techniques have been developed and were widely applied to study cancer heterogeneities and microenvironments at single-cell level [1–6]. Alongside the technological breakthrough and application, several bioinformatics packages (such as Seurat [7], scater [8] and scanpy [9]) of scRNA-seq data analysis have been developed, which greatly facilitate the research in biological and medical fields. However, these tools are designed for general applications and mainly contain basic analyses, lacking consideration on cancer-specific features. Therefore, there is a great need for developing cancer-specific tools beyond basic and routine analyses.
Here, we developed a user-friendly and automated R package scCancer for cancer scRNA-seq data analysis. In the package, we introduced more comprehensive and adaptive quality control (QC) metrics for cancer research and also encapsulated basic scRNA-seq analyses steps, such as clustering and differential expression analysis. Furthermore, we integrated several specific computational analyses for cancer data, including major cell-type classifications of cancer microenvironment, cell malignancy estimation, intra-tumor heterogeneities analyses based on cellular phenotypes (cell cycle and stemness) and gene signatures, and cell–cell interaction estimation. The package can also implement multi-sample data integration analysis, including different strategies to correct batch effect. In current version, the package is mainly designed for droplet-based platforms, which can profile thousands of cells in a single run and are very popular for cancer research application scenarios [10], but it can also be adapted to analyze scRNA-seq data by other platforms.
To test the compatibility, scalability and accuracy of our package, we applied scCancer for systematically analyzing 433 405 cells in 56 samples from 6 data sets (Table 1) [3, 11]. Results show that the package can achieve good performance on cell-type classification, cell malignancy estimation, etc.
Data set . | Cancer type . | Number of sample . | Number of cell . | Sourcea . | Technologyb . |
---|---|---|---|---|---|
KC | Kidney cancer | 3 | 27 844 | Unpublished | 10X |
HCC | Hepatocellular carcinoma | 4 | 30 205 | Unpublished | 10X |
PDAC-CR | Pancreatic ductal adenocarcinoma | 20 | 102 703 | CRA001160 [3] | 10X |
HCC-NC | Hepatocellular carcinoma | 7 | 118 301 | GSE112271 [11] | 10X |
LUAD-TJ | Lung adenocarcinoma | 9 | 119 486 | CRA001963, CRA001477 | 10X |
LUAD-CC | Lung adenocarcinoma | 13 | 34 866 | GSE123902 | 10X |
Data set . | Cancer type . | Number of sample . | Number of cell . | Sourcea . | Technologyb . |
---|---|---|---|---|---|
KC | Kidney cancer | 3 | 27 844 | Unpublished | 10X |
HCC | Hepatocellular carcinoma | 4 | 30 205 | Unpublished | 10X |
PDAC-CR | Pancreatic ductal adenocarcinoma | 20 | 102 703 | CRA001160 [3] | 10X |
HCC-NC | Hepatocellular carcinoma | 7 | 118 301 | GSE112271 [11] | 10X |
LUAD-TJ | Lung adenocarcinoma | 9 | 119 486 | CRA001963, CRA001477 | 10X |
LUAD-CC | Lung adenocarcinoma | 13 | 34 866 | GSE123902 | 10X |
aIn the ‘Source’ column, starting with ‘CRA’ means the accession number of Genome Sequence Archive databases, and starting with ‘GSE’ means the accession number of NCBI GEO databases.
bThe scRNA-seq technologies for test data sets. In this column, ‘10X’ means ‘10X Genomics’.
Data set . | Cancer type . | Number of sample . | Number of cell . | Sourcea . | Technologyb . |
---|---|---|---|---|---|
KC | Kidney cancer | 3 | 27 844 | Unpublished | 10X |
HCC | Hepatocellular carcinoma | 4 | 30 205 | Unpublished | 10X |
PDAC-CR | Pancreatic ductal adenocarcinoma | 20 | 102 703 | CRA001160 [3] | 10X |
HCC-NC | Hepatocellular carcinoma | 7 | 118 301 | GSE112271 [11] | 10X |
LUAD-TJ | Lung adenocarcinoma | 9 | 119 486 | CRA001963, CRA001477 | 10X |
LUAD-CC | Lung adenocarcinoma | 13 | 34 866 | GSE123902 | 10X |
Data set . | Cancer type . | Number of sample . | Number of cell . | Sourcea . | Technologyb . |
---|---|---|---|---|---|
KC | Kidney cancer | 3 | 27 844 | Unpublished | 10X |
HCC | Hepatocellular carcinoma | 4 | 30 205 | Unpublished | 10X |
PDAC-CR | Pancreatic ductal adenocarcinoma | 20 | 102 703 | CRA001160 [3] | 10X |
HCC-NC | Hepatocellular carcinoma | 7 | 118 301 | GSE112271 [11] | 10X |
LUAD-TJ | Lung adenocarcinoma | 9 | 119 486 | CRA001963, CRA001477 | 10X |
LUAD-CC | Lung adenocarcinoma | 13 | 34 866 | GSE123902 | 10X |
aIn the ‘Source’ column, starting with ‘CRA’ means the accession number of Genome Sequence Archive databases, and starting with ‘GSE’ means the accession number of NCBI GEO databases.
bThe scRNA-seq technologies for test data sets. In this column, ‘10X’ means ‘10X Genomics’.
Methods
Workflow overview
The scCancer package has three modules. The first, named scStatistics, performed basic statistical analyses of raw data and QC. The second, named scAnnotation, performed functional data analyses and visualizations of single sample, including low-dimensional representation, clustering, cell-type classification, malignancy estimation, cellular phenotype scoring, gene signatures analysis, cell–cell interaction analysis, etc. The third, named scCombination, performed multi-sample integration, batch-effect correction and visualization. After all the computational analyses finished, detailed and graphical reports were automatically generated in user-friendly HTML format (Figure 1). The information of packages and methods used in scCancer is summarized in Table 2.
Quality control
For a droplet-based scRNA-seq expression matrix, the first step was to identify the droplets more likely to include real cells (Supplementary Figure S1, the supplementary materials section 1) [12]. However, the droplets annotated with cells (identified by preprocessing steps) frequently captured low-quality cells or more than one cell (doublets or multiplets) [13]. To avoid misleading results caused by these artifacts, the number of total unique molecular identifiers (UMIs), the number of expressed genes and the percentage of transcripts from mitochondrial or ribosomal genes were widely used to perform cell (or droplet) quality control [7–9, 14] (Supplementary Figure S2A and B, the supplementary materials section 2). Generally, extremely low gene detection number or UMI number means very few transcripts are successfully captured; abnormally high UMI or gene number may due to that two or more cells (usually called as doublets or multiplets) are included in one droplet; the vast majority of UMIs from mitochondrial frequently suggest that cells are lysed during the preparation of single-cell suspensions. Using these metrics, the droplets having low transcripts diversity, capturing lysed cells or doublets were filtered. Moreover, to detect doublets better, we also integrated an R package scds [15] to estimate doublet scores (Supplementary Figure S2C). This package contained two doublets identification strategies: (i) binary classification-based approach (bcds), which created artificial doublets and trained binary classifiers to discriminate them; and (ii) co-expression-based approach (cxds), which detected the co-expression profile of gene pairs that should not be expressed in one cell simultaneously. Except these common metrics, considering that solid tumor cells were easier to produce stress response during sample dissociation process, we took advantage of the results of van den Brink et al. [16] and set a dissociation-associated gene list. By calculating these genes expression percentage, we defined an extra metric to remove cells seriously affected by dissociation process. Besides, considering that different sample sources and experimental conditions may lead to different distributions of the QC metrics, we adaptively determined the filtering thresholds by detecting outliers from the distribution of metrics, instead of routine fixed cutoffs (Supplementary Figure S2A and B, the supplementary materials section 2). By analyzing all the 56 testing samples and identifying the upper thresholds of five cell-QC metrics, we found that different samples had similar thresholds distributions, but variances also existed due to sample-specific characteristics (Figure 2). So, we highly recommended that users should choose a proper cutoff by carefully examining the metrics distributions.

Steps . | Packages and references . | Methods . |
---|---|---|
Cell calling | DropletUtils_1.6.1 [12] | Detecting deviations from the ambient profile |
Ambient RNA | SoupX_1.0.1 [17] | Estimating background (soup) fraction |
Basic analyses | Seurat_3.1.4.9903 [7, 22] | Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc. |
Doublet score | scds_1.2.0 [15] | Binary classification-based and co-expression-based algorithms |
Cell-type annotation | gelnet_1.2.1 [24] | OCLR |
Malignancy estimation | infercnv_0.8.2 [26] | Inferring CNV based on gene expression data |
Cell cycle estimation | Seurat_3.1.4.9903 [22] | Relative average expression levels |
Cell stemness estimation | gelnet_1.2.1 [24] | OCLR |
Gene set signatures | GSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22] | GSVA; relative average expression levels |
Expression programs | NNLM_0.4.3 | NMF |
Survival analysis | survival_3.1–11; survminer_0.4.6 | Significance test |
Cell–cell interaction | [33] | The product of the average expression of the ligand and receptor genes |
Batch effect correction | Seurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41] | Regression; SeuratMNN; Normal-MNN; Harmony; Liger |
Plotting | ggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12 | - |
Report generating | markdown_1.1; knitr_1.28 | - |
Steps . | Packages and references . | Methods . |
---|---|---|
Cell calling | DropletUtils_1.6.1 [12] | Detecting deviations from the ambient profile |
Ambient RNA | SoupX_1.0.1 [17] | Estimating background (soup) fraction |
Basic analyses | Seurat_3.1.4.9903 [7, 22] | Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc. |
Doublet score | scds_1.2.0 [15] | Binary classification-based and co-expression-based algorithms |
Cell-type annotation | gelnet_1.2.1 [24] | OCLR |
Malignancy estimation | infercnv_0.8.2 [26] | Inferring CNV based on gene expression data |
Cell cycle estimation | Seurat_3.1.4.9903 [22] | Relative average expression levels |
Cell stemness estimation | gelnet_1.2.1 [24] | OCLR |
Gene set signatures | GSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22] | GSVA; relative average expression levels |
Expression programs | NNLM_0.4.3 | NMF |
Survival analysis | survival_3.1–11; survminer_0.4.6 | Significance test |
Cell–cell interaction | [33] | The product of the average expression of the ligand and receptor genes |
Batch effect correction | Seurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41] | Regression; SeuratMNN; Normal-MNN; Harmony; Liger |
Plotting | ggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12 | - |
Report generating | markdown_1.1; knitr_1.28 | - |
Steps . | Packages and references . | Methods . |
---|---|---|
Cell calling | DropletUtils_1.6.1 [12] | Detecting deviations from the ambient profile |
Ambient RNA | SoupX_1.0.1 [17] | Estimating background (soup) fraction |
Basic analyses | Seurat_3.1.4.9903 [7, 22] | Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc. |
Doublet score | scds_1.2.0 [15] | Binary classification-based and co-expression-based algorithms |
Cell-type annotation | gelnet_1.2.1 [24] | OCLR |
Malignancy estimation | infercnv_0.8.2 [26] | Inferring CNV based on gene expression data |
Cell cycle estimation | Seurat_3.1.4.9903 [22] | Relative average expression levels |
Cell stemness estimation | gelnet_1.2.1 [24] | OCLR |
Gene set signatures | GSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22] | GSVA; relative average expression levels |
Expression programs | NNLM_0.4.3 | NMF |
Survival analysis | survival_3.1–11; survminer_0.4.6 | Significance test |
Cell–cell interaction | [33] | The product of the average expression of the ligand and receptor genes |
Batch effect correction | Seurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41] | Regression; SeuratMNN; Normal-MNN; Harmony; Liger |
Plotting | ggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12 | - |
Report generating | markdown_1.1; knitr_1.28 | - |
Steps . | Packages and references . | Methods . |
---|---|---|
Cell calling | DropletUtils_1.6.1 [12] | Detecting deviations from the ambient profile |
Ambient RNA | SoupX_1.0.1 [17] | Estimating background (soup) fraction |
Basic analyses | Seurat_3.1.4.9903 [7, 22] | Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc. |
Doublet score | scds_1.2.0 [15] | Binary classification-based and co-expression-based algorithms |
Cell-type annotation | gelnet_1.2.1 [24] | OCLR |
Malignancy estimation | infercnv_0.8.2 [26] | Inferring CNV based on gene expression data |
Cell cycle estimation | Seurat_3.1.4.9903 [22] | Relative average expression levels |
Cell stemness estimation | gelnet_1.2.1 [24] | OCLR |
Gene set signatures | GSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22] | GSVA; relative average expression levels |
Expression programs | NNLM_0.4.3 | NMF |
Survival analysis | survival_3.1–11; survminer_0.4.6 | Significance test |
Cell–cell interaction | [33] | The product of the average expression of the ligand and receptor genes |
Batch effect correction | Seurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41] | Regression; SeuratMNN; Normal-MNN; Harmony; Liger |
Plotting | ggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12 | - |
Report generating | markdown_1.1; knitr_1.28 | - |

The distributions of identified QC upper thresholds of five metrics across 56 test samples. Five subplots correspond to the number of total UMI (nUMI), the number of expressed genes (nGene), the percentages of UMIs from mitochondrial genes (mito.percent), ribosomal genes (ribo.percent) and dissociation-associated genes (diss.percent), respectively. The points in boxes indicate the identified QC upper thresholds of each sample to filter low-quality cells. PDAC, pancreatic ductal adenocarcinoma; LUAD, lung adenocarcinoma.
In addition to QC on cells, we also performed QC on genes. Firstly, we filtered genes that expressed in less than three cells. Then, due to the existence of some ambinet RNAs leaked by necrotic or lysed cells, some other droplets were contaminated by them. In order to analyze the influence of contamination, we tested on some samples and found that mitochondrial genes, ribosomal genes and some other genes (such as MALAT1, FTH1, B2M) had high expression proportion in both cell and background (nearly empty droplets) in nearly all samples (Supplementary Figure S3A and B), which meant that ambient RNAs contamination may have more significant impact on their expressions. Meanwhile, these genes were mainly related to cells basic structures, such as membrane and ATP synthase complex (Supplementary Figure S3C), and had little relation to downstream analyses. Therefore, the package listed these genes by boxplots (Supplementary Figure S3A), and it is strongly suggested that researchers should carefully consider the effects of these genes on the final results. Besides, we also integrated an R package SoupX [17] to estimate the contamination fraction from ambient RNAs. It estimated a soup (background) expression profile from nearly empty droplets firstly. Then, it defined or identified soup-specific genes according to prior knowledge or expression bimodality, which meant they should not be expressed in some of cells but were detected due to the existence of ambient RNAs. Using these genes as a bridge, a contamination fraction can be estimated and the expression profile can be corrected further to reduce ambient RNAs contamination (Supplementary Figure S3, the supplementary materials section 3).
Basic downstream analyses
After QC, clustering single cells, representing them on low-dimension and identifiying cluster-specific marker genes are some basic analyses for scRNA-seq data, and several methods and tools have been developed [18–22]. Here, we performed these downstream analyses based on Seurat [7, 22] and redesigned some visualization functions. In detail, these analyses mainly included normalization, log-transformation, highly variable genes identification, unwanted variance removing, scaling, centering, dimension reduction (PCA/t-SNE/UMAP), clustering and differential expression analysis (Supplementary Figure S4, the supplementary materials section 4).
Microenvironment cell-type classification
Cancer microenvironment plays an important role in tumor progression. So, accurate identification of different microenvironment cell types is essential to scRNA-seq downstream biological function analyses [23]. Here, we developed a data-driven method to annotate major microenvironment cell types, including endothelial cells, fibroblasts and immune cells (CD4+ T cells, CD8+ T cells, B cells, nature killer cells and myeloid cells). We curated a high-quality data set by combining multiple cancer scRNA-seq data and trained one-class logistic regression (OCLR) machine learning models for each cell type [24]. In detail, given a training expression matrix |${X}^{(k)}$| of cell type |$k$|, we trained a weight vector |${w}_{(k)}$| of OCLR model to maximize the log-likelihood |$l({w}_{(k)}|{X}^{(k)})={\sum}_{i=1}^n\log p({X}_i^{(k)}|{w}_{(k)})$|, where the likelihood was set as logistic function format |$p({x}_i^{(k)}|{w}_{(k)})=\frac{\exp ({w}_{(k)}^T{x}_i^{(k)})}{1+\exp ({w}_{(k)}^T{x}_i^{(k)})}$|. Meanwhile, a regularizer |$\frac{1}{2}({w}_{(k)}^T{w}_{(k)})$| was added to limit the complexity of model. Then, we calculated Spearman correlation between the trained weight template |${w}_{(k)}$| and expression vector |${X}_j^{(\mathrm{test})}$| of each test cell |$j$| to measure the possibility of this cell belonging to cell type |$k$|. By comparing these possibilities of different cell types, we assigned type label with the largest possibility (including an extra ‘Unknown’ type when all possibilities were especially small) to each cell (Supplementary Figure S5, the supplementary materials section 5). In scCancer, we presented the predicted labels low-dimension space distribution (Figure 3A) and enrichment degree in all clusters (Figure 3B).

The results of cancer microenvironment cell-type classification by scCancer. (A) Colors in the t-SNE plot shows the predicted cell types of the sample KC1 of KC data set. (B) Bar plot shows the proportion of different cell types in each cluster of KC1 sample. (C) Heat map shows the cell-type percentages (the first row of each block) and marker gene detection rates (the second row of each block) in each cluster (column) of KC1 sample. (D) Left: boxplot shows the Pearson correlations (y-axis) between cell-types percentages and marker gene detection rates of each clusters of all test samples across six data sets (x-axis). Right: density plot shows the distribution of correlation coefficients (y-axis) in left plot. The red dash line indicates correlation of 0.5 and 91.74% of them are >0.5. The colors indicate different cell types.
To evaluate the classification performance of OCLR models, we measured the consistency between the predicted cell types and their marker genes expression profile of all clusters. We thought the classifier had good performance, if each clusters cells were mainly predicted as one cell type (large cell-type percentage), and these cell-types marker genes also had high detection rate in the cluster. Taking one kidney cancer (KC) sample KC1 for example, we plotted the cell-type fractions and markers detection rates of different clusters and found they were highly correlated (Figure 3C). Although cluster 15 has low PTPRC detection rate, its top differential expressed genes (IGLL5, IGJ, MZB1, DERL3, etc. Supplementary Figure S4D) were all related to B cells function, which was consistent with scCancer-predicted labels. And for cluster 17, which has significantly high NKG7 detection rates, but was mainly predicted as epithelial cells, we thought it expressed both epithelial cell markers (such as KRT8) and immune cell markers (such as PTPRC, NKG7) and was more likely to be doublets, which was consistent with its high doublet scores estimated before (Supplementary Figures S2C and S4B).
By applying to the 56 test samples in Table 1, we proved our method presented high correlations between the predicted cell-type percentages and their corresponding maker genes detection rates across all clusters. In spite of the impact of doublets, ambinet RNAs contamination, and dropout, over 90% of these correlation coefficients were still >0.5 (Figure 3D). As a comprasion, we also tested two other algorithms: linear discriminant analysis (LDA), one of the most commonly used multi-class classification methods, and Garnett [25], a novel method designed for single-cell data based on hierarchical cell-type-specific genes. The resuls showed that three methods all had high correlations between cell-type percentages and makers detection rates (Figure 3D and Supplementary Figure S5A and B). However, Garnett depended on accurate prior knowledge of cell-type marker genes. If the given markers were improper, it may predict mixed cell types in one cluster (Supplementary Figure S5C and D). Besides, comparing to multi-class classification methods, such as LDA, one-class model can more easily add more cell types without retraining the existing classifiers and can also assign ‘Unknown’ labels for unsure or unseen types. In conclusion, the one-class model OCLR had stable performance, flexible application scenarios and simple parameter settings. Therefore, we used it to annotate common cancer microenviroment cell types in scCancer.
Cell malignancy estimation
Then, we defined the malignancy score as the mean of the squares of final CNV values across chromosome. By comparing the distribution of malignancy scores with reference and detecting its bimodality, we assigned malignancy labels for cells (Figure 4A and the supplementary materials section 6).

The results of cell malignancy estimation by scCancer. (A) Left: histogram shows the distribution of the estimated malignancy scores of observation (blue) of reference (gray) cells in KC1 sample. The red dash line marks the identified threshold to distinguish malignant and nonmalignant cells. Middle: t-SNE plot shows the cells’ malignancy scores. The deeper color indicates the cell is more malignant. Right: t-SNE plot shows the classified malignant types (red means malignant; green means non-malignant). (B) Boxplots show the estimated malignancy score (x-axis) of different cell types (y-axis) in all test samples. Each subplot indicates a data set. The colors of boxes are consistent with the legend colors of Figure 3A and B.
By testing on 56 samples in Table 1, we found the estimated malignancy scores of epithelial cells were significantly higher than other cell types (Figure 4B), which were consistent with these samples epithelial cancer origin. In addition, we found that the neighbor-smoothing process we proposed reduced the impact of dropout effectively and was more helpful to distinguish malignant and non-malignant cells (Supplementary Figure S6 and Figure 4A).
Intra-tumor cell phenotype heterogeneity analyses
Intra-tumor heterogeneity is widespread in cancer samples and highly associated with cancers poor prognosis. The single-cell sequencing technologies provide an avenue to explore the heterogeneity at cellular resolution. In scCancer, we mainly considered two key cellular phenotypes, cell cycle and stemness, to characterize cancer samples cellular heterogeneity. For cell cycle, we used the Seurat ‘AddModuleScore’ function to calculate the relative average expression of a list of G2/M and S phase markers as cell cycle scores (Supplementary Figure S7A) [22]. For cell stemness, we trained a stemness signature based on a stem/progenitor cells data set using OCLR model [27]. Then, we defined the stemness score as the Spearman correlation coefficient between this signature and cells expression vectors (Supplementary Figures S7B and the supplementary materials section 8).
Intra-tumor cell signature heterogeneity analyses
In addition to cell phenotypes, analyzing at pathways or other gene sets level may reveal more information on cell–cell heterogeneities [1, 5, 6]. In scCancer, we provided two complementary approaches to perform signature analyses.
The first was to calculate known gene set signature scores, such as pathways. Here, we integrated gene set variation analysis (GSVA) [28] to estimate variation of the known gene set activities over the cells, which was widely used to analyze signatures on various cancer types [29–31]. As a faster alternative, the method based on the relative average expression level across gene sets was also provided by using the Seurat ‘AddModuleScore’ function [22]. By default, scCancer calculated signature scores of 50 hallmark gene sets from MSigDB [32] and users can also input their own interested gene sets to the function. The estimtaed gene signature scores of cancer cells were presented by a heat map with cell clusters annotation (Supplementary Figures S8 and the supplementary materials section 9).
The second was to identify potential expression program signatures in unsupervised ways and assign scores for them. Many studies show that non-negative matrix factorization (NMF) can unsupervised identify transcriptional programs or gene signatures [1, 3]. In scCancer, we applied NMF on the centralized and non-negative changed expression matrix to uncover the common expression patterns of malignant cell subpopulations. As a result, the left decomposed matrix, gene by program, presented the related genes of each program. And the right one, program by cell, revealed the cell subsets diversity on these programs and were presented by heat map in scCancer (Supplementary Figures S9 and the supplementary materials section 10).
Survival analysis based on genes or signatures
Based on the above analyses, some cell-type-specific marker genes and other gene signatures (such as known gene sets and NMF expression programs) can be used for survival analysis. So, we provided an extra function ‘runSurvival’, which read the user-given expression and survival data to assess the prognostic values of these signatures (Supplementary Figures S10 and the supplementary materials section 11).
Cell–Cell interaction analyses
Cell–cell interaction is crucial to understand some tumor behaviors, such as growth, progression, drug response, therapeutic effect [33–35]. Here, we referred to the methods of Kumar et al. [33] to characterize ligand–receptor interactions extent over cell clusters. Firstly, we extracted single cells expressions of the approximately 2500 known ligand–receptor pairs from FANTOM5 data set [36], and defined the ligand–receptor interaction score between two cell clusters as the product of the average expression of the ligand and receptor gene in each cluster, respectively. To identify significant interactions, we filtered weak gene pairs according to their interaction scores and detection rates in cell clusters. Finally, we estimated the global interaction strength between any two cell clusters by the number of the remaining gene pairs and the sum of their scores. We visualized these global interaction strengths between clusters by a scatter plot and annotated clusters with their cell-type fractions for convenience of comparsion in scCancer (Supplementary Figures S11 and the supplementary materials section 12).
Multi-sample data integration analysis
Integrating mulitple scRNA-seq samples can uncover common or specific biological signals better. However, the batch effect existing across samples brings a great challenge on it [37, 38]. In scCancer, we provided six approaches to perform multi-sample integration analysis, which covered two of the most basic combination strategies (‘Raw’, ‘Regression’), three best-performing algorithms after systematic evaluation [‘Seurat mutual nearest neighbor (MNN)’, ‘Harmony’, ‘LIGER’] [39], and a modified MNN version considering the inter-tumor heterogeneity (‘NormalMNN’).
The first one, named ‘Raw’, was to combine the raw expression matrixes of all samples directly. This method was appropriate to the situation when we believed the batch effect had little impact or when we used it to compare with other batch correction methods.
The second one, named ‘Regression’, was to treat the sample sources as unwanted technical variance and regress out it to remove the batch effects. This method required the samples had no differences in sample biological design, such as control and experimental groups. If not, these designed differences may also be regressed out.
The third one, named ‘SeuratMNN’, was to take advantage of the algorithm in Seurat v3 [22]. It performed canonical correlation analysis to reduce dimension and capture correlated gene modules among samples. Then, it identified MNNs to determine shared cell subsets among different samples. Using these neighbor pairs on shared cell subsets as ‘anchors’, we can project multiple samples together. This method was under the assumption that all samples shared at least one type of cell subpopulation.
Besides, considering the widespread inter-tumor heterogeneity across cancer samples, we thought the ‘SeuratMNN’ anchors among malignant cell subpopulations may eliminate the specific features of tumor cells in different samples. So, we aslo proposed a modified MNN version based on Seurat v3 method, named ‘NormalMNN’ (recommended option), which used the cell-type annotation results in the previous step and only kept the anchors between normal cells. In this way, the method not only took advantage of the shared normal cell populations across samples, but also had less influence on inter-tumor heterogeneity.
The fifth one, named ‘Harmony’, was to integrate R package ‘harmony’ [40]. It corrected the cells low-dimension embedding coordinates to make cells of same types from different samples mixed. This method had superior performance, faster running speed and fewer computational cost.
The final one, named ‘LIGER’, was to integrate R package ‘liger’ [41]. It proposed integrative NMF to find shared and specific factors among different samples and then performed joint clustering based on cells neighbor relationships among shared factors.
To test the performance of these six approaches, we applied them to our three KC samples and four hepatocellular carcinoma (HCC) samples. According to the results (Figure 5), we found that the cells were mainly clustered by sample sources under the method combining the raw expression matrixes directly, though the same type cells were also close. Using regression to correct batch had mild effect and can push the same types cells closer. Using the ‘SeuratMNN’, ‘Harmony’ and ‘LIGER’ can make cells from different samples mixed uniformly, but it led to hardly any inter-tumor heterogeneity. The ‘NormalMNN’ method we proposed can mix the normal cells and keep the inter-tumor heterogeneity at the same time.

The results of multi-sample data integration analysis of KC (A) and HCC (B) data sets by scCancer. Six columns correspond to six strategies of multi-sample integration. Colors of points in t-SNE plots indicate sample sources.
After the batch correction, we reperformed dimension reduction, clustering, differential expression analysis. Then, for the following cell annotations, we combined the results of single sample analyses directly, or reanalyzed the integrated data based on whether the cells mutual relationship influenced the results. And all results were recorded in a new HTML report (the supplementary materials section 13).
Discussion
scCancer was a user-friendly R package for processing and analyzing scRNA-seq for cancer research. It not only included the common process steps for scRNA-seq data, but also paid more attention to the characteristics of cancer samples, such as cancer microenvironments and heterogeneities. After the nearly automated analyses, it can generate detailed graphical HTML reports to make users have a quick overview for the data. At the same time, users can also modify some arugments to realize some personalized setting, such as adjusting QC strength, analyzing mouse samples, analyzing patient-drived tumor xenograft samples, retraining cell-type OCLR templates to achieve more abundatnt cell-types classification, using own reference data to estimate malignancy, etc. By testing on 433 405 cells in 56 samples from 6 datasets, we demonstrated that scCancer had good performance on each steps analysis and was applicable to a variety of cancer types. And the total single sample analyses of general 10X samples can be ran on a personal computer within acceptable times (Supplementary Figures S12 and the supplementary materials section 14).
Among all the analysis steps, we integrated some widely used packages or algorithms and proposed a few specific methods for cancer applications. For the selection of these methods, we mainly followed three principles by carefully assessing them on all the 56 testing samples: (i) the performance should be stable and insensitive to parameter settings, (ii) the methods are useful or fine-adapted for cancer research and (iii) the implementation is simple and time-efficient. The specifically designed methods included: OCLR for annotating cancer microenvironment cell types, a modified cell malignancy estimation by neighbor smoothing and a modified batch-effect correction method (‘NormalMNN’) considering the inter-tumor heterogeneity. The advantages of OCLR are its performance stability and flexibility. As the quick accumulation of more single-cell data of different cell types, OCLR will be updated to identify more cell types and cellular states in the future. The modified cell malignancy estimation smoothed the original CNV profiles by neighbors, which makes the estimation more stable against the noisy single-cell data. And, it also makes the malignancy scores tend to form bi-model distributions, which are very helpful for selecting a proper threshold. The ‘NormalMNN’ method considers the high inter-tumor heterogeneities between tumor cells from different patients.
Currently, the package mainly focused on solid tumor samples. As for blood cancers, some of the steps may not be very applicable, such as the default cancer microenvironment cell-type setting. In the future, we will consider the characteristics of blood cancers and make scCancer compatible to these situations. Besides, we will provide more optional methods for each step and try to solve more cancer-specific problems.
An R package, named scCancer, is developed for automated processing of single-cell RNA-seq data in cancer.
scCancer integrates basic scRNA-seq data processing steps and considers more cancer-specific characteristics and concerns, including comprehensive quality control, cancer microenvironment cell-types identification, cell malignancy estimation, intra-tumor heterogeneities analyses at phenotype and signature levels, cell–cell interaction analyses, etc.
scCancer can perform multi-sample integration analysis by using common batch-effect correction methods and new proposed ‘NormalMNN’ algorithm.
User-friendly graphic reports can be generated automatically to overview all the analyses results.
Funding
National Natural Science Foundation of China (61922047, 81890993 and 61721003).
Wenbo Guo is a PhD student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.
Dongfang Wang, PhD, is a postdoctoral fellow of BIOPIC and School of Life Sciences, Peking University.
Shicheng Wang is a Master student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.
Yiran Shan is a PhD student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.
Changyi Liu is a Master student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.
Jin Gu, PhD, is an Associate Professor of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.