Abstract

Molecular heterogeneities and complex microenvironments bring great challenges for cancer diagnosis and treatment. Recent advances in single-cell RNA-sequencing (scRNA-seq) technology make it possible to study cancer cell heterogeneities and microenvironments at single-cell transcriptomic level. Here, we develop an R package named scCancer, which focuses on processing and analyzing scRNA-seq data for cancer research. Except basic data processing steps, this package takes several special considerations for cancer-specific features. Firstly, the package introduced comprehensive quality control metrics. Secondly, it used a data-driven machine learning algorithm to accurately identify major cancer microenvironment cell populations. Thirdly, it estimated a malignancy score to classify malignant (cancerous) and non-malignant cells. Then, it analyzed intra-tumor heterogeneities by key cellular phenotypes (such as cell cycle and stemness), gene signatures and cell–cell interactions. Besides, it provided multi-sample data integration analysis with different batch-effect correction strategies. Finally, user-friendly graphic reports were generated for all the analyses. By testing on 56 samples with 433 405 cells in total, we demonstrated its good performance. The package is available at: http://lifeome.net/software/sccancer/.

Introduction

Cancer is one of the most lethal diseases worldwide. Cellular heterogeneities and complex microenvironments are major challenges for current cancer research. Recently, various single-cell RNA-sequencing (scRNA-seq) techniques have been developed and were widely applied to study cancer heterogeneities and microenvironments at single-cell level [1–6]. Alongside the technological breakthrough and application, several bioinformatics packages (such as Seurat [7], scater [8] and scanpy [9]) of scRNA-seq data analysis have been developed, which greatly facilitate the research in biological and medical fields. However, these tools are designed for general applications and mainly contain basic analyses, lacking consideration on cancer-specific features. Therefore, there is a great need for developing cancer-specific tools beyond basic and routine analyses.

Here, we developed a user-friendly and automated R package scCancer for cancer scRNA-seq data analysis. In the package, we introduced more comprehensive and adaptive quality control (QC) metrics for cancer research and also encapsulated basic scRNA-seq analyses steps, such as clustering and differential expression analysis. Furthermore, we integrated several specific computational analyses for cancer data, including major cell-type classifications of cancer microenvironment, cell malignancy estimation, intra-tumor heterogeneities analyses based on cellular phenotypes (cell cycle and stemness) and gene signatures, and cell–cell interaction estimation. The package can also implement multi-sample data integration analysis, including different strategies to correct batch effect. In current version, the package is mainly designed for droplet-based platforms, which can profile thousands of cells in a single run and are very popular for cancer research application scenarios [10], but it can also be adapted to analyze scRNA-seq data by other platforms.

To test the compatibility, scalability and accuracy of our package, we applied scCancer for systematically analyzing 433 405 cells in 56 samples from 6 data sets (Table 1) [3, 11]. Results show that the package can achieve good performance on cell-type classification, cell malignancy estimation, etc.

Table 1

Test samples information (56 samples, 433 405 cells)

Data setCancer typeNumber of sampleNumber of cellSourceaTechnologyb
KCKidney cancer327 844Unpublished10X
HCCHepatocellular carcinoma430 205Unpublished10X
PDAC-CRPancreatic ductal adenocarcinoma20102 703CRA001160 [3]10X
HCC-NCHepatocellular carcinoma7118 301GSE112271 [11]10X
LUAD-TJLung adenocarcinoma9119 486CRA001963, CRA00147710X
LUAD-CCLung adenocarcinoma1334 866GSE12390210X
Data setCancer typeNumber of sampleNumber of cellSourceaTechnologyb
KCKidney cancer327 844Unpublished10X
HCCHepatocellular carcinoma430 205Unpublished10X
PDAC-CRPancreatic ductal adenocarcinoma20102 703CRA001160 [3]10X
HCC-NCHepatocellular carcinoma7118 301GSE112271 [11]10X
LUAD-TJLung adenocarcinoma9119 486CRA001963, CRA00147710X
LUAD-CCLung adenocarcinoma1334 866GSE12390210X

aIn the ‘Source’ column, starting with ‘CRA’ means the accession number of Genome Sequence Archive databases, and starting with ‘GSE’ means the accession number of NCBI GEO databases.

bThe scRNA-seq technologies for test data sets. In this column, ‘10X’ means ‘10X Genomics’.

Table 1

Test samples information (56 samples, 433 405 cells)

Data setCancer typeNumber of sampleNumber of cellSourceaTechnologyb
KCKidney cancer327 844Unpublished10X
HCCHepatocellular carcinoma430 205Unpublished10X
PDAC-CRPancreatic ductal adenocarcinoma20102 703CRA001160 [3]10X
HCC-NCHepatocellular carcinoma7118 301GSE112271 [11]10X
LUAD-TJLung adenocarcinoma9119 486CRA001963, CRA00147710X
LUAD-CCLung adenocarcinoma1334 866GSE12390210X
Data setCancer typeNumber of sampleNumber of cellSourceaTechnologyb
KCKidney cancer327 844Unpublished10X
HCCHepatocellular carcinoma430 205Unpublished10X
PDAC-CRPancreatic ductal adenocarcinoma20102 703CRA001160 [3]10X
HCC-NCHepatocellular carcinoma7118 301GSE112271 [11]10X
LUAD-TJLung adenocarcinoma9119 486CRA001963, CRA00147710X
LUAD-CCLung adenocarcinoma1334 866GSE12390210X

aIn the ‘Source’ column, starting with ‘CRA’ means the accession number of Genome Sequence Archive databases, and starting with ‘GSE’ means the accession number of NCBI GEO databases.

bThe scRNA-seq technologies for test data sets. In this column, ‘10X’ means ‘10X Genomics’.

Methods

Workflow overview

The scCancer package has three modules. The first, named scStatistics, performed basic statistical analyses of raw data and QC. The second, named scAnnotation, performed functional data analyses and visualizations of single sample, including low-dimensional representation, clustering, cell-type classification, malignancy estimation, cellular phenotype scoring, gene signatures analysis, cell–cell interaction analysis, etc. The third, named scCombination, performed multi-sample integration, batch-effect correction and visualization. After all the computational analyses finished, detailed and graphical reports were automatically generated in user-friendly HTML format (Figure 1). The information of packages and methods used in scCancer is summarized in Table 2.

Quality control

For a droplet-based scRNA-seq expression matrix, the first step was to identify the droplets more likely to include real cells (Supplementary Figure S1, the supplementary materials section 1) [12]. However, the droplets annotated with cells (identified by preprocessing steps) frequently captured low-quality cells or more than one cell (doublets or multiplets) [13]. To avoid misleading results caused by these artifacts, the number of total unique molecular identifiers (UMIs), the number of expressed genes and the percentage of transcripts from mitochondrial or ribosomal genes were widely used to perform cell (or droplet) quality control [7–9, 14] (Supplementary Figure S2A and B, the supplementary materials section 2). Generally, extremely low gene detection number or UMI number means very few transcripts are successfully captured; abnormally high UMI or gene number may due to that two or more cells (usually called as doublets or multiplets) are included in one droplet; the vast majority of UMIs from mitochondrial frequently suggest that cells are lysed during the preparation of single-cell suspensions. Using these metrics, the droplets having low transcripts diversity, capturing lysed cells or doublets were filtered. Moreover, to detect doublets better, we also integrated an R package scds [15] to estimate doublet scores (Supplementary Figure S2C). This package contained two doublets identification strategies: (i) binary classification-based approach (bcds), which created artificial doublets and trained binary classifiers to discriminate them; and (ii) co-expression-based approach (cxds), which detected the co-expression profile of gene pairs that should not be expressed in one cell simultaneously. Except these common metrics, considering that solid tumor cells were easier to produce stress response during sample dissociation process, we took advantage of the results of van den Brink et al. [16] and set a dissociation-associated gene list. By calculating these genes expression percentage, we defined an extra metric to remove cells seriously affected by dissociation process. Besides, considering that different sample sources and experimental conditions may lead to different distributions of the QC metrics, we adaptively determined the filtering thresholds by detecting outliers from the distribution of metrics, instead of routine fixed cutoffs (Supplementary Figure S2A and B, the supplementary materials section 2). By analyzing all the 56 testing samples and identifying the upper thresholds of five cell-QC metrics, we found that different samples had similar thresholds distributions, but variances also existed due to sample-specific characteristics (Figure 2). So, we highly recommended that users should choose a proper cutoff by carefully examining the metrics distributions.

The workflow of scCancer.
Figure 1

The workflow of scCancer.

Table 2

The information of packages and methods used in scCancer

StepsPackages and referencesMethods
Cell callingDropletUtils_1.6.1 [12]Detecting deviations from the ambient profile
Ambient RNASoupX_1.0.1 [17]Estimating background (soup) fraction
Basic analysesSeurat_3.1.4.9903 [7, 22]Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc.
Doublet scorescds_1.2.0 [15]Binary classification-based and co-expression-based algorithms
Cell-type annotationgelnet_1.2.1 [24]OCLR
Malignancy estimationinfercnv_0.8.2 [26]Inferring CNV based on gene expression data
Cell cycle estimationSeurat_3.1.4.9903 [22]Relative average expression levels
Cell stemness estimationgelnet_1.2.1 [24]OCLR
Gene set signaturesGSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22]GSVA; relative average expression levels
Expression programsNNLM_0.4.3NMF
Survival analysissurvival_3.1–11; survminer_0.4.6Significance test
Cell–cell interaction[33]The product of the average expression of the ligand and receptor genes
Batch effect correctionSeurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41]Regression; SeuratMNN; Normal-MNN; Harmony; Liger
Plottingggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12-
Report generatingmarkdown_1.1; knitr_1.28-
StepsPackages and referencesMethods
Cell callingDropletUtils_1.6.1 [12]Detecting deviations from the ambient profile
Ambient RNASoupX_1.0.1 [17]Estimating background (soup) fraction
Basic analysesSeurat_3.1.4.9903 [7, 22]Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc.
Doublet scorescds_1.2.0 [15]Binary classification-based and co-expression-based algorithms
Cell-type annotationgelnet_1.2.1 [24]OCLR
Malignancy estimationinfercnv_0.8.2 [26]Inferring CNV based on gene expression data
Cell cycle estimationSeurat_3.1.4.9903 [22]Relative average expression levels
Cell stemness estimationgelnet_1.2.1 [24]OCLR
Gene set signaturesGSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22]GSVA; relative average expression levels
Expression programsNNLM_0.4.3NMF
Survival analysissurvival_3.1–11; survminer_0.4.6Significance test
Cell–cell interaction[33]The product of the average expression of the ligand and receptor genes
Batch effect correctionSeurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41]Regression; SeuratMNN; Normal-MNN; Harmony; Liger
Plottingggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12-
Report generatingmarkdown_1.1; knitr_1.28-
Table 2

The information of packages and methods used in scCancer

StepsPackages and referencesMethods
Cell callingDropletUtils_1.6.1 [12]Detecting deviations from the ambient profile
Ambient RNASoupX_1.0.1 [17]Estimating background (soup) fraction
Basic analysesSeurat_3.1.4.9903 [7, 22]Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc.
Doublet scorescds_1.2.0 [15]Binary classification-based and co-expression-based algorithms
Cell-type annotationgelnet_1.2.1 [24]OCLR
Malignancy estimationinfercnv_0.8.2 [26]Inferring CNV based on gene expression data
Cell cycle estimationSeurat_3.1.4.9903 [22]Relative average expression levels
Cell stemness estimationgelnet_1.2.1 [24]OCLR
Gene set signaturesGSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22]GSVA; relative average expression levels
Expression programsNNLM_0.4.3NMF
Survival analysissurvival_3.1–11; survminer_0.4.6Significance test
Cell–cell interaction[33]The product of the average expression of the ligand and receptor genes
Batch effect correctionSeurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41]Regression; SeuratMNN; Normal-MNN; Harmony; Liger
Plottingggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12-
Report generatingmarkdown_1.1; knitr_1.28-
StepsPackages and referencesMethods
Cell callingDropletUtils_1.6.1 [12]Detecting deviations from the ambient profile
Ambient RNASoupX_1.0.1 [17]Estimating background (soup) fraction
Basic analysesSeurat_3.1.4.9903 [7, 22]Highly variable gene identification, dimension reduction (PCA, t-SNE, UMAP), graph-based clustering, differential expression analysis, etc.
Doublet scorescds_1.2.0 [15]Binary classification-based and co-expression-based algorithms
Cell-type annotationgelnet_1.2.1 [24]OCLR
Malignancy estimationinfercnv_0.8.2 [26]Inferring CNV based on gene expression data
Cell cycle estimationSeurat_3.1.4.9903 [22]Relative average expression levels
Cell stemness estimationgelnet_1.2.1 [24]OCLR
Gene set signaturesGSVA_1.34.0 [32]; Seurat_3.1.4.9903 [22]GSVA; relative average expression levels
Expression programsNNLM_0.4.3NMF
Survival analysissurvival_3.1–11; survminer_0.4.6Significance test
Cell–cell interaction[33]The product of the average expression of the ligand and receptor genes
Batch effect correctionSeurat_3.1.4.9903; harmony_1.0; liger_0.5.0.9000 [22, 40, 41]Regression; SeuratMNN; Normal-MNN; Harmony; Liger
Plottingggplot2_3.3.0; cowplot_1.0.0; ggExtra_0.9; grid_3.6.1; gridExtra_2.3; pheatmap_1.0.12-
Report generatingmarkdown_1.1; knitr_1.28-
The distributions of identified QC upper thresholds of five metrics across 56 test samples. Five subplots correspond to the number of total UMI (nUMI), the number of expressed genes (nGene), the percentages of UMIs from mitochondrial genes (mito.percent), ribosomal genes (ribo.percent) and dissociation-associated genes (diss.percent), respectively. The points in boxes indicate the identified QC upper thresholds of each sample to filter low-quality cells. PDAC, pancreatic ductal adenocarcinoma; LUAD, lung adenocarcinoma.
Figure 2

The distributions of identified QC upper thresholds of five metrics across 56 test samples. Five subplots correspond to the number of total UMI (nUMI), the number of expressed genes (nGene), the percentages of UMIs from mitochondrial genes (mito.percent), ribosomal genes (ribo.percent) and dissociation-associated genes (diss.percent), respectively. The points in boxes indicate the identified QC upper thresholds of each sample to filter low-quality cells. PDAC, pancreatic ductal adenocarcinoma; LUAD, lung adenocarcinoma.

In addition to QC on cells, we also performed QC on genes. Firstly, we filtered genes that expressed in less than three cells. Then, due to the existence of some ambinet RNAs leaked by necrotic or lysed cells, some other droplets were contaminated by them. In order to analyze the influence of contamination, we tested on some samples and found that mitochondrial genes, ribosomal genes and some other genes (such as MALAT1, FTH1, B2M) had high expression proportion in both cell and background (nearly empty droplets) in nearly all samples (Supplementary Figure S3A and B), which meant that ambient RNAs contamination may have more significant impact on their expressions. Meanwhile, these genes were mainly related to cells basic structures, such as membrane and ATP synthase complex (Supplementary Figure S3C), and had little relation to downstream analyses. Therefore, the package listed these genes by boxplots (Supplementary Figure S3A), and it is strongly suggested that researchers should carefully consider the effects of these genes on the final results. Besides, we also integrated an R package SoupX [17] to estimate the contamination fraction from ambient RNAs. It estimated a soup (background) expression profile from nearly empty droplets firstly. Then, it defined or identified soup-specific genes according to prior knowledge or expression bimodality, which meant they should not be expressed in some of cells but were detected due to the existence of ambient RNAs. Using these genes as a bridge, a contamination fraction can be estimated and the expression profile can be corrected further to reduce ambient RNAs contamination (Supplementary Figure S3, the supplementary materials section 3).

Basic downstream analyses

After QC, clustering single cells, representing them on low-dimension and identifiying cluster-specific marker genes are some basic analyses for scRNA-seq data, and several methods and tools have been developed [18–22]. Here, we performed these downstream analyses based on Seurat [7, 22] and redesigned some visualization functions. In detail, these analyses mainly included normalization, log-transformation, highly variable genes identification, unwanted variance removing, scaling, centering, dimension reduction (PCA/t-SNE/UMAP), clustering and differential expression analysis (Supplementary Figure S4, the supplementary materials section 4).

Microenvironment cell-type classification

Cancer microenvironment plays an important role in tumor progression. So, accurate identification of different microenvironment cell types is essential to scRNA-seq downstream biological function analyses [23]. Here, we developed a data-driven method to annotate major microenvironment cell types, including endothelial cells, fibroblasts and immune cells (CD4+ T cells, CD8+ T cells, B cells, nature killer cells and myeloid cells). We curated a high-quality data set by combining multiple cancer scRNA-seq data and trained one-class logistic regression (OCLR) machine learning models for each cell type [24]. In detail, given a training expression matrix |${X}^{(k)}$| of cell type |$k$|⁠, we trained a weight vector |${w}_{(k)}$| of OCLR model to maximize the log-likelihood |$l({w}_{(k)}|{X}^{(k)})={\sum}_{i=1}^n\log p({X}_i^{(k)}|{w}_{(k)})$|⁠, where the likelihood was set as logistic function format |$p({x}_i^{(k)}|{w}_{(k)})=\frac{\exp ({w}_{(k)}^T{x}_i^{(k)})}{1+\exp ({w}_{(k)}^T{x}_i^{(k)})}$|⁠. Meanwhile, a regularizer |$\frac{1}{2}({w}_{(k)}^T{w}_{(k)})$| was added to limit the complexity of model. Then, we calculated Spearman correlation between the trained weight template |${w}_{(k)}$| and expression vector |${X}_j^{(\mathrm{test})}$| of each test cell |$j$| to measure the possibility of this cell belonging to cell type |$k$|⁠. By comparing these possibilities of different cell types, we assigned type label with the largest possibility (including an extra ‘Unknown’ type when all possibilities were especially small) to each cell (Supplementary Figure S5, the supplementary materials section 5). In scCancer, we presented the predicted labels low-dimension space distribution (Figure 3A) and enrichment degree in all clusters (Figure 3B).

The results of cancer microenvironment cell-type classification by scCancer. (A) Colors in the t-SNE plot shows the predicted cell types of the sample KC1 of KC data set. (B) Bar plot shows the proportion of different cell types in each cluster of KC1 sample. (C) Heat map shows the cell-type percentages (the first row of each block) and marker gene detection rates (the second row of each block) in each cluster (column) of KC1 sample. (D) Left: boxplot shows the Pearson correlations (y-axis) between cell-types percentages and marker gene detection rates of each clusters of all test samples across six data sets (x-axis). Right: density plot shows the distribution of correlation coefficients (y-axis) in left plot. The red dash line indicates correlation of 0.5 and 91.74% of them are >0.5. The colors indicate different cell types.
Figure 3

The results of cancer microenvironment cell-type classification by scCancer. (A) Colors in the t-SNE plot shows the predicted cell types of the sample KC1 of KC data set. (B) Bar plot shows the proportion of different cell types in each cluster of KC1 sample. (C) Heat map shows the cell-type percentages (the first row of each block) and marker gene detection rates (the second row of each block) in each cluster (column) of KC1 sample. (D) Left: boxplot shows the Pearson correlations (y-axis) between cell-types percentages and marker gene detection rates of each clusters of all test samples across six data sets (x-axis). Right: density plot shows the distribution of correlation coefficients (y-axis) in left plot. The red dash line indicates correlation of 0.5 and 91.74% of them are >0.5. The colors indicate different cell types.

To evaluate the classification performance of OCLR models, we measured the consistency between the predicted cell types and their marker genes expression profile of all clusters. We thought the classifier had good performance, if each clusters cells were mainly predicted as one cell type (large cell-type percentage), and these cell-types marker genes also had high detection rate in the cluster. Taking one kidney cancer (KC) sample KC1 for example, we plotted the cell-type fractions and markers detection rates of different clusters and found they were highly correlated (Figure 3C). Although cluster 15 has low PTPRC detection rate, its top differential expressed genes (IGLL5, IGJ, MZB1, DERL3, etc. Supplementary Figure S4D) were all related to B cells function, which was consistent with scCancer-predicted labels. And for cluster 17, which has significantly high NKG7 detection rates, but was mainly predicted as epithelial cells, we thought it expressed both epithelial cell markers (such as KRT8) and immune cell markers (such as PTPRC, NKG7) and was more likely to be doublets, which was consistent with its high doublet scores estimated before (Supplementary Figures S2C and S4B).

By applying to the 56 test samples in Table 1, we proved our method presented high correlations between the predicted cell-type percentages and their corresponding maker genes detection rates across all clusters. In spite of the impact of doublets, ambinet RNAs contamination, and dropout, over 90% of these correlation coefficients were still >0.5 (Figure 3D). As a comprasion, we also tested two other algorithms: linear discriminant analysis (LDA), one of the most commonly used multi-class classification methods, and Garnett [25], a novel method designed for single-cell data based on hierarchical cell-type-specific genes. The resuls showed that three methods all had high correlations between cell-type percentages and makers detection rates (Figure 3D and Supplementary Figure S5A and B). However, Garnett depended on accurate prior knowledge of cell-type marker genes. If the given markers were improper, it may predict mixed cell types in one cluster (Supplementary Figure S5C and D). Besides, comparing to multi-class classification methods, such as LDA, one-class model can more easily add more cell types without retraining the existing classifiers and can also assign ‘Unknown’ labels for unsure or unseen types. In conclusion, the one-class model OCLR had stable performance, flexible application scenarios and simple parameter settings. Therefore, we used it to annotate common cancer microenviroment cell types in scCancer.

Cell malignancy estimation

In cancer research, estimating cell malignancy and distinguishing malignant and non-malignant cells is also a critical issue. Generally, copy number alterations (CNV) inferred from scRNA-seq data are potential to identify malignant tumor cells, which was widely used in the research of various cancer types [1, 3, 5, 6]. Here, we used the algorithm of R package infercnv [26] to estimate single-cell CNVs. The algorithm calculated the moving average of expression values across each chromosome and then compared them with some normal reference cells to estimate the CNVs. For the convenience of users, we curated 3366 precompiled normal microenvironmental cells (including immune cells, fibroblasts and endothelial cells) from a few high-quality samples as the default reference data set. Users can also set their own reference cells. Furthermore, considering the impact of dropout, we proposed to use cells neighbor information to smooth CNV values to estimate malignancy better. In detail, we took advantage of the similarity matrix of clustering to identify the neighbors |$j\in Nb(i)$| of each cell |$i$|⁠, and used the normalized similarities as weights |${w}_{ij}\ ({\sum}_{j\in Nb(i)}{w}_{ij}=1)$|⁠. By calculating the weighted average of each cell |$i$| and its neighbors initial CNV values |${v}_i^{(\mathrm{init})}$| and |${v}_j^{(\mathrm{init})}$|⁠, we got the final CNV profiles |${v}_i^{(\mathrm{final})}$|⁠:

Then, we defined the malignancy score as the mean of the squares of final CNV values across chromosome. By comparing the distribution of malignancy scores with reference and detecting its bimodality, we assigned malignancy labels for cells (Figure 4A and the supplementary materials section 6).

The results of cell malignancy estimation by scCancer. (A) Left: histogram shows the distribution of the estimated malignancy scores of observation (blue) of reference (gray) cells in KC1 sample. The red dash line marks the identified threshold to distinguish malignant and nonmalignant cells. Middle: t-SNE plot shows the cells’ malignancy scores. The deeper color indicates the cell is more malignant. Right: t-SNE plot shows the classified malignant types (red means malignant; green means non-malignant). (B) Boxplots show the estimated malignancy score (x-axis) of different cell types (y-axis) in all test samples. Each subplot indicates a data set. The colors of boxes are consistent with the legend colors of Figure 3A and B.
Figure 4

The results of cell malignancy estimation by scCancer. (A) Left: histogram shows the distribution of the estimated malignancy scores of observation (blue) of reference (gray) cells in KC1 sample. The red dash line marks the identified threshold to distinguish malignant and nonmalignant cells. Middle: t-SNE plot shows the cells’ malignancy scores. The deeper color indicates the cell is more malignant. Right: t-SNE plot shows the classified malignant types (red means malignant; green means non-malignant). (B) Boxplots show the estimated malignancy score (x-axis) of different cell types (y-axis) in all test samples. Each subplot indicates a data set. The colors of boxes are consistent with the legend colors of Figure 3A and B.

By testing on 56 samples in Table 1, we found the estimated malignancy scores of epithelial cells were significantly higher than other cell types (Figure 4B), which were consistent with these samples epithelial cancer origin. In addition, we found that the neighbor-smoothing process we proposed reduced the impact of dropout effectively and was more helpful to distinguish malignant and non-malignant cells (Supplementary Figure S6 and Figure 4A).

Intra-tumor cell phenotype heterogeneity analyses

Intra-tumor heterogeneity is widespread in cancer samples and highly associated with cancers poor prognosis. The single-cell sequencing technologies provide an avenue to explore the heterogeneity at cellular resolution. In scCancer, we mainly considered two key cellular phenotypes, cell cycle and stemness, to characterize cancer samples cellular heterogeneity. For cell cycle, we used the Seurat ‘AddModuleScore’ function to calculate the relative average expression of a list of G2/M and S phase markers as cell cycle scores (Supplementary Figure S7A) [22]. For cell stemness, we trained a stemness signature based on a stem/progenitor cells data set using OCLR model [27]. Then, we defined the stemness score as the Spearman correlation coefficient between this signature and cells expression vectors (Supplementary Figures S7B and the supplementary materials section 8).

Intra-tumor cell signature heterogeneity analyses

In addition to cell phenotypes, analyzing at pathways or other gene sets level may reveal more information on cell–cell heterogeneities [1, 5, 6]. In scCancer, we provided two complementary approaches to perform signature analyses.

The first was to calculate known gene set signature scores, such as pathways. Here, we integrated gene set variation analysis (GSVA) [28] to estimate variation of the known gene set activities over the cells, which was widely used to analyze signatures on various cancer types [29–31]. As a faster alternative, the method based on the relative average expression level across gene sets was also provided by using the Seurat ‘AddModuleScore’ function [22]. By default, scCancer calculated signature scores of 50 hallmark gene sets from MSigDB [32] and users can also input their own interested gene sets to the function. The estimtaed gene signature scores of cancer cells were presented by a heat map with cell clusters annotation (Supplementary Figures S8 and the supplementary materials section 9).

The second was to identify potential expression program signatures in unsupervised ways and assign scores for them. Many studies show that non-negative matrix factorization (NMF) can unsupervised identify transcriptional programs or gene signatures [1, 3]. In scCancer, we applied NMF on the centralized and non-negative changed expression matrix to uncover the common expression patterns of malignant cell subpopulations. As a result, the left decomposed matrix, gene by program, presented the related genes of each program. And the right one, program by cell, revealed the cell subsets diversity on these programs and were presented by heat map in scCancer (Supplementary Figures S9 and the supplementary materials section 10).

Survival analysis based on genes or signatures

Based on the above analyses, some cell-type-specific marker genes and other gene signatures (such as known gene sets and NMF expression programs) can be used for survival analysis. So, we provided an extra function ‘runSurvival’, which read the user-given expression and survival data to assess the prognostic values of these signatures (Supplementary Figures S10 and the supplementary materials section 11).

Cell–Cell interaction analyses

Cell–cell interaction is crucial to understand some tumor behaviors, such as growth, progression, drug response, therapeutic effect [33–35]. Here, we referred to the methods of Kumar et al. [33] to characterize ligand–receptor interactions extent over cell clusters. Firstly, we extracted single cells expressions of the approximately 2500 known ligand–receptor pairs from FANTOM5 data set [36], and defined the ligand–receptor interaction score between two cell clusters as the product of the average expression of the ligand and receptor gene in each cluster, respectively. To identify significant interactions, we filtered weak gene pairs according to their interaction scores and detection rates in cell clusters. Finally, we estimated the global interaction strength between any two cell clusters by the number of the remaining gene pairs and the sum of their scores. We visualized these global interaction strengths between clusters by a scatter plot and annotated clusters with their cell-type fractions for convenience of comparsion in scCancer (Supplementary Figures S11 and the supplementary materials section 12).

Multi-sample data integration analysis

Integrating mulitple scRNA-seq samples can uncover common or specific biological signals better. However, the batch effect existing across samples brings a great challenge on it [37, 38]. In scCancer, we provided six approaches to perform multi-sample integration analysis, which covered two of the most basic combination strategies (‘Raw’, ‘Regression’), three best-performing algorithms after systematic evaluation [‘Seurat mutual nearest neighbor (MNN)’, ‘Harmony’, ‘LIGER’] [39], and a modified MNN version considering the inter-tumor heterogeneity (‘NormalMNN’).

The first one, named ‘Raw’, was to combine the raw expression matrixes of all samples directly. This method was appropriate to the situation when we believed the batch effect had little impact or when we used it to compare with other batch correction methods.

The second one, named ‘Regression’, was to treat the sample sources as unwanted technical variance and regress out it to remove the batch effects. This method required the samples had no differences in sample biological design, such as control and experimental groups. If not, these designed differences may also be regressed out.

The third one, named ‘SeuratMNN’, was to take advantage of the algorithm in Seurat v3 [22]. It performed canonical correlation analysis to reduce dimension and capture correlated gene modules among samples. Then, it identified MNNs to determine shared cell subsets among different samples. Using these neighbor pairs on shared cell subsets as ‘anchors’, we can project multiple samples together. This method was under the assumption that all samples shared at least one type of cell subpopulation.

Besides, considering the widespread inter-tumor heterogeneity across cancer samples, we thought the ‘SeuratMNN’ anchors among malignant cell subpopulations may eliminate the specific features of tumor cells in different samples. So, we aslo proposed a modified MNN version based on Seurat v3 method, named ‘NormalMNN’ (recommended option), which used the cell-type annotation results in the previous step and only kept the anchors between normal cells. In this way, the method not only took advantage of the shared normal cell populations across samples, but also had less influence on inter-tumor heterogeneity.

The fifth one, named ‘Harmony’, was to integrate R package ‘harmony’ [40]. It corrected the cells low-dimension embedding coordinates to make cells of same types from different samples mixed. This method had superior performance, faster running speed and fewer computational cost.

The final one, named ‘LIGER’, was to integrate R package ‘liger’ [41]. It proposed integrative NMF to find shared and specific factors among different samples and then performed joint clustering based on cells neighbor relationships among shared factors.

To test the performance of these six approaches, we applied them to our three KC samples and four hepatocellular carcinoma (HCC) samples. According to the results (Figure 5), we found that the cells were mainly clustered by sample sources under the method combining the raw expression matrixes directly, though the same type cells were also close. Using regression to correct batch had mild effect and can push the same types cells closer. Using the ‘SeuratMNN’, ‘Harmony’ and ‘LIGER’ can make cells from different samples mixed uniformly, but it led to hardly any inter-tumor heterogeneity. The ‘NormalMNN’ method we proposed can mix the normal cells and keep the inter-tumor heterogeneity at the same time.

The results of multi-sample data integration analysis of KC (A) and HCC (B) data sets by scCancer. Six columns correspond to six strategies of multi-sample integration. Colors of points in t-SNE plots indicate sample sources.
Figure 5

The results of multi-sample data integration analysis of KC (A) and HCC (B) data sets by scCancer. Six columns correspond to six strategies of multi-sample integration. Colors of points in t-SNE plots indicate sample sources.

After the batch correction, we reperformed dimension reduction, clustering, differential expression analysis. Then, for the following cell annotations, we combined the results of single sample analyses directly, or reanalyzed the integrated data based on whether the cells mutual relationship influenced the results. And all results were recorded in a new HTML report (the supplementary materials section 13).

Discussion

scCancer was a user-friendly R package for processing and analyzing scRNA-seq for cancer research. It not only included the common process steps for scRNA-seq data, but also paid more attention to the characteristics of cancer samples, such as cancer microenvironments and heterogeneities. After the nearly automated analyses, it can generate detailed graphical HTML reports to make users have a quick overview for the data. At the same time, users can also modify some arugments to realize some personalized setting, such as adjusting QC strength, analyzing mouse samples, analyzing patient-drived tumor xenograft samples, retraining cell-type OCLR templates to achieve more abundatnt cell-types classification, using own reference data to estimate malignancy, etc. By testing on 433 405 cells in 56 samples from 6 datasets, we demonstrated that scCancer had good performance on each steps analysis and was applicable to a variety of cancer types. And the total single sample analyses of general 10X samples can be ran on a personal computer within acceptable times (Supplementary Figures S12 and the supplementary materials section 14).

Among all the analysis steps, we integrated some widely used packages or algorithms and proposed a few specific methods for cancer applications. For the selection of these methods, we mainly followed three principles by carefully assessing them on all the 56 testing samples: (i) the performance should be stable and insensitive to parameter settings, (ii) the methods are useful or fine-adapted for cancer research and (iii) the implementation is simple and time-efficient. The specifically designed methods included: OCLR for annotating cancer microenvironment cell types, a modified cell malignancy estimation by neighbor smoothing and a modified batch-effect correction method (‘NormalMNN’) considering the inter-tumor heterogeneity. The advantages of OCLR are its performance stability and flexibility. As the quick accumulation of more single-cell data of different cell types, OCLR will be updated to identify more cell types and cellular states in the future. The modified cell malignancy estimation smoothed the original CNV profiles by neighbors, which makes the estimation more stable against the noisy single-cell data. And, it also makes the malignancy scores tend to form bi-model distributions, which are very helpful for selecting a proper threshold. The ‘NormalMNN’ method considers the high inter-tumor heterogeneities between tumor cells from different patients.

Currently, the package mainly focused on solid tumor samples. As for blood cancers, some of the steps may not be very applicable, such as the default cancer microenvironment cell-type setting. In the future, we will consider the characteristics of blood cancers and make scCancer compatible to these situations. Besides, we will provide more optional methods for each step and try to solve more cancer-specific problems.

Key points
  • An R package, named scCancer, is developed for automated processing of single-cell RNA-seq data in cancer.

  • scCancer integrates basic scRNA-seq data processing steps and considers more cancer-specific characteristics and concerns, including comprehensive quality control, cancer microenvironment cell-types identification, cell malignancy estimation, intra-tumor heterogeneities analyses at phenotype and signature levels, cell–cell interaction analyses, etc.

  • scCancer can perform multi-sample integration analysis by using common batch-effect correction methods and new proposed ‘NormalMNN’ algorithm.

  • User-friendly graphic reports can be generated automatically to overview all the analyses results.

Funding

National Natural Science Foundation of China (61922047, 81890993 and 61721003).

Wenbo Guo is a PhD student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.

Dongfang Wang, PhD, is a postdoctoral fellow of BIOPIC and School of Life Sciences, Peking University.

Shicheng Wang is a Master student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.

Yiran Shan is a PhD student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.

Changyi Liu is a Master student of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.

Jin Gu, PhD, is an Associate Professor of bioinformatics at MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University.

References

1.

Puram
SV
,
Tirosh
I
,
Parikh
AS
, et al.
Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer
.
Cell
2017
;
171
:
1611
24.e24
.

2.

Collord
G
,
Farndon
SJ
,
Ferdinand
JR
, et al.
Single-cell transcriptomes from human kidneys reveal the cellular identity of renal tumors
.
Science
2018
;
361
:
594
9
.

3.

Peng
J
,
Sun
BF
,
Chen
CY
, et al.
Single-cell RNA-seq highlights intra-tumoral heterogeneity and malignant progression in pancreatic ductal adenocarcinoma
.
Cell Res
2019
;
29
:
725
38
.

4.

Azizi
E
,
Carr
AJ
,
Plitas
G
, et al.
Single-cell map of diverse immune phenotypes in the breast tumor microenvironment
.
Cell
2018
;
174
:
1293
308.e36
.

5.

Chung
W
,
Eum
HH
,
Lee
HO
, et al.
Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer
.
Nat Commun
2017
;
8
:
1
12
.

6.

Neftel
C
,
Laffy
J
,
Filbin
MG
, et al.
An integrative model of cellular states, plasticity, and genetics for glioblastoma
.
2019
;
Cell
,
178
:
835
49.e21
.

7.

Butler
A
,
Hoffman
P
,
Smibert
P
, et al.
Integrating single-cell transcriptomic data across different conditions, technologies, and species
.
Nat Biotechnol
2018
;
36
:
411
20
.

8.

McCarthy
DJ
,
Campbell
KR
,
Lun
ATL
, et al.
Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R
.
Bioinformatics
2017
;
33
:
1179
86
.

9.

Wolf
FA
,
Angerer
P
,
Theis
FJ
.
SCANPY: large-scale single-cell gene expression data analysis
.
Genome Biol
2018
;
19
:
15
.

10.

Zheng
GXY
,
Terry
JM
,
Belgrader
P
, et al.
Massively parallel digital transcriptional profiling of single cells
.
Nat Commun
2017
;
8
:
1
12
.

11.

Losic
B
,
Craig
AJ
,
Villacorta-Martin
C
, et al.
Intratumoral heterogeneity and clonal evolution in liver cancer
.
Nat Commun
2020
;
11
:
1
15
.

12.

Lun
A
,
Riesenfeld
S
,
Andrews
T
, et al.
EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data
.
Genome Biol
2019
;
20
:
63
.

13.

Ilicic
T
,
Kim
JK
,
Kolodziejczyk
AA
, et al.
Classification of low quality cells from single-cell RNA-seq data
.
Genome Biol
2016
;
17
:
1
15
.

14.

Lun
ATL
,
McCarthy
DJ
,
Marioni
JC
.
A step-by-step workflow for low-level analysis of single-cell RNA-seq data
.
F1000Res
2016
;
5
:
2122
.

15.

Bais
AS
,
Kostka
D
.
Scds: computational annotation of doublets in single-cell RNA sequencing data
.
Bioinformatics
2020
;
36
:
1150
58
.

16.

van den Brink
SC
,
Sage
F
,
Vértesy
Á
, et al.
Single-cell sequencing reveals dissociation-induced gene expression in tissue subpopulations
.
Nat Methods
2017
;
14
:
935
6
.

17.

Young
MD
,
Behjati
S
.
SoupX removes ambient RNA contamination from droplet based single cell RNA sequencing data
.
bioRxiv
2018
;
doi: https://doi.org/10.1101/303727
.

18.

Wang
D
,
Gu
J
.
VASC: dimension reduction and visualization of single-cell RNA-seq data by deep Variational autoencoder
.
Genomics, Proteomics Bioinf
2018
;
16
:
320
31
.

19.

Li
X
,
Chen
W
,
Chen
Y
, et al.
Network embedding-based representation learning for single cell RNA-seq data
.
Nucleic Acids Res
2017
;
45
:
e166
.

20.

Sun
Z
,
Chen
L
,
Xin
H
, et al.
A Bayesian mixture model for clustering droplet-based single-cell transcriptomic data from population studies
.
Nat Commun
2019
;
10
:
1
10
.

21.

Zheng
J
,
Wang
K
.
Emerging deep learning methods for single-cell RNA-seq data analysis
.
Quant Biol
2019
;
7
:
247
54
.

22.

Stuart
T
,
Butler
A
,
Hoffman
P
, et al.
Comprehensive integration of single-cell data
.
Cell
2019
;
177
:
1888
902.e21
.

23.

Zhao
X
,
Wu
S
,
Fang
N
, et al.
Evaluation of single-cell classifiers for single-cell RNA sequencing data sets
.
Brief Bioinform
2019
;
00
:
1
15
.

24.

Sokolov
A
,
Paull
EO
,
Stuart
JM
.
One-class detection of cell states in tumor subtypes
.
Pac Symp Biocomput
2016
;
21
:
405
16
.

25.

Pliner
HA
,
Shendure
J
,
Trapnell
C
.
Supervised classification enables rapid annotation of cell atlases
.
Nat Methods
2019
;
16
:
983
6
.

26.

Patel
AP
,
Tirosh
I
,
Trombetta
JJ
, et al.
Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma
.
Science
2014
;
344
:
1396
401
.

27.

Malta
TM
,
Sokolov
A
,
Gentles
AJ
, et al.
Machine learning identifies stemness features associated with oncogenic dedifferentiation
.
Cell
2018
;
173
:
338
54.e15
.

28.

Hänzelmann
S
,
Castelo
R
,
Guinney
J
.
GSVA: gene set variation analysis for microarray and RNA-Seq data
.
BMC Bioinf
2013
;
14
:
7
.

29.

Kim
C
,
Gao
R
,
Sei
E
, et al.
Chemoresistance evolution in triple-negative breast cancer delineated by single-cell sequencing
.
Cell
2018
;
173
:
879
93.e13
.

30.

Zhang
L
,
Li
Z
,
Skrzypczynska
KM
, et al.
Single-cell analyses inform mechanisms of myeloid-targeted therapies in colon cancer
.
Cell
2020
;
181
:
442
59.e29
.

31.

Lambrechts
D
,
Wauters
E
,
Boeckx
B
, et al.
Phenotype molding of stromal cells in the lung tumor microenvironment
.
Nat Med
2018
;
24
:
1277
89
.

32.

Subramanian
A
,
Tamayo
P
,
Mootha
VK
, et al.
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
.
Proc Natl Acad Sci U S A
2005
;
102
:
15545
50
.

33.

Kumar
MP
,
Du
J
,
Lagoudas
G
, et al.
Analysis of single-cell RNA-Seq identifies cell-cell communication associated with tumor characteristics
.
Cell Rep
2018
;
25
:
1458
68.e4
.

34.

Raredon
MSB
,
Adams
TS
,
Suhail
Y
, et al.
Single-cell connectomic analysis of adult mammalian lungs
.
Sci Adv
2019
;
5
:
eaaw3851
.

35.

Rozenblatt-Rosen
O
,
Regev
A
,
Oberdoerffer
P
, et al.
The human tumor atlas network: charting tumor transitions across space and time at single-cell resolution
.
Cell
2020
;
181
:
236
49
.

36.

Ramilowski
JA
,
Goldberg
T
,
Harshbarger
J
, et al.
A draft network of ligand-receptor-mediated multicellular signalling in human
.
Nat Commun
2015
;
6
:
7866
.

37.

Büttner
M
,
Miao
Z
,
Wolf
FA
, et al.
A test metric for assessing single-cell RNA-seq batch correction
.
Nat Methods
2019
;
16
:
43
9
.

38.

Stuart
T
,
Satija
R
.
Integrative single-cell analysis
.
Nat Rev Genet
2019
;
20
:
257
72
.

39.

Tran
HTN
,
Ang
KS
,
Chevrier
M
, et al.
A benchmark of batch-effect correction methods for single-cell RNA sequencing data
.
Genome Biol
2020
;
21
:
12
.

40.

Korsunsky
I
,
Millard
N
,
Fan
J
, et al.
Fast, sensitive and accurate integration of single-cell data with harmony
.
Nat Methods
2019
;
16
:
1289
96
.

41.

Welch
JD
,
Kozareva
V
,
Ferreira
A
, et al.
Single-cell multi-omic integration compares and contrasts features of brain cell identity
.
Cell
2019
;
177
:
1873
87.e17
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)