- Split View
-
Views
-
Cite
Cite
Ruiqing Zheng, Min Li, Zhenlan Liang, Fang-Xiang Wu, Yi Pan, Jianxin Wang, SinNLRR: a robust subspace clustering method for cell type detection by non-negative and low-rank representation, Bioinformatics, Volume 35, Issue 19, October 2019, Pages 3642–3650, https://doi.org/10.1093/bioinformatics/btz139
- Share Icon Share
Abstract
The development of single-cell RNA-sequencing (scRNA-seq) provides a new perspective to study biological problems at the single-cell level. One of the key issues in scRNA-seq analysis is to resolve the heterogeneity and diversity of cells, which is to cluster the cells into several groups. However, many existing clustering methods are designed to analyze bulk RNA-seq data, it is urgent to develop the new scRNA-seq clustering methods. Moreover, the high noise in scRNA-seq data also brings a lot of challenges to computational methods.
In this study, we propose a novel scRNA-seq cell type detection method based on similarity learning, called SinNLRR. The method is motivated by the self-expression of the cells with the same group. Specifically, we impose the non-negative and low rank structure on the similarity matrix. We apply alternating direction method of multipliers to solve the optimization problem and propose an adaptive penalty selection method to avoid the sensitivity to the parameters. The learned similarity matrix could be incorporated with spectral clustering, t-distributed stochastic neighbor embedding for visualization and Laplace score for prioritizing gene markers. In contrast to other scRNA-seq clustering methods, our method achieves more robust and accurate results on different datasets.
Our MATLAB implementation of SinNLRR is available at, https://github.com/zrq0123/SinNLRR.
Supplementary data are available at Bioinformatics online.
1 Introduction
Analysis of transcriptomic profiling has become a powerful approach to mine biological functions, regulatory relationships and biomarkers of diseases. However, the traditional transcriptomic analysis can only provide the bulk expression of cells, which is insufficient to reveal the states or differences of cells. Recently, the development of single-cell RNA-sequencing (scRNA-seq) techniques has provided a new perspective to study the biological mechanism at the cellular level. One of the major and popular applications of scRNA-seq is to analyze the cellular heterogeneity and identify subtypes of cells from a bunch of cells. The identification of cell types from scRNA-seq is an unsupervised clustering problem. However, the high level of technical noise and notorious dropouts in scRNA-seq would lead to the failure of existing clustering methods (Elowitz et al., 2002; Stegle et al., 2015), it is urgent and challenging to develop new statistical and computational methods (Stegle et al., 2015).
Up to now, a number of computational methods have been proposed to identify cell types based on scRNA-seq profiles. Most of these methods focus on learning better cell–cell similarities. Xu and Su (2015) proposed a clustering method by calculating the similarities between cells based on shared nearest neighbors. Seurat 2.0 (Butler et al., 2018) applied the canonical correlation to construct the weighted K-nearest neighbors graphs. As the single type of similarity cannot characterize all information of scRNA-seq, Wang et al. (2017) designed a multi-kernel based clustering method, called SIMLR, which learned the final similarities from 55 different Gaussian kernels. MPSSC (Park and Zhao, 2018) improved SIMLR by the additional doubly stochastic similarity learning and pairwise sparse structure of the similarity matrix. Zhu et al. (2019) proposed a method to detect the cell type from structural entropy of graphs. Consensus clustering methods enhanced the accuracy by assembling different results of clustering, which avoided the sensitivity of single clustering method. SC3 (Kiselev et al., 2017) obtained different clustering results based on Euclidean distance, Pearson correlation and Spearman correlation, then constructed the consensus matrix by counting the number of each pair of cells in the same cluster and clustered on it again. Tsoucas and Yuan (2018) proposed GiniClust2, a weighted ensemble clustering method based on Gini index-based and factor-based gene selection, to detect rare and common cell types simultaneously. A series of methods, such as CIDR (Lin et al., 2017), scImpute (Li and Li, 2018), netSmooth (Ronen and Akalin, 2018), improved the performance of clustering methods by imputing the dropouts of scRNA-seq. The imputation of dropouts depended on the local similarities of cells or certain biological knowledge. Jiang et al. (2018) defined differentiability correlations between two cells to avoid the bias brought by dropouts. ZIFA (Pierson and Yau, 2015) and ZINB-WaVE (Risso et al., 2018) learned the special low-dimensional representation from the noisy scRNA-seq. SCENIC (Aibar et al., 2017) defined the regulons’ activities based on reconstructed gene regulatory networks to analyze the states of cells, which gave a biological insight into the cellular heterogeneity. In addition to similarity learning, non-negative matrix factorization (NMF) has been successfully applied in the scRNA-seq profiles by regarding the latent dimension as types of metacells (Shao and Höfer, 2017). For large scale scRNA-seq, Sinha et al. (2018) proposed dropClust, a computationally efficient method, which clustered thousands of cells in several minutes.
However, most of the above methods just considered the similarities between pairwise of cells, which were hard to capture the complex relationships among cells. In order to learn more accurate similarity matrix, we proposed a self-expression of data driven clustering method with non-negative and low-rank constraints, called SinNLRR. In SinNLRR, we assumed the cells with the same type were in the same subspace, so the expression of one cell can be described as the combination of the same type of cells’ expressions. SinNLRR found the low-rank and non-negative representation of the expression matrix from all candidate subspaces. It is an optimization problem to learn the similarities among cells. Naturally, an alternating direction method of multipliers (ADMM) (Boyd et al., 2011) is applied to solve the optimization problem. In practice, the learned similarities are really sensitive to the penalty coefficient of low rank. We further designed a criterion to select the proper penalty factor automatically. The criterion takes the minimal number of neighbors of the localized similarity graph into account. Finally, spectral clustering is applied on the learned similarity graph to obtain the clusters. SinNLRR captures the better global structure of the similarity graph from the scRNA-seq profiles, and is effective to get more accurate and robust clustering results. In addition, the similarity matrix can be also used to visualize or prioritize gene markers.
2 Materials and methods
2.1 Non-negative and low-rank representation
Constructing the similarity or distance matrix is a key step in most of the computational methods for identifying cell types. Several pairwise evaluation criterions of similarity or distance, such as Euclidean distance, Pearson and Spearman, have been used. However, these criteria can only capture the local similarities of cells. Recently, a kind of clustering method, called subspace clustering (Liu et al., 2010; Vidal and Favaro, 2014), have been successfully applied to subspace segmentation of images and can characterize the similarity more globally. In this paper, we introduce a typical subspace clustering with low-rank representation and present a modified version to make it applicable to scRNA-seq.
where denotes the Frobenius norm, which is square root of the sum squares of all elements while represents the nuclear norm, which is the sum of all singular values of C, is a penalty factor.
2.2 Selection of penalty coefficient of low rank
According to the schema of the optimization algorithm, there are two user-defined parameters, λ and. In the experiments, we find the structure of learned similarity matrix is really sensitive to the selection of parameter λ. Taking the Pollen’s dataset (Pollen et al., 2014) as an example, the effect of different values of λ is shown in Figure 2.
Figure 2 shows that a proper value of λ would lead to a better similarity matrix corresponding to the real cell types. However, the optimal λ is distinct for different datasets. The parameter λ controls the learned similarity matrix S as follows: (i) when, the diagonal element will be close to 1 and will be close to 1, because the expression of cell can represent itself without the low-rank constraint. The form of S would be similar to Figure 2A. (ii) when, matrix S can be divided into one or a few blocks. For each block, the similarities in each column or row are approximately the same. That is because a very large leads to a lower rank, which is similar to Figure 2D. (iii) when is proper, the value of similarity in each column will look like Figure 2B.
where ‘Count’ denotes the number of similarities satisfying the Boolean function. When we raise parameter λ gradually, the NoMN will drastically jump to a value larger than one. The parameter λ around the tipping point is selected to obtain the final similarity matrix. The detail of the analysis could be found in the Section 3.2. The final similarity is .
2.3 Spectral clustering
Spectral clustering is a popular and efficient method to cluster the points based on the similarity matrix (Von Luxburg, 2007). Spectral clustering has been applied to identify cell types successfully (Park and Zhao, 2018; Wang et al., 2017). In the proposed method, we also adopt the spectral clustering on the learned similarity matrix. The process of spectral clustering used in our method is shown in Figure 4. The details of the spectral clustering could be found in Von Luxburg (2007).
2.4 Framework of SinNLRR
3 Results
3.1 Datasets
We collect 10 datasets of human and mouse scRNA-seq that involve in various tissues and different biological process such as cell development and cell differentiation. These datasets contain different scales of cells from dozens to thousands. Moreover, the datasets are derived from various single-cell RNA-seq techniques (Wu et al., 2013), such as SMARTer, Drop-seq and use different unit count, e.g. RPKM (reads per kilobase of transcript per million mapped reads), FPKM (fragments per kilobase of transcript per million mapped reads). Especially, the Lin dataset (Lin et al., 2017) is a mixed dataset, including GSE41265, GSE42268 and GSE45719 from GEO database. All the original expressions are applied with the log transformation. The detailed descriptions of the datasets are shown in Table 1.
Dataset . | Cells . | Genes . | Cell types . | Protocol . | Units . | Species . |
---|---|---|---|---|---|---|
Darmanis (Darmanis et al., 2015) | 420 | 22085 | 8 | SMARTer | CPM | Homo sapiens |
Goolam (Goolam et al., 2016) | 124 | 40315 | 5 | Smart-seq | CPM | Mus musculus |
Lin (Lin et al., 2017) | 402 | 9437 | 16 | Fusion | TPM | Mus musculus |
Pollen (Pollen et al., 2014) | 249 | 14805 | 11 | SMARTer | TPM | Homo sapiens |
Usoskin (Usoskin, et al., 2015) | 622 | 17772 | 4 | Usoskin(2010) | RPM | Mus musculus |
Treutlein (Treutlein et al., 2014) | 80 | 959 | 5 | SMARTer | FPKM | Mus musculus |
Engel (Engel et al., 2016) | 203 | 23337 | 4 | Smart-seq2 | TPM | Homo sapiens |
Tasic (Tasic et al., 2016) | 1727 | 5832 | 48 | SMARTer | TPM | Mus musculus |
Zeisel (Zeisel et al., 2015) | 3005 | 4412 | 48 | – | UMI | Mus musculus |
Macosko (Macosko et al., 2015) | 6418 | 12822 | 39 | Drop-seq | UMI | Mus musculus |
Dataset . | Cells . | Genes . | Cell types . | Protocol . | Units . | Species . |
---|---|---|---|---|---|---|
Darmanis (Darmanis et al., 2015) | 420 | 22085 | 8 | SMARTer | CPM | Homo sapiens |
Goolam (Goolam et al., 2016) | 124 | 40315 | 5 | Smart-seq | CPM | Mus musculus |
Lin (Lin et al., 2017) | 402 | 9437 | 16 | Fusion | TPM | Mus musculus |
Pollen (Pollen et al., 2014) | 249 | 14805 | 11 | SMARTer | TPM | Homo sapiens |
Usoskin (Usoskin, et al., 2015) | 622 | 17772 | 4 | Usoskin(2010) | RPM | Mus musculus |
Treutlein (Treutlein et al., 2014) | 80 | 959 | 5 | SMARTer | FPKM | Mus musculus |
Engel (Engel et al., 2016) | 203 | 23337 | 4 | Smart-seq2 | TPM | Homo sapiens |
Tasic (Tasic et al., 2016) | 1727 | 5832 | 48 | SMARTer | TPM | Mus musculus |
Zeisel (Zeisel et al., 2015) | 3005 | 4412 | 48 | – | UMI | Mus musculus |
Macosko (Macosko et al., 2015) | 6418 | 12822 | 39 | Drop-seq | UMI | Mus musculus |
Dataset . | Cells . | Genes . | Cell types . | Protocol . | Units . | Species . |
---|---|---|---|---|---|---|
Darmanis (Darmanis et al., 2015) | 420 | 22085 | 8 | SMARTer | CPM | Homo sapiens |
Goolam (Goolam et al., 2016) | 124 | 40315 | 5 | Smart-seq | CPM | Mus musculus |
Lin (Lin et al., 2017) | 402 | 9437 | 16 | Fusion | TPM | Mus musculus |
Pollen (Pollen et al., 2014) | 249 | 14805 | 11 | SMARTer | TPM | Homo sapiens |
Usoskin (Usoskin, et al., 2015) | 622 | 17772 | 4 | Usoskin(2010) | RPM | Mus musculus |
Treutlein (Treutlein et al., 2014) | 80 | 959 | 5 | SMARTer | FPKM | Mus musculus |
Engel (Engel et al., 2016) | 203 | 23337 | 4 | Smart-seq2 | TPM | Homo sapiens |
Tasic (Tasic et al., 2016) | 1727 | 5832 | 48 | SMARTer | TPM | Mus musculus |
Zeisel (Zeisel et al., 2015) | 3005 | 4412 | 48 | – | UMI | Mus musculus |
Macosko (Macosko et al., 2015) | 6418 | 12822 | 39 | Drop-seq | UMI | Mus musculus |
Dataset . | Cells . | Genes . | Cell types . | Protocol . | Units . | Species . |
---|---|---|---|---|---|---|
Darmanis (Darmanis et al., 2015) | 420 | 22085 | 8 | SMARTer | CPM | Homo sapiens |
Goolam (Goolam et al., 2016) | 124 | 40315 | 5 | Smart-seq | CPM | Mus musculus |
Lin (Lin et al., 2017) | 402 | 9437 | 16 | Fusion | TPM | Mus musculus |
Pollen (Pollen et al., 2014) | 249 | 14805 | 11 | SMARTer | TPM | Homo sapiens |
Usoskin (Usoskin, et al., 2015) | 622 | 17772 | 4 | Usoskin(2010) | RPM | Mus musculus |
Treutlein (Treutlein et al., 2014) | 80 | 959 | 5 | SMARTer | FPKM | Mus musculus |
Engel (Engel et al., 2016) | 203 | 23337 | 4 | Smart-seq2 | TPM | Homo sapiens |
Tasic (Tasic et al., 2016) | 1727 | 5832 | 48 | SMARTer | TPM | Mus musculus |
Zeisel (Zeisel et al., 2015) | 3005 | 4412 | 48 | – | UMI | Mus musculus |
Macosko (Macosko et al., 2015) | 6418 | 12822 | 39 | Drop-seq | UMI | Mus musculus |
3.2 Performance metrics
3.3 Parameter selection by NoMN
To solve the sensitivity of SinNLRR with the parameter λ, we propose the NoMN to select λ automatically. The description of NoMN can be found in Section 2.2. We increase the λ from 0 to 2 or 2.5 with interval 0.1 for datasets whose number of cells is smaller than 1000 or otherwise, respectively. The responding change of NoMN, NMI and ARI in datasets of Darmanis, Pollen and Macosko is shown in Figure 6. The NoMN jumps from 1 to a bigger value and increase quickly when λ reaches a certain point. Figure 6 shows that SinNLRR achieves the better performance when λ is around the tipping point. In practice, we determine the proper value of λ when the NoMN is larger than three for the first time for small dataset (the number of cells is smaller than 1000), and larger than one for large scale datasets. The search for the proper λ requires multi-runs of NLRR. However, NLRR can be in parallel computed, and we increase the value of λ with the increment of 0.2 to further speed the calculation up. The analysis of λ in remaining datasets and the effect of parameter could be found in Supplementary Figures S1 and S2.
3.4 Comparative analysis of clustering
In this section, we apply SinNLRR on 10 scRNA-seq datasets described in Table 1. These datasets contain different scales of cells and the subtype numbers. We select five state-of-art methods, SNN-Cliq (Xu and Su, 2015), SIMLR (Wang et al., 2017), NMF (Shao and Höfer, 2017), Corr (Jiang et al., 2018) and MPSSC (Park and Zhao, 2018). In these methods, SNN-Cliq, Corr, SIMLR and MPSSC focus on calculating pairwise similarities between cells or learning the similarity from multi-kernels, and NMF identifies the cell types based on the values in the latent dimension. For fairness, we provide the true number of clusters to Corr, SIMLR, MPSSC and NMF while SNN-Cliq cannot be set to a certain number of clusters, and other parameters are set to default. We use the native spectral clustering (SC) (Von Luxburg, 2007) with Pearson similarity as a baseline method. As the algorithm Corr is time-consuming for big datasets (more than 3 days for cells larger than 1000), we abandon the results of Corr on the datasets Tasic, Zeisel and Macosko. Figure 7 summarizes the NMI and ARI of these methods on the 10 datasets. The proposed method SinNLRR gets the best performance in seven datasets based on NMI and ARI, and gets top two performances in nine datasets. Although the identification of cell types is an unsupervised problem and is complex according to different conditions, the results show the better robustness and ability of generalization of SinNLRR. Moreover, we also analyze the time complex of SinNLRR and the comparison of running times with other methods. The details can be found in Supplementary Material Section D.
In the real biological experiment, the number of clusters is usually inaccessible, so evaluating the number of clusters is another important aspect in clustering methods. Based on the normalized Laplacian matrix L, we apply eigengap (Von Luxburg, 2007) to determine the number of cluster k by maximizing the eigenvalues gap , where is the eigenvalues of the Laplacian matrix L. This approach is also applied in SIMLR and MPSSC. SNN-Cliq and Corr also provided the methods to estimate the number of clusters. The comparison results on 10 datasets are shown in the Supplementary Table S1. Although these methods are weak to estimate the number of clusters accurately, SinNLRR could be a better selection which is closest to the true numbers.
3.5 Visualization and gene markers
Visualization of the scRNA-seq data in the lower dimensional is a powerful approach for biologists to pre-identify the subgroups of cells (Dong et al., 2018; Zhong et al., 2018). The t-distributed stochastic neighbor embedding (t-SNE) is one of the most popular tools for visualization (Maaten and Hinton, 2008). We use the similarity matrix learned by SinNLRR as the input of the modified t-SNE, which is the same with previous studies (Wang et al., 2017), to distinguish the subgroups of cells intuitively. We focus on two datasets Darmanis and Goolam described in Table 1. Darmanis dataset (Darmanis et al., 2015) is a crowd of 420 cells from the adult and fetal human brain, consisting of 62 astrocytes, 20 endothelial, 110 fetal quiescent neurons, 25 fetal_replicating neurons, 16 microglia, 131 neurons, 38 oligodendrocytes and 18 oligodendrocyte precursor cells. The second dataset is from Goolam et al. (2016). The cells in this dataset are derived from mouse embryos, including five stages of development: 2-cell (16 cells), 4-cell (64 cells), 8-cell (32 cells), 16-cell (6 cells) and 32-cell (6 cells). We select the native t-SNE, and similarity matrix based on SIMLR and MPSSC as comparison methods. It should be noted that SIMLR and MPSSC need the true cluster number to learn the similarity matrix, while native t-SNE and SinNLRR don’t, so we use the estimated cluster number instead. The two-dimensional t-SNE plots of two datasets are shown in Figure 8. In Figure 8A, SinNLRR groups the same type of cells better overall. The groups of SIMLR are more compact because of its block structure, but it divides the fetal quiescent neurons and neurons into a few subgroups. All the visualizations based on the learned similarity matrices are better than native t-SNE. As seen in Figure 8B, SinNLRR performs better in Goolam dataset.
The learned cell-to-cell similarity matrix can be further applied to identify the gene markers in each type of cells. We perform the bootstrapped Laplacian score proposed by Wang et al. (2017) on the similarity matrices of Darmanis and Goolam datasets. We select top 10 gene markers of the two datasets, and present the average of log transformed counts and sparsity (the proportion of non-zero expressed cells) in each cell type. The results are shown in Figure 9. In Darmanis dataset, the selected top 10 genes are in agreement with previous studies. The genes AGXT2L1, AQP4, FGFR3 and GJA1 are highly expressed in astrocytes and had been recognized as gene markers, while Opalin and ERMN are oligodendrocytes-specific genes related to the novel transmembrane proteins (Cahoy et al., 2008; Darmanis et al., 2015; Oldham et al., 2008). Especially, FGFR3 and AQP4 act as the important receptors for early astrocyte development. Popson et al. (2014) had identified IFITM1 as a pan-endothelial marker of endothelial cells in the bladder, brain and stomach. STMN2 is a neuron-specific gene both in adult and fetal neurons, and SOX11 is the enriched gene in fetal neurons (Darmanis et al., 2015). In Goolam dataset, OMT2A, OMT2B and OOG1 had been reported as the potential stage-specific genes (Tang et al., 2010). KLF17 was validated to highly expresses in earlier stages of development and was absent in blastocysts (Blakeley et al., 2015) and BTG4 showed the declining trend of expression in early mouse embryos (Yu et al., 2016). OBOX8 was found to express highly around 4-cell stage. APELA, B020004C17rik, EXOC3L2 and PPP1R16B are the novel potential gene markers.
4 Discussion
Identification of the cell types based on scRNA-seq data is one of the basic issues in Human Cell Atlas project (Rozenblatt-Rosen et al., 2017). However, the scRNA-seq data contains high noise and dropouts, which bring great challenges for clustering. In this paper, we have proposed a novel similarity based clustering method, called SinNLRR. SinNLRR is motivated by the self-expression among cells in the same type and assumes the non-negative and low-rank characteristics of the similarity matrix.
SinNLRR exposes global and robust similarity structures than the traditional pairwise similarity metric, such as Pearson correlation or Gaussian kernel. We apply the ADMM to solve the corresponding convex problem and define the NoMN to select the proper penalty factor. The learned similarity matrix could be incorporated with spectral clustering for grouping cells, t-SNE for visualization and Laplacian score for selecting gene markers. We evaluate the performance of SinNLRR on scRNA-seq datasets derived by different single-cell techniques and scales and find SinNLRR achieves more robust and accurate results than other state-of-the-art methods. In additional, SinNLRR could be useful in other applications of scRNA-seq analysis, such as pseudo-time reconstruction (Ji and Ji et al., 2016) and potency measure of cells (Guo et al., 2017; Shi et al., 2018), which require the clustering results or the cell-to-cell networks as a preliminary process.
Besides, several available biological information, such as protein–protein interaction networks and subcellular localization (Li et al., 2019), provides a lot of auxiliary information, which is helpful in gene selection and data imputation while gene regulatory networks (Li et al., 2017; Zheng et al., 2018) present a biological interpretation of cell states. It is promising to incorporate SinNLRR with these data to further enhance the performance. Currently, SinNLRR can handle the datasets with thousand cells in a reasonable time. However, designing the version for really large scale scRNA-seq would be one the of directions in the future researches.
Funding
This work was supported in part by the National Natural Science Foundation of China [61832019, 61622213], the 111 Project (No. B18059); the Hunan Provincial Science and Technology Program [2018WK4001]; and the Fundamental Research Funds for the Central Universities of Central South University [No.2018zzts028].
Conflict of Interest: none declared.
References