Imputing single-cell RNA-seq data by considering cell heterogeneity and prior expression of dropouts

Abstract Single-cell RNA sequencing (scRNA-seq) provides a powerful tool to determine expression patterns of thousands of individual cells. However, the analysis of scRNA-seq data remains a computational challenge due to the high technical noise such as the presence of dropout events that lead to a large proportion of zeros for expressed genes. Taking into account the cell heterogeneity and the relationship between dropout rate and expected expression level, we present a cell sub-population based bounded low-rank (PBLR) method to impute the dropouts of scRNA-seq data. Through application to both simulated and real scRNA-seq datasets, PBLR is shown to be effective in recovering dropout events, and it can dramatically improve the low-dimensional representation and the recovery of gene‒gene relationships masked by dropout events compared to several state-of-the-art methods. Moreover, PBLR also detects accurate and robust cell sub-populations automatically, shedding light on its flexibility and generality for scRNA-seq data analysis.

and the parameters used are shown in Supplemental Material Table S1. The four real datasets include two small datasets collected from human and mouse embryonic development, and two large-scale datasets (~20k cells). HEE dataset (Yan et al., 2013) consists of 88 cells from seven stages (from oocytes to blastocyst) during human early embryo (HEE) development. We obtained a data matrix with 16658 genes across 88 cells after filtering out genes that were expressed in less than 5 cells. MEF dataset (Treutlein et al., 2016) describes the reprogramming from mouse embryonic fibroblasts (MEFs) to induce neuronal (iN) cells. To reconstruct the reprogramming path from MEFs to iN cells, similar to the original study (Treutlein et al., 2016), we used 221 cells collected at multiple time points (0, 2, 5, 22 days) after removing cells that appeared stalled in reprogramming due to Ascl1 silencing or cells converging on the alternative myogenic fate.
In mouse retinal dataset, ∼28,000 cells were profiled from a transgenic mouse line that marks bipolar cells (BCs) of the mouse retina (Shekhar et al., 2016). We obtained the digital expression matrix of 13,166 appreciably expressed genes across 27,499 cells using the R Markdown Code (https://github.com/broadinstitute/BipolarCell2016). 18 cell clusters, including 13 cone BC clusters, 1 rod bipolar cell cluster and 4 non-BC clusters, were obtained in the original study. In Campbell dataset, 20,921 cells were profiled from acutely dissociated Arc-ME cells of adult mice under six feeding conditions: ad libitum access to standard mouse 2 chow, low-fat diet or high-fat diet, as well as overnight fasting, with or without subsequent refeeding (Campbell et al., 2017). 20 distinct clusters (including neuron and non-neuron cell types) was identified in the original study. To visualize cells in the low dimensional space, we used Seurat package and performed batch effect correction (using Combat), following the same procedure as the original study. As batch effect correction will totally change the proportion of zeros in the raw digital expression matrix, we thus used the raw expression matrix as an input of various imputation methods. We performed batch correction on the imputed data by scImpute and SAVER, and then run Seurat pipeline to project cells onto t-SNE space. We did not perform batch correction on the PBLR-imputed data because we did not observe explicit batch effects after projecting cells onto t-SNE space.
To evaluate the performance of PBLR in identifying cell subpopulations, we adopted five real datasets (Supplemental Material Table S5). Deng dataset (Deng et al., 2014) consists of 22431 genes across 268 cells, which was taken from the mouse embryo development process from zygote to blastocyst. Pollen dataset (Pollen et al., 2014) contains 301 single cells across diverse tissues, including neural cells and blood cells. This dataset was used to test the utility of low-coverage scRNA-seq to identify cell subpopulations. Darmanis dataset (Darmanis et al., 2015) was used to capture the cellular complexity of the adult and fetal human brain, including 20214 genes across 90 cells. These cells were divided into six groups, including astrocytes, endothelial, microglia, neurons, fetal quiescent and fetal replicating.
Zeisel dataset (Zeisel et al., 2015) contains 3005 single cells came from mouse cortex and hippocampus. The cells were collected by unique molecule identifier (UMI) and divided into nine clusters. Treutlein dataset (Treutlein et al., 2014) was taken from distal mouse lung epithelial cells at different developmental stages. We used 80 single cells at E18.5 stage, which were clustered into five groups including BP, AT1, AT2, Clara and Ciliated.

Symmetric non-negative matrix factorization (SymNMF)
SymNMF decomposes a non-negative affinity matrix into two symmetric non-negative lowrank matrices as follows, where A is the affinity matrix and H is the non-negative low-rank matrix, which can be used to indicate clustering assignment. As SymNMF is a non-convex problem that may lead to the assignment being not unique, we repeat it 20 times with random initial values.

Incomplete non-negative matrix factorization (INMF)
Let Ms represent the raw expression matrix with selected genes as its rows and cells as its , tol = 10 -6 and set the iteration step t=0.
repeat Steps 2-4 until the following convergence criterion is satisfied:

PBLR algorithm
Algorithm 2: PBLR  Step 1: Input raw data M, cluster number K, outer iterations N, threshold c.


Step 2: Data filtering and normalization.   (Table S1). (B) The Spearman correlation coefficient (SCC) of differential genes across all cells between full data and raw or imputed data. between the full data and the raw data as well the imputed ones. All of these three methods decrease the SSE values and improves PCC values relative to that of the raw data. However, PBLR gives more accurate imputed data than other two methods in terms of SEE and PCC values. (D) PCA plot on raw data, full data, and imputed data matrices by scImpute, SAVER, PBLR, respectively. The three cell clusters are distinguishable on the full data although the red and green clusters are close to each other, but they become less well separated in the raw data with dropout events. However, the relationships among these three clusters are separated after we applied PBLR. For other two methods, scImpute cannot separate the red and blue clusters. Although SAVER can distinguish these three clusters, it changes the data distribution as the relative distances between the three clusters are very different to that of the full data.           dropout.type "group" "group" Null Null Null "group" "group" "group" Method "groups" "groups" "groups" "groups" "path" "path" "path" "groups"   Table S4. Marker genes of 20 cell subpopulations in Campbell scRNA-seq dataset. These marker genes were obtained from the figure 1d in the original study (Campbell et.al, 2017