TsImpute: an accurate two-step imputation method for single-cell RNA-seq data

Abstract Motivation Single-cell RNA sequencing (scRNA-seq) technology has enabled discovering gene expression patterns at single cell resolution. However, due to technical limitations, there are usually excessive zeros, called “dropouts,” in scRNA-seq data, which may mislead the downstream analysis. Therefore, it is crucial to impute these dropouts to recover the biological information. Results We propose a two-step imputation method called tsImpute to impute scRNA-seq data. At the first step, tsImpute adopts zero-inflated negative binomial distribution to discriminate dropouts from true zeros and performs initial imputation by calculating the expected expression level. At the second step, it conducts clustering with this modified expression matrix, based on which the final distance weighted imputation is performed. Numerical results based on both simulated and real data show that tsImpute achieves favorable performance in terms of gene expression recovery, cell clustering, and differential expression analysis. Availability and implementation The R package of tsImpute is available at https://github.com/ZhengWeihuaYNU/tsImpute.


Data preprocessing
TsImpute takes as input a raw count matrix X m×n with m genes and n cells.To find highly variable genes, we first generate the normalized expression matrix Y = (y ij ) m×n , which is defined as × 1000000 + 1), i = 1, ..., m, j = 1, ..., n. (1) Then we calculate the coefficient of variance (CV) of each normalized gene: in which sd(Y i ) means the standard deviation of Y i .Then, the 2000 genes with highest CV are passed to next step and others genes will not be imputed.Note that in the subsequent step tsImpute still uses the raw counts of these highly variable genes as ZINB model requires count data as input.

Description of datasets
In this article, nine real datasets are used for evaluation, i.e.Darmanis (Darmanis et al., 2015), Ting (Ting et al., 2014), Pollen (Pollen et al., 2014), Huarte (Uriarte Huarte et al., 2021), PBMC (Zheng et al., 2017), Klein (Klein et al., 2015), Baron (Baron et al., 2016), Domingo (Domingo-Gonzalez et al., 2020) and Chu (Chu et al., 2016).The description and availability of these data are displayed as below: 3 Generation of simulated data The simulated datasets used in the article are generated using the R package Splatter (Zappia et al., 2017).More specifically, the datasets can be generated with the following R codes: BiocManager : : i n s t a l l ( ' s p l a t t e r ' ) l i b r a r y ( s p l a t t e r ) r a t e<− −0.5 #f i x t h i s parameter mid<− 5 #s e t t h i s as 3 , 4 or 5 params = newSplatParams ( ) params = setParams ( params , l i s t ( b a t c h C e l l s = 5 0 0 , nGenes =2000 , group .prob = rep ( 0 . 2 , 5 ) , de .prob = c ( 0 .0 5 , 0 .0 8 , 0 .0 1 , 0 . 1 , 0 . 1 ) , de .f a c L o c = 0 .5 , de .f a c S c a l e = 0 .8 , s e e d= 1 ) ) # Generate t h e s i m u l a t i o n d a t a u s i n g S p l a t t e r p a c k a g e sim = s p l a t S i m u l a t e G r o u p s ( params , dropout .shape =rep ( r a t e , 5 ) , dropout .mid = rep ( mid , 5 ) , dropout .type = " group " ) c o u n t s <− as .data .frame ( c o u n t s ( sim ) ) #o b s e r v e d c ou n t m a t r i x t r u e c o u n t s <− as .data .frame ( a s s a y s ( sim ) $TrueCounts ) #t r u e c ou n t m a t r i x dropout <− as .matrix ( a s s a y s ( sim ) $Dropout ) #d r o p o u t m a t r i x

Evaluation metrics
In this article, several evaluation metrics are used to assess the performance of different imputation methods, in this section we list the formulae of these metrics, including rooted mean squared error (RMSE), mean absolute error (MAE), Spearman correlation coefficient, Pearson correlation coefficient, sensitivity, specificity, adjusted Rand Index (ARI) (Hubert and Arabie, 1985) and normalized mutual information (NMI) (Strehl and Ghosh, 2002).

RMSE and MAE
Suppose X = (x 1 , ..., x n ) is the true value and X = ( x1 , ..., xn ) is the predicted value, RMSE and MAE can be calculated by and

Pearson correlation and Spearman correlation
Suppose X = (x 1 , ..., x n ) is the true value and Y = (y 1 , ..., y n ) is the predicted value, Pearson correlation coefficient can be calculated by As for Spearman correlation, it is defined as where d i is the rank difference between x i and y i .

Sensitivity, specificity and balanced accuracy
Consider the confusion matrix of imputation: Imputed Not imputed Dropouts TP FN True zeros FP TN Sensitivity, specificity and balanced accuracy can be calculated as follows: Specif icity = Balanced accuracy = sensitivity + specif icity 2 . (9)

ARI and NMI
ARI and NMI are two popular evaluation metrics which assess the performance of clustering analysis, ARI is defined as where N is the number of cells, N ij is the number of cells of the real cell type C * j ∈ A * assgined to cluster C i in partition A, and N j is the number of cells of cell type C * j .The value range of ARI is [−1, 1], a higher ARI indicates better clustering results, and ARI equals to 1 only when the clustering result is identical to the real cell type partition.NMI is defined as follows: in which ).The value of NMI falls between [0, 1] and a larger NMI means better clustering performance.

Silhouette coefficient
Silhouette coefficient is a metric that can be used to evaluate clustering results without knowing the ground truth type labels.For a given cell i, silhouette coefficient can be calculated as follows: where a(i) is the average distance between cell i and other cells of the same cluster, and b(i) denotes the average distance between cell i and its nearest cluster.The overall result is generated by averaging silhouette coefficients of all cells.The value of silhouette coefficient is between -1 and 1, and a larger value indicates better clustering performance.
6 Pseudo-codes of EM algorithm for ZINB paramters estimation In this article, the parameters of ZINB distribution are estimated with expectation maximization (EM) algorithm (Dempster et al., 1977).EM algorithm optimizes the parameters in a iterative manner: Algorithm 1 EM algorithm for ZINB parameters estimation Input: a gene vector g = (x 1 , ..., x m ) Process: Initialize the ZINB parameters π = π 0 , r = r 0 and p = p 0 , maximum number of iterations t max , minimum difference , t = 1, lik = 0, dif f = 100 while t ≤ t max and dif f ≥ do M step: estimate p t and r t by maximizing end while return π, r, p 7 Pseudo-codes of tsImpute Algorithm 2 Pseudo-codes of tsImpute Input: expression matrix X m•n , number of top genes n top and dropout threshold thres Process: Binarize each cell vector x i (i = 1, ..., n) by converting the n top highest expressed genes to 1 and others to 0 Decide the optimal number of clusters k according silhouette coefficient Divide cells into k clusters with Jaccard distance, in which cluster i includes n i cells Step 1: ZINB imputation: for i in 1:k do for j in 1:m do Estimate ZINB parameters π i j , r i j , p i j for gene j in cells of cluster i end for for l in 1:n i do Calculate scale factor

Figure S1 :
Figure S1: Cell-wise and gene-wise Pearson correlation between the real and imputed values within different cell types of the simulated data.Higher correlation coefficients indicate better imputation performance.(A)Cell-wise correlation.(B) Gene-wise correlation.

Table S1 :
Description and availability of datasets used in the article.

Table S2 :
Given the ZINB imputed expression matrix X init , normalize the expression levels to generate Y Divide cells into C clusters, where cluster c includes n c cells for i in 1:C do Calculate the Euclidean distance matrix D i according to Y i Calculate the inverse distance matrix W i for j = 1 : m do for k = 1 : n c do In this section, we display the time consumption of different methods on eight real data used in data masking experiment.All methods are implemented on Apple MacBook Pro M1 2020 (3.2 GHz Apple M1 8-core CPU and 16-GB RAM) for ten times, and the time cost table is shown as below (average time cost ± standard deviation): Time consumption of different methods on real datasets.

Table S3 :
Sensitivity, specificity and balanced accuracy of all methods on simulated data.

Table S4 :
Results of ablation test measured by NMI.Best results are marked in bold.Jaccard clustering ZINB imputation IDW imputation Pollen Ting Darmanis Huarte Klein Baron PBMC Domingo

Table S5 :
Robustness test of Jaccard clustering.ARI and NMI are used as evaluation metrics.