More accurate estimation of cell composition in bulk expression through robust integration of single-cell information

Abstract Motivation The rapid single-cell transcriptomic technology developments have led to an increasing interest in cellular heterogeneity within cell populations. Although cell-type proportions can be obtained directly from single-cell RNA sequencing (scRNA-seq), it is costly and not feasible in every study. Alternatively, with fewer experimental complications, cell-type compositions are characterized from bulk RNA-seq data. Many computational tools have been developed and reported in the literature. However, they fail to appropriately incorporate the covariance structures in both scRNA-seq and bulk RNA-seq datasets in use. Results We present a covariance-based single-cell decomposition (CSCD) method that estimates cell-type proportions in bulk data through building a reference expression profile based on a single-cell data, and learning gene-specific bulk expression transformations using a constrained linear inverse model. The approach is similar to Bisque, a cell-type decomposition method that was recently developed. Bisque is limited to a univariate model, thus unable to incorporate gene-gene correlations into the analysis. We introduce a more advanced model that successfully incorporates the covariance structures in both scRNA-seq and bulk RNA-seq datasets into the analysis, and fixes the collinearity issue by utilizing a linear shrinkage estimation of the corresponding covariance matrices. We applied CSCD to several publicly available datasets and measured the performance of CSCD, Bisque and six other common methods in the literature. Our results indicate that CSCD is more accurate and comprehensive than most of the existing methods. Availability and implementation The R package is available on https://github.com/empiricalbayes/CSCDRNA.


Introduction
Gene expression profiling measures mRNA levels, showing the pattern of genes expressed by a cell at the transcription level (Fielden and Zacharewski, 2001). It is widely practised by a variety of biomedical researchers, molecular biologists and environmental toxicologists, to characterize cellular or disease states (Tsoucas et al., 2019).
RNA-sequencing of the individual cell or single-cell RNAsequencing (scRNA-seq) has enabled dissecting transcriptomic heterogeneity, and led to the discovery of novel cell types (Zheng et al., 2017). The rapid single-cell transcriptomic technology developments have led to an increasing interest in cellular heterogeneity within cell populations. Cell-type compositions may be characterized from bulk RNA-seq data, which typically represent total gene expression levels of heterogeneous cell types within tissues.
Many computational tools have been developed in the literature for dissecting cellular content from bulk samples. Perhaps the nonnegative least-squares (NNLS) and ordinary least squares (OLS) methods are the most simple approaches utilized in the cell-type decomposition literature (Abbas et al., 2009;Avila Cobos et al., 2020). However, as we show in our data analysis, they may either fail or lead to poor performance. CIBERSORT (Newman et al., 2015) and its next generation, CIBERSORTx (Newman et al., 2019), extract cell-type-specific gene expression signatures from a reference scRNA-seq dataset. They learn cell-type decomposition in bulk expression data by applying a linear support vector regression (SVR) method. These machine learning-based tools were originally designed to analyze microarray data and are expected to perform well for purified cell populations. BSEQ-sc (Baron et al., 2016) extends CIBERSORT and allows for building a reference profile based on scRNA-seq gene expression. Similar to the above methods, the dampened weighted least squares (DWLS) approach of Tsoucas et al. (2019) uses scRNA-seq data to extract cell-type-specific gene expression signatures, and estimates cell-type proportions by improving the OLS approach. The corresponding tool also outputs cell-type proportions estimated by the OLS and SVR approaches based on the generated signature matrix. Note that because the signature matrices created by DWLS and CIBERSORTx are different, the SVR approach used in both tools leads to distinct estimates for the same cell-type proportions.
MuSiC (Wang et al., 2019) performs cell-type decomposition without the need to build a gene expression signature matrix. It utilizes gene expression levels in the scRNA-seq reference data and applies a constrained NNLS regression model to estimate cell-type proportions in the bulk expression data. The corresponding tool also outputs estimated cell-type proportions using the ordinary NNLS method. MuSiC was shown to outperform NNLS, CIBERSORT and BSEQ-sc on two real datasets.
The above-mentioned approaches do not apply any transformation to either scRNA-seq or bulk RNA expression datasets, and assume that there is a direct proportional relationship between both datasets. Bisque (Jew et al., 2020) highlights that different technologies used to generate bulk and scRNA-seq data may cause an issue for decomposition models and the direct proportional relationship assumption may not be true. It was shown in the corresponding publication that these differences may introduce gene-specific biases that negatively impact the correlation between cell-type-specific and bulk tissue measurements. It was concluded that expression profiles of cell types in heterogeneous tissues may not follow the direct proportionality assumptions of regression-based methods. By accounting for such biases, Bisque estimates cell-type proportions in bulk expression data through robust integration of single-cell information, and uses a normalizing transformation on both bulk and single-cell expressions. Bisque was applied on two datasets in which both bulk and single-cell expressions were collected from the same individuals. It was shown that Bisque outperforms CIBERSORT, CIBERSORTx, MuSiC and BSEQ-sc. However, Bisque does not account for gene-gene correlations. The normalization in Bisque is done for each gene separately, while multivariate normalization that accounts for gene-gene correlations seems to be more appropriate. Indeed, to the best of our knowledge, none of the existing methods in the literature have incorporated a multivariate normalization technique to provide such useful information into the analysis, perhaps due to the existence of high gene-gene correlations which in turn leads to the collinearity issue. Here we introduce covariance-based single-cell decomposition (CSCD), an advanced model for analyzing bulk RNAseq datasets that improves on the Bisque method. It estimates cell-type proportions more accurately and successfully utilizes gene-gene correlations into the analysis. We fix the collinearity issue by utilizing a linear shrinkage estimation of the corresponding covariance matrix. We apply CSCD to several publicly available datasets in which both bulk and scRNA-seq were collected from the same individuals. Using gene expression levels in both bulk RNAseq and scRNA-seq data for the same individuals maximizes measurement accuracy, as the real cell-type proportions are known in advance. We compare the performance of CSCD with Bisque, CIBERSORTx, MuSiC, DWLS, OLS, NNLS and SVR. Although our primary aim is to improve on Bisque, our results indicate that CSCD is more accurate and comprehensive than most of the other existing methods used in this article, as well.
The structure of this article is as follows. In Section 1, we introduce the notations used in this article, and explain the method in detail. As our proposed method is built on the model used in Bisque, we quickly review its workflow. In Section 2, we evaluate the performance of CSCD, Bisque, CIBERSORTx, MuSiC, DWLS, OLS, NNLS and SVR on simulated data as well as four publicly available datasets. In Section 3, we compare runtimes of these methods. Discussion and concluding remarks are provided in Section 4.

Notation
Let C, J and I represent cell types, subjects in the single-cell data and subjects in the bulk expression data, respectively. Also let G represent the total number of genes common in both bulk and single-cell data. For each gene g 2 f1; . . . ; Gg and cell type c 2 f1; . . . ; Cg, we denote average relative abundances across all subjects in the singlecell dataset by Z gc . For each cell type c and subject j 2 f1; . . . ; Jg, let p cj represent the fraction of cell type c. For each gene g and subject i 2 f1; . . . ; Ig, we denote counts per million converted relative abundances in the bulk data by X gi .
Given p cj and Z gc , the pseudo-bulk for each gene g and subject j is given by Y gj ¼ P C c¼1 p cj Z gc , which is a weighted sum of average relative abundances. Let Y ¼ ½Y gj be the matrix containing all Y gj with genes in rows and subjects in columns. Also, let X ¼ ½X gi represent the bulk expression data matrix. The corresponding sample mean vectors are given by Y ¼ ½Y 1: . . . Y G: T and X ¼ ½X 1: . . . X G: T , with Y g: ¼ 1 J P J j¼1 Y gj and X g: ¼ 1 The sample covariance matrices associated with the bulk and single-cell expression datasets are then respectively given by where for g; k 2 1; . . . ; G f g ; s Y gk ¼ 1 J P J j¼1 ðY gj À Y g: ÞðY kj À Y k: Þ and s X gk ¼ 1 I P I i¼1 ðX gi À X g: ÞðX ki À X k: Þ.

A novel method
Bisque estimates cell-type proportions p ci in bulk RNA-seq data through the integration of single-cell information. In a preprocessing step, Bisque excludes genes with zero variance in the single-cell dataset, unexpressed genes in the bulk expression and mitochondrial genes. Since the distribution of bulk expression (X gi ) may differ from the distribution of the single-cell expression (Y gj ), Bisque re-scales the bulk expression data so that the two distributions match more closely. This maximizes the linear relationship across all genes for improved decomposition. Bisque replaces X gi by the gene-specific transformation X u gi so that it satisfies where J Jþ1 is a variance shrinkage factor, due to the small number of subjects in the single-cell dataset. Then, for each subject i in the bulk expression data, cell-type proportion p ci is estimated using the NNLS regression with a sum-to-one constraint, i.e. by minimizing the norm . . such that P C c¼1 p ci ¼ 1. We apply the same pre-processing steps used in Bisque. Although the univariate transformation X u i in Bisque accounts for changes across individuals in both the bulk and single-cell datasets, it is unable to account for gene-gene correlations. Consequently, this may affect the estimation accuracy. Instead of the transformation X u i , we suggest replacing X by X m so that it satisfiesð ffiffiffiffiffiffi J Jþ1 Y and S À 1 2 X stand for the root of the sample covariance matrix S Y and the inverse of the root of the sample covariance matrix S X , respectively. The above transformation leads to inclusion of gene-gene correlations into the model. However, calculating the inverse and/or the root of the covariance matrices appeared in the transformation may be a challenge. Gene expression covariance matrices have a very large dimension, compared to the number of observations/subjects. It is well known that such covariance matrices perform poorly, are expected to suffer from ill-conditioning, and might not even be invertible. To overcome such challenges, there are ways to arrive at modified covariance matrix estimators. Incorporating some additional knowledge in the estimation process, such as sparseness, a graph model ora factor model may be ideal, see for example Bickel and Levina (2008), Khare and Rajaratnam (2011) and Cai and Zhou (2012) among many others. However, such additional knowledge is not always available. Alternatively, it may be reasonable to look for rotation-equivariant covariance matrix estimators, i.e. matrices with the same eigenvectors. This implies that a sample covariance matrix can be differentiated by its eigenvalues. On the other hand, according to Ledoit and Wolf (2004), the largest sample eigenvalues are biased upwards, and the smallest ones downwards. Therefore, it is possible to modify the sample covariance matrix by shrinking its eigenvalues towards their grand mean. Indeed, shrinkage estimation of the covariance matrix has become one of the most successful approaches. Ledoit and Wolf (2004) introduce a linear shrinkage estimation of the covariance matrix, and later, Ledoit and Wolf (2012) present a non-linear shrinkage estimation method. The latter clarifies that if the number of variables (genes in the current study) are significantly higher than the number of observations, the linear shrinkage estimation is more appropriate than the non-linear approach. To fix the above-mentioned issue, we apply the linear shrinkage estimation of Ledoit and Wolf (2004) on the sample covariance matrices S Y and S X , and denote the resulting estimates by S Y;s and S X;s , respectively. The final transformed bulk expression data matrix is then given by Cell-type proportions can now be obtained by minimizing the norm (1), with X u i being replaced by the i-th column of X m s . The covariance-based bulk expression transformation X m s in (2) is ideal due to the fact that it is able to extract and utilize all the information (univariate and multivariate changes) from both bulk and single-cell datasets into analysis. Nevertheless, the inclusion of all genes to the analysis may lead to imprecise results. Obviously, if one categorizes all genes to several cell-type clusters, some genes may appear in all or most of them. This may negatively influence the cell-type proportion estimation accuracy. To overcome this issue, we identify cluster marker genes (marker genes that are recognized to be differentially expressed within the cell-type clusters) using the FindAllMarkers function of the Seurat package (Hao et al., 2021). In this approach, the expression for a given gene g among cells in a given cell-type cluster c is compared against the expression of that gene among cells in all other cell-type clusters, say c. We look at the percentage of cells in the cluster c where the gene is detected (say p gc ) and the percentage of cells of other types (say p gc ). Ideally, a cluster marker gene is expected to be expressed exclusively in that cluster and silenced in all other clusters and thus for each gene g in cluster c, p gc is expected to be high (towards one) and p gc is expected to be low (towards zero). It may be reasonable to consider a 0.5 threshold for the difference between p gc and p gc . Thus, we excluded marker genes for which jp gc À p gc j < 0:5. The 0.5 threshold may be replaced by some arbitrary values. The smaller this threshold, the more genes to be included in the final list of gene markers. We further remove common marker genes that appear in multiple clusters. This leads to a list of final marker genes that each falls into only one cell-type cluster and reduces the number of common genes between cell types to zero. The final gene list is then used to compute the transformed expression levels X m s in (2). The final cell-type proportions p i are given by minimizing the norm (1) with X u i being replaced by the i-th column of X m s .

Results
The reference-based decomposition approach presented in this article requires both bulk and scRNA-seq datasets that contain expression levels of multiple subjects. ScRNA-seq data may be collected from individuals involved in the same study or similar studies conducted by other labs. It is assumed that scRNA-seq data includes cell-type as well as subject labels. A reference profile (Z) is constructed by read count abundance averaging within each cell-type category. Given the generated single-cell profile and cell-type proportions observed in the single-cell data, the method applies a multivariate transformation to the original bulk data. Given the single-cell reference profile and transformed bulk expression data, CSCD learns cell-type composition using a constrained linear inverse model (Wang et al., 2017).

Performance evaluation
In this section, we evaluate the performance of CSCD and seven other common tools (Bisque, CIBERSORTx, MuSiC, DWLS, OLS, NNLS and SVR) on several publicly available datasets. Table 1 represents cell counts in the single-cell datasets used in this article. Accuracy was maximized by using datasets for which both bulk and single-cell expression levels were available from the same individuals. Thus, real cell-type proportions for all samples in the bulk expression data are known. The Spearman correlation (R) has become a common tool in measuring cell-type estimation accuracy in the literature. In addition to reporting R, we measure the difference between the real and estimated cell-type proportions by mean squared error (MSE) and mean absolute error (MAE) to evaluate the performance of different methods. The higher the correlation R, the better the model performance. Similarly, the smaller the MSE (or MAE), the more accurate the corresponding method.

Evaluation using simulated data
To evaluate the performance of CSCD, we first performed a simulation study. We used ESCO (Tian et al., 2021) to simulate realistic single-cell expression data by incorporating gene co-expression. The simulator relies on parameters such as number of genes ðn g Þ, number of cells ðn c Þ, number of cell types ðN c Þ, probability that a gene is differentially expressed in a group (de:prob), location parameter for the differential expression factor log-normal distribution (de:facLoc), scale parameter for the differential expression factor log-normal distribution (de:facScale) and percentage of different cell types. Using ESCO, we simulated single-cell datasets in 100 iterations with parameters n g ¼ 10 000, n c ¼ 1000, N c ¼ 5, de:prob ¼ 0:7; de:facLoc ¼ 3; de:facScale ¼ 1. These parameter values where chosen randomly. To ensure that the simulation allows for different cell-type proportions, in each iteration, we generated random percentages of different cell types, say, p c1 ; p c2 ; p c3 ; p c4 ; p c5 so that P 5 i¼1 p ci ¼ 1. We also generated pseudo-bulk data based on the single-cell data simulated in each iteration. We then used all the eight methods to estimate cell-type proportions in each iteration and calculated averages of the corresponding R, MSE and MAE values, as presented in Table 2

Evaluation on human embryonic stem cells
We first benchmarked all the eight decomposition methods using expression data collected from 1018 purified human embryonic stem cell samples of seven cell types generated from various lineagespecific progenitors (Chu et al., 2016). See Table 1 for the distribution of the cells in the corresponding single-cell reference data. A total of 19 bulk samples from these cell types were sequenced. This includes duplicates for definitive endoderm cell, neuronal progenitor cell and trophoblast-like, triplicates for H9, EC and HFF, and four replicates for H1. Both bulk and scRNA-seq data were collected from the same individuals. For more accuracy, we merged replicates of different cell types which reduced the bulk data to seven samples. Here, with a different purpose than Chu et al. (2016), we analyzed the bulk and single-cell expression data to assess the accuracy of the different cell-type decomposition methods. Figure 1 represents a jitter plot of the estimated cell-type proportions using the different methods for the seven purified bulk samples. Each panel represents real and estimated values for the labeled cell type. Because cell samples are purified, only one individual in each panel is expected to have a cell-type proportion of one and the remaining cell-type proportions should be zero. The figure represents proportions estimated by CSCD, Bisque, CIBERSORTx, SVR, OLS and DWLS. MuSiC and NNLS failed due to a 'not enough valid cell type' error. Based on the real and estimated cell-type proportions, it is obvious from the jitter plot that CIBERSORTx and SVR performed very well. This was expected, as CIBERSORTx was originally developed for data that utilizes a reference generated from purified cell populations, see Jew et al. (2020). The SVR approach implemented in DWLS also uses a similar approach to estimate celltype proportions. Based on the figure, DWLS seems to be the third top method. The performance of these tools is compared in Table 3, which summarizes the R, MSE and MAE measurements. The results in the table also confirm CIBERSORTx and SVR methods significantly outperformed the other methods. DWLS also outperformed the rest of the methods. Although Bisque led to a bit higher correlation than CSCD, it led to higher MSE and MAE values than CSCD. In two of the three measurements, CSCD outperformed Bisque.

Evaluation on melanoma tumor data
We also benchmarked all the eight decomposition methods using expression data collected from 19 freshly procured melanoma tumors spanning a range of clinical and therapeutic backgrounds (Tirosh et al., 2016). The corresponding scRNA-seq data includes cells isolated from 19 patients, profiling malignant, immune, stromal and endothelial cells. The cells have been grouped as either malignant (1668 cells) or non-malignant (2755 cells) on the basis of their expression profiles and using the t-distributed stochastic neighbor embedding (t-SNE) method (Van der Maaten and Hinton, 2008), see Table 1. Bulk gene expression levels for the same individuals were retrieved from Cibersortx's tutorial webpage. Figure 2 represents a jitter plot of the estimated cell-type proportions using the different methods for the 19 bulk samples. Comparing the real and estimated cell-type proportions, one may conclude that CSCD, Bisque and Cibersortx outperformed the other methods. However, for a more quantitative comparison, Table 4 summarizes the R, MSE and MAE measurements. The table illustrates that all methods led to high correlation, with CIBERSORTx gaining the highest correlation, followed by CSCD and Bisque. Looking at the MSE and MAE values, we observe that CSCD outperformed all the other methods. Bisque and CIBERSORTx were the second and third top methods leading to minimum MSEs and MAEs.

Evaluation on pancreatic islets data
We also benchmarked these decomposition methods using expression data collected from human pancreatic islets (Segerstolpe et al., 2016). The scRNA-seq data were collected from six healthy and four type-2 diabetic (T2D) donors. The bulk data were collected from three of the six healthy individuals in the scRNA-seq data and four different T2D donors.
To maximize measurement precision, we evaluated the performance of the different methods based on the three healthy donors for which both bulk and scRNA-seq data are available. See Table 1 for the cell counts. We excluded the unclassified cell from our analysis. Figure 3 represents a jitter plot of the estimated cell-type proportions using the different methods for the three common donors. Different methods performed differently. It is not easy to make an overall judgment based on this figure alone. However, the figure is informative in the beta estimates, which was reported to be very important (Wang et al., 2019). The real beta proportions in the scRNA-seq data are around 20 percent. Also, CSCD was the only method that correctly detected real beta proportions in the bulk samples. For a more quantitative comparison, Table 5 summarizes the R, MSE and MAE values. The table illustrates that all methods led to positive correlation, with CSCD gaining the highest correlation. Looking at the MSE and MAE measurements, we observe that CSCD outperformed all the other methods. Bisque and SVR were the second and third top methods leading to minimum MSEs and MAEs. Cell-specific decomposition accuracy was also evaluated using two other well-characterized datasets. Fadista et al. (2014) provide a comprehensive catalog of novel genetic variants influencing gene expression and metabolic phenotypes in human pancreatic islets. Their study includes expression levels of 89 bulk donors. The data also includes hemoglobin A1c (HbA1c) levels, which is an important biomarker for T2D and is expected to have a negative relationship with beta proportions (Wang et al., 2019). Indeed, T2D is characterized by the inability of the beta cells to secrete enough insulin to overcome insulin resistance in peripheral tissues (Xin et al., 2016). Of the 89 bulk samples, the HbA1c level information is available for 77 individuals, which includes 51 healthy and 26 diabetic individuals. Xin et al. (2016) used single-cell RNA-seq to measure the transcriptomes in 1492 human islet cells from 12 healthy and 6 T2D organ donors. They focused on four major endocrine cell types (alpha, beta, delta and gamma). The scRNA-seq data for healthy samples was composed of 354 alpha, 194 beta, 28 delta and 33 gamma cells, while the scRNA-seq data for T2D samples was composed of 532 alpha, 278 beta, 21 delta and 52 gamma cells.
We used the scRNA-seq datasets to capture cell-type proportions in the bulk samples. Figure 4 represents estimated beta cell-type proportions versus HbA1c levels (after outlier removal) as well as P-values for a negative Spearman correlation test (a one-sided Spearman test for a negative correlation) between the beta cell-type proportions and HbA1c levels. Based on the P-values, DWLS, CSCD, SVR and CIBERSORTx were successfully able to detect a strong negative relationship between beta proportions and HbA1c levels. It is remarkable that SVR and OLS reported lower estimated values than the other methods.

Evaluation on cortex tissue data
As another application, we benchmarked all these decomposition methods using expression data collected from dorsolateral prefrontal cortex (DLPFC). This dataset was generated by the Rush Alzheimer's Disease (AD) Center (Mostafavi et al., 2018) and includes 636 postmortem bulk RNA-seq samples. The scRNA-seq data were collected from 24 of the 636 individuals. The scRNA-seq data include 162 767 cells that have already been clustered into eight clusters, as provided in Table 1. To manage the memory limit available on our system, we randomly selected eight individuals and this reduced the number of cells to 60 388 cells, see Table 1. We remark that most scRNA-seq data consist of only a few samples. Similarly, Bisque (Jew et al., 2020) developers also used eight individuals to study the performance of their method. We further, similar to Bisque, sampled 25% of the nuclei in the scRNA-seq data to accommodate CIBERSORTx's uploading limitation, which currently is 1 GB. Figure 5 represents a jitter plot of the estimated cell-type proportions using the different methods for the eight individuals common in both bulk and scRNA-seq samples. Each panel represents real and estimated values for the labeled cell type. From the figure, it is obvious that all methods except Bisque and CSCD either underestimated or over-estimated the proportions. CIBERSORTx, MuSiC and NNLS under-estimated Peri cell-type proportions while SVR, DWLS and OLS over-estimated them. All the methods except Bisque and CSCD under-estimated Exc proportions. Quantitative comparison of data is summarized in Table 6 and shows the corresponding R, MSE and MAE values. From this table, we observe that all methods led to poor correlation results. CIBERSORTx led to a negative correlation, and CSCD, DWLS and SVR led to a bit higher correlation. Looking at MSE and MAE measurements in Table 6, it  is observed that CSCD outperformed all the other tools. Bisque and DWLS were the second and third top methods, respectively. We further applied these decomposition methods to estimate celltype proportions in the remaining 628 individuals based on expression levels of the eight distinct individuals in the scRNA-seq data. Figure 6 represents the observed scRNA-seq proportions as well as estimated proportions using the eight methods. The figure illustrates that both CSCD and Bisque were successfully able to preserve the distribution of the scRNA-seq data, and produced estimates in bulk data that were significantly close to the scRNA-seq observations.
To determine cell-specific decomposition accuracy, we measured association between neuron proportions (the sum of Exc and Inh proportions) and each individual physician cognitive diagnostic category at time of death (cogdx). Neural death is a hallmark symptom of AD (Yankner, 1996), and a negative association between cogdx and neuron proportion is expected (Jew et al., 2020). The cogdx provides a semiquantitative measure of dementia severity ranging from 1 (no cognitive impairment) to 5 (a confident diagnosis of AD by physicians). Figure 7 represents estimated neuron proportion according to different cogdx levels. Based on this figure alone, it is difficult to make any inference on association between cogdx and proportions. We further validated the negative relationship between codex levels and neuron proportions. Because the cogdx variable has ordered discrete levels from 1 to 5, we used the Jonckheere-Terpstra test (Jonckheere, 1954), after outlier removal. We included the resulting P-values in Figure 7. Based on the P-values and the significance level 0.01, we observe that all methods except NNLS were able to identify the expected negative association between neuron proportions and cogdx levels. It is notable that SVR, OLS, NNLS and MuSiC reported lower estimated proportion values than the other methods.

Runtime comparison
In this section, we benchmarked these decomposition methods in terms of their runtime on a shared computer cluster with 40 CPUs and 500 GB of RAM. trates that CSCD is a bit slower than Bisque, which is due to the implementation of the covariance-based transformation, but it is still faster than most of the other methods.

Discussion and concluding remarks
In this article, we presented a reference-based decomposition method. We introduced the CSCD approach and illustrated the method performance by analyzing several datasets. Our primary goal was to improve on Bisque. However, CSCD outperformed most of the existing methods evaluated in this study. By using both bulk and scRNA-seq datasets that were collected from the same individuals, we observed that CSCD is able to capture the expected cell-type proportions. The method is executable by the inclusion of all genes in the bulk and/or scRNA-seq data in the analysis. However, it may lead to imprecise results, as some genes may appear in most (or all) of celltypes categories. To overcome this issue, one may select genes to be included in the analysis based on a preferred marker gene list. This is indeed the way scRNA-seq data are often prepared. Clusters in scRNA-seq data are constructed based on a pre-collected list of identified marker genes. Genes not on the original list are then assigned to clusters using some probabilistic approaches such as the nearest neighbor or t-SNE clustering, which could in turn be a source of noise. If such an original marker gene list was available in addition to a given scRNA-seq dataset, one could just base the analysis on the makers on the given list. Otherwise, we suggest identifying marker genes using one of the existing approaches in the literature. We used the MAST  (Fadista et al., 2014). The P-values were obtained using a negative Spearman correlation test between beta proportions and HbA1c levels    approach (Finak et al., 2015) implemented in the FindAllMarkers function of the Seurat package (Hao et al., 2021), and our results indicated that this approach works well. However, the fact that the CSCD algorithm removes all genes that are expressed in more than one cluster may be problematic for cell types with few marker genes or closely related cell types that share many marker genes. CSCD is similar to Bisque in the sense that it is a transformationbased approach and standardizes gene expression levels but its strength is the ability to successfully incorporate the covariance structures in both scRNA-seq and bulk RNA-seq datasets into the analysis. This leads to a more accurate transformation and more accurate cell-type estimates. Also, CSCD applies the linear shrinkage estimation of the covariance matrices, accounting for the collinearity issue, which is common in every multivariate large-scale data analysis.
CSCD and Bisque were robust to changes in data sources. For example, looking at the cortex tissue data, CSCD was the top method, and Bisque and DWLS were the second and third top methods, respectively. Looking at the pancreatic islets data, CSCD and Bisque were the first and second top methods, with SVR being the third. Both Bisque and CSCD rely on transformations to match the distribution of both single-cell expression and bulk expression data. This requires enough samples in both datasets. Having a small number of individuals may lead to unreliable results. Also, both Bisque and CSCD preserve the distribution of data in scRNA-seq. Both methods assume that the resulting single-cell-based estimates of celltype proportions accurately reflect the true cell-type proportions of interest. If cell-type proportions captured by single-cell experiments differ significantly from the true physiological distributions, the accuracy of Bisque and CSCD may decrease. Users are advised to be cautious in case of existence of a known significant bias in the single-cell measurements of a tissue of interest. For more discussion, readers may refer to Jew et al. (2020).  1  2  3  4  5  1  2  3  4  5  1  2  3  4  5  1  2  3  4  5  1  2  3  4  5  1  2  3  4  5  1  2  3  4  5  1  2  3   Note: Dashes stand for failed methods.