-
PDF
- Split View
-
Views
-
Cite
Cite
Jose Alquicira-Hernandez, Joseph E Powell, Nebulosa recovers single-cell gene expression signals by kernel density estimation, Bioinformatics, Volume 37, Issue 16, August 2021, Pages 2485–2487, https://doi.org/10.1093/bioinformatics/btab003
- Share Icon Share
Abstract
Data sparsity in single-cell experiments prevents an accurate assessment of gene expression when visualized in a low-dimensional space. Here, we introduce Nebulosa, an R package that uses weighted kernel density estimation to recover signals lost through drop-out or low expression.
Nebulosa can be easily installed from www.github.com/powellgenomicslab/Nebulosa.
Supplementary data are available at Bioinformatics online.
Current single-cell sequencing technologies allow transcriptional profiling from thousands to millions of individual cells in a single experiment. However, single-cell gene expression data commonly exhibit dropout events derived from stochastic transcription (Ochiai et al., 2020), low abundance of transcripts (Kharchenko et al., 2014) or shallow sequence depth (Haque et al., 2017). All of these factors affect the visualization of gene expression in low-dimensional representations such as UMAP (Becht et al., 2019), PCA or t-SNE (van der Maaten and Hinton, 2008). Such visualization is crucial in any single-cell data analysis, and is a commonly used approach for cell annotation (based on canonical markers), the discovery of new cell sub-types, and the evaluation of confounding or batch effects.
Here, we introduce Nebulosa, a novel visualization approach to resolve sparsity based on gene-weighted kernel density estimation. We show that by incorporating the cell density information from a low-dimensional space, the gene expression signal from dropped-out genes can be rescued. We applied Nebulosa in three datasets: 68k peripheral blood mononuclear cells (PBMCs) (Zheng et al., 2017), 4k PBMCs and keratinocytes from a mouse transgenic for the E7/E6 genes of the Human Papilloma Virus 16 (HPV16) (Lukowski et al., 2018). We first highlight the performance of Nebulosa using CD4 expression in PBMCs—a gene that exhibits a high dropout rate, and is predominately restricted to non-cytotoxic CD4 T cells and myeloid cells. Visualizing the expression of CD4 in cells using a traditional UMAP, shows an absence of expression, which can be attributed to the large number of cells with dropped-out expression combined with over-plotting (Fig. 1a). When plotting cells in ascending order based on their rank of CD4 expression, some of the cells expressing CD4 become slightly more observable (Supplementary Fig. S1). However, this expression lacks a clear specificity across the UMAP space, resulting in a noisy signal difficult to associate to a given sub-population of cells. Applying Nebulosa to CD4 expression, we observe the clear expression signal in two well-defined clusters corresponding to CD4 T cells and myeloid cells (Fig. 1b). To demonstrate this, we independently classified cells using scPred (Alquicira-Hernandez et al., 2019) in a supervised manner and observed a high concordance between the expression density of CD4 and the annotation of CD4 T cells, classical and non-classical monocytes, and conventional and plasmacytoid dendritic cells (Fig. 1c). Furthermore, we identified a similar concordance in t-SNE and PCA spaces (Supplementary Fig. S1D–F). Next, we sought to compare the density visualization of Nebulosa with smoothing imputation methods. We imputed the gene expression of CD4 using MAGIC (van Dijk et al., 2018) and kNN smoothing (Wagner et al., 2018) following their default specifications. We found that the density estimates of CD4 with Nebulosa results in a better visual characterization to identify both myeloid and CD4 T cells (Supplementary Fig. S2A). While the results produced by MAGIC also improved the characterization of myeloid cells, the signal of CD4 from CD4 T cells was difficult to visualize (Supplementary Fig. S2B). Likewise, kNN smoothing failed to resolve the sparsity of CD4, particularly in T cells (Supplementary Fig. S2C). Moreover, we argue that while these methods are reliable for downstream analysis, Nebulosa is more suitable for exploratory data analysis of large datasets due to its computational speed. Using the 68 579 PBMC cells, Nebulosa was 1500 times faster than MAGIC, and 11 000 times faster than kNN smoothing. To verify that Nebulosa provides reliable density estimates, we compared the RNA expression density of CD4 determined with Nebulosa with the protein expression measured with CITE-seq in 4k PBMCs. We observed that Nebulosa recapitulates the protein expression of CD4 based on the RNA counts in this data (Supplementary Fig. S3). This is achieved by the kernel function smoothing cell density weighted by the gene expression. Nebulosa also removes the localized expression of CD4 from areas where this gene is expressed in very limited cell numbers. These instances are caused by stochastic transcription of other genes, or ambient RNA present in droplets, making biological interpretation difficult. This feature produces a better representation of the gene expression while recovering the signal from cells that are more likely to express a gene based on their neighbouring cells.

Nebulosa recovers cell gene expression signals that are lost through drop-out or low expression. Cells expressing one or more transcripts of CD4 are plotted for each of 68 000 peripheral blood mono-nucleated cells using the standard UMAP plotting (A); and with Nebulosa’s kernel function utilizing low-dimension (UMAP) cellular density features (B). Cell-type classification obtained with scPred shows the correspondence between CD4 T cells and myeloid cells, and Nebulosa estimates for CD4 (C). We also highlight the ability to identify cell populations based on joint density estimation of multiple gene markers, using CD3D, CD4 and CCR7 to identify CD4+ cells in peripheral blood mono-nucleated cells (D)
Nebulosa can also create a joint density estimate to visualize the expression overlap of multiple genes by multiplying their estimated densities. As a demonstration, we used Nebulosa to identify naive CD4 T cells based on the joint expression of CD3D, CD4 and CCR7 (Fig. 1d). This feature allows the direct and more precise identification of cell types based on the combined expression of various markers. To further demonstrate the utility of this function, we applied Nebulosa to detect the expression of the viral oncogenes E6 and E7 from HPV-transgenic keratynocytes. The combined expression of E6 and E7 was observable in two clusters characterized by the expression of Krt5 (see Supplementary Fig. S4), that are difficult to identify with traditional UMAP plotting.
Overall, Nebulosa is a powerful approach for visualizing single-cell data as it resolves data-sparsity by using the information from neighbouring cells while successfully dealing with over-plotting. Nebulosa is easy to use and can be implemented into current standard single-cell workflows, such as Seurat (Stuart et al., 2019) and Bioconductor (Huber et al., 2015) workflows.
Funding
J.A.-H. was supported by the University of Queensland under a Research Training Program and a UQ Research Training scholarships. J.E.P. is supported by National Health and Medical Research Council Investigator [1175781]. This work was supported by National Health and Medical Research Council project grant [APP1143163] and Australian Research Council Discovery project [DP180101405].
Conflict of Interest: none declared.
Acknowledgements
The authors thank Drew Neavin for her suggestions and discussions on data visualization