Abstract

Summary

Data sparsity in single-cell experiments prevents an accurate assessment of gene expression when visualized in a low-dimensional space. Here, we introduce Nebulosa, an R package that uses weighted kernel density estimation to recover signals lost through drop-out or low expression.

Availability and implementation

Nebulosa can be easily installed from www.github.com/powellgenomicslab/Nebulosa.

Supplementary information

Supplementary data are available at Bioinformatics online.

Current single-cell sequencing technologies allow transcriptional profiling from thousands to millions of individual cells in a single experiment. However, single-cell gene expression data commonly exhibit dropout events derived from stochastic transcription (Ochiai et al., 2020), low abundance of transcripts (Kharchenko et al., 2014) or shallow sequence depth (Haque et al., 2017). All of these factors affect the visualization of gene expression in low-dimensional representations such as UMAP (Becht et al., 2019), PCA or t-SNE (van der Maaten and Hinton, 2008). Such visualization is crucial in any single-cell data analysis, and is a commonly used approach for cell annotation (based on canonical markers), the discovery of new cell sub-types, and the evaluation of confounding or batch effects.

Here, we introduce Nebulosa, a novel visualization approach to resolve sparsity based on gene-weighted kernel density estimation. We show that by incorporating the cell density information from a low-dimensional space, the gene expression signal from dropped-out genes can be rescued. We applied Nebulosa in three datasets: 68k peripheral blood mononuclear cells (PBMCs) (Zheng et al., 2017), 4k PBMCs and keratinocytes from a mouse transgenic for the E7/E6 genes of the Human Papilloma Virus 16 (HPV16) (Lukowski et al., 2018). We first highlight the performance of Nebulosa using CD4 expression in PBMCs—a gene that exhibits a high dropout rate, and is predominately restricted to non-cytotoxic CD4 T cells and myeloid cells. Visualizing the expression of CD4 in cells using a traditional UMAP, shows an absence of expression, which can be attributed to the large number of cells with dropped-out expression combined with over-plotting (Fig. 1a). When plotting cells in ascending order based on their rank of CD4 expression, some of the cells expressing CD4 become slightly more observable (Supplementary Fig. S1). However, this expression lacks a clear specificity across the UMAP space, resulting in a noisy signal difficult to associate to a given sub-population of cells. Applying Nebulosa to CD4 expression, we observe the clear expression signal in two well-defined clusters corresponding to CD4 T cells and myeloid cells (Fig. 1b). To demonstrate this, we independently classified cells using scPred (Alquicira-Hernandez et al., 2019) in a supervised manner and observed a high concordance between the expression density of CD4 and the annotation of CD4 T cells, classical and non-classical monocytes, and conventional and plasmacytoid dendritic cells (Fig. 1c). Furthermore, we identified a similar concordance in t-SNE and PCA spaces (Supplementary Fig. S1D–F). Next, we sought to compare the density visualization of Nebulosa with smoothing imputation methods. We imputed the gene expression of CD4 using MAGIC (van Dijk et al., 2018) and kNN smoothing (Wagner et al., 2018) following their default specifications. We found that the density estimates of CD4 with Nebulosa results in a better visual characterization to identify both myeloid and CD4 T cells (Supplementary Fig. S2A). While the results produced by MAGIC also improved the characterization of myeloid cells, the signal of CD4 from CD4 T cells was difficult to visualize (Supplementary Fig. S2B). Likewise, kNN smoothing failed to resolve the sparsity of CD4, particularly in T cells (Supplementary Fig. S2C). Moreover, we argue that while these methods are reliable for downstream analysis, Nebulosa is more suitable for exploratory data analysis of large datasets due to its computational speed. Using the 68 579 PBMC cells, Nebulosa was 1500 times faster than MAGIC, and 11 000 times faster than kNN smoothing. To verify that Nebulosa provides reliable density estimates, we compared the RNA expression density of CD4 determined with Nebulosa with the protein expression measured with CITE-seq in 4k PBMCs. We observed that Nebulosa recapitulates the protein expression of CD4 based on the RNA counts in this data (Supplementary Fig. S3). This is achieved by the kernel function smoothing cell density weighted by the gene expression. Nebulosa also removes the localized expression of CD4 from areas where this gene is expressed in very limited cell numbers. These instances are caused by stochastic transcription of other genes, or ambient RNA present in droplets, making biological interpretation difficult. This feature produces a better representation of the gene expression while recovering the signal from cells that are more likely to express a gene based on their neighbouring cells.

Nebulosa recovers cell gene expression signals that are lost through drop-out or low expression. Cells expressing one or more transcripts of CD4 are plotted for each of 68 000 peripheral blood mono-nucleated cells using the standard UMAP plotting (A); and with Nebulosa’s kernel function utilizing low-dimension (UMAP) cellular density features (B). Cell-type classification obtained with scPred shows the correspondence between CD4 T cells and myeloid cells, and Nebulosa estimates for CD4 (C). We also highlight the ability to identify cell populations based on joint density estimation of multiple gene markers, using CD3D, CD4 and CCR7 to identify CD4+ cells in peripheral blood mono-nucleated cells (D)
Fig. 1.

Nebulosa recovers cell gene expression signals that are lost through drop-out or low expression. Cells expressing one or more transcripts of CD4 are plotted for each of 68 000 peripheral blood mono-nucleated cells using the standard UMAP plotting (A); and with Nebulosa’s kernel function utilizing low-dimension (UMAP) cellular density features (B). Cell-type classification obtained with scPred shows the correspondence between CD4 T cells and myeloid cells, and Nebulosa estimates for CD4 (C). We also highlight the ability to identify cell populations based on joint density estimation of multiple gene markers, using CD3D, CD4 and CCR7 to identify CD4+ cells in peripheral blood mono-nucleated cells (D)

Nebulosa can also create a joint density estimate to visualize the expression overlap of multiple genes by multiplying their estimated densities. As a demonstration, we used Nebulosa to identify naive CD4 T cells based on the joint expression of CD3D, CD4 and CCR7 (Fig. 1d). This feature allows the direct and more precise identification of cell types based on the combined expression of various markers. To further demonstrate the utility of this function, we applied Nebulosa to detect the expression of the viral oncogenes E6 and E7 from HPV-transgenic keratynocytes. The combined expression of E6 and E7 was observable in two clusters characterized by the expression of Krt5 (see Supplementary Fig. S4), that are difficult to identify with traditional UMAP plotting.

Nebulosa makes use of weighted kernel density estimation methods (Duong, 2007; Venables and Ripley, 2002) to represent the expression of gene features from cell neighbours. Similarly, Nebulosa can also be applied to handle missing values from metadata variables (such as batch, condition, time point etc.) to improve their visualization via kernel density estimation, and visualize feature information from other assays [such as CITE-seq (Stoeckius et al., 2017) protein markers, ATAC-seq (Buenrostro et al., 2015) signals]. Nebulosa can take Seurat (Stuart et al., 2019) or SingleCellExperiment objects as input, enabling easy integration into current single-cell analysis workflows. The weighted kernel density estimation is calculated as following:
(1)
where n is the total number of cells, wi is the scaled gene expression (from ln(exp + 1)) and Xi the embeddings of the ith cell, K(x) is a Gaussian kernel function. H is a smoothing parameter corresponding to the bandwidth matrix determined with an unconstrained plug-in selector implemented in the ks package (Duong, 2007) and x is a reference point in the embedding space defined by the grid size used for the computation to weight the distances of nearby cells. KH (x—Xi) therefore works as a weight for wi to smooth the gene expression based on neighbouring cells in a two-dimensional space (although density estimation can potentially be expanded to more). One potential caveat of smoothing gene expression from two-dimensional spaces which preserve better local than global distances between cells (such as UMAP and t-SNE) is the choice of the bandwidth, which can lead to sub-optimal results. We encourage users to use low bandwidths if the default value requires changing for calibration.

Overall, Nebulosa is a powerful approach for visualizing single-cell data as it resolves data-sparsity by using the information from neighbouring cells while successfully dealing with over-plotting. Nebulosa is easy to use and can be implemented into current standard single-cell workflows, such as Seurat (Stuart et al., 2019) and Bioconductor (Huber et al., 2015) workflows.

Funding

J.A.-H. was supported by the University of Queensland under a Research Training Program and a UQ Research Training scholarships. J.E.P. is supported by National Health and Medical Research Council Investigator [1175781]. This work was supported by National Health and Medical Research Council project grant [APP1143163] and Australian Research Council Discovery project [DP180101405].

Conflict of Interest: none declared.

Acknowledgements

The authors thank Drew Neavin for her suggestions and discussions on data visualization

References

Alquicira-Hernandez
J.
 et al. (
2019
)
scpred: accurate supervised method for cell-type classification from single-cell rna-seq data
.
Genome Biol
.,
20
,
264
264
.

Becht
E.
 et al. (
2019
)
Dimensionality reduction for visualizing single-cell data using UMAP
.
Nat. Biotechnol
.,
37
,
38
44
.

Buenrostro
J.D.
 et al. (
2015
)
Atacseq: a method for assaying chromatin accessibility genomewide
.
Curr. Protoc. Mol. Biol
.,
109
,
21.29.1
21.29.9
.

Duong
T.
(
2007
)
ks: kernel density estimation and kernel discriminant analysis for multivariate data in R
.
J. Stat. Softw
.,
21
,
1
16
.

Haque
A.
 et al. (
2017
)
A practical guide to single-cell rna-sequencing for biomedical research and clinical applications.(report)
.
Genome Med
.,
9
,
75
.

Huber
W.
 et al. (
2015
)
Orchestrating high-throughput genomic analysis with bioconductor
.
Nat. Methods
,
12
,
115
121
.

Kharchenko
P.V.
 et al. (
2014
)
Bayesian approach to single-cell differential expression analysis
.
Nat. Methods
,
11
,
740
742
.

Lukowski
S.W.
 et al. (
2018
)
Detection of HPV E7 transcription at single-cell resolution in epidermis
.
J. Investig. Dermatol
.,
138
,
2558
2567
.

Ochiai
H.
 et al. (
2020
)
Genome-wide kinetic properties of transcriptional bursting in mouse embryonic stem cells
.
Sci. Adv
.,
6
,
eaaz6699
.

Stoeckius
M.
 et al. (
2017
)
Simultaneous epitope and transcriptome measurement in single cells
.
Nat. Methods
,
14
,
865
868
.

Stuart
T.
 et al. (
2019
)
Comprehensive integration of single-cell data.(report)
.
Cell
,
177
,
1888
1902.e21
.

van der Maaten
L.
,
Hinton
G.
(
2008
)
Visualizing high-dimensional data using t-SNE
.
J. Mach. Learn. Res
.,
9
,
2579
2605
.

van Dijk
D.
 et al. (
2018
)
Recovering gene interactions from single-cell data using data diffusion
.
Cell (Cambridge)
,
174
,
716
729.e27
.

Venables
W.N.
,
Ripley
B.D.
(
2002
).
Modern Applied Statistics with S
, 4th edn.  
Springer
,
New York
.

Wagner
F.
 et al. (
2018
) K-nearest neighbor smoothing for high-throughput single-cell RNA-seq data. BioRxiv.

Zheng
G.X.Y.
 et al. (
2017
)
Massively parallel digital transcriptional profiling of single cells
.
Nat. Commun
.,
8
,
14049
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Anthony Mathelier
Anthony Mathelier
Associate Editor
Search for other works by this author on:

Supplementary data