Normalization of single-cell RNA-seq counts by log(x + 1) or log(1 + x)

 
Single-cell RNA-seq technologies have been successfully employed over the past decade to generate many high resolution cell atlases. These have proved invaluable in recent efforts aimed at understanding the cell type specificity of host genes involved in SARS-CoV-2 infections. While single-cell atlases are based on well-sampled highly-expressed genes, many of the genes of interest for understanding SARS-CoV-2 can be expressed at very low levels. Common assumptions underlying standard single-cell analyses don't hold when examining low-expressed genes, with the result that standard workflows can produce misleading results.


SUPPLEMENTARY INFORMATION
Supplementary data and all of the code to reproduce Figure 1 are available here: https://github.com/pachterlab/BP_2020_2/.


Results
The ACE2 receptor, which facilitates entry of SARS-Cov-2 into cells (Zhang et al., 2020), has become one of the most studied genes in the history of genomics over the past two months. There are already hundreds of preprints about the gene (Google Scholar), and it is currently the default gene displayed on the UCSC genome browser (Lee et al., 2020). Several studies have reported on the expression of ACE2 at single-cell resolution, and papers have been rife with speculation about implications of differential ACE2 mRNA abundance for severity of disease. As is common in single-cell RNA-seq, the expression estimates of ACE2 are derived from counts that are filtered and normalized. Figure 1a shows an analysis of ACE2 mRNA in mice lungs (data from (Angelidis et al., 2019)). The expression is computed from cells containing at least one copy of the gene. While single-cell RNA-seq expression data has been modeled with many different distributions (Dadaneh et al., 2020;Van den Berge et al., 2018), for simplicity in illustrating our points we model this count data with a simple Poisson random variable X with parameter k in order to demonstrate the implications of this restriction. Application of the filter amounts to computing While this is approximately k when k is large, it is close to 1 when k is small (de L'Hospital, 1768). Figure 1b shows the fraction of cells containing at least one copy of ACE2 (Booeshaghi and Pachter, 2020). Evidently, Figure 1a creates a misleading impression. While it may appear that average ACE2 expression is similar between young and old mice, when comparing the fraction of cells with non-zero expression of ACE2 it is clear that ACE2 has significantly lower mRNA expression in the lungs of aged mice than young mice.
The fraction f of cells with non-zero expression of a gene has a useful statistical interpretation. We leave it as an exercise for the reader to show that the following estimator for the Poisson rate is consistent: Since f is approximately equal to this expression when f is small, this provides an interpretation of the fraction of cells with at least one copy of a low-abundance gene as an estimate of the rate parameter k in a Poisson distribution.
Another mistake that we've found to be common in reporting ACE2 expression has to do with the log transformation, frequently used as part of a normalization of counts. Counts are log transformed for two reasons: the first is to stabilize the variance, as the log transform has the property that it stabilizes the variance for random variables whose variance is quadratic in the mean (Bartlett, 1947;Yeo and Johnson, 2000). There are two main considerations for this step: first when performing PCA on the gene expression matrix to find a reduced-dimensional representation that captures the variance, it is desirable that all genes contribute equally. The second consideration for the log transform is that it converts multiplicative relative changes to additive differences. In the context of PCA, this allows for interpreting the projection axes in terms of relative, rather than absolute, abundances of genes.
A seemingly minor technical issue in log transforming counts is that zero counts cannot be 'logged', as logð0Þ is undefined. To circumvent this problem, it is customary to add a 'pseudocount', e.g. þ1, to each gene count prior to log transforming the data (Innes and Bader, 2018). We denote logð1 þ xÞ by log1p in accordance with nomenclature standard in scientific computing (Liu, 1988). For a gene with an average of k counts where k is large, it is intuitive that the average of the log1p transformed counts is approximately logðkÞ. However, this is not true for small k. An understanding of the result of applying the log1p transform begins with the observation that for a random variable X, E½f ðXÞ is not, in general, equal to f ðE½XÞ. For example, if X is a Poisson random variable with parameter k, it is not true that E½logð1 þ XÞ ¼ logðk þ 1Þ. By Taylor approximation, This shows that for low-expressed genes, the average log1p expression can differ considerably from logðkÞ, with the maximum difference according to the Taylor approximation at k % 1. (see Fig. 1c). Thus, while a 2-fold change for large k translates to a logð2Þ difference after log1p, that is not the case for small k.
In summary, while single-cell RNA-seq atlases offer detailed information about the transcriptomic profiles of distinct cell types, their use to examine specific genes, as has been done recently in the study of SARS-CoV-2 infection related genes, requires care. Methods should not be used unless their limitations are understood. For example, while it does not matter whether one uses logðx þ 1Þ or logð1 þ xÞ, the filtering and normalization applied to counts can affect comparative estimates in non-intuitive ways. For example, the SCnorm normalization (Bacher et al., 2017) is based on a preliminary filter for all cells with at least one count, thus subjecting the method to the problem seen in Figure 1a and b. Indeed, there have been reports of problems with SCnorm when applying the method to sparse datasets with many zeroes (Tian et al., 2019), leading to the development of methods that account for this (Hafemeister and Satija, 2019). Moreover, there are subtle problems that arise when working with small counts that transcend the elementary issues we have raised (Lun, 2018;Warton, 2018). These matters are not theoretical; we leave the identification of published preprints and papers that have ignored the issues we've raised, and hence reported misleading results, as another exercise for the reader.