demuxmix: demultiplexing oligonucleotide-barcoded single-cell RNA sequencing data with regression mixture models

Abstract Motivation Droplet-based single-cell RNA sequencing (scRNA-seq) is widely used in biomedical research for interrogating the transcriptomes of single cells on a large scale. Pooling and processing cells from different samples together can reduce costs and batch effects. To pool cells, they are often first labeled with hashtag oligonucleotides (HTOs). These HTOs are sequenced alongside the cells’ RNA in the droplets and subsequently used to computationally assign each droplet to its sample of origin, a process referred to as demultiplexing. Accurate demultiplexing is crucial but can be challenging due to background HTOs, low-quality cells/cell debris, and multiplets. Results A new demultiplexing method based on negative binomial regression mixture models is introduced. The method, called demuxmix, implements two significant improvements. First, demuxmix’s probabilistic classification framework provides error probabilities for droplet assignments that can be used to discard uncertain droplets and inform about the quality of the HTO data and the success of the demultiplexing process. Second, demuxmix utilizes the positive association between detected genes in the RNA library and HTO counts to explain parts of the variance in the HTO data resulting in improved droplet assignments. The improved performance of demuxmix compared with existing demultiplexing methods is assessed using real and simulated data. Finally, the feasibility of accurately demultiplexing experimental designs where non-labeled cells are pooled with labeled cells is demonstrated. Availability and implementation R/Bioconductor package demuxmix (https://doi.org/doi:10.18129/B9.bioc.demuxmix)


Introduction
Single-cell transcriptomic profiling is fundamental to studying complex biological systems. New droplet-based techniques enable the parallel processing of thousands of cells in a single run (Klein et al. 2015, Macosko et al. 2015, Zheng et al. 2017. Pooling cells from different samples before dropletbased single-cell RNA sequencing (scRNA-seq) can reduce costs and batch effects (Cheng et al. 2021). If genetically unrelated samples are pooled, common genetic variants can be used to assign droplets to the respective donor after sequencing (Kang et al. 2018, Huang et al. 2019. Alternatively, if genetically related or identical samples are pooled, cells can be labeled by hashtag oligonucleotides (HTOs) using oligonucleotide-conjugated antibodies (Stoeckius et al. 2018) or chemical labeling (McGinnis et al. 2019, Gehring et al. 2020. The HTOs are sequenced together with the RNA in the droplet and are subsequently used to assign each droplet to its sample of origin, which is termed demultiplexing. Demultiplexing is complicated by the presence of multiplets, i.e. droplets with more than one cell. If a multiplet contains cells from different samples, it can be detected based on its HTO profile during demultiplexing. Further, depending on the sample quality, a substantial amount of background HTOs can be observed in some experiments. Background HTOs are HTOs that were not used to tag the cell in the considered droplet. Several methods have been proposed for HTO-based demultiplexing. HTODemux implemented in the R package Seurat clusters cells into groups of low and high counts for each HTO, models the distribution of the cluster with low counts, and then classifies outliers above the 0.99 quantile as tagged (positive) cells (Stoeckius et al. 2018). hashedDrops in the R/Biocondutor package DropletUtils uses thresholds based on the ratio of the most abundant HTO to the second most abundant HTO to assign droplets to samples. MULTI-seq detects two local maxima in a kernel density estimation for each HTO and defines a threshold between the two maxima such that the number of singlets passing only one of the thresholds across all HTOs is maximized (McGinnis et al. 2019). GMM-Demux applies a Gaussian mixture model to normalized HTO counts to classify droplets (Xin et al. 2020). DemuxEM differs from the other methods by explicitly modeling the distribution of background HTO counts using empty droplets (Gaublomme et al. 2019). All these methods can identify multiplets and, if default parameters are used, may reject to classify some droplets in case of high classification uncertainty.
The new method, demuxmix, described here uses twocomponent mixture models like GMM-Demux but differs substantially in two important aspects. First, instead of transforming HTO counts to approximate a normal distribution, the negative binomial distribution is used to model raw HTO counts. A good model fit depends on the selected probability distributions and is crucial for accurate class probabilities. Second, demuxmix leverages the association between detected genes and HTO counts in the droplets using regression mixture models to explain some of the variance observed in HTO counts, thereby improving the demultiplexing results. demuxmix is described in detail in Section 2, and its performance is assessed and compared with the existing methods in Section 3 using simulated and real data.

Methods
For each droplet i ¼ 1; . . . ; m, a vector y i;Á 2 N n of the counts from the n HTOs used in the experiment is observed. Let ðy i;j Þ 2 N mÂn denote the matrix of HTO counts from all m droplets in the experiment. The vector with the numbers of genes detected in the droplets is denoted by x 2 N m . Any gene with at least one mapped read from the RNA library is considered detected. The task of demultiplexing is to infer the unobserved classes C i 2 f0; 1g n , where the jth entry equals 1 if droplet i contains a cell tagged with the jth HTO. If all entries of C i are 0, the droplet is empty. If exactly one entry equals 1, the droplet is a single-sample droplet (SSD). It can still contain more than one cell from the same sample, which is difficult to detect based on HTO data and not modeled here. If more than one entry equals 1, the droplet is a multisample multiplet (MSM).

demuxmix algorithm
demuxmix consists of three steps: preprocessing, mixture model fitting, and classification. The first two steps are executed independently for each of the j ¼ 1; . . . ; n HTOs. Empty droplets and low-quality cells should be identified based on their transcriptomic profile and removed before running demuxmix (Lun et al. 2019).
2.1.1 Preprocessing k-means clustering with k ¼ 2 is applied to the counts y Á;j from the jth HTO after log-transformation to obtain initial class labels. The cluster with the larger mean HTO count is assumed to contain mainly droplets with cells from the sample tagged by HTO j and is referred to as the positive cluster. Droplets in the positive cluster with y i;j greater than 1.5 interquartile range (IQR) þ third quartile are flagged as outliers. Similarly, droplets from both clusters with x i greater than 1.5 IQR þ third quartile are flagged. Outliers are not removed and are still classified by demuxmix but not used for subsequent model fitting.

Regression mixture model
For each HTO j, a separate negative binomial regression mixture model with two components representing negative droplets (not containing cells tagged by HTO j) and positive droplets (containing cells tagged by HTO j) is used to model the distribution of the HTO counts y i;j ; i ¼ 1; . . . ; m (Wedel and DeSarbo 2002). Thus, the index j denoting the HTO is constant in Equations (1)-(3) and only annotated for consistency. Let h denote the probability mass function of the negative binomial distribution. The mixture distribution for the jth HTO can be written as where the vector / j ¼ ðp j;1 ; h j;1 ; p j;2 ; h j;2 Þ denotes all model parameters. p j;1 ; p j;2 > 0, p j;1 þ p j;2 ¼ 1, describe the unknown proportions of negative and positive droplets. In a regression mixture model, the means of the observations in each component are predicted from explanatory variables. Here, the number of detected genes in the droplet is used to predict the HTO counts using a negative binomial regression model with canonical link function g: The vector h j;k ¼ ða j;k ; b j;k ; j;k Þ contains the regression coefficients and the dispersion parameter of the kth component. The model parameters / j are estimated using the expectation-maximization (EM) algorithm, which iteratively estimates the regression model parameters h j;k in Equation (2) and then updates the class memberships of the droplets until convergence. Within each EM iteration, maximum likelihood estimates for h j;k are obtained through a nested alternating iteration process. This process alternates between estimating the regression coefficients and the dispersion parameter j;k using Fisher's scoring algorithm, as implemented in the R package MASS. The EM algorithm is initialized using the clustering results from the preprocessing step: droplets from the positive cluster with the larger mean HTO count are assigned to component k ¼ 2. Thus, component k ¼ 2 represents positive droplets. After the parameters have been estimated, the posterior probability that droplet i contains a cell from the sample tagged with HTO j can be calculated: If droplet i is more likely to contain a cell from sample j than not, then estimated classĉ i;j is set to 1 andĉ i;j :¼ 0 otherwise. As shown in Section 3, many datasets demonstrate a positive association between x and y Á;j , which can be leveraged to improve the demultiplexing results. However, the association can be absent in the negative component if the HTO counts are very low. Moreover, if different cell types with different RNA contents and cell surface properties are pooled, the regression can be driven by the differences between these heterogeneous cell clusters rather than by the association between x and y Á;j within the clusters. Therefore, demuxmix fits two simpler mixture models in addition to model (1) and selects the best model for each HTO. The first model does not include a regression model for the negative component k ¼ 1, meaning b j;1 is set to 0. Consequently, the mean l i;j;1 in Equation (1) simplifies to l j;1 and is independent of the number of detected genes. The second model removes the regression models from both components and is referred to as naïve mixture model. For each HTO, the model that minimizes the expected classification errors is calculated using X m i¼1 Pðĉ i;j 6 ¼ C i;j Þ and is selected for use in the subsequent droplet classification.

Probabilistic classification
The final classification of a droplet is obtained from the classifications of the n separate mixture modelsĉ i ¼ ðĉ i;1 ; . . . ;ĉ i;n Þ. An advantage of mixture models for classification is that the posterior probabilities for each class can be calculated using Equation (3). Assuming that the classification results from the different HTOs are independent, the probability of droplet i being from classĉ i is Although the independence assumption is a simplification of the actual data structure, the approximate probabilities from Equation (4) are helpful during demultiplexing. In most settings, a droplet with a high classification uncertainty should be discarded and not assigned to a perhaps incorrect sample, which can then negatively affect downstream analyses. demuxmix uses a default threshold of p acpt ¼ 0:9 n and classifies droplets as "uncertain" if Pðĉ i ¼ C i Þ < p acpt . The threshold can be adapted to account for different acceptable demultiplexing error rates in different experimental settings.

Quality metric for HTO data
An intuitive quality metric that quantifies the separability of positive and negative droplets by HTO j can be obtained by calculating the shared probability mass of the negative and positive component of the mixture model. The metric, termed overlap score (ovs), is defined as x 2 denotes the weighted mean number of detected genes, where the weights are the posterior probabilities of the droplets being positive from Equation (3). Similarly, x 1 denotes the respective weighted mean of detected genes in negative droplets. Prior to calculating the weighted averages, outliers are removed based on the definition in Section 2.1.1. In the case of naïve mixture models, x 1 and x 2 in Equation (5) are omitted since the mean HTO count is independent of the number of detected genes. The metric is standardized within (0, 1], does not depend on the proportion of positive droplets, and can therefore be used to compare the quality of HTO data across experiments.

Datasets
Two publically available datasets and one new dataset were used to demonstrate demuxmix's usability. All datasets were generated by the droplet-based 10Â Genomics Chromium scRNA-seq platform (Zheng et al. 2017) using oligonucleotide-labeled antibodies to tag cells before pooling (Stoeckius et al. 2018).

Human brain single-nucleus dataset
The human brain single-nucleus RNA sequencing (snRNAseq) dataset consists of nuclei from the dorsolateral prefrontal cortices from eight genetically unrelated donors. Each sample was tagged with a different HTO. The dataset has been described previously (Gaublomme et al. 2019). demuxlet was used as an alternative demultiplexing approach based on genetic variation and used as ground truth in the benchmark (Kang et al. 2018). demuxlet was applied as in the original work: Droplets with a doublet probability >.99 were classified as doublets. The remaining droplets were assigned to the most likely singlet class unless they were flagged as ambiguous by demuxlet. The preprocessed dataset of 2509 high-quality nuclei and the demuxlet results were obtained from the Docker image provided by Gaublomme et al. .

Cell line mixture dataset
The cell line mixture dataset consists of 12 samples from 4 different cell lines with a total of 7956 cells. Each sample was tagged with a different HTO, as described previously (Stoeckius et al. 2018). Droplets were demultiplexed based on the distinct transcription profiles of the four cell lines, independent of the HTOs, using a standard single-cell clustering workflow as described by Amezquita et al. (2020). Further details on the clustering approach and the corresponding R code can be found in the Supplementary Information. The dataset was downloaded from Gene Expression Omnibus (GSM3501446 and GSM3501447).

Cerebrospinal fluid dataset
The cerebrospinal fluid (CSF) dataset consists of CSF cells pooled with peripheral blood mononuclear cells (PBMCs).
Only the PBMCs but not the CSF cells were stained using oligonucleotide-labeled antibodies (BioLegend, 394613) following the manufacturer's protocol. Cell capture, amplification, and library construction on the 10Â Genomics Chromium platform (v3 Chemistry) were performed according to the 10Â protocol. Libraries were sequenced on the Illumina NovaSeq instrument and aligned and preprocessed using the Cell Ranger software, including the removal of empty and low-quality droplets. A total of 2510 droplets with !200 detected genes were obtained. To allow for genetic demultiplexing using the freemuxlet software, CSF and PBMCs were obtained from two genetically unrelated donors. A droplet was assigned to the respective class "CSF" or "PBMC" if freemuxlet achieved a posterior probability !.99 and labeled "uncertain" otherwise. The study was approved by an Institutional Review Board of Columbia University Irving Medical Center. The dataset is included in the R/Bioconductor package demuxmix.

Simulation
Simulated data to benchmark the demultiplexing methods were generated based on real datasets using only highconfidence SSDs as starting set. For the human brain dataset, high-confidence SSDs were defined as droplets identified as SSDs by the genetic demultiplexing (demuxlet) with a doublet probability < .1 and a probability of the most likely singlet class > .999. This stringent filter resulted in 2123 SSDs with known sample source. To simulate low-quality datasets, a scaling factor s ¼ 1, 0.95, . . ., 0.1 was used to attenuate the true HTO signal:ỹ i;j ðiÞ ¼ s y i;j ðiÞ , where j ðiÞ refers to the HTO used to tag the cell in droplet i. Background HTO counts were not altered. After attenuating the HTO signal, 200 doublets were generated by randomly merging two SSDs from different donors. The HTO and RNA read counts of the demuxmix artificial doublets were set to the average values of the two merged SSDs. Similarly, data were simulated based on the cell line mixture dataset. All 7047 droplets assigned to one of the four cell lines based on their RNA profile as described in Section 2.2.2 were considered high-confidence SSDs and used for the simulation. The scaling factor ranged from 1 to 0.25 since all methods performed poorly on this dataset with s < 0.25. The clustering-based ground truth only identified the cell line but not which of the three HTOs used for each cell line was used to tag the cell in the droplet. Thus, to decide which HTO should be multiplied by the scaling factor s, majority voting was applied after running all six methods included in the benchmark on the unmodified dataset. In case of a tie (observed for 66 out of 7047 droplets), one of the top-voted HTOs was randomly drawn. A total of 350 doublets were generated and added to the dataset in the same way as in the human brain dataset.

Results
Three different HTO datasets from three laboratories were used to assess demuxmix's performance. The first dataset was the human brain single-nucleus dataset, which consisted of pooled nuclei isolated from eight different donors. A different HTO was used for each donor (Fig. 1A). The median number of HTO reads per droplet was 1454. Genetic demultiplexing was applied and used as ground truth.
3.1 HTO counts are positively correlated with the number of detected genes in the droplets To study the association between the number of detected genes and the HTO count in a droplet, the number of genes was plotted against the number of reads of the HTO used to tag the respective nucleus (Fig. 1B). All samples demonstrated a positive correlation ranging from 0.67 to 0.86, indicating that the number of detected genes explains parts of the variance in HTO counts. This association is likely driven by biological factors, e.g. nucleus size, and technical factors like the quality of the gel beads. A positive correlation was also observed for background HTO reads (HTOs not used to tag the respective nucleus), albeit with a much lower correlation ranging from 0.07 to 0.13, likely due to the low background HTOs counts (median of 25 reads).

demuxmix's classification is highly consistent with genetic demultiplexing
For six out of eight HTOs, demuxmix selected the mixture model with a regression model in the positive but not in the negative component, reflecting the observed weak correlations between background HTO counts and the number of detected genes. The naïve mixture model and the mixture model with two regression components were each selected once. Figure 1C exemplarily shows the histogram for one HTO (S8HuM) overlaid with the probability mass function of the mixture model. The dataset was originally generated to demonstrate the feasibility of tagging single nuclei and is of exceptional quality. The negative component (blue) shows low background HTO counts and little overlap with the positive (red) component (ovs S8HuM ¼ 0.006). Figure 1D shows the decision boundary for the HTO S4HuM. Since a regression mixture model was selected for this HTO, the decision boundary is a curve depending on both HTO counts and the number of genes. If a naïve mixture model were used, the decision boundary would be a vertical line. demuxmix's results were highly concordant with those from genetic demultiplexing (Fig. 1E).

demuxmix demonstrates superior performance on simulated benchmark data
On the original high-quality dataset, previously published demultiplexing methods performed similarly well ( Supplementary Fig. S1). To systematically assess and compare the performance of demuxmix to other demultiplexing algorithms, HTO datasets of different quality were simulated based on droplets identified by demuxlet as high-confidence SSDs. The HTO signal in the data was then gradually attenuated by a scaling factor s to simulate low-quality data. MSMs were artificially generated by merging two SSDs. To assess the methods' performances, the probability that the predicted classĉ i is correct given the method predicted an SSD (Pðĉ i ¼ C i jĉ i 2 SSDÞ, referred to as precision SSD ) was calculated. This measure is essential in the context of demultiplexing, because falsely predicted SSDs remain in the dataset and can negatively affect downstream analyses. Since a high precision SSD can be achieved by generously classifying ambiguous droplets as "uncertain" or as MSMs, precision SSD was considered together with sensitivity SSD , which was defined as the probability that a true SSD was identified as SSD, i.e. Pðĉ i 2 SSDjC i 2 SSDÞ. A low sensitivity SSD means that many usable droplets were removed from the dataset. The F-score (F SSD ), defined as the harmonic mean of precision SSD and sensitivity SSD , was used to rank the methods. Supplementary  Fig. S2 shows the two performance measures for different scaling factors s ranging from 1 to 0.1. Table 1 shows the average measures observed across all values of s. All methods were applied using their default settings as described in the Supplementary Information. Notably, all methods achieved a high precision SSD for larger values of s, reflecting the high quality of the dataset. While differences in the average precision SSD were relatively small between methods, larger differences were observed for the average sensitivity SSD (Table 1). demuxmix achieved the highest average precision SSD and the highest average sensitivity SSD . Only 6.3% of true SSDs were discarded as "uncertain," "negative," or MSMs by demuxmix, whereas other methods, on average, discarded up to 12.8% of true SSDs.
While high precision and sensitivity for SSDs are most important, identifying MSMs reliably can be helpful for two reasons. First, MSMs can be used to detect single-sample multiplets with similar transcription profiles in downstream clustering analyses. Second, accurately estimating the dataset's MSM proportion can help plan future experiments. Therefore, two additional measures, precision MSM and MSM proportion, were calculated as defined in the Supplementary Information. As shown in Supplementary Table S1, all methods performed worse in identifying MSMs and a substantial fraction of predicted MSMs (13%-40%) was actual SSDs. Further, most methods slightly overestimated the proportion of MSMs in the dataset. Detailed results for different scaling factors are shown in Supplementary Fig. S2.

demuxmix performs better than naïve mixture models
A modified version of demuxmix (demuxmix naïve) was run on the benchmark data using exclusively naïve mixture models. The goal was to investigate whether regression mixture models outperform naïve mixture models. While demuxmix naïve still performed better than some of the other methods, it yielded inferior results compared with demuxmix, indicating that the use of regression mixture models improved the demultiplexing of this dataset (Table 1 and Supplementary  Fig. S2). In particular, the application of regression mixture models improved the detection of MSMs (Supplementary  Table S1). Moreover, the superior SSD classification of demuxmix naïve compared with GMM-Demux, which employs normal mixture models after data transformation, implies that the negative binomial mixture models may provide a better data fit.

Validation on an independent dataset
The cell line mixture dataset (Stoeckius et al. 2018) consisted of a pool of 12 samples from 4 different cell lines as shown in Fig. 2A, and was used to validate demuxmix's performance. Instead of genetic demultiplexing, the cell lines' distinct transcription profiles were used to assign droplets to cell lines independent of the HTO data. Small clusters potentially consisting of doublets or apoptotic cells were removed, leaving four large distinct clusters representing the four cell lines ( Fig. 2A). The median HTO count in this dataset was 185 reads per droplet.
The correlation between HTO counts and the number of detected genes was weaker than in the human brain dataset (Fig. 2B). This was likely caused by the homogeneity of the cells within each cell line, whereas the brain dataset consisted of a mix of neuronal and glial nuclei of different sizes. Thus, the observed weaker correlation is probably mainly driven by technical factors rather than biological differences between the cells. Consequently, demuxmix selected a regression mixture model for only 2 out of the 12 HTOs indicating that modeling the relation between HTO counts and the number of genes improves the demultiplexing only marginally in this dataset. As exemplarily displayed in Fig. 2C and D, demuxmix's models fit the data well. One HTO ("KG1_C") showed a larger amount of background HTOs, causing some overlap between the positive and the negative component (Fig. 2D). Still, demuxmix's classification was overall highly consistent with the results from clustering the RNA profiles. The lower quality of the HTO "KG1_C" (ovs KG1_C ¼ 0.137) led to a slightly higher number of KG1 cells being classified as "negative" (Fig. 2E). In addition, considering Equation (4), the low quality of KG1_C likely increased the number of Table 1. Average scores from the simulation based on the human brain dataset using scaling factors ranging from 1 to 0.1. "uncertain" cells but did not lead to incorrect SSD assignments (Fig. 2E). All methods performed similarly on the original dataset ( Supplementary Fig. S3). To assess the methods in more detail, the cell line mixture dataset was used to simulate benchmark datasets of various quality, with the RNA cluster labels as the ground truth. A total of 350 MSMs were generated and the HTO signal was attenuated by a scaling factor s between 1 and 0.25. The performance of all methods declined considerably for s < 0.25. Supplementary Fig. S4 shows the results for each value of s; Table 2 and Supplementary Table S2 show the average performances. The results mainly replicate the findings from the human brain dataset. demuxmix achieved the highest precision SSD and the highest sensitivity SSD . Whereas all methods demonstrated a high precision SSD , larger differences were observed in the sensitivity SSD , indicating that other methods, compared with demuxmix, discard more SSDs resulting in a loss of data. Similarly, the observed MSM-related metrics were lower than in the human brain dataset, reflecting this dataset's slightly lower quality and sequencing depth as quantified by an average overlap score of 0.019 compared with 0.002 for the human brain dataset. As expected, the difference between demuxmix and demuxmix naïve was small since demuxmix mainly relied on naïve mixture models for this dataset.

Method
The simulation revealed that the differences between methods became more pronounced as the quality of the datasets was reduced. To evaluate the quality of typical real datasets, the overlap score was computed for 10 recently published HTO datasets available at the Gene Expression Omnibus and compared with the two benchmark datasets ( Supplementary   Fig. S5A). While some datasets demonstrated a comparable quality to the original benchmark datasets, others exhibited significantly lower quality, akin to the simulated datasets with a small scaling factor s ( Supplementary Fig. S5B). Consequently, the choice of method is likely to have an impact on the demultiplexing results of these datasets, as suggested by the simulation study.

Experimental design with non-tagged cells
Staining cells with oligonucleotide-labeled antibodies requires an additional experimental step, which may alter cell states or lead to cell loss. Thus, when rare and precious cells are pooled with highly abundant cells, staining only the abundant cells is desirable if the demultiplexing algorithm can reliably separate the negative from the positive (tagged) cells. For example, Fig. 3A illustrates an experiment where rare untreated CSF cells were pooled with oligonucleotide-labeled PBMCs. To assess demuxmix's performance in this setting, CSF and blood  were obtained from two unrelated donors so that genetic demultiplexing could be used as ground truth. A positive correlation of 0.48 was observed between the number of genes and HTO counts in the PBMCs (Fig. 3B). In the non-tagged CSF cells, which harbored some background HTOs, the correlation was weaker (r ¼ 0.18). In this dataset, demuxmix selected a mixture model with regression models in both components. Figure 3C shows the mixture probability mass function for a droplet with the average number of 1796.5 detected genes. For such an average droplet, the expected HTO count was 4977.8 for PBMCs and 897.4 for CSF cells. Figure 3D shows the decision boundary of the regression mixture model. For this special design with one nontagged sample, the acceptance probability was set to p acpt ¼ 0:99 to increase the precision at the cost of a larger amount of droplets being discarded as "uncertain." Ignoring droplets identified as "multiplets" or "uncertain" by the genetic demultipexing algorithm, demuxmix achieved a specificity of 0.985, a sensitivity of 0.980, and removed 261 cells (11.1%) as "uncertain" (Fig. 3E). In line with the previous observations, results from the naïve mixture models were worse, indicating the benefit of modeling the number of detected genes as explanatory variable (Fig. 3F, specificity of 0.983, sensitivity of 0.972, 308 cells [13.1%] removed as "uncertain"). Overall, these results demonstrate the feasibility of this experimental design where sensitive unlabeled cells are pooled with tagged cells.

Discussion
Successful demultiplexing is essential for downstream data analyses. The overall good performance of the new method presented here can be attributed to two major improvements: (i) an accurate probabilistic model of HTO counts based on the negative binomial distribution and (ii) the addition of the number of detected features in the RNA library to the model as an explanatory variable that explains parts of the variance of the HTO counts.
The performance of demuxmix was assessed in a comparative study based on two published benchmark datasets with an available ground truth derived from two different approaches utilizing genetic diversity or distinct transcription profiles, respectively. The characteristics of the benchmark datasets have implications for the interpretation of the results. First, the human brain and the cell line mixture datasets were carefully optimized and generated to feature antibody-based nucleus and cell labeling and are of exceptional quality (Stoeckius et al. 2018, Gaublomme et al. 2019. Thus, all methods performed well on these data and the HTO signal had to be attenuated to detect larger differences between the methods. However, as shown in a small sample of 10 published HTO datasets, the quality of real HTO data varies depending on factors like tissue quality and may even fall below the quality of the simulated data ( Supplementary Fig. S5), so that the method selection likely has a larger effect on the results for these datasets. Second, cells in the cell line mixture dataset were distinct between samples but very homogenous within samples (same cell line). This design facilitated demultiplexing based on the transcription profiles but is uncommon in real experimental designs. Cellular heterogeneity is a driving factor underlying the association between HTO counts and the number of detected genes, which could not be utilized by demuxmix in this dataset. Third, the reported error rates in the benchmark study are an upper boundary since the (B) Scatter plot shows the correlation between the log number of detected genes and the log counts of the HTO used to tag the respective cells as determined by genetic demultiplexing. Multiplets and uncertain droplets were removed. Regression lines are plotted and Pearson correlation coefficients are shown in the legend. (C) Histogram depicts the distribution of HTO counts overlaid by the probability mass function of the mixture model and its two components. Since the mixture probability mass function depends on the number of detected genes, the probability mass function for a droplet with the average number of genes observed in the data is shown, and the histogram shows adjusted HTO counts after regressing out the effect of the number of genes. (D) Scatter plot shows the relation between HTO counts (x-axis), the number of detected genes (y-axis), and the posterior probability for the droplet containing a PBMC (color) obtained from the model. The black dashed curve indicates the decision boundary where the posterior probability equals 0.5. (E) Matrix displays the concordance between the droplet classifications obtained from genetic demultiplexing by freemuxlet (x-axis) and demuxmix (y-axis). (F) As in (E) but only naïve mixture models were used by demuxmix, ignoring the number of detected genes in the droplets. demuxmix ground truth cannot be assumed to be free of any error despite the stringent filters. Compared with typical experimental designs, the unusually large number of pooled samples (8 and 12) further complicated the demultiplexing. However, these factors are likely offset by the high quality of the benchmark datasets.
The benchmark study provided valuable insights, and despite the differences between the two benchmark datasets, the rankings based on average performance measures were highly similar. Consistent with the results presented here, a recently published independent benchmark study confirmed the superior performance of demuxmix (Howitt et al. 2022). Not all demultiplexing errors are equally important. This study primarily focused on (i) whether predicted SSDs were classified correctly (precision SSD ) since only these droplets remain in the dataset and (ii) on the recovery of SSDs (sensitivity SSD ) since undetected SSDs result in a loss of data. Notably, the observed precision SSD never fell under 0.9 in the benchmark study, indicating that none of the methods produced many incorrect SSD assignments even when the HTO signal was artificially weakened. However, some methods discarded a significant number of true SSDs as uncertain or multiplets (low sensitivity SSD ) to achieve a high precision SSD . demuxmix demonstrated the highest precision SSD in both datasets and classified on average 94% of SSDs (other methods 87%-93%) in the human brain dataset and 85% of SSDs (other methods 67%-81%) in the cell line mixture dataset.
The results obtained from both benchmark datasets were consistent, irrespective of whether antibodies that target cell surface proteins or those targeting the nuclear core complex were employed. This suggests that the methods can be successfully applied for both cell and nucleus demultiplexing. Moreover, the comprehensive study conducted by Howitt et al. (2022) assessed demuxmix and other methods using various datasets, including lipid-based cell tagging data. Based on their results, the study concluded that the methods were agnostic to different tagging protocols.
Depending on the tagging protocol and cell characteristics, certain cells may display lower HTO counts compared with others. For instance, when using antibodies, cells with lower antigen expression may exhibit fewer HTO counts in comparison to cells with high antigen expression within the same cell pool (Mylka et al. 2022). demuxmix uses the negative binomial distribution to account for additional biological variance. However, further studies are needed to thoroughly evaluate the performance of demuxmix and other methods in such complex scenarios.
Finally, the CSF dataset was used to demonstrate that demuxmix can successfully demultiplex datasets where only one of two pooled samples was tagged with HTOs. This novel experimental design is useful when pooling rare and sensitive cells together with highly abundant cells. Although some of the other methods could theoretically be used to demultiplex the CSF dataset, the algorithms' implementations required a minimum of two HTOs, which hindered their application to this dataset.
In summary, demuxmix is a new versatile demultiplexing method that showed superior performance in both the benchmark presented in this study and an independent benchmark published recently (Howitt et al. 2022). The method is implemented as an R/Bioconductor package and provides methods for estimating error rates, calculating quality metrics, and generating diagnostic plots.