Benchmarking of a Bayesian single cell RNAseq differential gene expression test for dose–response study designs

Abstract The application of single-cell RNA sequencing (scRNAseq) for the evaluation of chemicals, drugs, and food contaminants presents the opportunity to consider cellular heterogeneity in pharmacological and toxicological responses. Current differential gene expression analysis (DGEA) methods focus primarily on two group comparisons, not multi-group dose–response study designs used in safety assessments. To benchmark DGEA methods for dose–response scRNAseq experiments, we proposed a multiplicity corrected Bayesian testing approach and compare it against 8 other methods including two frequentist fit-for-purpose tests using simulated and experimental data. Our Bayesian test method outperformed all other tests for a broad range of accuracy metrics including control of false positive error rates. Most notable, the fit-for-purpose and standard multiple group DGEA methods were superior to the two group scRNAseq methods for dose–response study designs. Collectively, our benchmarking of DGEA methods demonstrates the importance in considering study design when determining the most appropriate test methods.


INTRODUCTION
Single-cell transcriptomics enables researchers to investigate homeostasis, development and disease at unprecedented cellular resolution (1)(2)(3)(4)(5). As with any new innovative technology, diverse tools soon follow to address spe-cific applications and unique challenges. Currently, there are dozens of differential gene expression analysis (DGEA) approaches for single-cell RNAseq (scRNAseq) data; developed based on differences in assumptions, statistical methodologies, and study designs (6)(7)(8)(9)(10)(11). A recent comparison of 36 approaches demonstrated acceptable performance for common bulk RNAseq tools such as edgeR and limmatrend, and MAST for snRNAseq, as well as common statistical tests such as the Wilcoxon rank sum (WRS) and the t-test (9). However, most methods have been developed primarily for two group comparisons whereas study designs typical of pharmacology and toxicology experiments such as as dose-responses consist of multiple groups. The use of two sample tests for multiple group study designs elevate the type I error rate warranting further investigation of these methods for multiple group dose-response study designs (12).
Dose-response studies are used to derive the efficacy and/or safety margins such as effective dose and the point of departure (POD). Significant efforts by the toxicology and regulatory communities have suggested that acute (<14 days) and sub-acute (14-28 days) transcriptomic studies as viable alternative to the current standard 2-year rodent bioassay that significantly reduces the time and resources needed to assess risk (13)(14)(15). Gene expression profiling at single-cell resolution could further support such evaluations by identifying cell-specific dose-dependent responses indicative of an adverse event. The U.S. National Toxicology Program (NTP) recently reported a robust DGEA approach is essential to deriving biologically relevant PODs (15). However, concerns regarding the inclusion of false positives that produce less conservative POD estimates potentially leads to incorrect classification of mode-of-action PAGE 2 OF 15 (MoA), thus highlighting the importance of controlling type I error rates (16,17).
Unlike microarray and bulk RNAseq datasets, single-cell RNAseq (scRNAseq) data exhibits excess zero values due to the low per cell RNA input, biases in capture and amplification, transcriptional bursts, and other technical factors (18). This no expression (zero values), due to a conflation of both biological and technical factors, results in an excessive number of zeroes in an otherwise continuous measure (19). Therefore, traditional tests of differential gene expression, based on the assumption of a normal distribution, fail to correctly model the bimodality of single cell gene expression (20). Consequently, scRNAseq test methods usually consider the gene expression distribution as a mixture of a unexpressed (zero) and a positively (non-zero) expressed population (19)(20)(21). For example, the Seurat Bimod approach tests for differential gene expression using a likelihood ratio test designed for the said mixture population. MAST extends the Seurat Bimod test to a two-part generalized linear model structure capable of incorporating covariates (19,20). Given the improved performance of MAST (9,(19)(20), we hypothesized that multiple group tests developed assuming the same distributional framework would be most favorable for dose-response study designs. Furthermore, a Bayesian approach which considers prior knowledge is anticipated to minimize type I error rates (22,23).
The aim of the presented study is to evaluate the performance of existing and novel DGEA test methods on dose-response scRNAseq datasets. To reduce the rate of false positives we propose a novel, multiplicity corrected, Bayesian multiple group test (scBT) designed exclusively for DGEA of dose-response scRNAseq data. Two other fitfor-purpose frequentist multiple group tests are also examined: (i) a multiple group extension of the Seurat Bimod test and (ii) a simple extension of test (i) to a generalized linear model framework. Existing and proposed methods are benchmarked on simulated and real experimental doseresponse datasets. Using simulated datasets we were able to investigate the influence of various parameters such as number of cells, and illustrate how using different test methods can aid in gaining biological insight on the role of individual cell types on the pathophysiological consequences of exposure.

Animal handling and treatment
Male C57BL/6 mice aged postnatal day (PND) 25 were obtained from Charles Rivers Laboratories (Kingston, NY) were housed and treated as previously described (24). Mice were housed in Innovive cages (San Diego, CA) with ALPHA-dri bedding (Shepherd Specialty Papers, IL) at 23 • C, 30-40% relative humidity, and a 12:12 h light:dark cycle. Aquavive water (Innovive) and Harlan Teklad 22/5 Rodent Diet 8940 (Madison, WI) was provided ad libitum. On PND 29, randomly assigned mice were gavaged at Zeitgeber time (ZT) 0 with 0.1 mL sesame oil vehicle (Sigma-Aldrich, St. Louis, MO), 0.01, 0.03, 0.1, 0.3, 1, 3, 10 or 30 g/kg TCDD every 4 days for 28 days (7 total administered doses). At day 28 mice were euthanized by CO 2 asphyxiation and livers were immediately flash frozen in liquid nitrogen and stored at −80 • C. All animal procedures were approved by the Michigan State University (MSU) Institutional Animal Care and Use Committee (IACUC) and reporting of in vivo experiments follow the Animal Research: Reporting of In Vivo Experiments (ARRIVE) (25) and Minimum Information about Animal Toxicology Experiments (MIATE) guidelines (https://fairsharing.org/FAIRsharing.wYScsE).

Real scRNAseq and snRNAseq datasets
Hepatic single-nuclei RNA-sequencing (snRNAseq) was performed as previously described using the 10× Genomics Chromium Single Cell 3 v3.1 kit (10X Genomics, Pleasanton, CA) (26). Briefly, nuclei were isolated using EZ Lysis Buffer (Sigma-Aldrich), homogenized by disposable Dounce homogenizer, washed, filtered using a 70-m cell strainer. The nuclei pellet was resuspended in buffer containing DAPI (10 g/ml), filtered using a 40-m strainer, and immediately sorted using a BD FACSAria IIu (BD Biosciences, San Jose, CA) at the MSU Pharmacology and Toxicology Flow Cytometry Core (facs.iq.msu.edu/). Sequencing (150-bp paired end) was performed at a depth of 50 000 reads/cell using a NovaSeq6000 at Novogene (Beijing, China). CellRanger v3.0.2 (10× Genomics) was used to align reads to mouse gene models (mm10, release 93) including introns and exons to consider both pre-mRNA and mature mRNA gene models. Seurat was used to integrate and log-normalize expression data (27). The data is available on the Gene Expression Omnibus (GEO) at accession ID GSE184506 and R package versions are listed in Supplementary Table S1. Additional real datasets were publicly available. Hepatic whole-cell generated using the 10X Genomics platform was obtained from GEO (GSE129516) (5). Hepatic single-nuclei processed as the dose-response data for control and high dose TCDD treatment (0 and 30 g/kg) was obtained from GEO (GSE148339). Peripheral blood mononuclear cell (PBMC) data also generated using the 10× Genomics platform and Seurat was obtained from the SeuratData R package (27).
Gene set enrichment analysis of experimental data was performed using the fgsea v1.14 R package on gene lists sorted by significance values (e.g. P-value). Gene sets from BIOCARTA, KEGG, PANTHER and WIKIPATHWAYS were obtained from the Gene Set Knowledgebase (GSKB; http://ge-lab.org/gskb/) and filtered for gene sets containing 15-250 genes. Gene sets were agglomerated based on overlap of gene membership and only those showing ≥50% overlap were considered similar for subsequent network analyses. Visualization and calculation of measures of centrality were determined using igraph v1.2.7. Gene sets were considered enriched when adjusted P-value ≤0.05.

Dose-response data simulation
To simulate dose-response scRNAseq datasets we developed a wrapper for the Splatter R package (28). Splatter simulates counts using parameters estimated from real data to set the mean expressions, variance and outlier probability. Other parameters such as the number of cells, genes, probability of being differentially expressed, mean foldchange of DE genes (location) and standard deviation of fold-change of DE genes (scale) were manually assigned to best reflect real data. The wrapper (SplattDR) leverages the group simulation feature of Splatter by applying a multiplicative factor estimated using dose-response models in Table 1 based on the US EPA Benchmark Dose Software (29). SplattDR R package is available at (https://github. com/zacharewskilab/splattdr) and R session information is listed in Supplementary Table S1.

Single cell RNA-seq hurdle model
We model the log-normalized gene expression matrix using a hurdle distribution wherein the rate of gene expression is assumed to follow a Bernoulli distribution and conditional on a cell expressing the gene, the log-normalized expression level is assumed to follow a Gaussian distribution (19). We denote Y i, j to be the log-normalized expression value of gene j in cell i, for i = 1, . . . n and j = 1, . . . p. To characterize the bimodal properties of single cell data, for a given cell, a gene is defined to be either positively expressed or undetected. Define R ij = I[Y ij > 0] to be the indicator variable denoting the presence or absence of expression for gene j in cell i. Following (19), the log-normalized gene expression values are modeled as follows: where j and σ 2 j denote the mean and variance of the gene expression level, conditional on the gene being expressed and j denotes the rate of gene expression of gene j across all cells. Since the binary variable R ij denotes the absence/presence of gene expression Y ij , the Bernoulli distribution provides a natural representation as it can be thought of as a model for the set of possible outcomes of any single experiment that asks a yes-no question.
Hypothesis formulation. We now assume that our data has been collected under K conditions (doses), and denote the data by D k, o ≡ {(Y k, i, j , R k, i, j ), i = 1, . . . , n k } . The underlying populations for the sample data D k, o for k = 1, 2, . . . , K, dose groups are assumed to be identified by the parameters (μ k, j , σ 2 j , ω k, j ). The aim of this study is to test for difference in gene expression patterns between the different dose groups. Traditionally one would perform an ANOVA test to detect changes in mean across groups for samples with continuous measurements. However, to account for the bimodality in single cell gene expression distribution, the test should detect for changes in j and j simultaneously, as both could drive differential gene expression. Therefore we define, H 0 : μ 1, j = μ 2, j = . . . μ K, j = μ j and versus the alternative H a : μ k, j is different for at least one k and ω k, j is different for at least one k, k = 1, . . . K Single cell Bayesian hurdle model analysis (scBT). Given the single cell RNA-seq hurdle model structure, we assume that a priori, given σ 2 where IG is the inverse gamma distribution with shape a and scale b and m k, 0 , k, , a , b , a k, , b k, are the hyperparameters. Given the large number of gene-wise model fits arising from a single cell experiment, there is a pressing need to allow for a parallel structure whereby the same model is fitted to each gene. The prior distributions on the parameters describe how the unknown coefficients k, j k, j and σ 2 j vary across the genes and the dose groups while allowing for information borrowing between the genes. Now, based on the model assumptions, we propose a Bayesian test for simultaneously testing the differences in mean gene expression and dropout proportions as formulated in Equation (2). Under the null hypothesis the marginal likelihood is written as Under the alternative hypothesis we compute the marginal likelihood without any restriction on the K means k, j and the dropout parameter k, j ; k = 1, 2, . . . K. Particularly, we assume that μ k, j ∼ Normal(m k,0 , τ k,μ σ 2 j ), and σ 2 j ∼ IG(a σ , b σ ), k, j ∼ Beta(a k, , b k, ); k = 1, 2, . . . K. Now, the marginal likelihood under the alternative hypothesis is given by The Bayes factor is then defined as where (H a ) and (H 0 ) are the prior probabilities for the alternative and null model, respectively. The hyperparameters are obtained by maximising the marginal likelihood un-  (2), scBT conducts a test of DE for each gene independently. To control for multiplicity we adopt the FDR correction approach discussed in (30). The rejection threshold is estimated in terms of the posterior probabilities of the null hypothesis, p(H 0, j |D j ). For a target FDR ␣, the procedure rejects all hypotheses with

Multiple group likelihood ratio test (LRT).
To carry out a direct performance comparison with scBT, we extend the Seurat Bimod (19) for multiple groups. Assuming that all the K groups have the same variance σ 2 j and omitting the index j for clarity, the likelihood ratio test can be defined as; where the likelihood can be written as: with Y and R representing the gene observation vector and the gene indicator vector across K dose groups, is the number of cells expressing the gene in group k and f is the density function of the normal distribution with parameters k and 2 . Following (19), it can be shown that the combined LRT is the product of a binomial and a normal LRT statistic, both of which can easily be derived using classical statistical theory. Applying classical asymptotic results about LRTs , −2log (Y, R) converges to a 2 distribution with (2K − 2) degrees of freedom under H 0 . Detailed derivations of the test statistic can be found in the Supplementary Material.

Linear model-based likelihood ratio test (LRT linear)
. The generalized linear model approach MAST was identified as one of the top performing tests for pairwise differential expression testing (9,20). Deriving from their approach, the LRT multiple test is extended to a generalized linear model framework, where the mean and the dropout proportions are modelled as a linear function of the dose groups (assumed to be a continuous covariate). Using the same distributional assumptions defined in Equation (1) we fit a logistic regression model for the discrete variable R and a Gaussian linear model for the continuous variable Y conditional on (R = 1) independently, as follows: resents the continuous dose groups. Under this modelling approach, the null hypothesis described in Equation (2) can be rewritten as H 0 : E(Y ij |R ij = 1) = m 0j and logit{P(R i, j = 1)} = 0j . The regression models are fit using the lm and brglm functions in the stats and brglm R packages. The likelihood ratio test statistic is computed using the same statistical theory discussed for the LRT multiple test and it asymptotically follows a 2 distribution under H 0 .

Benchmarking method selection
Our fit-for-purpose tests were benchmarked to existing differential expression testing methods or their multiple group equivalent based on previously reported performance, ability to consider multiple groups, or whether they served as foundation for the scBT and multiple group LRT (LRT multiple) tests developed here. Seurat Bimod served as foundation for the scBT and LRT multiple tests as previously outlined, and MAST was identified as one of the top performing test for two group comparisons (9). Similarly, limma-trend performed well for two sample comparisons and can consider multiple groups. The WRS test was identified as providing excellent balance between its ability to identify DE genes and speed, and is the default test for the Seurat R package for scRNAseq analysis. It was also reported that the t-test performed well and therefore we included the ANOVA and Kruskal-Wallis (KW) tests, a parametric and non-parametric alternative of the t-test for multiple group comparisons. All tests were run without correction for batch effects or other nuisance covariates. Multiplicity for each test was controlled using FDR correction. All tests, including scBT, LRT multiple and LRT Linear are available in our scBT R package (https://github.com/ satabdisaha1288/scBT). R session information is listed Supplementary Table S1. A flow diagram outlines our benchmarking approach (Figure 1). (19) is a pairwise differential gene expression testing approach developed assuming the single cell RNA-seq hurdle model framework. The test is formulated as H 0 : the mean and the dropout parameters of the gene vector under two dose groups are equal versus H a : the mean and the dropout parameters differ over the two groups. The LRT based test statistic −2log (y, r) converges to a 2 distribution with 2 degrees of freedom under H 0 . The computations are carried out using the R Package Seurat.

Seurat bimod. Seurat bimod test
MAST. MAST (20) proposes a two-part generalized linear model for differential expression analysis of scRNAseq data. The first part models the rate of gene expression using logistic regression logit(ω i j ) = X i β ω j and the second part uses a linear model to express the positive gene-expression j are the coefficients of the covariates used in the logistic and linear regression model respectively. A test with an asympotic 2 null distribution is employed for identifying DEGs and multiplicity is controlled using FDR correction (31). Despite the fact that LRT-linear and MAST have the same hurdle regression framework, the estimation process for the two methods has some significant differences. First, to achieve shrinkage of the continuous variance, MAST assumes a gamma prior distribution on the precision (inverse of variance) parameter and estimates its posterior maximum likelihood estimator (MLE) and uses that in place of the regular MLE of the precision parameter. Second, it fits a Bayesian logistic regression model for the discrete component by assuming Cauchy distribution priors centered at zero for the regression parameters. This is done to deal with cases of 'linear separation' where the parameter estimates diverge to Figure 1. Flow diagram of the simulation, benchmarking, and experimental data evaluation strategy presented in the manuscript. Briefly, SplattDR was developed to simulate dose-response scRNAseq data and validated based on experimental dose-response data. Simulated datasets were generated varying diverse parameters 10 times and then used to assess the performance of each test method. Each test method was also assessed using experimental data from the hepatic snRNAseq dose response dataset obtained from male mice gavage every 4 days for 28 days with 0.01, 0.03, 0.1, 0.3, 1, 3, 10 or 30 g/kg TCDD. Related figures for each analysis from the main body are noted.
±∞ and the Fisher information matrix becomes singular. And finally, it considers the cellular detection rate defined as CDR i = 1 p P j =1 R i j to be a covariate in both the logistic and linear regression models. LRT linear on the other hand simply fits the non-Bayesian linear and the logistic regression models without considering variance shrinkage or adjustment for additional covariates.
limma-trend. Limma-trend (32) proposes a linear model based differential expression approach for modelling RNAseq experiments of arbitrary complexity. Their framework models the mean gene expression as a function of several continuous and categorical covariates. A separate linear model is fitted for each gene, but the gene-wise models are linked by global parameters using the parametric empirical Bayes approaches (33). The global variance estimated by the empirical Bayes procedure also incorporates a mean variance trend, allowing better modelling of low abundance genes. Finally, test of differential gene expression is carried out by testing the significance of one or more coefficients of the fitted linear model. (34) test is a nonparametric test commonly used for pairwise DGE testing. The test is formulated as H 0 : the distributions of the gene vector under two dose groups are equal versus H a : the dis-tributions are not equal.The test involves the calculation of the U statistic, which for large samples is approximately normally distributed. Since this is a pairwise test, a union is taken over all the genes found to be DE in each of the pairwise tests. The computations are carried out using the wilcox.test function in R package stats and multiplicity is controlled using FDR correction.

Wilcoxon rank sum (WRS) test. WRS
ANOVA. Analysis of variance (ANOVA) (35) is very commonly used for testing the differences among means in multiple groups. For a fixed gene j, it is assumed that the observed gene vector y k, i, j for cell i is grouped by dose. Assuming that Y k,i, j ∼ Nor mal(μ k, j , σ 2 j ), ANOVA aims to test the null hypothesis H 0 : 1, j = 2, j = . . . K, j = j versus H a : k, j , i = 1, . . . , n; j = 1, . . . , p; k = 1, . . . , K is different for at least one k. The test statistic is computed using the aov function in R package stats and it follows a F-distribution with (K − 1) and (n − K) degrees of freedom. Multiplicity is controlled by applying FDR correction on the obtained P-values. Kruskal putation of the KW test statistic is carried out using the kruskal.test function in R-package stats and it asymptotically follows a 2 distribution with K − 1 degrees of freedom. Multiplicity is controlled by applying FDR correction on the obtained P-values.

Benchmarking and sensitivity analyses
Benchmarking of DE test methods was performed on simulated datasets based on initial parameters derived from real dose-response snRNAseq data. The probability of differential expression was set to 10% with a 50% probability of being down-regulated, equally distributed among the dose-response models in Table 1. Batch parameters were used to include sample variation associated with data obtained from 3 individuals in each dose group. A total of 5000 genes were simulated for 4500 cells (500 per dose group) using the same doses as the real dataset. Sensitivity analyses varied each of the following parameters according to values in Supplementary Tables S2 and S3: cell abundance equally distributed among dose groups, varying cell numbers in each dose group, percent DE genes, proportion of downregulated DE genes, fold-change location or scale, and dropout rate. Each simulation was replicated 10 times using a different initial seed. Seeds and parameters listed in Supplementary Table S2 can be used to reproduce datasets under each condition. Method concordance was determined as area under the concordance curve (AUCC) for the top 100-or 500-ranked genes in simulated and real datasets, respectively, as previously described (37).

Dose-response single-cell data simulations
For benchmarking of DGEA methods, a ground truth is required. Existing simulation tools such as PowSimR, SymSim, SPsimSeq and Splatter are commonly used for power analyses, evaluating DE analysis methods, and testing cell clustering strategies (28,(38)(39)(40). Tools such as Sym-Sim and Splatter are also capable of simulating cell trajectories and model differentiation processes. Trajectories which exhibit non-linear changes over time or across different developmental stages are not unlike dose-response effects which change over a continuum of doses. However, dose-responsive changes commonly follow defined trajectories such as Hill, exponential, power, and linear models (29). To simulate dose-response scRNAseq data we developed a wrapper for the Splatter scRNAseq data simulation tool named SplattDR. SplattDR modified the Splatter grouped data simulation strategy by adjusting counts from means defined by one of the dose-response functions outlined in the Materials and Methods.
To demonstrate the modeling capability of SplattDR, 10 000 gene expression responses were simulated with a 10% probability of being differentially expressed, equally distributed across the dose-response models. Parameters used in Splatter were initially estimated from our experimental single nuclei RNAseq (snRNAseq) dose-response dataset. The simulated data compared to the experimental data showed the relationship between the mean expression, percentage of zeroes, and mean variance were consistent (Fig-ures 2A, B). Estimation of the normalized root mean square deviation (NRMSD) from a curve fit to the experimental data indicated excellent concordance. This strong concordance was also maintained within distinct dose groups ( Supplementary Figures S1 and S2). The distribution of log(fold-changes) between vehicle (dose 0) and the highest simulated dose (dose 9; 30 g/kg) showed a more even distribution within a similar range compared to experimental data which was skewed towards induction ( Figure 2C). However, the gene induction skew was captured by modulating the parameters affecting the probability of differential expression and the proportion of differentially repressed genes (Supplementary Figure S3). Principal components analysis (PCA) of the simulated data clearly showed the dose-dependent characteristics of scRNAseq data with distinct clusters increasing in separation with increasing dose ( Figure 2D) which was also resolved by PCA within the experimental data (Supplementary Figure S4).
To our knowledge, no other published in-vivo doseresponse scRNAseq datasets are available limiting the number of datasets to estimate initial parameters for simulation to date. To investigate whether existing datasets generated using a different study design (e.g. whole cells or different tissue source) could be used to derive initial parameters, we also simulated 10 000 genes starting with parameters estimated from (i) a two-dose liver snRNAseq (GSE148339), (ii) whole cell liver scRNAseq (GSE129516) and (iii) peripheral blood mononuclear cells (PBMC; GSE108313) datasets. When compared to a model fit for experimental data to determine the relation between mean expression and percent zeroes or mean variance, the NRMSD for data simulated from these datasets were between 1 and 10% with data simulated from whole cell data differing the most from the model fit ( Figure 2E). We then explored whether parameters estimated from distinct cell types could replicate the characteristics of that same cell type ( Figure 2F). Not surprisingly, using initial parameters derived from individual cell types in the experimental dose-response data had lower NRMSD than those derived from the whole cell dataset. Notably, when data derived from a lower abundant cell subtype was used to estimate starting parameters, the doseresponse characteristics for that cell subtype was also poorly modeled ( Figures 2E, F, S1-2).

Performance accuracy of DE test methods
We evaluated the performance of several differential gene expression analysis methods on simulated datasets consisting of nine dose groups of 500 cells each (4500 total) and 5000 genes with a 10% probability of being differentially expressed (500 differentially expressed genes). Selection criteria for test inclusion are outlined in the Materials and Methods section and included 9 test methods; ANOVA (35), single-cell Bayes hurdle model test (scBT), Kruskall-Wallis (KW) (36), limma-trend (32,33), likelihood-ratio test (LRT) linear and multiple, MAST (20), Seurat bimod (19) and WRS (34). With ground truth from simulated data, the sensitivity, specificity, and precision for each test method was computed. Area under the receiver-operating characteristic curve (AUROC) was used to measure test performance for correctly classified differentially expressed genes.

PAGE 7 OF 15
Nucleic Acids Research, 2022, Vol. 50, No. 8 e48 In unfiltered data, AUROC scores showed similar performance for most tests except scBT which had the largest AUROC among all test methods ( Figure 3A). To account for the inherent class imbalance between differentially expressed and non-differentially expressed classes the area under the precision-recall curves (AUPRC) was also calculated. Similar to AUROCs, AUPRCs identified scBT as the best performing test ( Figure 3C). In most standard differential expression testing pipelines genes expressed at low levels are removed to minimize false detection rates. Following filtering of genes expressed in ≥5% of cells in any dose group, scBT was consistently ranked as the best test based on AUROC and AUPRC scores. The performance of LRT linear test also improved, with comparable AUROC and AUPRC scores relative to scBT, suggesting LRT linear is poorly suited for genes expressed at low levels ( Figures  3B-D).
AUROC and AUPRC reflect the performance of each test method with varying significance (i.e. P-value) thresholds. In the standard pipeline a fixed threshold is used, typically a P-value ≥0.05 after adjustment for multiple hypothesis testing (i.e. Bonferroni correction). For each method except scBT, the performance at an adjusted P-value ≥0.05 significance criteria was evaluated. In scBT analysis, a gene was considered differentially expressed when the estimated posterior probabilities of the null hypothesis, p(H 0, j |D j ), was less than , where the value was chosen to achieve a target FDR of 0.05. scBT significantly outperformed all other tests in precision rates irrespective of low expression filtering ( Figures 3E, S5). However, scBT was less effective in identifying true positives (Figures 3F, S5). Applying the filtering criteria improved the recall rates, but the precision rates remain largely unchanged ( Figure 3E, F). Test method classification performance scores were estimated as the Matthews correlation coefficient (MCC) which is well suited for unbalanced data (41). We see that the scBT and LRT linear tests performed best for this metric on both unfiltered and filtered data ( Figure 3G).

Type I error control and power
To investigate test performance in controlling type I errors (false positives), DGEA methods on simulated datasets were examined with 0% DE genes (i.e. negative control). Using the threshold for the computed posterior null probabilities, scBT identified only one false positive gene in 2 of 10 simulations ( Figure 4A). ANOVA, scBT, KW, limma-trend and LRT linear had false positive rates (FPRs) below 3% indicating better performance compared to two group tests. After filtering for genes with low expression levels, scBT still correctly identified all the non-differentially expressed genes and was the best performing test. These are the same tests that had a better FPR control in initial simulations ( Figure  3). To explore whether mean expression or percentage of zeroes influenced type I error rates, a logistic regression model was fit to negative control data. We predicted the probability for each gene to be identified as differentially expressed in the negative control data. While the curve for scBT is missing since few false positives were identified, the predicted FPR for all the other tests except LRT linear were also high for highly expressed genes with few zeroes (Figures 4B, C). Next, a positive control dataset with 100% differentially expressed genes was simulated to evaluate test performance for detecting true positives. All tests except scBT exhibited a false negative rate (FNR) ≥40% ( Figure 4D). The best performing tests for FNR also had high FPR. Logistic model regression fitting for false negative classification of genes shows that the false negative rates were highest when the mean expression was either too high or too low for all tests (Figures 4E, F).

Parameter Sensitivity Analysis
Experimental scRNAseq datasets will vary between cell types, cell composition, and responses depending on the target tissue, treatment, number of cells sequenced, and more. For example, some distinct cell types are very abundant (e.g. hepatocytes), with others present at lower levels (e.g. portal fibroblasts) in hepatic scRNAseq datasets. Moreover, treatments such as exposure to a xenobiotic, can elicit dosedependent changes in relative proportions of cell types such as the infiltration of immune cells (26). We investigated the impact by changing cell abundance from 25 to 2000 cells per dose group and observed an increase in the false positive rate (FPR) when increasing the number of cells (Supplementary Figure S6). The scBT and LRT linear tests were less sensitive to an increase in the FPR as cell abundance increased while the total positive rates (TPR + FPR) increased with cell abundance for all methods. Although all tests exhibited comparable performance at low cell numbers (≥500), as cell numbers increased scBT outperformed all other tests in both precision and MCC score ( Figures  5A, S6). Comparison of AUROCs and AUPRCs across cell numbers showed that ANOVA, KW, limma-trend, and LRT linear tests performed best for a small number of cells, but the increase in AUROC was steeper for scBT (Supplementary Figures S7 and S8).
It was also evident from the experimental snRNAseq dataset that the number of cells per dose group was not fixed. We evaluated the performance of the test methods when the number of cells dose-dependently increased or decreased, and when the number of cells per dose group were taken from experimental data. Notably, while scBT had the best MCC for increasing number of cells per dose, LRT linear performed better than scBT when the number of cells decreased before and after filtering for genes expressed at low levels ( Figure 5B). The shift in MCC between increasing and decreasing cell numbers for scBT appears to be driven rate (FPR) of 9 differential expression test methods estimated from negative control (0% DE genes) simulated dose-response scRNAseq data including all genes expressed in at least 1 cell (unfiltered) and genes expressed in only ≥ 5% of cells in at least one dose group (filtered). (B, C) Logistic regression models were fitted to negative control data to predict the probability of false positive identification using percent zeroes and mean expression as covariates. Lines represent the predicted probability of false positive classification with the shaded region representing the 95% confidence interval. (D) False negative rate (FNR) of nine differential expression test methods estimated from positive control (100% DE genes) simulated dose-response scRNAseq data including unfiltered and filtered datasets. (E, F) Logistic regression models were fit to positive control data. Lines represent predicted probability of false negative classification with shaded region representing the 95% confidence interval. by a concomitant decrease in TPR and increase in FNR (Supplementary Figures S9 and S10).
Unique chemical, drug, environmental contaminant, and natural product classes elicit distinct differential gene expression profiles defined by the MoA as well as by their metabolism, potency (sensitivity) and efficacy (maximal response). Differences between compound classes are reflected in the gene expression profile in (a) the proportion of differentially expressed genes, (b) the number of induced/repressed genes, (c) the mean fold-change for differentially expressed genes and (d) the distribution of foldchange for differentially expressed genes. These four parameters were modulated in simulated data to determine the effect of the percentage of differentially expressed genes (Supplementary Figures S11-S16), the fold-change distribution (aka scale; Supplementary Figures S17-S19), and the mean fold-change (aka location; Supplementary Figures S20-S22) on test performance. Among these scenarios, changing the proportion of repressed genes had little to no impact on test method performance ( Figures 5C-F, S14).
Increasing the proportion of differentially expressed genes led to an improvement in MCC except for scBT and LRT linear, though these tests maintained the top MCC scores as well as AUROC and AUPRC (Figures 5C, S11-S13). As the magnitude of the effect increased, LRT linear performed best at the low end while scBT exhibited the greatest improvement in MCC ( Figure 5D). Conversely, while the MCC decreased for most tests when modulating the fold-change scale of differentially expressed genes, scBT improved and was more stable (Figures 5E, S17-S19). As the proportion of unexpressed genes increased, the FPR increased with precision decreasing for all tests (Supplementary Figure S23). However, scBT was least affected, and maintained the highest MCC among all tests ( Figure 5F). AUROC and AUPRC values also indicated that scBT consistently outperformed other test methods ( Supplementary  Figures S24-S25).

Test method agreement
To assess agreement between tests, the area under the concordance curve (AUCC) for each pair of tests for the top 100 genes ranked by adjusted P-value was calculated as previously described (9,37). All methods showed excellent concordance (AUCC ≥ 0.77) with LRT linear showing the poorest consistency compared to all other tests while the limma-trend and ANOVA tests showed perfect agreement with an AUCC of 1 (Supplementary Figure S5). Pairwise differential gene expression comparisons between Seurat Bimod, MAST and WRS had AUCC >0.95 AUCCs while the multiple group tests ANOVA, LRT multiple, KW, and scBT clustered together with AUCC ranging between 0.9  Supplementary Table S3. Points represent median and error bars represent minimum to maximum values. Boxplots represent median, 25th to 75th percentile, and minimum to maximum values. Each analysis consisted of 10 replicate datasets including all genes expressed in at least one cell (unfiltered) and genes expressed in ≥5% of cells in at least one dose group (filtered). and 1. In the absence of nuisance covariates, MAST and Seurat Bimod provided similar results, as expected given their similar mixture normal model structure. Likewise for ANOVA and limma-trend, both of which rely on normality assumptions for testing differential gene expression.

Real dose-response dataset DE analysis
Without ground truth for experimental data, the performance of the differential expression test methods was examined by first evaluating the agreement for each identified cell type (Figures 6, S26). Genes in the experimental dataset were considered differentially expressed when expressed in ≥5% of cells in at least one dose group and had a |fold-change| ≥ 1.5. In hepatocytes, the most abundant cell type, fewer than 5 genes were not detected in all test methods, with the majority missed by the WRS test (Figure 6A). Upon closer examination, those genes were not expressed in control hepatocytes. Not surprisingly, for all cell types, the largest intersection was between all tests indicating strong agreement within all test methods. Only a few tests identified a subset of unique genes as differentially expressed, which accounted for a very small fraction. For example, LRT linear identified 12 unique differentially expressed genes in portal fibroblasts, one of the least abundant cell types ( Figure 6B). LRT linear was the best performing test for low cell numbers indicating that the 12 unique differentially expressed genes may in fact be true positives. Consistent with simulations of varying cell numbers (Figure 5A), 24 genes were not identified as differentially expressed by the scBT method for stellate cells which exhibit a dose-dependent decrease in numbers ( Figures 6C, D). Although scBT outperformed other tests in most scenarios, it under performed in this scenario. Nevertheless, when ranking genes by significance level (i.e. P-values), AUCC were high for all pairwise comparisons.
To explore the biological insight gained by using the test methods, gene set enrichment analysis was performed by ranking genes following significance values (adjusted P-value or Bayes factor) on gene sets from BIOCARTA, KEGG, PANTHER and WIKIPATHWAYS. Gene sets were grouped based on their similarity in gene membership into a network for which centrality measures can be estimated. An examination of portal fibroblasts, which exhibited the most disagreement among test methods ( Figure  6B), showed that multiple group test methods, particularly scBT had improved centrality metrics (centrality -number of edges; closeness -steps required to access other nodes; and betweenness -number of paths that go through a node) ( Figure 6E). Visualization of significantly enriched terms identified enriched functions associated with growth factor and immune cell signaling in addition to expected terms such as xenobiotic metabolism and nuclear receptors involved in lipid metabolism ( Figure 6F). Alternatively, WRS which did not find as many connected groups of functions, was largely limited to those identified by scBT except for the hormone signaling and tryptophan metabolism clusters (Supplementary Figure S27). While there is no ground truth from real data, greater agreement between similar gene sets from disparate sources (Supplementary Figure S28-S30) suggests that multiple group tests such as scBT provide more reliable findings. However, all the test methods produce comparable gene set enrichment results as expected since the most robust changes were identified by all the test methods.

DISCUSSION
The goal of this study was to compare the performance of newly developed DGEA test methods for dose-response experiments to existing analysis methods. Using simulated data to generate ground truth, we evaluated the performance of nine differential expression testing methods which were broadly classified as either fit-for-purpose, multiple group, or two group tests. Criteria for test method selection was based on previous benchmarking efforts for two group study designs identifying MAST, limma-trend, WRS, and t-test as the best performers (9,42). ANOVA and KW tests were also included for evaluating multiple group comparisons, and Seurat Bimod, for having the same modelling framework as scBT, LRT multiple and LRT linear tests. The test methods were ranked from best to worse (1-9) based on type I error rate, type II error rate, MCC, AUROC and AUPRC (Figure 7, Supplementary Table S4).
While several scRNAseq tools have been developed (28,(38)(39)(40), none are developed to simulate dose-response models commonly identified in toxicological and pharmacological datasets (29,43). Our SplattDR wrapper for the Splatter package (28) was able to show that simulated data can effectively emulate key experimental scRNAseq data characteristics when simulation parameters were estimated from various Unique Molecular Identifier (UMI)-based datasets. In agreement with a previous report, technical and biological factors, such as cell type, does appear to influence gene dropout rates (18). We primarily focused on 10× Genomics UMI data given the unavailability of real experimental dose-response data generated using other platforms.
Overall, test method performance was consistent with their intended application. For example, fit-for-purpose tests scBT and LRT linear consistently ranked higher followed by multiple groups tests such as KW and LRT multiple. scBT exhibited the best overall performance with excellent FPR control and top ranked MCC while LRT linear struck a balance between type I and type II error rates. The scBT results are not surprising as Bayes factor-based tests have proven to be conservative and consequently more appropriate when false positives are of concern (22,23). In the context of investigating chemical or drug MoAs, false positives have the potential to lead to wasted effort and resources in attempts to validation and support findings (44). Moreover, when assessing a large number of genes, a 5% FP rate (P-value ≥ 0.05) can result in hundreds of FPs that skew MoA classifications (17).
A single test method was not expected to outperform all other tests under all conditions as previously demonstrated when comparing pairwise testing (6,9,42). Therefore, we assessed the strengths and limitations of each test method by varying parameters likely to change within and across various experimental datasets. The number and relative abundance of cell types is known to be affected by disease or treatment, and the distribution of differential expression influenced by the chemical, drug, or food contam-  . Median ranking of differential expression test methods across all simulations. The median rank of each test method was calculated for AUPRC, AUROC, MCC, FNR and FPR. Tests were grouped according to intended application including fit-for-purpose tests developed for the analysis of doseresponse datasets, multiple group tests, and two group tests. The overall rank represents the median value for the five key metrics presented here. inant being evaluated (5,26). scBT consistently ranked at the top under most scenarios, particularly when the mean and standard deviation of the fold-change for differentially expressed genes varied. However, scBT under performed in MCC when the number of cells decrease in a dosedependent manner which would be expected in treatments which alter cell population sizes (e.g. inflammation). Under these circumstances LRT linear outperformed all other tests with scBT performing similar to the other test methods as evident when 24 differentially expressed genes were not identified by scBT within experimental data for stellate cells which experienced a dose-dependent decrease in relative abundance following TCDD treatment. Although excluding genes expressed at low levels generally improved the performance of all test methods, the comparative performance of test methods did not significantly change in most cases. We did not have access to experimental scRNAseq doseresponse data, however, we expect that the scBT would perform equally well as with experimental snRNAseq data as the elevated number of zeroes are common to both types of data. Major differences between these types of data are (i) biases in gene detection and (ii) overall counts (26). Given the higher overall counts in scRNAseq data, test method such as scBT may even perform better.
DGEA provides biological information regarding the effects of exposure to chemicals, drugs, and food contaminants. As expected, gene set enrichment analyses did not dramatically differ in the enriched pathways which are driven by the most robust responses such as xenobiotic metabolism. However, when integrating gene sets from disparate sources we found gene sets that partially overlap in gene membership were consistently identified by multiple group test methods. For example, several gene sets related to growth factors and cell proliferation were identified by scBT but not WRS. Portal fibroblasts are implicated in proliferation of cholangiocytes and the secretion of growth factors during development. Enrichment of these terms suggests a functional role consistent with the induction of bile duct proliferation by TCDD (45,46). In contrast, WRS identified enrichment associated with tryptophan as well as oxytocin/thyrotropin-releasing-hormone pathways which has not been linked to the effects of TCDD on portal fibroblasts. Although ground truth for the complete experimental dataset is not available, the use of test methods such as scBT reduce experimental noise to identify leads warranting further analysis.

CONCLUSION
Collectively, our findings suggest that scBT and LRT linear fit-for-purpose tests are better suited for the differential expression analysis of dose-response studies and when false positives are of greater concern than false negatives. Moreover, consistent with previous benchmarking efforts, we show that common non-parametric tests such as KW outperform test methods developed for scRNAseq data when the study involves comparisons between multiple groups. Ultimately, each test method performs optimally under diverse scenarios. While the importance of controlling type I error rates is acknowledged, a balance must be struck with type II error rates. The tradeoff should be determined based on the individual research question being investigated. It may even be reasonable to apply different test methods to distinct cell types based on dropout rates, cell abundance, and changes in relative cell proportions given the strengths and weaknesses of each test method.

DATA AVAILABILITY
New dose-response single-nuclei RNA sequencing data has been deposited in the Gene Expression Omnibus (GEO) under the accession ID GSE184506. Publicly available singlenuclei and single-cell transcriptomic datasets were obtained from GEO under the accession IDs GSE108313 (PBMC), GSE129516 (hepatic whole cell), and GSE148339 (two dose hepatic single-nuclei). Simulated data was produced using our SplattDR R package can be reproduced using parameters from Table S2.