Enhancing biological signals and detection rates in single-cell RNA-seq experiments with cDNA library equalization

Abstract Considerable effort has been devoted to refining experimental protocols to reduce levels of technical variability and artifacts in single-cell RNA-sequencing data (scRNA-seq). We here present evidence that equalizing the concentration of cDNA libraries prior to pooling, a step not consistently performed in single-cell experiments, improves gene detection rates, enhances biological signals, and reduces technical artifacts in scRNA-seq data. To evaluate the effect of equalization on various protocols, we developed Scaffold, a simulation framework that models each step of an scRNA-seq experiment. Numerical experiments demonstrate that equalization reduces variation in sequencing depth and gene-specific expression variability. We then performed a set of experiments in vitro with and without the equalization step and found that equalization increases the number of genes that are detected in every cell by 17–31%, improves discovery of biologically relevant genes, and reduces nuisance signals associated with cell cycle. Further support is provided in an analysis of publicly available data.


INTRODUCTION
Single-cell RNA-sequencing (scRNA-seq) protocols have evolved rapidly over the last 10 years, with increased throughput and sensitivity allowing for unprecedented insights into cell type heterogeneity across tissues (1). In spite of the advances, substantial technical variability and biases remain, which present challenges in data analysis and can obscure biological signals (2)(3)(4)(5). From mRNA capture, re-verse transcription, and PCR amplification, to additional single-cell library preparation and multiplex sequencing, there are numerous opportunities for technical noise to arise in scRNA-seq experiments. Inefficiencies or biases at any of the steps in the protocol may lead to increased technical artifacts and noise affecting expression variability and increase the proportion of dropout (6,7).
Numerous computational approaches including data smoothing and imputation have been developed to address excess variability and zeros in scRNA-seq data (8,9). However, they do so with the risk of introducing or perpetuating bias (8,10), thus making it preferable to optimize experimental protocols when feasible. A few studies have evaluated the downstream effects of various amplification techniques (11) or reverse transcriptases (12) on scRNAseq data. However, to our knowledge no study has assessed the effect of equalizing cDNA concentrations in single-cell protocols. In bulk RNA-seq experiments, equalization of cDNA concentrations across libraries is a standard procedure that has been shown to reduce sequencing coverage variability and increase transcriptome diversity (13)(14)(15) by providing more even sequencing coverage of all samples. Equalization also leads to decreased sequencing of highly abundant transcripts and increases the efficiency at which low and moderately expressed genes are sequenced in bulk experiments (14).
For single-cell RNA-seq we hypothesized that equalization may improve sensitivity by increasing gene detection and thus began our investigation into the technical artifacts in scRNA-seq data by developing a simulation framework that generates counts by modeling each step of the experimental protocol. Simulation frameworks offer a significant advantage to studying sources of variability compared to experimental approaches as they allow an investigator to quickly assess a large number of scenarios at considerably low cost. While a number of good methods are available for simulating scRNA-seq data (16)(17)(18), most do not model each step in the experimental protocol, and therefore are not useful for assessing how each step of the process affects the final counts. Two frameworks have attempted to study the data generation process but are limited in scope, either relying on spike-ins (19) or combining all sources of variation into a single parameter (20). Scaffold models the scRNAseq data generating process by representing each step of the protocol mathematically, from the initial cell-to-cell heterogeneity to the final sequencing (Methods). Here we mainly focus on the SMART-SEQ (21) protocol as it uses oligo-dT priming and template switching as the backbone chemistry to generate cDNA from single cells which is used in multiple major scRNA-seq platforms, including Fluidigm C1 and 10X Chromium. The simulation framework is implemented in the R package, R/Scaffold and freely available at https://github.com/rhondabacher/scaffold/.
Based on our simulation results, which suggest that equalization is a critical step in the scRNA-seq protocol, we designed a set of scRNA-seq experiments in which we varied the extent at which cDNA libraries were equalized. The experiments demonstrate that equalization results in more consistent detection of genes, reduced expression variability, and reduced variability in the count-depth rate (3), the relationship between a gene's observed expression and sequencing depth. Finally, we confirm the effect of equalization in a survey of publicly available scRNA-seq datasets.

EC and TB cell experiments
We focused on a subset of 96 single cells, from hESC-derived endothelial cells (EC) or trophoblast-like cells (TB) generated using the Fluidigm C1 system. The original data is considered to be unequalized (unEQ), where the single-cell cDNA libraries were first diluted to a range of 0.125-0.375 ng for subsequent library preparation protocols. The unEQ data was published in a previous study (GEO: GSE75748) (22). In the subsequent EQ experiments performed here, including EQ, EQ-Vary and EQ-75%, we retrieved the harvested cDNA, which are amplified full-length single-cell cDNAs identical to those used for the unEQ experiments (Supplementary Figure S1), but further diluted and adjusted so only 0.1 ng of cDNA were used as input across all the cells for subsequent library preparation protocols. In all the experiments, 1.25 l of indicated input cDNA were used in a 5.0 l Tagmentation reaction (Nextera XT DNA Sample Preparation Kit, Illumina) followed with a 12.5 l dual-indexing PCR amplification reaction (Nextera XT DNA Sample Preparation Index Kit, Illumina). In the unEQ, EQ and EQ-75% experiments, 2.0 l of the amplified/tagmented cDNA were used for pooling. In the EQ-Vary experiment, a single scaling factor was applied to generate variable amounts of the pooling volume. These pooled single-cell libraries were used in an AMpure XP Bead-based Dual Bead Cleanup and Size Selection reaction (Agencourt AMPure XP PCR Purification modified Instructions for Use, Beckman Coulter). In both bead cleanup reactions, 90% of AMPure XP beads were added to the amplified single-cell libraries to select for an approximate size range of 150-700 bp and incubated for 15 min at room temperature. Libraries bound to beads were then placed on a magnet for 5 min, washed twice with 70% ethanol, eluted with Suspension Buffer (Nextera XT DNA Sample Preparation Index Kit, Illumina), and transferred to a new tube. Final amplified and pooled single-cell libraries were quantified with the Qubit dsDNA HS Assay Kit (Q32854, Thermofisher) and Bioanalyzer High Sensitivity DNA Analysis Kit (5067-4626, Agilent). The unEQ libraries were multiplexed with 18-20 samples per lane and sequenced on an Illumina HiSeq2500 with single-end 51 bp reads while the EQ, EQ-75% and EQ-Vary were all pooled with 96 samples per lane and sequenced on an Illumina HiSeq3000 with paired-end 65 or 78 bp reads.

Processing and quality control on cells across equalization experiments
Reads were mapped against the GRCh38 Ensembl reference of protein-coding genes via Bowtie 1.2.3 (22), allowing up to two mismatches. The expected counts were estimated via RSEM 1.2.31 (23). To control for any differences due to differing read lengths all reads were first trimmed to have a length of 51 bp. In the initial unEQ experiment, cells that had fewer than 5000 genes with TPM > 1 or that upon inspection of cell images displayed doublets or appeared dead were removed in quality control.
Using the scater v1.18.6 R package (24) we removed cells from any experiments in which the log10 sequencing depth was <5.4 or the percent of counts in the top 50 genes was >31%, the thresholds corresponding to being two standard deviations away from the median (Supplementary Figure S2). The expected counts in all experiments were rounded to the nearest whole number for all subsequent analyses.

Comparison of cell-specific and gene-specific detection rates
The cell-specific detection rate was calculated as the proportion of genes with nonzero expression within each cell. Similarly, the gene-specific detection rate was calculated as the proportion of cells with nonzero expression for each gene. When comparing differences in gene-specific detection rates between the experimental datasets, we accounted for differences in the sequencing depth since more sequencing typically results in more genes detected. For each comparison we subset the cells such that the average difference in sequencing depths was zero.

Analysis of highly variable genes
For the analysis of highly variable genes, gene expression estimates were first normalized using SCnorm v1.6.0 (3). We then fit a mean-dependent trend across each gene's meanvariance relationship. The trend represents technical variability and a gene's biological variability was calculated from the residuals using the functions trendVar and decom-poseVar in the scran v1.12.1 R package (25). The decom-poseVar function tests for nonzero biological variability using an F-test of total variability to technical variability. We considered genes significantly highly variable if they had an PAGE 3 OF 12 Nucleic Acids Research, 2022, Vol. 50, No. 2 e12 FDR < . 10. In order to compare gene variability across datasets, we ranked a gene's relative variability to all other genes in the dataset and calculated the difference in the two ranks.

Estimating the count-depth rate
The gene-specific count-depth rate was estimated within EC and TB separately using a median quantile regression on the log nonzero gene expression versus log sequencing depth using the getSlopes function in the SCnorm v1.6.0 R package. For each condition, we filtered out genes that had fewer than 10 nonzero expression counts across all cells and genes with median nonzero expression less than two. Visualization of the count-depth rate distributions is shown using smoothed density plots of the slopes within gene groups, where genes were split into 10 equally sized groups based on their nonzero median expression. The variability of the count-depth rate is quantified using the median absolute deviation statistic (MAD). First, the mode of the slope distribution was estimated for each gene group, then the MAD was calculated as the median of the absolute differences between the slope modes and one, where one is the expected value of the count-depth rate. All density plots of the slope distribution are done with smoothing parameters adjust = 1, and estimated over the grid (-3, 3) using the density function in R. All analyses were carried out using R version 3.6.3.

Analysis of publicly available datasets
We obtained processed counts from the conquer scRNAseq database (26) for four single-cell RNA-seq datasets pro- For the non-UMI datasets, cells with fewer than 10 000 total counts were removed and counts were rounded to the nearest whole number. For estimating the count-depth rate, again we filtered out genes that had fewer than 10 nonzero expression counts across all cells and genes with median nonzero expression less than two. In Figure 4, the representative datasets displayed from each study are: EF cells from Islam, Earlyblast-Embryo2 in Deng, M11W-Embryo2 in Guo, Unstim-Rep1 in Shalek and TB2 in Chu. The Picelli and H1-Bulk each only had one dataset in the study. The comparison of properties in Table 1 for the equalized versus unequalized datasets in publicly available studies was done using a two-sided t-test.
For the Grün UMI dataset, the isOutlier function in the scater v1.18.6 R package was used to remove cells having to-tal detected genes greater than five median absolute deviations from the median. The isOutlier function was similarly applied to the Smart-seq3 datasets to remove outlier cells based on total counts and total genes detected per cell. For the 10X dataset, we first used the emptyDrops function in the DropletUtils v1.10.3 R package (35) to remove empty droplets containing ambient RNA and kept cells with an FDR <0.01. Cells were further filtered using the isOutlier function using three quality control metrics: total counts, genes detected per cell, and the percent of mitochondrial counts; outliers were considered as those above three median absolute deviations. Count-depth relationships were not estimated for the 10× or the Smart-Seq3 HCA dataset due to the large number of zeros in the data. For estimating the count-depth relationship in the UMI and Smart-seq3 Fibroblast datasets, all genes having a nonzero mean were included (the median in these datasets is often zero).

Simulation Framework
Here we first describe the data-generating process in Scaffold and the following section contains details on the estimation procedures. Let M g, j be the true number of mR-NAs present for gene g in cell j with distribution, M g, j ∼ Poisson(ω j μ g ), where g = 1, . . . , G, j = 1, . . . , N, and μ g is the latent level of gene-specific expression. As not all cells in a population are identical, the parameter ω j is a cell-specific population heterogeneity parameter ω j ∼ Uni f orm(ω .05 , ω .95 ); scaling factors are applied to each cell to represent the range of cellular heterogeneity.
In an scRNA-seq experiment, a cell is first isolated and its mRNA is captured following cell lysis. A reverse transcription step occurs immediately after to convert the mRNA into cDNA. It is currently not possible to naturally estimate these two steps separately. Thus, here we model both of these events together as a single process. The number of molecules successfully captured for genes in cell j is represented as: where λ j is the efficiency of conversion, referred to as the capture efficiency. Following this step, the cDNA molecules are exponentially amplified using PCR. The number of successfully amplified cDNA molecules for gene g in cell j is: where C is the number of amplification cycles and ρ j is the efficiency. When ρ j = 1, all molecules double each cycle. We expect ρ j to vary across reactions and to be independent across cells. For droplet/10X protocols, the capture step occurs for each cell independently and all cDNA is then combined for further library preparation (skipping the cellindependent pre-amplification step). For plate-based methods like Smart-seq, the next steps involve re-plating the cells for further library preparation where cells are still processed independently. At this point, cDNA concentrations are typically quantified in part to ensure that quality is high. An optional next step is to equalize the cDNA concentrations to make them as similar as possible. This is first done by determining an acceptable range of concentrations -one may Table 1. Summary of publicly available datasets. The first column contains the dataset name. Column 2 shows the organism. Column 3 shows the sequencing protocol used. Column 4 shows the number of cells per dataset included in the study. Column 5 is average sequencing depth across all cells. Column 6 is the average cell-specific detection rate across all cells. Column 7 is the average MAD and Column 8 indicates whether cDNA equalization was performed. Datasets above the black line are non-UMI and shown in Figure 4 Dataset Organism and l j is the cDNA concentration for cell j ; q * is the upper limit of the acceptable concentration range. For cells having concentrations within the acceptable range, there is no dilution. In this case, Scaffold sets the dilution factor to 0.95 indicating that, on average, 95% of the cDNA molecules will be retained in the next step (100% is not used as some loss of material may occur in the next step when transferring liquids). To mimic the situation in which all concentrations are diluted to the smallest observed, q * is set to be the concentration of the smallest cell. If a range is chosen, then q* is set to the midpoint between the lowest concentration and the concentration at a user-specified quantile; all concentrations larger than q* are then diluted as described above.
The number of cDNA molecules in cell j after equalizing cDNA concentrations is: Following the protocols for C1 Fluidigm (Smart-seq and Smart-seq2), the cDNA is fragmented into shorter pieces and sequencing adapters and cell-specific indexes are added. We model this similarly to capture efficiency since the failure of any particular cDNA removes it from further consideration in sequencing. This is commonly referred to as 'tagmentation'. We denote the tagmentation efficiency here as γ j . The number of cDNA molecules successfully tagmented for genes in cell j is represented as: Next, the cDNA molecules go through a second round of PCR amplification, where for gene g in cell j the number of amplified molecules is represented as: where C 2 is the number of amplification cycles and ρ 2, j is the efficiency per cell. Finally, the observed gene counts per cell, Y g, j , are obtained by: where π = (π 1,1 , . . . , π G,1 , . . . , π G,1 , . . . , π G,N ) , π g, j = B g, j g j B g, j , and R is the total number of sequences obtained. To simulate data from UMI protocols, the same steps above are followed, with Scaffold tracking the unique molecules throughout the simulation framework. For 10X/droplet based protocols, there are a few differences in the procedure. Specifically, there is no pre-amplification step, the transcripts from all cells are combined immediately following the capture step, and the tagmentation and PCR amplification steps do not having cell-specific parameters since tagmentation and PCR amplification are not cellspecific.

Estimation of simulation parameters
For the simulation framework described above, a number of parameters must be set or estimated. All parameters that are estimated can also be input by the user if desired so that no input dataset is actually needed to simulate data. The number of genes and cells are estimated from the input dataset. Since the expression means observed in sequence data do not necessarily reflect the total transcripts in the cells, Scaffold first scales all cells to have a total of 300 000 counts to estimate the mean for each gene as human cells have been previously reported to have approximately 300k total transcripts (36). This parameter can be changed in Scaffold, if desired. The estimated means are then used in the Poisson distribution to generate the starting number of mRNA per cell. To estimate the cell heterogeneity parameters, ω .05 and ω .95 , Scaffold calculates the ratio of total counts for a random sample of 100 pairs of cells in the observed data, then the 5th and 95th percentile of ratios are used as the lower and upper bound in a Uniform distribution to draw cellspecific scaling factors.

PAGE 5 OF 12
Nucleic Acids Research, 2022, Vol. 50, No. 2 e12 The majority of zeros are thought to occur during the capture step (cell lysis and reverse transcription), thus the capture efficiency has the largest impact on the detection rates. It is not possible, even with spike-ins, to differentiate these two steps; and consequently, default settings in Scaffold treat them as a single step. However, we note that it is possible to simulate these two steps separately. We estimate the capture efficiencies from a Normal(μ λ , σ λ ). The mean capture efficiency, μ λ , is estimated in Scaffold as follows: (i) A weighted probability of observing each gene is calculated as p g = μ g /m, where μ g is the gene mean in the initial simulated mRNA counts and m is the total number of genes. (ii) For a given mean capture efficiency, we calculate P( X g = 0), where X g ∼ Bi nomi al(μ λ * m, p g ). To avoid heavy computation, this is estimated for a random representative subset of 100 genes. (iii) μ λ is chosen as that which minimizes the difference between P(X g > 0) and the average detection rate per cell in the observed data. This search is done using the optimize function in R. For UMI and 10× datasets, we found better estimation accuracy using P(X g > 1), which corresponds to an increased probability of initial counts of one being observed as zeros in the sequenced data. The standard deviation, σ λ , is calculated from the observed data as the median absolute deviation of the cell detection rates.
The default number of cycles for each PCR step is set to the number used by the Smart-seq protocol as detailed in the Fluidigm user manual (first PCR is 18 cycles and the second PCR is 12 cycles), and these can be adjusted by the user. In our testing we did not identify the PCR or tagmentation steps to have a major influence, and in the literature these are typically regarded as highly efficient procedures (37). Thus, the default distribution of the efficiency parameter for these steps is Normal(0.95, 0.02), but can be adjusted by the user as desired. Finally, the total sequencing depth is set to the sum of all the counts in the observed data. Details on the specific parameters for the simulated datasets is given in Supplementary Methods.

Simulating multiple populations
Following the initial generation of mRNA counts, multiple populations can be simulated by specifying additional parameters. The number of cells per population must be provided, and the first population serves as the reference from which each additional population differs by a proportion of genes having distinct expression. The expression differences are sampled from a Normal distribution with a mean and standard deviation of fold-changes given by the user; and the direction of expression differences for a given gene is chosen at random. For the comparison of cluster visualization between unEQ and EQ simulated datasets, we simulated two populations having 50 and 40 cells using Scaffold. The proportion of DE genes simulated was 10% with fold-changes drawn from a Normal(1.5, .5). All other Scaffold parameters were estimated from the unEQ EC dataset. For generating the UMAP (38), TSNE (39) and EDGE (40) embeddings and plots, 250 simulations were conducted. Within each simulation and for each dimension reduction algorithm, the mean silhouette distances were averaged over 25 iterations using different random seeds. UMAP and TSNE plots were obtained using the scater R package v1.18.6 using the first ten principle components, n neighbors = 10 for UMAP, and perplexity = 25 for TSNE. EDGE plots were obtained with the EDGE R package v1.0 with the number of weak learners n wl = 5000, nearest neighbors n neighs = 10, n dm = 10, hash table size H = 1000, and optimization opt = TRUE.

Simulating dynamic populations
To simulate datasets from a continuous or dynamic population, Scaffold simulates gene expression via a B-spline for a user-specified proportion of genes. The default spline generation is degree two, with two knots and coefficients sampled from a Normal(5,5); mRNA counts are then generated from a Poisson distribution with latent mean for each cell equal to the value from the B-spline. The spline generation parameters can be user-specified. For the simulation of trajectory analysis, we used the default settings and all other scaffold parameters were estimated from the unEQ EC dataset. The SCORPIUS R package v1.0.8 was used to infer the trajectory using default settings (41). A two-degree polynomial was used to identify genes having a significant dynamic along pseudotime. The pROC R package v1.17.0.1 was used to estimate the area under the receiver operator curve (AUC) (42).

In silico investigation of cDNA equalization using Scaffold
As detailed in Methods and Figure 1A, Scaffold allows for assessment of how each step of the single-cell protocol (cell lysis, amplification, equalization, library preparation, and sequencing depth) affects scRNA-seq measurements. Using an scRNA-seq dataset of unequalized endothelial cells (unEQ EC) as a reference, Scaffold estimated starting parameters and simulated data that reproduced the features of the unEQ EC dataset including gene-specific means, variances, and proportions of zeros ( Figure 1B-E). We also simulated data using unequalized trophoblast cells (unEQ TB) as a reference with similar results (Supplementary Figure  S3). Systematic variability in the count-depth rate, a feature shown to be unique to scRNA-seq data (3), was also reproduced ( Figure 1F and Supplementary Figure S4).
Holding all other parameters constant, we simulated data while varying parameters for equalization and sequencing depth and found that cDNA equalization has the largest effect on the average variability in the count-depth rate (Supplementary Figure S4C, D), while the total sequencing depth (Supplementary Figure S4E) had little effect.
To examine the effect of equalization on other properties of the data, we simulated additional datasets with and without equalization holding all other parameters constant. Specifically, we simulated pairs of unequalized and equalized datasets by adjusting only the equalization parameter. In simulated datasets, gene-specific variation decreased by an average of 16.2% due to equalization alone and the variability in the sequencing depths was reduced by 74.2% despite the simulations having the same average depth (Figure 1G).

Experiments to assess the effect of cDNA equalization
Given results from the simulation study, we hypothesized that a lack of equalization during the preparation of singlecell libraries would increase variation in the amount of input cDNA which in turn could contribute to reduced gene detection and increased variability in expression estimates observed in scRNA-seq data. To test this hypothesis, we applied alternative protocols to full-length single-cell cDNA libraries of identical cells to generate matched scRNA-seq data sets (Figure 2A). The original data includes single endothelial cells (EC) and trophoblast-like cells (TB) derived from human embryonic stem cells (hESC) (31) which were unequalized (unEQ). For these experiments, the cDNA input ranged from 0.125 to 0.375 ng (Materials and Methods). In the next series of experiments, we equalized the same set of single-cell cDNA to a fixed input (0.1 ng) across all the cells. Prior to sequencing, cells were pooled at an equal volume (EQ) or pooled by a scaling factor to produce highly variable sequencing depths (EQ-Vary) (Figure 2A). Finally, we replicated the entire EQ experiment, including equalized cDNA input and pooling, but we sequenced at approximately three-quarters the depth of the previous experiments (EQ-75%). Because these four conditions all derive from identical cells, these experiments provide the most robust investigation to date on how input cDNA variations impact scRNA-seq data.

Equalization increases cell-specific and gene-specific detection rates
A common challenge in scRNA-seq experiments is the high proportions of zeros, or dropouts. Dropouts are due to an incomplete sampling process, stochastic gene expression, and inefficient capture of mRNA, with the probability of dropping out inversely related to a gene's underlying expression level (43). Equalizing cDNA libraries would not recover dropouts that occur upstream in a protocol, but it may recover dropouts that are due to inefficiencies in later preparation steps (e.g. second PCR amplification) or due to underrepresentation in the pooled library. Thus, we first investigated the effect of cDNA equalization on cell-specific detection rates, defined as the proportion of nonzero genes within a cell. Across both EC and TB cells, we observed an increase in the efficiency of gene detection in the equalized experiments ( Figure 2B). An average of 745 (8.6%) more genes per cell were detected with expression greater than zero in the EQ versus the unEQ experiments. EQ-vary, which was pooled in a way to reflect possible inefficiencies that might occur after equalization such as during pooling or amplification, reduced the detection efficiency slightly to 534 (6.2%) more genes detected on average. Comparatively, the effect of equalization on gene detection is stronger than the effect of solely increasing total sequencing depth. Between EQ and EQ-75%, in which both experiments were equalized but the latter had three-quarters the sequencing depth, we observed only 470 (5.0%) fewer genes detected per cell in EQ-75%. We next investigated the gene-level detection rate across experiments, defined as the proportion of cells with nonzero expression for each gene ( Figure 3A, B). Here we calculated the difference in gene-level detection rates between EQ and unEQ while accounting for differences in sequencing depth (Methods). The overall increase in detection efficiency due to equalization translates to a 31.1% increase in genes having consistent detection in all EC cells and a 17.9% increase in TB cells (1002 and 622 genes, respectively). We also observed a 10.4% decrease in the number of genes not detected in any cells for EC and an 8.1% decrease in TB (382 and 276 genes, respectively). Since a gene's detection rate is related to its expression level, we further analyzed detection differences by splitting genes into four equally sized gene groups based on their nonzero median expression. We assessed what differences would appear due to random chance by randomly splitting the EC or TB cells in the unEQ dataset into two groups and examining the detection rate differences between them. Approximately equal proportions of genes had increased/decreased detection rates across all expression groups for both experimental conditions (Supplementary Figure S5).
Between the EQ data and unEQ datasets, we consistently saw a higher proportion of genes having a higher detection rate in the equalized dataset especially among the moderately expressed genes (62% and 64% for EC gene groups 2 and 3; 56% and 59% for TB gene groups 2 and 3) ( Figure  3A&B). The average increase in detection rate in the equalized experiments for the genes in Groups 2-4 is 13.6% in EC2 and 7.9% for TB2. In comparison, we performed the same analysis between the EQ and EQ-Vary datasets which underwent the same equalization procedure and found the ratio of genes with increasing versus decreasing detection rate was stable across expression groups; the increased variability in sequencing depth did not compromise the detection rate in the equalized dataset (Supplementary Figure  S6).
To identify any functional relevance of genes with increased or decreased detection rates in the EQ experiment we performed gene-set enrichment using MSigDB's list of GO biological processes on the top 200 genes sorted by their magnitude change in detection. Genes with increased detection rate in the EQ experiment were enriched for important developmental processes including morphogenesis, and tube and epithelium development in both EC and TB (Supplementary Table S1). Genes with decreased detection rates after equalization tended to be among the most lowly expressed genes. Of the 200 genes with the most decreased detection, 142 were in the lowest expression group in EC and 162 such genes in TB. Taken together, these results suggest that equalization improves the detection of biologically relevant genes without compromising signal.

Equalization reduces nuisance variation
Next, we investigated the effect of equalization on gene expression variability. A common first step in single-cell clustering or trajectory inference analysis is to reduce the data to the most informative set of genes, often defined as the most highly variable genes (HVG). However, in the presence of excess nuisance variation, the top ranked HVG may not reflect the most relevant set of genes. Here, we detected HVG by decomposing the total variance of each gene into technical and biological components. To do so, we estimated a mean-dependent trend for the mean-variance relationship across all genes to represent technical variability (Methods). A gene's biological variability was calculated as the differ- ence between a gene's total variability and its fitted trend value. An HVG classification was assigned to genes having biological variability significantly larger than zero (FDR < 0.10). HVG genes in the unequalized experiment were enriched in GO biological processes involving the cell cycle. This is likely due to the fact that cellular mRNA content is directly related to cell cycle stage and, consequently, if cDNA content is not equalized across cells, variability in cell cycle genes is prominent in the resulting data. Following equalization, genes classified as HVG were enriched for biological processes specific to EC cells including gastrulation and cell fate/differentiation ( Figure 3C and Supplementary  Table S2).

Equalization reduces technical artifacts in the count-depth rate
Previously, we reported that scRNA-seq data display systematic variation in the relationship between a gene's observed expression and sequencing depth (which we termed the count-depth rate), whereby a gene's expected increase in expression with increased sequencing depth fails to materialize (3). Variability in the count-depth rate affects downstream analysis as popular scale-factor based normalization methods assume that the count-depth rate is common across genes and equal to one on the log-log scale (3,25). As shown in Bacher et al., much of the variability in the count-depth rate arises from under-detection of genes despite increasing sequencing depth since highly expressed genes are over-represented during sequencing. Since equalizing cDNA increases detection rates, we hypothesized that it may also reduce variability in the count-depth rate. To investigate, we quantified the count-depth rate for every gene using median quantile regression, where a slope of one indicates a proportional increase of gene expression with sequencing depth (Supplementary Figure S7). Next, we binned genes into ten equally sized groups based on their median nonzero expression. In the unEQ dataset, we found only highly expressed genes had slopes near one and slopes gradually decreased with gene expression level (Figure 4A). The extent of variability in the count-depth rate was measured using the MAD of the ten groups slope mode from their expected value of one. The EQ experiments had a lower MAD and displayed less variability in the countdepth rates for both EC and TB ( Figure 4A, B). EQ-75% was similar to the EQ datasets, indicating the count-depth rate is not affected by total sequencing depth. The EQ-Vary experiment had the most reduction in count-depth variability, with the majority of slopes close to 1 (Supplemental Figure S8), due to its increased dissociation of cell size with sequencing depth.
As more single-cell datasets have become public and identically processed in databases such as conquer (26), we were able to inquire whether systematic variability in the count-depth rate was reduced across scRNA-seq data in published studies. Across seven different studies, we found large heterogeneity in the experiment-specific count-depth rates with the MAD ranging from 0.045 to 1.176 ( Figure  4C, D). We found no revealing association between the average MAD within study and various properties of the scRNA-seq data, including the average sequencing depth, cell-specific detection rate, organism, or number of cells (Table 1). However, consistent with our simulated and experimental datasets, the publicly available studies in which equalization was performed had significantly lower MAD values (P-value < 0.001), higher cell-specific detection rates (P-value < 0.001), and higher gene-specific detection rates (P-value = 0.039) (Supplementary Figure S9). On average the equalized datasets contain 2215 additional genes detected consistently in every cell compared to the unequalized datasets (P-value < 0.001 and Supplementary Figure  S9).

Equalization improves downstream analyses
To further examine how equalization might affect common downstream analyses, we simulated data for two scenarios -clustering analysis and trajectory analysis. For clustering, we used Scaffold to simulate datasets from multiple populations (Supplementary Figure S10). Here we consider two cell type populations with slight separation--only 10% of genes have distinct expression with an average fold change of 1.5. We simulated pairs of unequalized and equalized datasets and evaluated two-dimensional embeddings of cells using the silhouette distance. On average, equalization had a higher median silhouette distance and improved visible separation of cell populations in UMAP (38) plots ( Figure 5). TSNE (39) and EDGE (40) reduced dimension embeddings showed similar trends (Supplementary Figure  S11). We also used Scaffold to simulate cells coming from a continuous population, in which we assumed a proportion of genes have dynamic expression across the cells (Supplementary Figure S12). We simulated pairs of unequalized and equalized datasets and inferred a trajectory on the sim-ulated cells (41). We then fit a polynomial regression of each gene's expression to the trajectory to determine the significantly dynamic genes (adjusted P-value < 0.05). Equalization had a slight improvement in the ability to detect dynamically expressed genes, with an AUC of 83.0 versus 81.7 for the unequalized simulations.

Using Scaffold to simulate data from UMI and 10× protocols
Although equalization cannot be applied to 10× protocols, or most UMI protocols, due to the vast number of cells these protocols produce, other aspects of the data generation process can be systematically explored. We applied Scaffold to a 10X dataset and three additional UMI datasets and observed that the simulated data was highly representative of cell-and gene-specific properties of the data (Supplementary Figures S13-S18).

DISCUSSION
Obtaining the highest quality data with minimal technical variability remains a goal for scRNA-seq experiments. Given the competitive nature of the sequencing process, highly expressed transcripts are often overrepresented in the final library and will consume a large proportion of the total reads leading to low detection rates for the majority of genes. Here, we showed that equalizing single-cell cDNA libraries prior to pooling improves detection rates and PAGE 11 OF 12 Nucleic Acids Research, 2022, Vol. 50, No. 2 e12 decreases nuisance variation such as that attributable to cell cycle.
Our finding of reduced variability in expression for cell cycle genes in equalized experiments is novel, yet not unexpected since cell cycle signals are often the largest drivers of differences in total mRNA. Note that if cell cycle signals are of marked interest, then equalization may not be appropriate. However, reduction of cell-cycle signals has been implemented in most scRNA-seq analysis pipelines as it is considered a hindrance in most downstream analyses (45,46). While different cell types often have different cell sizes, they are also distinguished by relative differences in key marker genes. Equalization preserves these relative differences as the dilution is performed on the entire cell's cDNA and thus, would not remove cell-type specific differences.
In many cases, identified sources of technical variability in downstream analyses have proven to be excellent targets for protocol improvement (47)(48)(49)(50). Scaffold, our simulation framework, offers an opportunity to directly and efficiently explore how different steps in a protocol affect scRNA-seq data. Here, we focused the effect of equalizing cDNA concentration across cells. However, Scaffold provides a framework to study other parameters, or to simulate data that recapitulates characteristics of scRNA-seq data (e.g. detection rates and count-depth rate).
In practice, the process of equalizing cDNA concentrations is non-trivial and time-consuming, leading it to be one of the critical limiting points of the library preparation process (51). Automation has alleviated this to some extent, and has been used in large single-cell sequencing projects such as the Tabula Muris (52). However, some state-of-the-art protocols, such as 10×, profile scRNA-seq measurements from thousands to millions of cells using massively parallel sequencing systems with high levels of multiplexing (51) and equalization is not possible since cDNA is pooled early in the experiment. We expect that single-cell protocols will continue to advance and improve with technology. Our study offers insight into one mechanism worth further exploration in protocol design and development.

DATA AVAILABILITY
All R code used for analysis and simulations is available at https://github.com/rhondabacher/scEqualization-Paper. The simulation package Scaffold is available at https:// github.com/rhondabacher/scaffold. The unEQ, EQ, EQ-Vary, and EQ-75% datasets are available at the NCBI Gene Expression Omnibus: GSE156494.
For the publicly available datasets, we obtained processed counts from the conquer scRNA-seq database for four single-cell RNA-seq datasets processed identically: Deng et al.