simCAS: an embedding-based method for simulating single-cell chromatin accessibility sequencing data

Abstract Motivation Single-cell chromatin accessibility sequencing (scCAS) technology provides an epigenomic perspective to characterize gene regulatory mechanisms at single-cell resolution. With an increasing number of computational methods proposed for analyzing scCAS data, a powerful simulation framework is desirable for evaluation and validation of these methods. However, existing simulators generate synthetic data by sampling reads from real data or mimicking existing cell states, which is inadequate to provide credible ground-truth labels for method evaluation. Results We present simCAS, an embedding-based simulator, for generating high-fidelity scCAS data from both cell- and peak-wise embeddings. We demonstrate simCAS outperforms existing simulators in resembling real data and show that simCAS can generate cells of different states with user-defined cell populations and differentiation trajectories. Additionally, simCAS can simulate data from different batches and encode user-specified interactions of chromatin regions in the synthetic data, which provides ground-truth labels more than cell states. We systematically demonstrate that simCAS facilitates the benchmarking of four core tasks in downstream analysis: cell clustering, trajectory inference, data integration, and cis-regulatory interaction inference. We anticipate simCAS will be a reliable and flexible simulator for evaluating the ongoing computational methods applied on scCAS data. Availability and implementation simCAS is freely available at https://github.com/Chen-Li-17/simCAS.


Supplementary Notes Supplementary Note 1: Adapted simCAS framework with a Bernoulli assumption
Adapted simCAS with a Bernoulli assumption (referred to as simCAS_Bernoulli) is different from the original simCAS (referred to as simCAS_Poisson) in the following steps: estimation for distributions of statistics, parameter matrix correction and synthetic peak-by-cell matrix generation.
In the step of estimation for distributions of statistics, simCAS_Bernoulli only estimates distributions of library size (the number of aligned reads per cell) and peak summation (the sum of aligned reads per peak), except for cell non-zero proportion (the proportion of non-zero values per cell), which is repeated with library size under a Bernoulli assumption. The estimation of library size and peak mean of simCAS_Bernoulli is consistent with simCAS_Poisson. The estimated distributions of log-transformed library size and peak summation as ℧ and ℧ as in simCAS_Poisson.
In the step of parameter matrix correction, simCAS_Bernoulli performs the corrections by ℧ and ℧ . For simplicity, here we used the symbols with the same meaning in simCAS_Poission. Suppose , and ′ denote number of simulated cells, number of simulated peaks and number of referenced cells in real data, respectively. After the low-dimensional embeddings generation and activation transformation as in simCAS_Poisson, we obtained the activated parameter matrix � ∈ ℝ × . The library size correction is performed as follows: (1) Multiply the CEV and the CEM , and obtain ̂∈ ℝ 1× . Each value of ̂ represents the weight of potential library size of each column vector (represents each cell) in � .
(2) Randomly sample values from ℧ to form the synthetic library size set ∈ ℝ . Then assign the values of to each column vector in � with the same order of the potential library size weights. Denote the value of assigned to th cell of � as .
(3) For each column in � we perform a uniform correction. For the th column vector � ⋅, in � , sort it with an ascending order to get � ⋅, . Then search from the largest value to the lowest value of � ⋅, to find an index and a factor to satisfy the equations: where ̂, is the th value of � ⋅, .
(4) Then multiply � ⋅, with and set the values exceeding 1 to 1 to get the corrected parameters.
(5) Conduct the uniform correction for each column in � and acquire the library size corrected parameter matrix � .
The peak mean correction is conducted with a similar operation: (1) Randomly sample values from the ℧ to form the synthetic peak summation set ∈ ℝ . Then assign the values of to each row vector (represents each peak) in � with the same order of the peak-wise summations in � . Denote the value of assigned to th peak of � as (2) For each row in � we perform a uniform correction. For the th row vector � ,⋅ in � , sort it with an ascending order to get � ,⋅ . Then search from the largest value to the lowest value of � ,⋅ to find an index and a factor to satisfy the equations: where ̌, is the th value of � ,⋅ .
(3) Then multiply � ,⋅ with and set the values exceeding 1 to 1 to get the corrected parameters.
(4) Conduct the uniform correction for each row in � and acquire the peak mean corrected parameter matrix .
In the step of synthetic peak-by-cell matrix generation, we replaced the Poisson distribution with the Bernoulli distribution, of which the probability parameter is the corresponding parameter in . 5

Supplementary Note 2: Distributions for modeling peak summation
For simplicity of notation, we assume as a random variable with no practical meaning in the following discussion. With specific modeling for the count of zero and one, the probability mass function of the Log-variant distribution is: where 0 (·) and 1 (·) indicate the point mass function at zero and one, respectively, and 0 and 1 are the corresponding probabilities. is a parameter ranging from 0 to 1 in Logarithmic distribution.
We also provide five discrete distributions for modeling peak summation as alternatives, namely Log-variantB (another variant of Logarithmic distribution different with Log-variant), NB (Negative Binomial), ZINB (Zero-Inflated Negative Binomial), NB-variant (a variant of Negative Binomial distribution) and ZIP distributions. For simplicity of notation, we assume as a random variable with no practical meaning in the following discussion.
The probability mass function (PMF) of the Log-vairantB distribution is: where 0 (·) indicates the point mass at zero, and 0 is the corresponding probability.
is a parameter ranging from 0 to 1 in Logarithmic distribution.
The PMF of the NB distribution is: where denotes the number of successes, and 0 denotes the probability of success on each trial.
The PMF of the ZINB distribution is: The PMF of the NB-variant distribution is: The PMF of the ZIP distribution is: 7

Supplementary Note 3: Two operations for parameter matrix correction
First, we developed two activation functions adaptive to different modes. For the discrete or continuous mode, we provide a piecewise function 1 (·) by integrating an exponential function and a linear function as the activation function: where 1 is a parameter to control the variance of simulated data, and a higher 1 value brings higher variance of peak accessibility. In this study 1 is fixed to 2. For pseudo-cell-type mode, we expect the diversity within a cell type less than between different cell states, and a Sigmod-format function 2 (·) with a smaller slope is provided as the activation function: where 2 is a parameter to adjust the steepness of the activation function curve and fixed to 2 in this study. After this operation, the parameter matrix � is transformed into an activated parameter matrix � .
Second, simCAS conducts correction with ℧ , ℧ and ℧ in turn, to make simulated data preserve the cell-wise and peak-wise properties as with the real. For instance, simCAS performs the library size correction as follows: (1) Multiply the CEV and the CEM , and obtain ̂∈ ℝ 1× .
(2) Randomly sample values from ℧ to form the synthetic library size set ∈ ℝ .
(3) Sort the elements of ̂ and values in to obtain the sorting indices.
(4) Obtain the vector ̂′ ∈ ℝ 1× of the library sizes of synthetic cells by replacing the elements in ̂ with the sampled values from one by one according to the sorting indices.
(5) For each column vector (represents each cell) in � , divide each element by the sum of the vector, multiply all elements by the corresponding element (represents the associated cell) in ̂′ , and obtain the corrected parameter matrix � .
After correction of library size, simCAS obtains the corrected parameter matrix � . simCAS then performs the peak summation correction as follows: (1) Randomly sample values from ℧ to form the synthetic peak summation set ∈ ℝ .
(2) Sum the elements of each row vector in � , and obtain � ∈ ℝ

×1
(3) Sort the elements of � and values in to obtain the sorting indices.
(4) Obtain the vector � ′ ∈ ℝ ×1 of the peak summations of synthetic cells by replacing the elements in � with the sampled values from one by one according to the sorting indices.
(5) For each row vector (represents each peak) in � , divide each element by the sum of the vector, and then multiply all elements by the corresponding element (represents the associated cell) in � ′ . Then obtain the corrected parameter matrix ́.
After correction of peak summation, simCAS obtains the corrected parameter matrix ́. simCAS finally performs the cell sparsity correction as follows: (1) Randomly sample values from ℧ to form the synthetic cell sparsity set ∈ ℝ .
(2) For the th synthetic cell, randomly select a value in without replacement, called , as the sparsity of the cell.
(3) For the th synthetic cell, obtain and by solving two nonlinear equations as follows: where and denote a scaling factor and a zero-setting probability of the th synthetic cell, respectively, and ́, is the element in ́ corresponding to the th peak and the th cell.
(4) For the th synthetic cell, randomly set elements of the corresponding column vector to zero with the probability , and then multiple the remaining non-zero elements by .
(5) For each cell, performing the processing from (2) to (4), obtain the final parameter matrix .

Supplementary Note 4: Details of the optional steps in simCAS
simCAS could generates multi-batch data by incorporating the batch effects in the discrete simulation mode. For simulating data with the technical batch effect, we add the Gaussian noise on the corrected parameter matrix : where 1 is Gaussian noise sampled from ( , 2 ). 1 can be deemed as technical variations when sequencing reads and results in an evident variation in library size values, and and control the variance of batches and degree of technical noise, respectively. ′ and generated from the common CEM guarantees the ground truth of same cells from different batches. With different levels of Gaussian noise added on , data with multiple technical batch effects can be generated. Note that the technical variance may be different in the cell types with different library sizes, we add Gaussian noise proportion to the average cell-wise summations of in different cell populations.
The biological batch effect can be directly modeled by adding Gaussian noise to the PEM: where 2 is Gaussian noise sampled from � , 2 �. Then synthetic data with different batches is generated with a common CEM, which provides the ground truth of cells with same population.
The interactive peaks are constructed by assigning similar vectors in the PEM generation step. First, we pick up high chromatin accessibility regions to define hubs of interactive peaks, and then the correlation is added on the peaks of each hub. In a defined peak hub, randomly select some peaks as the interactive peaks and remaining peaks are non-interactive peaks. For interactive peaks in the hub, we first generate a general peak embedding vector � ∈ R 1× , then the embedding vector of each interactive peak is generated by: where 3 is sampled from (0, 2 ) and is a parameter to control the degree of co-accessibility among interactive peaks.

Supplementary Note 5: Details for baseline methods implementations, downstream analysis methods for benchmarking, and the procedure for data visualization
For simATAC simulation, we used simATAC (Navidi, et al., 2021) R package to directly simulate peakby-cell matrix by regarding the peaks as bins in the input matrix. Since the scMultisim R package lacks the interface to simulate the data resembling real cell type, we implemented scMultisim with the same settings (Li, et al., 2022) as our framework.
For benchmarking clustering methods, we tested Leiden clustering, k-means clustering and Hierarchical clustering (HC) methods (Chen, et al., 2019) using the simulated data generated by simCAS in the discrete mode. For benchmarking trajectory inference methods, we tested Monocle3 (Cao, et al., 2019) with different parameters. For benchmarking methods of data integration or cisregulatory interaction inference, we also tested Harmony (Korsunsky, et al., 2019) and Cicero (Pliner, et al., 2018) on synthetic data with multiple batches and cis-regulatory interactions, respectively.
For data visualization, we first select 50000 highly accessible peaks, then perform term frequencyinverse document frequency (TF-IDF) and principal component analysis (PCA) transformation to reduce dimensions to 50, and finally apply uniform manifold approximation and projection (UMAP) to project the cells into a 2-dimensional space. Unless otherwise stated, we perform the above pipeline for visualization in this study.

Supplementary Note 6: Evaluation metrics
Suppose is the sorted real statistic values is the sorted simulated statistic values, MAD, MAE and RMSE are calculated as following equations: PCC calculates linear correlation between and : where cov is the variance and is the standard deviation.
JSD is a measurement of the similarity between two probability distributions. For the reason that integral of continuous variable is incapable be calculated directly, JSD is merely calculated to compare discrete statistic peak mean: where is Kullback-Leibler divergence, and are two probability distributions.
KS statistic quantifies the distance between the empirical distribution of two samples, which is derived from a nonparametric test of the equality of two continuous samples named KS test. We apply it on continuous statistics such as the log-transformed library size and the cell sparsity.
Median integration local inverse Simpson's index (miLISI) is a score first proposed to evaluate the integration of data with different batches. Gaussian kernel-based distributions of neighborhoods of the mixing batches are built in low-dimensional embedding space. iLISI is then computed for each neighborhood: where ( ) refers to the probability that two sampling neighbors are from the same batch and is the number of batches. By considering original data and synthetic data as different batches, this score can directly be adapted to quantify the similarity between synthetic cells and real cells. This score ranges from 1 to 2, and a larger value indicates a greater similarity. The closer miLISI is to 2, the local neighborhood has more equal synthetic and real cells. We calculate miLISI value on the 2D UMAP embedding space containing real and synthetic cells using the R package LISI.
Denoting ground truth with and clustering labels with . Homo is calculated using: where is the entropy and ( | ) is the conditional entropy of ground truth clusters given the unsupervised predictions.
AMI is calculated using: where is the mutual information and (⋅) is the expectation function.
ARI is calculated using: where Rand index (RI) is a similarity measurement between ground truth labels and predicted labels.
F1 score is calculated as follows:

Supplementary Note 7: Analysis of the computing resources
To measure the CPU time and memory usage of simCAS, we performed simulations with varying cell numbers (500, 1,000, 2,000, and 5,000) and peak numbers ( with an increasing cell number for each simulation mode. Interestingly, certain isolated scenarios exhibit shorter CPU times with smaller peak numbers. For instance, the CPU time with 50,000 peaks is observed to be shorter than that with 100,000 peaks. It is worth noting that the cell-wise correction within this step tends to be more time-consuming than the peak-wise correction. Consequently, due to random sampling variations, simulations may require less time with a higher peak number while maintaining the same cell number. Supplementary Fig. S7b showcases the linear growth pattern of memory usage, which corresponds to both the cell number and peak number. This can be attributed to the predominant factor occupying storage space in simCAS, namely the generated parameter matrix.
Each element of this matrix represents a non-zero floating-point value. Notably, generating a peak-bycell matrix with 100,000 peaks and 5,000 cells in pseudo-cell-type mode requires approximately 120 seconds and 8 GB of memory, suggesting that simCAS can be compatible with personal laptops. To summarize, considering the linear increase in CPU time and memory usage with both cell number and peak number, simCAS can be served as a valuable simulator for generating scCAS data with a substantial number of cells and peaks.
We further conducted a comparative analysis of simCAS, simATAC and scMultisim, focusing on their CPU time and memory usage. Using the peak-by-cell matrix of Buenrostro2018 dataset as training data, we first fixed the peak number of simulated data to 169,221 (as the real Buenrostro2018 dataset) and varied the cell number from 500 to 5,000 to examine the impact of cell number on CPU time and memory usage. Supplementary Fig. S7c illustrates the relationship between cell number and computing resourses required for simCAS, scMultisim, and simATAC. Both simCAS and scMultisim exhibit a linear proportionality with respect to the cell number, indicating increased computing resources as the number of simulated cells grows. Conversely, simATAC displays a less pronounced sensitivity to changes in the number of simulated cells. When simulating a large number of cells, scMultisim exhibits the longest computation time, while simCAS requires the largest memory allocation. We then fixed the cell number of simulated data to 2,000 and varied the peak number from 10,000 to 100,000. As shown in Supplementary Fig. S7d, the memory usage of simCAS is linearly proportional to the peak number, while the CPU time is not that sensitive to the increments of the peak number. For scMultisim, both the CPU time and memory usage demonstrate a linear correlation with the peak number. In contrast, simATAC displays relatively stable computing resource utilization across different peak number settings. The results align with the underlying principles of each simulator. In the framework of simCAS, parameter matrix correction step consumes the most time and occupies the most storage space, and cellwise correction is significantly more time-cosuming than the peak-wise correction. In the framework of scMultisim, the key step that significantly impacts computing resources is the value sorting of the parameter matrix, with CPU time and memory usage governed by both peak number and cell number.
Conversely, simATAC primarily consumes computing resources during the estimation of the input peakby-cell matrix, suggesting that the CPU time and memory usage of simATAC are primarily influenced by the shape of the input peak-by-cell matrix rather than the simulation matrix itself.