scHi-CSim: a flexible simulator that generates high-fidelity single-cell Hi-C data for benchmarking

Abstract Single-cell Hi-C technology provides an unprecedented opportunity to reveal chromatin structure in individual cells. However, high sequencing cost impedes the generation of biological Hi-C data with high sequencing depths and multiple replicates for downstream analysis. Here, we developed a single-cell Hi-C simulator (scHi-CSim) that generates high-fidelity data for benchmarking. scHi-CSim merges neighboring cells to overcome the sparseness of data, samples interactions in distance-stratified chromosomes to maintain the heterogeneity of single cells, and estimates the empirical distribution of restriction fragments to generate simulated data. We demonstrated that scHi-CSim can generate high-fidelity data by comparing the performance of single-cell clustering and detection of chromosomal high-order structures with raw data. Furthermore, scHi-CSim is flexible to change sequencing depth and the number of simulated replicates. We showed that increasing sequencing depth could improve the accuracy of detecting topologically associating domains. We also used scHi-CSim to generate a series of simulated datasets with different sequencing depths to benchmark scHi-C clustering methods.


Simulation
Then, scHi-CSim simulates the raw cells by running the following script: python simulating.py -p parameters.txt scHi-CSim is flexible in specifying the number of interactions and replicates of simulated cells by tuning them in the "parameters.txt" file.

Post-processing
Finally, the data generated by the simulation can be combined and converted into bin pairs data by running two scripts: The resolution is assigned by , and the generated sparse bin pairs data can be used for downstream analysis, such as detection of loops and TADs and clustering.

Complexity and running time
The complexity of Step 1 is , where represents the total number of fragment interactions in raw data. The complexity of Step 2 and Step 3 are , where represents the total number of fragment interactions in simulated data. Besides, and are performed independently for each cell. Therefore, they are highly scalable for parallel computation. In summary, the complexity of scHi-CSim is , where and the usage of a multi-kernel CPU will significantly accelerate the simulation process.

Jaccard index
Jaccard index is used to gauge the similarity of two sets. We define , where and are two sets, such as TAD sets. The higher the Jaccard index, the more similarity between and .

Pearson correlation coefficient
Pearson correlation coefficient is defined as , where and are random variables, is the covariance, and are the standard deviation of and respectively.

Compartments detection
Compartments are detected by CscoreTool [4]. CscoreTool assume that is the chance of each genomic window belonging to be in the A-compartment and define C-score as , which ranges between -1 and 1. Then perform the maximum-likelihood estimation for the following log-likelihood function: , where represents the observed number of contacts, represents the genomic distance, represents the scaling factor accounting for the decay profile of longer interactions, and PCR biases or genome mappability of Hi-C experiments is described as and . Finally, Cscore vectors will represent the division of genomic.

TADs detection
TADs are detected by Insulation Score [1]. Hi-C data is first converted to a bin-separated matrix according to a fixed resolution, such as 10kb, 50kb, and so on. A fixed width square is slid along the Hi-C matrix's main diagonal that generates insulation square, the mean contacts of the bins belonging to the square. Next, insulation delta calculation is denoted as for each bin. The bin where the insulation score is a local minimum and the insulation delta is equal to is defined as the TAD boundaries. Besides, for the boundaries' quantitative analysis, the difference between the local maximum and minimum of the insulation delta per boundary is defined as insulation strength. The higher the insulation strength, the greater the difference between the two sides of the boundary.

Percentage
In order to compare the loops between the raw and simulated data, we define percentage as , where .

Hypergeometric test
The probability mass function for the hypergeometric test is given by , where represents the number of background genes (all genes of the mouse), represents the number of genes overlapping with top differentials loops, and represent the total number and the observed number of cell-cycle annotated genes or DEGs.

Circular normal distribution
The circular normal probability density function for the angle is defined as: where is the modified Bessel function of order 0, and are the location and concentration, respectively.

CROC and ACROC
To evaluate the cell-cycle data, the embedding of each stage is assumed to obey circular normal distribution [2]. Then we calculated CROC score and ACROC after fitting each stage to a circular normal distribution. More formally, the embedding results assigned angles along the cell cycle for cells as , and the labels of cells is denoted as . For each cell, naturally and in which represent the number of labels types. In order to calculate the CROC score, we first set one label as positive class and the others as negative class, and denote the mean angle as by fitting the cells of one stage to a circular normal distribution.
Then we design the absolute difference between one cells' angle and the mean angle as: .

ARI
Adjusted Rand index (ARI) is used to measure the similarity between predicted classes and true classes. We define where ARI is based on confusion matrix , represents the cells number that belongs to th cell type labeled by biological marks and th cluster predicted by the algorithm, and are corresponding to the total number of the th row and the th column of , respectively, and is the total number of cells.