Identifying TAD-like domains on single-cell Hi-C data by graph embedding and changepoint detection

Abstract Motivation Topologically associating domains (TADs) are fundamental building blocks of 3D genome. TAD-like domains in single cells are regarded as the underlying genesis of TADs discovered in bulk cells. Understanding the organization of TAD-like domains helps to get deeper insights into their regulatory functions. Unfortunately, it remains a challenge to identify TAD-like domains on single-cell Hi-C data due to its ultra-sparsity. Results We propose scKTLD, an in silico tool for the identification of TAD-like domains on single-cell Hi-C data. It takes Hi-C contact matrix as the adjacency matrix for a graph, embeds the graph structures into a low-dimensional space with the help of sparse matrix factorization followed by spectral propagation, and the TAD-like domains can be identified using a kernel-based changepoint detection in the embedding space. The results tell that our scKTLD is superior to the other methods on the sparse contact matrices, including downsampled bulk Hi-C data as well as simulated and experimental single-cell Hi-C data. Besides, we demonstrated the conservation of TAD-like domain boundaries at single-cell level apart from heterogeneity within and across cell types, and found that the boundaries with higher frequency across single cells are more enriched for architectural proteins and chromatin marks, and they preferentially occur at TAD boundaries in bulk cells, especially at those with higher hierarchical levels. Availability and implementation scKTLD is freely available at https://github.com/lhqxinghun/scKTLD.


Simulation of single-cell and reference Hi-C data
We used the following procedures to simulate single-cell and reference Hi-C contact matrices from the threedimensional physical models of chromosomes.First, we selected two distant fragments (25 Mb -30 Mb and 100 Mb -105 Mb) on chromosome 8 of K562 cell line at random in Rao's bulk Hi-C dataset to mimic two types of cells, respectively, and 100 three-dimensional physical models were generated from each segmented contact matrix of bulk Hi-C with the help of Integrative Modeling Platform (Bau and Marti-Renom, 2012;Serra, et al., 2017).Then for each model, the Euclidean distances , ij Dist between every loci pair (i, j) are calculated, and four contact matrices are generated by weighted sampling the genomic loci, including three single-cell contact matrices corresponding to three minimal contacting distances (500, 750 and 1000) and one reference contact matrix to define the ground-truth TADs (Supplementary Table S2).
For single-cell Hi-C contact matrix, the weight between loci pair (i, j) is

W
and can be calculated by setting the total number of contacts to 1000 per 5 Mb genomic region, close to the experimental single-cell Hi-C data (Tan, et al., 2018).Then the contacts between every loci pair can be simulated by randomly sampling from Binomial distributions with the expected number of contacts without replacement.For reference Hi-C contact matrix, the weight between loci (i, j) is , ,  and the expected number of contacts between them is regarded proportion to , ij W , which can be calculated given that the total number of contacts is about 0.35M per 5 Mb genomic region, close to the experimental bulk Hi-C (Rao, et al., 2014).Similarly, the contacts between every loci pair can be simulated by randomly sampling from Poisson distributions with the expected number of contacts without replacement.To get Hi-C contact matrices at resolution of 50 kb, the simulated Hi-C contact matrices were further binned into 50 kb for TAD-like domain calling.

Chebyshev expansion for the modulated Laplacian
To avoid the expensive computational cost in explicit eigen-decomposition of the Laplacian matrix for large graphs, we used truncated Chebyshev expansion to speed up the calculation of the modulated Laplacian, which has been proven to have the ability to approximate the modulated Laplacian well in a fast way (Kipf and Welling, 2016;Zhang, et al., 2019).For Chebyshev polynomials of the first kind, we have 0 ( ) 1 Tx , 1 () T x x  and the higher order can be obtained by iterating the formula 11 ( ) 2 ( ) ( )

  
, then the modulator function will be     and the modulated Laplacian can be approximated by Chebyshev polynomials with the following forms: where f  , which can be easily calculated according to the orthogonality of ()  ， .k is the truncated order, which is set to 10 in this study.After obtaining these coefficients, L can be approximated as: where   i B  is the modified Bessel function of the first kind.The equation ( 2) enables us to calculate L and propagate the embeddings in the spectrally modulated network very efficiently.Finally, to maintain the orthogonality of the original embedding space, SVD is performed again on the propagated embeddings.

PELT optimization
PELT considers each observation t y sequentially and use an explicit pruning rule to determine whether or not to discard it from the set of potential changepoints (Killick, et al., 2012).Let   Then PELT can be specified as follows in Supplementary Algorithm 1.

Supplementary Algorithm 1 PELT
// Find the optimal from the candidate changepoints up to index t Configurations of scKTLD and competing methods scKTLD.Our scKTLD treats symmetric single-cell Hi-C contact matrix as an adjacency matrix for a weighted graph, embeds the graph into a low-dimensional space with the help of sparse matrix factorization followed by spectral propagation, and identifies the TAD-like domains in the embedding space using a kernel-based changepoint detection.
To implement this method, the source code can be downloaded from https://github.com/lhqxinghun/scKTLD,and there are two parameters need to be tuned.During the implementation of TADfit in this study, we set the embedding dimension to 128 and the penalty constant to 1.42 by default.
deTOKI.deTOKI uses a sliding window to segment the entire chromosome into smaller sub-matrices spanning 8 Mb genomic regions.Then a consensus map is constructed by multiple non-negative matrix factorizations (NMF), and the TAD-like domain boundaries can be determined based on the cluster rate (CR) calculated on the consensus map.To implement this method, the program TOKI was downloaded from https://github.com/lixiaoms/TOKI,and two main parameters need to be specified.We set the resolution to the bin size of input contact matrices, and set the TAD mean size (measured in kb) to (600, 800) by default.Besides, deTOKI supports multi-threaded execution, and we set the number of threads to 4, 8, and 12 to evaluate its computational efficiency with multiple threads.deDoc.deDoc is a TAD detection algorithm based on the theories of structural information and graph.It seeks to extract a structure that minimizes the global uncertainty of the Hi-C graph.To implement this method, the program deDoc was download from https://github.com/yinxc/structural-information-minimisation,and no parameters need to be specified to run it.deDoc integrates two algorithms, denoted as deDoc(E) and deDoc(M), where deDoc(M) reapplies deDoc(E) to its identified modules iteratively according to the proposer.In our test, deDoc(M) typically produce extremely small TADs on single-cell Hi-C data, occupying only about 2 bins.Therefore, we used deDoc(E) for comparison.
TopDom.TopDom is a TAD detection algorithm based on statistics.It calculates a binSignal for each bin by averaging the interaction frequencies upstream and downstream, and determines the TAD boundaries by local minima calling and statistic filtering.In our comparison, TopDom was performed using the R package TopDom (0.10.1), which can be downloaded from CRAN repository, and two parameters need to be specified to execute it.window.sizewas set to 10 and statFilter was set to true according to the instructions for the package.
scHiCluster.scHiCluster uses convolution and random walk with restart to impute the sparse single-cell Hi-C contact matrix, and only top-ranked interactions preserved after imputation.Then it employs TopDom to call the TAD-like domains on the imputed contact matrices.To implement this method, the python package scHiCluster was downloaded from https://github.com/zhoujt1994/scHiCluster,and there are five parameters need to be tuned.We configured these parameters by default as follows: pad=1, rp=0.5, prct=20, window.size=10, and statFilter=true.
SpectralTAD.SpectralTAD identifies hierarchical TADs by a modified version of the multiclass spectral clustering algorithm.The initial TADs are obtained by maximizing an average silhouette score, and the hierarchical structures are then determined by iteratively portioning the initial TADs.This method was implemented using R package SpectralTAD (v.1.2.0) which can be downloaded from BiocManager.During the implementation, the levels of hierarchical structures was set to 1 since we only consider basic TAD-like domains in single cells in our study.The parameter qual_filter was set to false, and min_size was set to 5 by default.
GRiNCH.GRiNCH employs graph regularized NMF to obtain the lower-dimension representation of a highdimensional Hi-C contact matrix while capturing the distance dependence of interaction frequencies, and chainconstrained k-medoids clustering is performed to find TADs.The program of GRiNCH was downloaded from https://roy-lab.github.io/grinch/,and there are four parameters need to be tuned: the number of clusters, expected size of a cluster, neighborhood radius, and regularization strength.During the implementation of GRiNCH, we set these parameters to the default values recommended by the software.
Higashi.Higashi transforms the input single-cell Hi-C data into a hypergraph, where each hyperedge connects one cell node and two bin nodes, and imputes the single-cell Hi-C contact matrices by predicting missing hyperedges within the hypergraph.The TAD-like domains are then called by finding the local minima of the insulation score on the imputed contact matrices.To implement Higashi, the source code was downloaded from https://github.com/macompbio/Higashi,and there are 14 parameters required to be tuned.According to the configure file given by the original proposer, we set these parameters as follows: resolution=50 kb or 25 kb (according to the bin size), minimum_distance Discard the impossible changepoints from set for searching end for Output: set L[n] of estimated changepoint indexes.
The bolded indicate methods specifically developed for single-cell Hi-C data, and the numbers inside the parentheses behind scHiCluster and deTOKI indicate the numbers of threads.