PHi-C2: interpreting Hi-C data as the dynamic 3D genome state

Abstract Summary High-throughput chromosome conformation capture (Hi-C) is a widely used assay for studying the three-dimensional (3D) genome organization across the whole genome. Here, we present PHi-C2, a Python package supported by mathematical and biophysical polymer modeling that converts input Hi-C matrix data into the polymer model’s dynamics, structural conformations and rheological features. The updated optimization algorithm for regenerating a highly similar Hi-C matrix provides a fast and accurate optimal solution compared to the previous version by eliminating the factors underlying the inefficiency of the optimization algorithm in the iterative optimization process. In addition, we have enabled a Google Colab workflow to run the algorithm, wherein users can easily change the parameters and check the results in the notebook. Overall, PHi-C2 represents a valuable tool for mining the dynamic 3D genome state embedded in Hi-C data. Availability and implementation PHi-C2 as the phic Python package is freely available under the GPL license and can be installed from the Python package index. The source code is available from GitHub at https://github.com/soyashinkai/PHi-C2. Moreover, users do not have to prepare a Python environment because PHi-C2 can run on Google Colab (https://bit.ly/3rlptGI). Supplementary information Supplementary data are available at Bioinformatics online.

. (iv)Σ 2 is transformed into the contact matrix C = 1 +Σ 2 −3/2 . Here, we briefly express this procedure as C = φ(K). If we can find the inverse function φ −1 , we can solve the inverse problem K = φ −1 (C) for any input contact matrix. The first step of the inverse function isΣ 2 = C −2/3 − 1 . As we previously noted 1 , there is an inverse transformation fromΣ 2 toM because the matrixM satisfies the conditionM1 1 1 = 0 0 0, where 1 1 1 is a vector (1, 1, · · · , 1) T . Here, we explicitly provide the inverse transformation. Let m m m be a vector that consists of the diagonal elements of the matrix M, M 00 ,M 11 , · · · ,M N−1,N−1 T , and we define a square matrix B as (m m m, m m m, · · · , m m m). Then, the relationship between Σ 2 andM can be written as 3Σ 2 = B + B T − 2M. Multiplying 1 1 1 from the right, we obtain 3Σ 2 1 1 1 = Am m m, where A is a square matrix defined by NI + 1 1 11 1 1 T . Since the matrix A is non-singular, there exists the inverse matrix given by A −1 = 2NI−1 1 11 1 1 T 2N 2 . Therefore, we can determine the diagonal elements of the matrixM by m m m = 3A −1Σ2 1 1 1 and calculate all elements byM i j =M ii +M j j −3Σ 2 i j 2 .
Next, we considered a transformation fromM toL. Theoretically, the matrixM should be positive semidefinite according to the definition. However,M calculated through the above transformation did not necessarily satisfy the positive semidefiniteness requirement, which is mainly due to the experimental noise of Hi-C data. Therefore, the Laplacian matrixL could not be found in a straightforward manner by solving the eigenvalue problem. Instead, a different strategy was used to construct a pseudo-Laplacian matrixL using the Moore-Penrose (MP) inverse matrixM + of the matrixM 2 . MatricesM andL theoretically satisfy the relationshipML = I − 1 1 11 1 1 T N ; therefore, the MP inverse matrix minimizes the Frobenius norm ML − I + 1 1 11 1 1 T N F and provides the solution byL =M + I − 1 1 11 1 1 T N . The construction of the MP inverse matrix is based on the singular value decomposition of the matrixM, and we can easily implement it on numerical scripts. Finally, we can obtain a pseudo-interaction matrixK byK = −L and zero-diagonals (K ii = 0). Based on the above algorithm, we reconstructed a contact matrixC = φ(K) from an input matrix C (Supplementary Figure S1A). Note that although the pseudo-Laplacian matrixL is no longer positive semidefinite (Supplementary Figure S1B), the numerical calculation of the eigenvalues and eigenvectors proceeds normally. We estimated the closeness of all elements between C andC. Surprisingly, the allclose function in NumPy returned True, which means that the two matrices are element-wise equal within the default numerical tolerance. Taken together, we were able to find the inverse transformation φ −1 . The optimization procedure is a central part of PHi-C. We define the optimization process as an iterative finding of the normalized interaction matrixK that decreases a cost function f (C, φ(K)) for an input contact matrix C. To clarify the mathematical concept, we introduce the following sets (Supplementary Figure S2): For a small ϵ, a matrixK ∈ U ϵ is an optimal candidate. In Supplementary Note 1, we found the minimal solution φ −1 (C) =K ∈ U ϵ=0 by the inverse transformation. However, since the induced Laplacian matrix does not necessarily satisfy the positive semidefiniteness, the polymer system becomes unstable and the normalized interaction matrix K is unrealistic. Thus, the set A ensures the physically acceptable matrixK with connections along the polymer backboneK i,i+1 > 0 as a stable polymer system.
LetK n be the normalized interaction matrix at the n-th iteration step in the optimization procedure. An initial matrixK 0 should be in the set A, and an optimization path in K N would go pass into set U ϵ for a small ϵ (Supplementary Figure S2). Then, paths that output an optimalK ∈ U ϵ ∩ A are acceptable, while paths that pass into U ϵ ∩ A c should be forbidden.

Physically acceptable Physically unrealistic
Supplementary Figure S2. Mathematical concept of the PHi-C optimization. The PHi-C optimization iteratively decreases the cost function in K N . The set U ϵ has optimal candidates of the matrixK. As the inverse transformation induces the pseudo-interaction matrixK, the matrixK is just an element of the set U ϵ=0 . Optimization paths can pass into the set U ϵ ∩ A c , where the polymer systems are physically unrealistic. Optimization paths should move in the physically acceptable set A.

Updated algorithm
We mainly updated the optimization algorithm regarding the definition of the cost function and the iterative operation at each step (Supplementary Table S1). Accordingly, both factors provided faster and more accurate optimization results than the previous version. First, we re-defined the cost function as

PHi-C2
Cost function: Optimization operation at every iteration step For a randomly selected pair (i, j), For all the elements, This value represents the standard deviation between the contact probabilities C and φ(K) per element. Moreover, we calculated the standard deviation between the logarithmic contact probabilities in the previous version (Supplementary  Table S1). Since the logarithmic function is not defined for a zero value, an interpolate operation is needed for the zero-valued elements to ensure that the shape of the contact probability curve remains unaltered 1 . Therefore, the newly defined cost function does not require the interpolation process. Every iterative optimization step updates all the elements of the matrixK n and obeys the following rule: where η is the learning rate parameter. In the previous version of PHi-C, the random choice of a matrix element (i, j) and its slight change of the element value at every iteration step were a bottleneck (Supplementary Table S1). The new iterative rule overcomes the bottleneck by updating the values of all the elements at one time. Finally, we set a stop condition due to a monotonical decrease of the cost function, 0 < f (C, φ(K n−1 )) − f (C, φ(K n )) < δ =⇒ outputK n as an optimal solutionK opt , where δ = η α and α is a parameter for the stop condition. Overall, the above three points are significant updates of the PHi-C2 algorithm. Thus far, we have not been able to mathematically ensure that every optimization path decreases the cost function monotonically in the set A (Supplementary Figure S2). The update coefficient φ(K n−1 ) − C does not exactly correspond to the gradient , although the gradient-based algorithm has already been formulated 4 and shows a good performance 5 . However, we can operationally obtain an optimal solution in the set A by tuning the initial pointK 0 ∈ A, learning rate η, and stop condition parameter α.

Supplementary Note S3: Benchmarks
After the optimization process, we can estimate the closeness of the optimal contact matrix C opt = φ(K opt ) to the input contact matrix C using the Pearson correlation coefficient (PCC), r, and distance-corrected PCC 6 , r ′ , for the non-diagonal elements of both matrices. Here, we show benchmarks for Hi-C data of mESCs 3 [chr1: 50-60 Mb (25-kb bins)].
For the 400 × 400-sized input matrix of mESCs and a default set of optimization parameters ((K 0 ) i,i+1 = 0.5, η = 10 −4 , α = 10 −4 ), PHi-C2 output an optimal solution with r = 0.998 and r ′ = 0.970 in 26,307 iteration steps (Supplementary Figure S3A and Supplementary Table S2). The differences between the PHi-C1 and PHi-C2 optimization algorithms are summarized in Supplementary Table S1, and they indicate that the calculation of the cost function f (C, ϕ(K n )) for all matrix elements in PHi-C1 seems relatively inefficient for a slight change of a randomly selected matrix element at each iterative step. In contrast, all matrix elements are updated together using the feature of NumPy for matrix calculations in PHi-C2. Therefore, the optimization process should be more efficient. As expected, for the same iteration steps and η = 4 × 10 −3 , an optimal solution of the previous PHi-C (PHi-C1) had a value of r = 0.966 (Supplementary Figure S3B), although the meaning of the cost function and the learning rate differs between PHi-C1 and PHi-C2. We can confirm that the updated algorithm improves the speed and accuracy of the optimization procedure compared to that of PHi-C1.
Next, we show benchmarks for different optimization parameters: initial values of the polymer backbone (K 0 ) i,i+1 , learning rate η, and stop condition parameter α (Supplementary Table S2). We carried out all calculations using an Intel ® Xeon ® Gold 6154 processor (24.75 M Cache, 3.00 GHz) with Intel ® distribution for the Python environment and Google Colab environment. The procedure stopped within less than 100,000 iteration steps and 30 min and 90 min for these parameters in our workstation and the Google Colab environment, respectively. Interestingly, although the optimal cost function values differed, the correlation values reached 0.998. As our stop condition relates to the cost function curve saturation, the optimal solutions would be close to local minimums in the set U ε=0.00223 . Also, we could confirm that the optimal solutions are located in the set A due to the positive semidefiniteness (Supplementary Figure S3C).
The speed of the PHi-C2 optimization depends on the matrix calculations at every iterative step. To clarify the speed and accuracy features, we benchmarked the PHi-C2 optimization by changing the different bin sizes from 25 kb to 50, 100, and 250 kb for the same genomic region [chr1: 50-60 Mb] (Supplementary Table S3). The optimized accuracy and number of iterative steps differed due to the different input Hi-C matrices. At the same time, the calculation time dramatically decreased depending on the reduction of the matrix size. A rough estimation of the ratio (9m54s)/(2m41s) = 3.69 · · · ≃ 4 for 25-and 50-kb bin sizes implies that the optimization time is in O(N 2 ), where N is the number of the diagonal bins or beads of our polymer model.
As we stated, the update algorithm does not always decrease the cost function monotonically. We observed such counter-examples for Hi-C data of the mouse erythroid cells 7 [chr1: 3-195.5 Mb (500-kb bins)] in the prometaphase. Here, we altered only the initial values of the polymer backbone (K 0 ) i,i+1 = 0.1 ∼ 1.0 (fixed η = 10 −4 and α = 10 −4 ). Some curves showed an increase in the cost function within the initial 100 steps (Supplementary Figure S4A). Nevertheless, every curve eventually converged and output an optimal solution. Moreover, for Hi-C data in the late G1 phase, we confirmed a monotonical decrease of the cost function (Supplementary Figure S4B).   Rheology spectra of the loss tangent tan δ(ω) as a measure of the liquid-like (> 1) and solid-like (< 1) states of three cases.