EVRC: reconstruction of chromosome 3D structure models using error-vector resultant algorithm with clustering coefficient

Abstract Motivation Reconstruction of 3D structure models is of great importance for the study of chromosome function. Software tools for this task are highly needed. Results We present a novel reconstruction algorithm, called EVRC, which utilizes co-clustering coefficients and error-vector resultant for chromosome 3D structure reconstruction. As an update of our previous EVR algorithm, EVRC now can deal with both single and multiple chromosomes in structure modeling. To evaluate the effectiveness and accuracy of the EVRC algorithm, we applied it to simulation datasets and real Hi-C datasets. The results show that the reconstructed structures have high similarity to the original/real structures, indicating the effectiveness and robustness of the EVRC algorithm. Furthermore, we applied the algorithm to the 3D conformation reconstruction of the wild-type and mutant Arabidopsis thaliana chromosomes and demonstrated the differences in structural characteristics between different chromosomes. We also accurately showed the conformational change in the centromere region of the mutant compared with the wild-type of Arabidopsis chromosome 1. Our EVRC algorithm is a valuable software tool for the field of chromatin structure reconstruction, and holds great promise for advancing our understanding on the chromosome functions. Availability and implementation The software is available at https://github.com/mbglab/EVRC.


Introduction
Since the implementation of ENCODE, scientists have discovered that the folding and winding of chromatin in the nucleus is not random.Regulatory elements require physical contact with target genes to perform their functions (Handoko et al. 2011, Li et al. 2012, Dekker et al. 2013).The high-level structure of chromatin is increasingly recognized as a crucial component of many nuclear activities (Misteli 2007, Dekker et al. 2013).While traditional fluorescence microscopic imaging techniques can directly observe the position and relative distance of different chromosomes in the cell, they suffer from low flux and low resolution.However, the birth of chromosome conformation capture (3C) technology and its derivative technologies has provided researchers with a way to obtain a large number of chromatin interaction data (Dekker et al. 2002, Rao et al. 2014).Researchers have discovered that the 3D structure of chromatin plays an important role in many life activities, such as gene expression regulation (Li et al. 2012, Slager andVeening 2016), cell development (Koohy et al. 2018), and the occurrence of genetic diseases (Lupia ´ñez et al. 2015, Dixon et al. 2018).By using chromatin interaction data to reconstruct chromosome 3D structure models, researchers can gain a deeper understanding of the fine structure and dynamic changes of chromosomes.This will enable researchers to explore the chromatin formation mechanism of the genome, discover the relationship between regulatory function and spatial conformation, and provide new ideas for the study of heterosis, genetic diseases, and cancer cell therapy.
The current methods for modeling the 3D structure of chromatin can be broadly categorized into two types: thermodynamics-based (TB) and restraint-based (RB) modeling (Serra et al. 2015).TB modeling uses polymer physical properties of chromatin fibers to construct the spatial arrangement of chromatin, while RB modeling employs spatial constraints between chromatin fragments obtained by Hi-C technology to build 3D models of chromatin (Marti-Renom and Mirny 2011).One of the advancements in this field is the miniMDS method, which uses genome partitioning and parallelization to reduce memory requirements and achieve faster speed.This technique can calculate the 3D structure of the human genome at kilobase resolution in less than 5 h (Rieber and Mahony 2017).ClusterTAD is another technique that uses unsupervised machine learning models and image analysis technology to identify topologically associating domains (TADs) in the diagonal of the interaction heat map; the interaction matrix is treated as pixel points in the image to identify the TADs (Oluwadare and Cheng 2017).HiCExplorer, on the other hand, utilizes machine learning methods to identify TAD boundaries accurately; it distinguishes TAD boundaries from non-boundaries and detects the ones missed when using Hi-C data alone (Ram ırez et al. 2018).Researchers have also incorporated information obtained from other experimental methods in structure modeling, such as FISH experiments (Abbas et al. 2019), which detect relative positions and spatial distances between different DNA fragments.These integrations can help improve and evaluate the reliability and stability of chromosome 3D structure reconstruction.Recently, more and more reconstruction tools have be developed (Li et al. 2020, Wang et al. 2022, Varoquaux et al. 2023).For example, ShNeigh combines the classical multidimensional scaling (MDS) technique with local dependence of neighboring loci to infer the 3D structure from noisy and incomplete contact frequency matrices (Li et al. 2020).
In this article, we present a novel EVRC algorithm that utilizes Hi-C experiments data to reconstruct the 3D structure of chromatin.Our approach relies on the co-clustering coefficient (CC) and error-vector resultant (Hua and Ma 2019).Specifically, the algorithm begins by calculating the co-CC between chromatin fragments.It then adds the error vector of each fragment weighted by the co-CC together to iteratively optimize the reconstruction of the 3D chromatin structure.As an update of our previous EVR algorithm (Hua and Ma 2019), EVRC now can deal with both single and multiple chromosomes in structure modeling.We applied the EVRC algorithm to six simulated structures and real Hi-C datasets to demonstrate its validity and accuracy.Our results indicate that the algorithm is an effective tool for reconstructing the chromatin structure and has the potential to significantly advance the study of chromatin structure and function.

Simulation data
To assess the effectiveness of EVRC algorithm for 3D structure reconstruction, we generated multiple simulation datasets with six 3D structure models of varying complexities: (A) circular curve (circle), (B) open spiral curve (spiral), (C) closed spiral curve (circular spiral), (D) replication fork, (E) double helix, and (F) double spherical helix.(A)-(C) are single-chain structures, and (D)-(F) are double-chain structures, simulating different chromatin morphology of prokaryotes/eukaryotes with one or multiple chromosomes.For the double-chain model structures, the intra-chain and inter-chain contacts should be considered, which simulates the intra-and interchromosome DNA interactions.Each structure is composed of 500 points.In the 3D structure, the reciprocal of the space distance between each two points is taken as the interaction frequency between the two points, and the interaction matrix of the structure is obtained in this way.The single-chain structure generates only the interaction matrix within a single curve (simulating single chromosome), while the double-chain structure generates the interaction matrix within each of the two curves and the interaction matrix between them (simulating multiple chromosomes).

Hi-C data
We used three published chromosome interaction datasets: (i) the raw chromosome interaction frequency matrix with 50 kb resolution of human IMR90 cell line (GEO accession number: GSE63525) (Rao et al. 2014); (ii) the Dixon2012 dataset (Dixon et al. 2012), which is from the GSDB database (Oluwadare et al. 2020) and includes the normalized chromosome interaction frequency matrix with 40 kb resolution of four cell types: mouse embryonic stem cell (mESC), mouse cortex cell (mCortex), human embryonic stem cell (hESC), and human IMR90 fibroblasts (hIMR90); (iii) the normalized chromosome interaction frequency matrix with 50 and 200 kb resolution of Arabidopsis (GEO accession number: GSE37644, where GSM1078404 was the wild-type Hi-C data and GSM1078405 was the AtMORC6 mutant Hi-C data) (Moissiard et al. 2012).

FISH data
The FISH data for human cell line IMR90 used in this study was adopted from a previous work (Abbas et al. 2019), which is originally from https://www.sciencemag.org/content/353/6299/598/suppl/DC1 (Wang et al. 2016).

Conversion of interaction frequency to spatial distance
The Hi-C experimental data generates a chromatin interaction matrix that reflects the intensity of interactions between DNA fragments (bins).The closer the space distance between bins, the higher the interaction intensity.In order to transform the interaction intensity into spatial distance, the commonly used negative exponential function is adopted in this article: Here, D ij represents the spatial distance between bin i and bin j , F ij represents the interaction frequency between bin i and bin j , and a is the exponential parameter.In this article, the a value defaults to 0.5 and can be adjusted according to the data characteristics.

Co-clustering coefficient
In graph theory, the CC is used to describe the degree of clustering between nodes in a graph (Watts and Strogatz 1998).Specifically, it measures the closeness of the connections between the adjacent nodes of a node.In an undirected network, if node i has k i neighbor nodes, and if all the k i nodes are connected in pairs, the total number of edges E i would be: If the actual number of edges between the k i nodes is e i , then the CC of node i is From this formula, it can be seen that the higher the number of edges actually present between the neighbor nodes, the greater the CC of node i (0 CC i 1), indicating a closer group.
The chromatin interaction matrix reflects the interaction relationships between the DNA fragments (bins) and can be regarded as an undirected network among bins, in which each bin is a node.After conversion to a distance matrix, D ij represents the spatial distance between bin i and bin j .If D ij is not 1, there is a connection between bin i and bin j .Based on Equation (2), we can define the co-CC between bin i and bin j .Let the set of neighbor nodes of node bin i be N i , and the set of neighbor nodes of node bin j be N j .The union of the neighbor sets of node bin i and node bin j is U ij ¼ N i [ N j , which contains k ij nodes.The co-CC of the nodes bin i and bin j is defined as In this formula, coCC ij represents the co-clustering coefficient of nodes bin i and bin j (Fig. 1); e ij is the actual number of edges connecting the nodes in U ij , and E ij is the number of edges if all nodes in U ij are connected in pairs.The closer the coCC ij to 1, the greater the tendency of co-clustering between bin i and bin j in the 3D structure of chromatin.For example, bins in the same TAD tend to have higher co-CCs.We consider the co-CC as an important parameter for 3D structure reconstruction.

Reconstruction of 3D chromosome structure model
The objective of RB chromosome 3D structure modeling algorithm is to generate a 3D structure that fits experimental data as closely as possible.In previous work, we proposed an algorithm based on the error vector resultant (EVR) to model the 3D structure of chromosomes (Hua and Ma 2019).In this article, we have developed the EVR algorithm by incorporating the co-CC between bins, resulting in the EVRC algorithm.
For each bin, its initial 3D coordinates are randomly generated.Assuming the position relationship between bin i and bin j (i 6 ¼ j) as shown in Fig. 2A, P i and P j represent the corresponding position vectors, e ij represents the unit vector of p i À p j , and D ij represents the spatial distance between bin i and bin j (D ij 6 ¼ 1).The error vector E ij between bin i and bin j is defined as: The sum of the error vectors from bin j to other bins can be obtained using the parallelogram rule.For example, the sum of the error vectors from bin j to bin 1 and bin 2 is shown in Fig. 2B.Summing over the genome-wide range, the total error vector of bin j is where w ij ¼ coCC ij /(N À 1) b is the weight in sum, coCC ij is the co-CC between bin i and bin j , N is the total number of bins, and b is a parameter for adjusting the convergence speed (called convergence factor; the larger its value, the slower the convergence speed) that is set to 0.1 by default.
According to the flow of EVR algorithm (Hua and Ma 2019), the structure model of chromosome is obtained through an iterative process.In each iteration step, bin j 's resultant error vector E j is added to its current 3D coordinates to obtain the new coordinates and then the iteration proceeds to the next step.For all bins in the genome, the global optimization objective is defined as where F is the sum of resultant error vectors of all bins, and F t and F t À 1 are the F values of two successive iterative steps t and t À 1.With continuous iterative optimization, the resultant error vector of each bin and the F value become smaller and smaller, making the generated 3D structure more and more consistent with the interaction data.According to the flow of EVR algorithm (Hua and Ma 2019), when DF is less than a certain value or the maximum number of iteration steps is reached, the iterative process stops, resulting in the final chromosome 3D structure model.To better visualize the structure model, the 3D coordinates of the structure can be smoothed using a linear Gaussian filtering function, which contains a parameter (smoothing factor) that is set to "auto" by default, meaning no smoothing.Users can adjust the value of this parameter according to their needs.Gaussian smoothing is only used to optimize visualization, not for algorithm evaluation, that is, algorithm evaluation is always based on unsmoothed coordinates.

Model evaluation indexes
To assess the accuracy of the reconstruction algorithm, simulation data generated based on model structures are often employed.In this article, the 3D structures are reconstructed from the simulation data of six models mentioned above and then compared with the original model structures to evaluate In the subplot (A), there are more connecting edges between the neighbor nodes of node i and node j, so the co-CC is large (coCC ij ¼ 10/21 ¼ 0.476).In the subplot (B), the number of connecting edges between the neighbor nodes of nodes i and j is small, so the co-CC is small (coCC ij ¼ 6/21 ¼ 0.286).
Chromosome reconstruction using error-vector resultant algorithm with clustering coefficient the effectiveness of the reconstruction algorithm.Two indicators are used to measure the similarity between the reconstructed 3D structure and the original structure: root meansquared deviation (RMSD) of 3D coordinates (Arun et al. 1987) and Pearson correlation coefficient (PCC) based on spatial distance (Hua and Ma 2019).

Iteration and convergence
The EVRC algorithm initializes the chromosome structure by randomly assigning 3D coordinates to each bin.Then, under the guidance of the resultant error vector, the 3D structure is gradually optimized through an iterative process until it converges or reaches the maximum number of iterations.In each iteration step, the values of F and DF are calculated according to Equations ( 6) and ( 7), respectively.Under default parameter settings, the curve of F value changing with the number of iterations for the six simulating structures is shown in Supplementary Fig. S1.As can be seen, the F value of the randomly generated initial structure is high, but decreases rapidly after several iterative steps, and tends to 0 steadily after some intermediate fluctuations.The iterative process terminates when the convergence criterion (DF < 1E-6) is met, and the optimized structure is obtained (Fig. 3).The reconstructed 3D structure is highly consistent with the original structure, and the RMSD values are small.The results demonstrate that the EVRC algorithm is effective for reconstructing both openloop and closed-loop structures, as well as single-chain and double-chain structures.Therefore, the EVRC algorithm is generally effective for 3D chromatin reconstruction.

Noise and smoothing
To demonstrate the effectiveness of EVRC algorithm for reconstructing 3D structures under different levels of noise, we added noise to the simulation data and reconstructed the six simulating structures based on the noisy data.Briefly, noise is added before the transformation from a spatial distance D to an interaction frequency by using this formula: , where D is the distance between two bins in the simulating structure and r is a random number in the interval [À1, 1] and s is the noise level taken value from [0.0, 0.1, 0.2, . .., 1.0].Supplementary Figure S2 displays the changes in RMSD and PCC with increasing noise level.The figure shows that the RMSD value gradually increases with increasing noise level, while the PCC decreases.At a noise level of 0.1, the PCCs of all the six simulating structures are greater than 0.99, indicating the high similarity with the real structures.At the maximum noise level of 1, the corresponding PCCs are still all greater than 0.96.The reconstructed 3D structure at noise level 1 (Fig. 4) shows that although the curves are fluctuated due to the high noise, they still closely surround the real structure without significant deviation.Our modeling scheme assumes that each bin corresponds to a point in the 3D structure of a chromosome, which results in a relatively sharp curve formed by directly connecting the points with straight lines.For a better visualization of the reconstructed structure, Gaussian filtering function might be used for smoothing as shown in Supplementary Fig. S3, where the noise level is 1 and the smoothing factor is 2.

Simulation data
We compared the performance of EVRC with five representative algorithms, namely EVR (Hua and Ma 2019), miniMDS (Rieber and Mahony 2017), ShRec3D (Lesne et al. 2014), ShNeigh (Li et al. 2020), andMOGEN (Trieu andCheng 2016) on the six simulating structures with varying noise levels.All the algorithms were run with default parameters.To reduce the impact of randomness, each algorithm was run ten times at each noise level to obtain the average RMSD and PCC.The results indicate that when the noise level is near 0, the RMSD values of EVRC, EVR, ShRec3D, and ShNeigh are also close to 0 (Fig. 5), while the PCCs of these algorithms are close to 1 (Fig. 6) for the single-chain structures.This indicates that these algorithms can effectively reconstruct the original structure at very low noise levels.However, as the noise level increases, the RMSD values of most algorithms increase gradually, while the PCC values decrease.Notably, EVRC displays better stability than other algorithms.Thus, the EVRC algorithm exhibits superior stability compared to other algorithms as the noise level increases.

Real Hi-C data
In addition to the simulating structures, we also evaluated the performance of each algorithm using datasets from the real Hi-C experiments of human cell lines.Given that the real structure of human chromosomes is unknown, we cannot calculate the RMSD value between the reconstructed structure and the real structure of human chromosomes.However, we can use the FISH experimental data to evaluate the accuracy   The results, as shown in Table 1, revealed that, the PCC of EVRC is 0.7576 for chromosome No. 20 and 0.8206 for chromosome No.21, the highest among the compared algorithms.Meanwhile, for chromosome No. 20, the miniMDS and ShRec3D methods failed to generate 3D structure due to the missing interaction data for the middle region of this chromosome in the Hi-C experiment.On the other dataset (Dixon et al. 2012), subject to the capabilities of different algorithms in dealing with input data (Supplementary Table S1), only EVRC, MOGEN, and ShNeigh can build all the models in this dataset.Meanwhile, since there is no corresponding FISH data for this dataset, we followed a previous  work (Li et al. 2020) to compute Spearman correlation coefficient (SCC) between the input frequency matrix and the frequency matrix recovered from reconstructed structure.As shown in Fig. 7, for all the chromosomes of four cell types, the performance (SCC) of EVRC far exceeds ShNeigh and MOGEN.Overall, these results demonstrate that the EVRC algorithm is accurate.

Application of EVRC to plant 3D genomes
3C technologies have led to the publication of chromatin interaction data for more and more species.In this study, we applied the EVRC algorithm to the chromatin interaction datasets of Arabidopsis to reconstruct the 3D chromosome structures.The Arabidopsis thaliana chromatin interaction dataset used in this study was derived from two different samples, GSM1078404 (wild-type) and GSM1078405 (AtMORC6 mutant), both available in the GEO database (accession number: GSE37644).
The datasets included five intra-and inter-chromosome interaction matrices.AtMORC6 is a member of the conservative Microrchidia (MRC) family that catalyzes changes in chromosome superstructure (Moissiard et al. 2012).AtMORC6 mutants in A.thaliana show depolymerization of heterochromatin in centromeres, leading to an increased interaction between the centromeric region and the rest of the genome (Moissiard et al. 2012).
The input data for the EVRC algorithm was the normalized interaction matrixes with a resolution of 50 and 200 kb.The 3D structure of chromatin was obtained by running 20 000 iteration steps under default parameters.Figure 8 shows the reconstructed 3D conformation of chromosomes for wildtype A.thaliana.The colors green, cyan, magenta, yellow and brown represent chromosomes 1-5, respectively.Figure 8A has a resolution of 200 kb, and Fig. 8B has a resolution of 50 kb.The overall morphology of the chromosomes is consistent between different resolutions (Supplementary Fig. S4), with the structure in Fig. 8A being smoother and the structure in Fig. 8B being more detailed.These results indicate that the EVRC algorithm accurately reconstructs chromosome conformations at different resolutions with strong stability.The shapes of the five chromosomes are different, with chromosomes 1, 3, and 5 showing a narrow "U" shape, and the centromere region having varying degrees of curvature.Chromosomes 2 and 4 are hooked, with the ends of the chromosomes being closer.The common feature for all the five chromosomes is the loose centromere region.The 3D reconstruction results of the five chromosomes in A.thaliana demonstrate the effectiveness of the EVRC algorithm in reconstructing conformation of multiple chromosomes, suggesting that EVRC can serve as a valuable chromatin 3D reconstruction tool in eukaryotes.
Figure 9 depicts the 3D conformation of chromosome 1 at a resolution of 50 kb, with the wild-type shown in red and the mutant in blue.The non-overlapping part represents the centromere region.As seen in the figure, the structures of the Figure 7. Evaluation based on the Dixon2012 dataset.The evaluation index is the SCC between the input interaction frequency matrix and the recovered interaction frequency matrix from the generated models.mESC, mouse embryonic stem cell; mCortex, mouse cortex cell; hESC, human embryonic stem cell; hIMR90, human IMR90 fibroblasts.
Chromosome reconstruction using error-vector resultant algorithm with clustering coefficient wild-type and the mutant show a high degree of similarity except for the centromere region, which is separated by 180 .In the mutant, the centromere region is closer to the surrounding chromatin, resulting in enhanced interaction and giving the chromosome a "C" shape.In the wild-type, the centromere region folds upward, giving the chromosome an "S" shape.
The reconstruction results are consistent with the functional characteristics (Moissiard et al. 2012), indicating the validity and accuracy of the EVRC algorithm in plant chromosome structure reconstruction.

Discussion
In this article, we have presented an algorithm called EVRC, which can be used to reconstruct the 3D structure of chromosomes based on spatial distance constraints that are related to chromatin condensation characteristics.The algorithm takes the co-CC between two bins as the weight and sums the error vectors to determine the optimal coordinates of bin position.Through continuous iterative optimization, the error vector becomes smaller and smaller, and the reconstructed structure approaches the actual structure.We applied the EVRC algorithm to six simulating structure datasets of varying complexity and to real Hi-C experimental datasets.The reconstruction process of the algorithm was demonstrated, and its performance was evaluated using the RMSD, PCC, and SCC.We used UCSF chimera (Pettersen et al. 2004) to visualize the reconstructed chromosome models by various algorithms on the human dataset IMR90 (Supplementary Figs S5 and S6).According to the research of Rao et al. (2014), we annotated subcompartments A1, A2, B1, B2 by different colors in the reconstructed models of human IMR90 dataset (Supplementary Figs S5 and S6).The reconstructed models by EVRC nicely demonstrate not only the reasonable spherical shapes of the chromosome conformations (Abbas et al. 2019) but also the spatial co-localization of bins in subcompartments.
As an update of our EVR algorithm (Hua and Ma 2019), EVRC now can deal with both single and multiple chromosomes in structure modeling, which makes it applicable to multichromosome eukaryotes.We applied the EVRC algorithm to the Hi-C datasets of A.thaliana, and the reconstruction results of the five Arabidopsis chromosomes at different resolutions showed that the overall morphology of chromosomes was consistent.We also reconstructed chromosome No. 1 of wild-type  and mutant A.thaliana, and found that the centromere regions are separated by 180 .In the mutant, the centromere region was closer to the surrounding chromatin, and the interaction was enhanced, resulting in the whole chromosome exhibiting a "C" shape.In contrast, in the wild-type, the centromere region folds upward, resulting in the chromosome exhibiting an "S" shape.These results confirmed the effect of AtMORC6 protein on the centromere region of chromosomes (Moissiard et al. 2012).
In summary, our EVRC algorithm has wide applicability and high accuracy, and has a great potential for multichromosome 3D structure reconstruction.

Figure 1 .
Figure1.Definition of co-CC.Red circles represent nodes, and green lines represent interactions between nodes.In the subplot (A), there are more connecting edges between the neighbor nodes of node i and node j, so the co-CC is large (coCC ij ¼ 10/21 ¼ 0.476).In the subplot (B), the number of connecting edges between the neighbor nodes of nodes i and j is small, so the co-CC is small (coCC ij ¼ 6/21 ¼ 0.286).

Figure 2 .
Figure 2. Diagram of error-vector resultant.(A) The error-vector E ij between bin i and bin j .(B) The resultant of error-vectors of E 1j and E 2j according to the parallelogram rule.
of the reconstruction algorithm.Specifically, we found the Hi-C data(Rao et al. 2014) and the corresponding FISH data(Wang et al. 2016) on chromosomes No. 20 and No. 21 of the human IMR90 cell line through a literature search.We mapped the fluorescent-labeled sites onto the reconstructed 3D structure, calculated the spatial distances between these sites in the reconstructed structure, and analyzed the correlation with the distances measured in the FISH experiments.

Figure 5 .
Figure 5. Based on the noisy data of six simulating structures, the RMSD values of the reconstructed structures by each algorithm compared with the original structures under different noise levels.Note that only EVRC, miniMDS, and MOGEN have reconstruction results for the double-chain structures (namely, replication fork, double helix, and double spherical helix).

Figure 6 .
Figure 6.Based on the noisy data of six simulating structures, the PCC values between the spatial distances in the reconstructed structures and in the original structures under different noise levels.Note that only EVRC, miniMDS, and MOGEN have reconstruction results for the double-chain structures (namely, replication fork, double helix, and double spherical helix).

Figure 8 .
Figure 8.The 3D conformation of wild-type A.thaliana chromosomes reconstructed by EVRC algorithm.Chromosomes 1-5 are shown in green, cyan, magenta, yellow, and brown, respectively.(A) At a resolution of 200 kb; (B) at a resolution of 50 kb.

Figure 9 .
Figure 9.Comparison of the chromosome 1 structure between wild-type and AtMORC6 mutant of A.thaliana.The red structure represents the wild-type chromosome and the blue structure represents the AtMORC6 mutant chromosome.A clear separation in the centromere region can be observed.(A) and (B) show the same structure in different orientations (viewpoints).

Table 1 .
Pearson correlation coefficient between the spatial distance in reconstructed structure and the distance measured by FISH experiment in human IMR90 cells.