Structural heterogeneity and functional diversity of topologically associating domains in mammalian genomes

Recent chromosome conformation capture (3C) derived techniques have revealed that topologically associating domain (TAD) is a pervasive element in chromatin three-dimensional (3D) organization. However, there is currently no parameter to quantitatively measure the structural characteristics of TADs, thus obscuring our understanding on the structural and functional differences among TADs. Based on our finding that there exist intrinsic chromatin interaction patterns in TADs, we define a theoretical parameter, called aggregation preference (AP), to characterize TAD structures by capturing the interaction aggregation degree. Applying this defined parameter to 11 Hi-C data sets generated by both traditional and in situ Hi-C experimental pipelines, our analyses reveal that heterogeneous structures exist among TADs, and this structural heterogeneity is significantly correlated to DNA sequences, epigenomic signals and gene expressions. Although TADs can be stable in genomic positions across cell lines, structural comparisons show that a considerable number of stable TADs undergo significantly structural rearrangements during cell changes. Moreover, the structural change of TAD is tightly associated with its transcription remodeling. Altogether, the theoretical parameter defined in this work provides a quantitative method to link structural characteristics and biological functions of TADs, and this linkage implies that chromatin interaction pattern has the potential to mark transcription activity in TADs.


INTRODUCTION
Recent chromosome conformation capture (3C) derived techniques (1,2), especially Hi-C (3), have revealed that chromatins are partitioned into topologically associating domains (TADs), also called topological domains, in which the intra-domain chromatin interactions are significantly stronger than inter-domain interactions. Remarkably, TAD acts as a pervasive, or at least in part, structural element of chromatin three-dimensional (3D) organization across species, including Caulobacter crescentus (4), Plasmodium falciparum (5), Arabidopsis (6,7), yeast (8), drosophila (9), mouse (10,11) and human (11). Comparative analysis further shows that the genomic positions of TADs appear to be stable across cell types and conserved across species in mammals (11). Besides chromatin organization, TAD provides structural basis for chromatin regulation. It was found that most identified enhancer-promoter interactions were located in the same TADs (12). The study on mouse revealed that the limb development was regulated by the switch of two neighbor TADs (13). A recent work on bacteria C. crescentus provided the direct evidence that disrupting TADs leads to the change of gene expressions (4).
Given the importance of TADs, it is critical to interpret the structural characteristics of TADs and their association with biological functions. Previous works (4,(9)(10)(11) annotated the TADs by integrating transcription factor (TF) binding and epigenomic signals, revealing that TADs could be different in biological functions. However, integrative annotation cannot reveal the structural characteristics of TADs. Thus, it is hard to build the linkage between structure and function in this way. To bridge the gap, the interaction frequency together with principle component analysis were previously used to identify open and close structures of chromatin regions at the genome-wide scale (3,6). Nevertheless, using this calculation to distinguish domain structures in the same class is difficult since specific chromatin interaction patterns are not considered. To our knowledge, there is currently no parameter to quantitatively measure the chromatin interaction patterns of TADs.
Inspired by our finding that TADs exhibit specific chromatin interaction patterns, we define a theoretical parameter, called aggregation preference (AP), to represent the structural characteristics of TAD and then build its association with biological functions. First, the locally high-frequency chromatin interactions violating distancedependence decay are selected from TAD. Then a densitybased algorithm DBSCAN (14) is used to cluster the selected chromatin interactions into groups according to their spatial aggregation. Finally, the weighted density of clustered groups is defined as AP to represent the interaction aggregation degree. This defined parameter is next used to investigate 11 chromatin interaction data sets on human and mouse cell types generated by traditional Hi-C and in situ Hi-C, a so-called improved pipeline published very recently (15). The Hi-C data on other species are excluded from this work because their domain sizes are too small at current resolution. Statistical analyses show that TADs are quite different in chromatin interaction patterns and these structural characteristics are significantly correlated to DNA sequence, epigenomic signals and transcription activity. Intercell-line comparison further reveals that the detailed structures of TADs can be significantly rearranged across cell lines although these domains are stable in genomic positions. Furthermore, the structural rearrangement is tightly associated with epigenomic remodeling and transcriptional regulation. Altogether, we define a theoretical parameter to build linkage between structural characteristics and biological functions of TADs.

Hi-C data sources and processing
For traditional Hi-C experiments, the data sets of human cell line hESC and two mouse cell lines (mESC and Cortex) were downloaded from NCBI with accession number GSE35156 (11), and the data sets on human cell lines IMR90 and GM12878 were downloaded from NCBI with accession numbers GSE43070 (16) and GSE48592 (17), respectively. The in situ Hi-C data sets of human cell lines GM12878, IMR90, K562, HMEC, HUVEC and NHEK were downloaded from NCBI with accession number GSE63525 (15). Only the data sets generated by MboI restriction enzyme in the original work are used in this work, and the data sets on human (KBM7) and mouse (CH12-LX) cell lines are eliminated due to lack of epigenomic data.
The pipeline of Hi-C data processing follows a previous procedure (18) by using hg19 and mm10 as human and mouse reference genomes. Briefly, pair-end reads are mapped to the genome using an iterative mapping scheme, in which each read is truncated to 25 bp first and extended by 5 bp iteratively until the mapping with maximum read length is achieved. The uniquely mapped pair-end reads are used for the next analysis. To maintain data reliability, different types of noisy reads are removed from data sets, including the reads originated from polymerase-chainreaction duplication, the reads started within 5 bp of the restriction site or located in the same restriction fragments, the reads located on very large (100 kb) or small (100 bp) fragments, the so-called random-break reads, and the reads that face towards each other and are separated by less than the library length. Fragments with the top 0.5% number of reads are also eliminated from data sets. The summary of Hi-C data is provided in Supplementary Table S1.

Chromatin representation and TAD identification
Chromatin is partitioned into bins (or beads) at given resolution, and the kept reads are assigned to these beads to generate the observed interaction matrix (O i j ) N×N for each Hi-C data set, where N is the number of beads in the chromatin and O i j is the observed interaction frequency between beads i and j. The gap regions arising from DNA repeat elements or the absence of enzyme restriction sites are eliminated from the observed interaction matrix to facilitate next calculations (19). To remove experimental biases, interaction matrix is corrected by using hiclib package (18) with default relative error (10%). The corrected matrix ( f i j ) N×N is used as input to perform TAD identification by using directionality index (DI) based hidden Markov model (HMM) (11).

Chromatin interaction parameter calculation
The statistically significant chromatin interactions are selected by considering both distance-dependent decay and local interaction background (15). Specifically, the averaged interaction frequency at given genomic distance k in a n-beads TAD is calculated by f Then B-Spline is used to approximate the curve between f (k) and k to obtain the smoothed interaction frequency S(k). Generally, the smoothed curve will fluctuate after the first turning point k r due to the shortage of super long-range chromatin interactions in the TAD.
To solve this problem, the averaged interaction fre- where (i, j ) represents given chromatin interaction in the TAD. Besides genomic distance, the expected interaction frequency of chromatin interaction (i, j ) is calculated by using square windows to further take local interaction background into consideration: where 2 p + 1 and 2w + 1 specify the widths of two square windows centered at (i, j ) with p < w. Then the P-value for observing interaction frequency f loor( f i j ) is calculated based on the Poisson process with expected value To further reduce noisy chromatin interactions, the chromatin interactions with low windowed interaction interaction selection (around 30% in this work to account for low sequencing depths in some data sets). Then ∼5% of chromatin interactions with highest statistical significance in the TAD are selected for next analysis. By following previous work (15), the window parameters p and w are set to 4 and 7 respectively at 5 kb resolution, 2 and 5 at 10 kb resolution, and 1 and 3 at 20 kb resolution.
The selected chromatin interactions are clustered into spatially neighbored groups G i , called interaction blocks in this work, by using the density-based algorithm DB-SCAN (14). This algorithm utilizes the fact that the clustered points tend to have high local density and the isolated Nucleic Acids Research, 2015, Vol. 43, No. 15 7239 points tend to have low local density. In the calculation, the point density is defined to be the number of points located in the spatial sphere at a given radius. Two parameters, i.e. the minimum number of points and the spatial radius, are involved in this calculation. Previous work (20) has discussed how to set these two parameters, and our calculation follows this procedure.
The convex hull H i is calculated from interaction block G i by using QuickHull algorithm (21). The so-called corepoint p j in DBSCAN is defined as the point surrounded by at least n p points (aforementioned minimum number of points) in convex hull H i , and its local density is calculated by dp j = n p n p m=1 dist( p j , p m ), where p m is the m th closest point to target point p j , and dist() denotes the squared distance between two points. The interaction density of convex hull H i is calculated by The density of isolated convex hull is set to zero. Finally, chromatin interaction aggregation preference of this domain is defined as the weighted density: where pts(H i ) and pts(T AD) mean the numbers of selected interactions in the convex hull H i and TAD, respectively. It is worth noting that other parameters may also be developed to characterize TAD structures by using these significant interactions.

TAD annotation and comparison
The DNA sequence features, such as transcription start sites (TSSs), short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs), were downloaded from ENCODE (22). LaminB1-associated binding signals in mESC were obtained from the reference (23), in which the nucleotides have already been processed to be binding or unbinding ones. The ChIP-Seq, including TF binding and epigenomic signals, and processed RNA-Seq reads were downloaded from ENCODE (22). The ChIP-Seq signals were uniformly normalized by using the align2rawsignal software (A. Kundaje, http://code.google. com/p/align2rawsignal/). When calculating Pearson correlation between AP and given annotation signal, the processed signal in the domain are summed and then averaged by domain length to represent signal strength of TAD. To reduce the negative impact of confounding factors, such as GC-composition of the genome, in computing P-value for correlation coefficient, the Pearson correlation between AP and randomly shuffled annotation signal (less than 1 Mb independently for different chromosomes) is calculated. This shuffled correlation is repeated 500 times independently, and the final P-value is computed by testing the difference between original correlation coefficient and shuffled correlation average under Fisher's z transformation (24). TAD stability is measured by using the boundary overlap in genomic positions between two cell lines or between two experimental pipelines on the same cell line. The number of stable TADs decreases as stricter criterion is used (Supplementary Figure S1). To compromise between TAD number and stability, TADs sharing similar boundaries (less than 5 bins in both left and right ones) in genomic positions are considered to be stable. The overlapped block regions of two stable TADs are divided by total blocks in each cell line, and the averaged overlap percentage is used to calculate Pearson correlation between AP change and block overlap. By using the biological replicates sharing similar intra-chromosomal interaction reads in different cell lines, statistical analysis shows that the stable TADs consistently exhibit higher structural variations arising from cell lines than those arising from biological replicates (Supplementary Figure S2). To simplify the TAD selection when performing structural comparisons across cell lines in diverse cases, quantile regression is used to select stable TADs with top AP change by accounting for the difference in structural variations (25,26). Finally, around 20% of stable TADs are selected to perform structural comparisons in detail (Supplementary Figure S3). When investigating the association between parameter AP and annotation signals in TADs, the Wilcoxon signed rank test is used to calculate the P-value for each investigated signal.

Parameter calculation from Hi-C chromatin interactions
Five traditional Hi-C maps are denoted as hESC-T, GM12878-T, IMR90-T, mESC-T and Cortex-T to represent three human cell lines and two mouse cell lines, while six in situ Hi-C maps are denoted as GM12878-I, IMR90-I, K562-I, HMEC-I, HUVEC-I and NHEK-I to represent corresponding human cell lines. To compromise between higher resolution and limited sequencing depth, TADs were identified at 20 kb resolution for traditional Hi-C data sets and at 10 kb resolution for in situ Hi-C data sets by using DI based HMM proposed by a previously work (11). For three deepest sequencing data sets (GM12878-I, IMR90-I and K562-I), additional TAD identifications were performed at 5 kb resolution. The validity of TAD identification at given resolutions has already been evaluated in both traditional and in situ Hi-C data sets elsewhere (11,15). Finally thousands of TADs were generated in each cell line (Supplementary Figure S4). Since different kinds of structural characteristics can be obtained at different resolutions due to hierarchical chromatin architecture (27), the TADs identified from different resolutions are not directly compared in this work.
Thorough inspection revealed the nonrandom distribution of those high-frequency chromatin interactions, which tend to form different types of spatial clusters (Figure 1). Inspired by this finding, we defined a novel parameter, called aggregation preference (AP), to represent the structural characteristics of TADs by calculating the aggregation degree of high-frequency chromatin interactions. First, the statistical significance of observing each chromatin interaction was calculated by considering both the genomic distance and local interaction background. Second, the chromatin interactions were sorted by calculated P-values, and ∼5% of top significant ones with locally high interaction frequencies were selected for next analysis. Third, the selected chromatin interactions were clustered into spatial groups, called interaction blocks in this work, by using a density-based algorithm DBSCAN (14). Finally, the parameter AP was defined as the weighted density of interaction blocks in each TAD (Materials and Methods). It can be expected that the parameter AP defined from selected interactions captures the structural characteristics of TAD based on the fluorescence in situ hybridization (FISH) observations that statistically significant interactions are often physically preferred contacts (6,(8)(9)(10)12,15,28), despite essential difference between these two concepts due to the complicated nature of chromatin interactions (29).
Pearson correlations between biological replicates show that the defined parameter AP is highly reproducible in both traditional and in situ Hi-C data sets ( Supplementary Fig-ure S5). The cell lines hESC-T, HUVEC-I and NHEK-I are eliminated from reproducibility evaluation, either because significant difference exists in sequencing depth between two replicates or because there are no biological replicates. AP values calculated from merged replicates are used for next analysis in all cell lines. We next performed TAD identification and AP calculation for GM12878-I and IMR90-I at 20 kb resolution to further evaluate the robustness of parameter AP between two experimental pipelines in the same human cell lines. To reduce the impact of differences in sequencing depth and data quality, only the stable TADs shared by both Hi-C pipelines were used to calculate the reproducibility of parameter AP. Our result shows that the parameter AP is quite reproducible between two experimental pipelines (Supplementary Figure S6). Finally, these two kinds of reproducibility analyses together indicate the robustness of our defined parameter.

Structural heterogeneity and functional implications of TADs
We explored the structural characteristics of TADs by using interaction blocks and corresponding parameter AP. Meanwhile, domain structure was functionally annotated by using DNA sequences, epigenomic signals and transcription activity. Among these signals, SINEs are often found in gene-rich regions (30) and LINEs are generally located in heterochromatin (31). Lamina-associated bindings are generally heterochromatin signals. H3K4me3 and H3K4me1 are promoter and enhancer markers, respectively. H3K27ac generally reflects that the enhancer and promoter are active or not. RNAPII binding and H3K36me3 reflect the transcription activity, and RNA sequence (RNASeq) directly represents transcription level (32).
The interaction blocks generated from selected chromatin interactions are the structural and functional bases to distinguish TADs. Though previous works (16,33) have identified functional chromatin interactions on the same cell lines, our interaction selection is different from those works in terms of scientific purpose. Moreover, our method further utilizes the clustering patterns of locally highfrequency chromatin interactions to distinguish TADs. Figure 2A illustrates that TADs exhibit different types of interaction blocks. The chromatin interactions are dispersedly distributed to form low-density interaction blocks in the left domain, but the selected chromatin interactions are greatly aggregated to form high-density interaction blocks in the right domain. This structural difference is well captured by the defined parameter AP. The annotation signals show that the domain with high-density blocks is more enriched in active signals, such as SINEs, active epigenomic signals and RNA expression, implying that the structural characteristics of TAD is correlated to biological function. The same situation can also be observed in traditional Hi-C data sets (Supplementary Figure S7).
Statistical analysis was performed to systematically investigate the relationship between structure and function in TADs. The wide distributions of AP values suggest that TAD structures are heterogeneous in all cell lines (Supplementary Figure S8). Some domains have high AP values, indicating that the selected chromatin interactions are aggregated to form high-density and small-size interaction blocks. By contrast, the domains exhibiting low AP values own the dispersedly distributed chromatin interactions. Nevertheless, the widely distributed values suggest the mixed situations in most domains. Consistent with Figure  2A, the parameter AP also shows significant correlation to active signals and anti-correlation to inactive signals (LINE and LaminB1-association binding) in both traditional and in situ Hi-C data sets ( Figure 2B). These results suggest that TADs with highly aggregated interactions are generally located in active regions but TADs with disperse interactions are generally located in inactive regions. However, most domains contain both active and inactive regions with diverse biological functions.

Structural rearrangements of TADs across cell lines
We next investigated the structural rearrangements of TADs across cell lines by using the parameter AP. Our analysis showed that TADs could be stable in genomic positions across cell lines, consistent with previous study (11). We then calculated the Pearson correlation between AP change and block overlap on these stable TADs, obtaining considerably negative coefficients (Supplementary Table S2). This result shows that AP change can indicate the overlap degree of interaction blocks between stable TADs, but the relationship is a little complicated. Therefore, quantile regression was used to select different types of stable TADs with significant AP change to perform structural comparisons in detail (Supplementary Figure S3).
There exist different kinds of structural changes across cell lines, which can be captured by interaction blocks and corresponding AP values. Figure 3 illustrates that TAD can undergo block split when comparing human cell lines GM12878-I with IMR90-I. Intuitively, the statistically significant chromatin interactions in cell line GM12878-I are more dispersedly distributed compared with those in cell line IMR90-I, which is accurately captured by interaction blocks and the parameter AP. Together with aforementioned functional annotation, the structural change indicates that this TAD is activated in cell line IMR90. Consistently, the increase of active epigenomic and transcription signals (H3K4me1, H3K4me3, H3K27ac, H3K36mes, RNAPII and RNASeq) validates the increased transcription activity in cell line IMR90-I (Figure 3). In addition, the reverse direction of structural change, block merging, is also captured by our defined parameter ( Supplementary Figure S9). Generally, the structural rearrangements of TADs are rather complicated, such as concurrent appearance, disappearance, split and merging of interaction blocks (Supplementary Figure S10), and thus the relationship between individual block change and functional change should be complex. However, the parameter AP can capture the overall structural and functional changes in various cases. The similar situations can be observed in traditional Hi-C data sets ( Supplementary Figures S11 and S12).
Genome-wide comparisons on the parameter AP revealed that a considerable number of TADs undergo structural changes in both human and mouse cell lines (Supplementary Figure S3). Some domains become more aggregated in chromatin interactions, and some domains transit to disperse chromatin interaction patterns. To statistically analyze the association between structural change and transcription remodeling, the selected TADs were paired by AP values, and Wilcoxon signed rank test was performed on epigenomic and transcription signals (Materials and Methods). Figure 4 shows that structural change is significantly associated with the remodeling of transcription activity in both traditional and in situ Hi-C cell lines, indicating that most TADs becoming interaction-aggregated are activated and those becoming interaction-dispersed are repressed in regulatory activity. The observed exceptions are partly caused by the complicated nature of structural variations and biological functions of TADs. These results suggest that chromatin interaction pattern captured by param-  eter AP has the potential to mark regulatory functions at genome-wide scale in mammalian genomes.

DISCUSSION
Previous methods only focused on identifying functional chromatin interactions based on different purposes (16,33), but neglected the aggregation patterns of these chromatin interactions. In this paper, we utilize the intrinsic chromatin interaction pattern of TAD to define a novel parameter. By using this chromatin interaction parameter, we systematically investigate the structural characteristics of TADs and its association with biological functions in human and mouse cell lines, thus providing insights into chromatin 3D structure and functions in mammalian genomes. Heterogeneous structures exist among TADs and this structural heterogeneity is probably encoded by DNA sequences and corresponding regulatory activities. Previous studies have shown that TAD is a pervasive element in chromatin 3D organization (4)(5)(6)(7)(8)(9)(10)(11), but the structural difference among domains is not well understood. Statistical analysis on the defined feature reveals that the aggregation degree of chromatin interactions can be widely distributed among TADs, indicating the existence of diverse domain structures at high resolution. Integrative analysis further shows that euchromatin TADs have statistically higher chromatin interaction aggregation than heterochromatin TADs. This result can be partly explained by the fact that euchromatin regions are rich in genes and regulatory elements, which provide greater potential to form promoter-enhancer interaction clusters (34). Thus, the DNA sequence and regulatory activity may have resulted in the structural heterogeneity and functional diversity of TADs.
Chromatin interaction pattern has the potential to mark gene regulation and transcription activity, at least in TADs. Several recent works explored the relationship between chromatin interactions and biological functions. McCord et al. showed that the progress of Hutchinson-Gilford progeria syndrome was related to the reorganization of chromatin compartments (35). Rousseau et al. showed that chromatin conformation data in HOXA locus could be used to classify leukemia types, with better performance than RNA-Seq data (36). By combining polymer model with RNA-FISH, Giorgetti et al. revealed that transcription activities were coupled with chromatin conformation fluctuations in X-inactivation center (28). In this work, our systematical comparisons further show that structural change of TAD is significantly associated with its transcription remodeling in genome-wide scale. These results imply that chromatin interaction pattern has the potential to mark biological functions in mammalian genomes.
In summary, we define a parameter, the aggregation preference of statistically significant chromatin interactions, to represent the structural characteristics of TADs in this work, allowing us to systematically investigate domain structures and their association with biological functions. Statistical analysis shows that TADs are different in chromatin 3D structure, and this structural difference may be encoded in DNA sequences and corresponding regulatory activities. TADs can also undergo significant structural rearrangements across cell lines though they may preserve the domain positions in the genome. Moreover, this structural change is tightly associated with transcription remodeling. With the rapid expansion of Hi-C data, the chromatin interaction parameter defined in this work can provide a useful tool to explore the biological functions underlying the structural characteristics of TADs, especially when the epigenomic and TF binding signals are poor.