Analysis of the structural variability of topologically associated domains as revealed by Hi-C

Abstract Three-dimensional chromosome structure plays an integral role in gene expression and regulation, replication timing, and other cellular processes. Topologically associated domains (TADs), building blocks of chromosome structure, are genomic regions with higher contact frequencies within the region than outside the region. A central question is the degree to which TADs are conserved or vary between conditions. We analyze 137 Hi-C samples from 9 studies under 3 measures to quantify the effects of various sources of biological and experimental variation. We observe significant variation in TAD sets between both non-replicate and replicate samples, and provide initial evidence that this variability does not come from genetic sequence differences. The effects of experimental protocol differences are also measured, demonstrating that samples can have protocol-specific structural changes, but that TADs are generally robust to lab-specific differences. This study represents a systematic quantification of key factors influencing comparisons of chromosome structure, suggesting significant variability and the potential for cell-type-specific structural features, which has previously not been systematically explored. The lack of observed influence of heredity and genetic differences on chromosome structure suggests that factors other than the genetic sequence are driving this structure, which plays an important role in human disease and cellular functioning.


Hanging TADs
TADsim identifies structurally similar genomic regions by breaking the genome up into every possible interval that starts and ends at a TAD boundary in either TAD set, and testing each interval for statistically significant similarity of the TAD sets within them. One of the artifacts of this method is that the boundaries of these intervals may fall within a TAD, creating a false TAD boundary at the start of end of the interval, and the algorithm is unable to distinguish between true TAD boundaries and those imposed by the genome segmentation. We therefore define a "hanging TAD" as a TAD at the edge of an interval that has been truncated to less than 50% of its original length. Hanging TADs can appear to be a perfect match to a true TAD and will therefore be included in the interval identified by the method (an example of this can be seen in Figure S1, where the hanging TADs have been circled).
In order to avoid this, a preprocessing step has been added to the algorithm. Sub-intervals that include hanging TADs (as defined above) are removed from consideration before testing for statistical significance, guaranteeing that they will not be included in the output. Figure S2 shows the output for the same input set as Figure S1, after removing hanging TADs. Though there are fewer total intervals identified after this modification, the area covered by both outputs is essentially the same.

Adaptive shuffling
After computing the VI between TAD sets on every sub-interval, statistically significant subintervals are chosen by an adapted permutation test. For each sub-interval, one TAD set is fixed and TADs from the other set are randomly permuted n times, calculating the VI at each permutation. The appropriate number of permutations n depends on the number of TADs being shuffled. Using a fixed number n = 1000 meant that in some cases, results would differ on different runs with the exact same input due to under-sampling artifacts.
In order to improve the robustness of our results, we have modified the method to continue randomly permuting TADs until the p-values converge. We define convergence as the point where the p-values for 5 consecutive iterations are equal up to at least 5 decimal places. Though this approach appears more computationally intensive, in practice, it does not take more than twice as much time as the fixed 1000 shuffles. This is because, for regions with very few TADs, convergence is reached very quickly and requires significantly fewer than 1000 permutations, somewhat balancing out the instances in which more than 1000 permutations are required.

Parallelism, concurrency, and memory optimization
One of the challenges with the original implementation of TADsim was efficiency. Among all of the subroutines, the permutation testing is the most time intensive, taking up to several hours Figure S3: Speedup resulting from various code optimizations for tests on chromosome 1 of A549 and Caki2. The speedup is initially directly proportional to the number of processing cores used, but the speedup plateaus once the number of processes exceeds the number of cores, potentially due to processor saturation.
depending on the number of TADs in the input sets. Thus, we modified this method to make it both concurrent and memory optimized. In this implementation, each execution can spawn one or more processes with each process having 1 or more threads. The independence of each permutation in the p-value calculation sub-routine allows for high concurrency that is maximized when each process has only one thread. The speedup is then directly proportional to the number of processing cores. However, once the process count becomes more than the number of cores there is only a minor improvement in the speedup which may be attributed to processor saturation (see Figure  S3). The tests were done on a 24-core/48-thread 2.6 GHz Intel Xeon E5-2690 machine.

Execution Time
Each of the method changes described above affect the overall execution time of TADsim. For example, while adaptive shuffling often leads to an increase in the runtime, removing hanging TADs greatly reduces it, as shown in Figure S4a. To identify the effects of each of the functions we ran 22 random tests, 1 for each chromosome with random cell types on a 24-core/48-thread 2.6 GHz Intel Xeon E5-2690 machine. Figure S4b shows the runtime comparison of the original program against the modified program with different thread counts.   Table S1: Hi-C samples downloaded and processed that could not be analyzed at 100kb resolution  Figure S5: Reproducibility versus Hi-C coverage. Reproducibility is quantified by the similarity value of each replicate pair. We compute a contact count for each replicate sample by adding all non-normalized contacts on all intra-chromosomal matrices, and Hi-C coverage is defined as the smaller of the two contact counts for the replicate samples being compared. Across all three measures, especially the two quantifying TAD reproducibility, there is a low correlation with coverage though very high coverage experiments tend to have high reproducibility. Under all measures, the pairs containing samples from the two different protocols were statistically significantly lower than pairs from the same protocol (p < 0.01), demonstrating that samples produced through the different protocols have lower similarity due simply to this technical variation rather than meaningful biological differences.   Figure S9: Robustness study for the results on tissue samples and parent-parent-child trios. The boxes outline the first to third quartiles of data, the midline represents the median, and outliers are shown as diamonds. These boxplots represent the distributions of JI (a, c) and TADsim (b, d) values for various median TAD sizes (500kb, 700kb, 880kb, and 1Mb). The 880kb distributions are also shown in Figure 2, and are included here for comparison. (a,b) The "Background" represents all pairs from our data with at least one cell line (as opposed to a tissue sample), and "Replicates" distributions are the values from all replicate pairs. The "Same tissue, different donor" values come from pairs of samples of the same tissue type collected from different individuals, and "Different tissue" represents all pairs of tissue comparisons from two different tissue types. (c,d) The "Background" distributions include all pairs with at least one non-blood lymphocyte sample. "Blood lymphocyte pairs" include all comparisons of two different blood lymphocyte samples that do not come from the trio data. "Trio replicates" are the replicate comparisons from each of the trio samples (n = 9). "Within trio pairs" represent comparisons of samples from the same trio (either two parents, or a parent and their child, n = 9). a b c d Figure S10: Robustness study for the results on comparing in situ and dilution Hi-C. The boxes outline the first to third quartiles of data, the midline represents the median, and outliers are shown as diamonds. These boxplots represent the distributions of JI (a, c) and TADsim (b, d) values for various median TAD sizes. The 880kb distributions are also shown in Figure 3, and are included here for comparison. (a,b) "In situ replicates" represents all replicate comparisons of samples collected by an in situ protocol, and "Dilution replicates" shows all replicate comparisons collected with a dilution protocol. (c,d) "Different cell types" represents all comparisons where the two samples compared are not of the same cell or tissue type. "Same cell type, different protocol" shows the similarity values of pairs in which both samples are the same cell type but one was generated with an in situ protocol, and one came from a dilution protocol. "Same cell type, same protocol" shows the similarities of pairs in which both samples come from the same cell type and both were generated by the same protocol (either both in situ or both dilution). a b Figure S11: Robustness study for the results on restriction enzymes. The boxes outline the first to third quartiles of data, the midline represents the median, and outliers are shown as diamonds. These boxplots represent the distributions of JI (a) and TADsim (b) values for various median TAD sizes. The 880kb distributions are also shown in Figure 4, and are included here for comparison. The "Same cell type, different restriction enzyme" values represent similarities of pairs of samples from the same cell line (either HFF-hTERT or hESC), where different restriction enzymes were used during the Hi-C protocol (n = 13). The "Background" here represents all other non-replicate pairs. a b Figure S12: Robustness study for the results on lab-specific variation. The boxes outline the first to third quartiles of data, the midline represents the median, and outliers are shown as diamonds. These boxplots represent the distributions of JI (a) and TADsim (b) values for various median TAD sizes. The 880kb distributions are also shown in Figure 5, and are included here for comparison."Same cell type, different study" distributions show the similarities of pairs of samples of the same cell line (either IMR90 or hESC), which came from different studies (n = 16). "All replicates" distributions show the values from all replicate pairs in the data collected, and "All non-replicates" shows all non-replicate pairs which are not in the "Same cell type, different study" category.