HiC4D: forecasting spatiotemporal Hi-C data with residual ConvLSTM

Abstract The Hi-C experiments have been extensively used for the studies of genomic structures. In the last few years, spatiotemporal Hi-C has largely contributed to the investigation of genome dynamic reorganization. However, computationally modeling and forecasting spatiotemporal Hi-C data still have not been seen in the literature. We present HiC4D for dealing with the problem of forecasting spatiotemporal Hi-C data. We designed and benchmarked a novel network and named it residual ConvLSTM (ResConvLSTM), which is a combination of residual network and convolutional long short-term memory (ConvLSTM). We evaluated our new ResConvLSTM networks and compared them with the other five methods, including a naïve network (NaiveNet) that we designed as a baseline method and four outstanding video-prediction methods from the literature: ConvLSTM, spatiotemporal LSTM (ST-LSTM), self-attention LSTM (SA-LSTM) and simple video prediction (SimVP). We used eight different spatiotemporal Hi-C datasets for the blind test, including two from mouse embryogenesis, one from somatic cell nuclear transfer (SCNT) embryos, three embryogenesis datasets from different species and two non-embryogenesis datasets. Our evaluation results indicate that our ResConvLSTM networks almost always outperform the other methods on the eight blind-test datasets in terms of accurately predicting the Hi-C contact matrices at future time-steps. Our benchmarks also indicate that all of the methods that we benchmarked can successfully recover the boundaries of topologically associating domains called on the experimental Hi-C contact matrices. Taken together, our benchmarks suggest that HiC4D is an effective tool for predicting spatiotemporal Hi-C data. HiC4D is publicly available at both http://dna.cs.miami.edu/HiC4D/ and https://github.com/zwang-bioinformatics/HiC4D/.


Insulation score
The insulation score (IS) for the -th 40-kb bin is calculated as following: where  and  are the bin sets of [ − 5,  − 1] and [ + 1,  + 5], respectively, (, ) is the Hi-C contacts at bin indexes  and ,  is equal to 5, and   is the average IS of all bins on the current chromosome.
The final IS (  ′ ) is the log transformation of log 2 ( + 1), where  is the ratio of each bin's IS (  ) and the average IS (  ).

Different lengths of input time steps
We used the four methods (two next-frame methods ConvLSTM and ResConvLSTM, two n-step ahead methods NaiveNet and SimVP) for testing different lengths of input time steps.It was easier for the two next-frame methods to change the lengths of input time steps by resetting from which time step we start to use the reconstructed matrix as the input of next time.For the two n-step ahead methods, we usually need to revise the network by appending an output layer or resetting the last out channels to obtain predicted matrices with the desired lengths of predicted time steps.For example, if we want to use NaiveNet to predict five time-steps with only one time step as input (1 → 5), we just set the output channel of the last layer to five.

Fig. S1 .
Fig. S1.The architecture of ResConvLSTM2.The outputs of every five ResConvLSTM blocks are concatenated as input for the final output layer.

Fig. S2 .
Fig. S2.The architecture of the 3-layer NaiveNet.The number of groups for each of the two group normalizations (GroupNorm) is set to 2. The negative slope for each of the two LeakyReLUs is set to 0.2.The padding tuple for each of the three Conv3Ds is (1, 3, 3) for keeping the shape of 3 × 50 × 50.The three input temporal channels are  1 ,  2 , and  3 , while the three output temporal channels are corresponding to  4 ,  5 , and  6 .

Fig. S3 .
Fig. S3.Performances of three ConvLSTM networks on validation data from chromosome 19 at times  4 ,  5 , and  6 .Pearson correlations between experimental Hi-C and predicted Hi-C from each of the three ConvLSTMs at each genomic distance.The three ConvLSTMs were trained with the same hyperparameters: the batch size of 32, the kernel size of 5, and the number of layers equal to 4.

Fig. S4 .
Fig. S4.Deeper ResConvLSTM performs better on validation data from chromosome 19.Pearson correlations between experimental Hi-C and predicted Hi-C from each of the three ResConvLSTMs at each genomic distance.The three ResConvLSTM networks were trained with the same kernel size of 7 and the same batch size of 32, but with different hidden dimensions and number of layers because of the limitation of GPU memory.The hidden dimensions of the 8-layer, 28-layer, and 50-layer networks (together with two more convolutional layers) are 128, 64, and 32, respectively.

Fig. S5 .
Fig. S5.ST-LSTM without decouple loss performs better on validation data from chromosome 19 at time  6 .Pearson correlations between experimental Hi-C and predicted Hi-C from each of the three ST-LSTMs at each genomic distance.The three ST-LSTMs were trained with different configurations.The first one was equipped with the batch size of 32, the kernel size of 5, and the number of layers equal to 4 together with only MSE loss.The second one was equipped with the same hyperparameters as the first one but used MSE plus decouple loss.The last one was equipped with the same batch size (32) and the number of layers (4), but with the kernel size of 7 together with MSE plus 0.1 times decouple loss.

Fig. S6 .
Fig. S6.Pearson correlations between insulation scores calculated on experimental (ground truth) and predicted Hi-C contact matrices from nine methods on dataset 1.

Fig. S8 .
Fig. S8.Performances of four methods on chromosome 2 from dataset 1 for using different lengths of input time steps.(A) Boxplots of Pearson correlations between ground truth and predicted Hi-C contact matrices at each genomic distance.P-values were computed using the paired Wilcoxon test.(B) SCC scores between ground truth and predicted Hi-C contact matrices from the four methods.

Fig. S9 .
Fig. S9.Performances of the four methods on chromosome 6 from dataset 1 for using different lengths of input time steps.(A) Boxplots of Pearson correlations between ground truth and predicted Hi-C contact matrices at each genomic distance.P-values were computed using the paired Wilcoxon test.(B) SCC scores between ground truth and predicted Hi-C contact matrices from the four methods.

Fig. S10 .
Fig. S10.Pearson correlations between insulation scores calculated on experimental (ground truth) and predicted Hi-C contact matrices on dataset 2.

Fig. S11 .
Fig. S11.TAD-recovering results on two testing chromosomes 2 and 6 for dataset 2. Valleys/minima of insulation scores from predicted Hi-C contact matrices around strong TAD boundaries called on experimental Hi-C.

Fig. S12 .
Fig. S12.TAD-recovering results on chromosome 2 for dataset 2. Hi-C heat maps and their corresponding insulation-score curves for ground truth and ResConvLSTM predictions.

Fig. S13 .
Fig. S13.Reproducibility results for dataset 3. (A) Boxplots of Pearson correlations between ground truth and predicted Hi-C contact matrices at each genomic distance for two testing chromosomes (2 and 6).For clarity, only P-values between ResConvLSTM and each of the other four methods (NaiveNet, SimVP, SA-LSTM, and ConvLSTM) were shown for all plots with statistical tests.P-values were computed using the paired Wilcoxon test.(B) SCC scores between ground truth and predicted Hi-C contact matrices from the nine methods.(C) SCC scores between ground-truth Hi-C matrices from each pair of the six time-steps for two testing chromosomes.(D) SCC scores between predicted Hi-C matrices from each pair of the three predicted time-steps.

Fig. S14 .
Fig. S14.Genome-wide average SCC scores between ground-truth Hi-C matrices from each pair of the five time-steps on dataset 4.

Fig. S15 .
Fig. S15.Running time summary for predicting all chromosomes from dataset 4 when an NIVIDIA A100 equipped with 40GB memory was used and the batch size for all testing data is set to one.A minimum of 2.5 GB of GPU memory is required for all nine models.

Fig. S16 .
Fig. S16.Genome-wide average SCC scores between ground-truth Hi-C matrices from each pair of the six time-steps on dataset 5.

Fig. S17 .
Fig. S17.Genome-wide average SCC scores between ground-truth Hi-C matrices from each pair of the six time-steps on dataset 6.

Fig. S18 .
Fig. S18.Reproducibility results for dataset 6. (A) Boxplots of Pearson correlations at each genomic distance.P-values were computed using the paired Wilcoxon test.(B) SCC scores between ground truth and predicted Hi-C contact matrices.P-values were computed using the paired t-test.Both correlation and SCC values were collected from all chromosomes.

Fig. S19 .
Fig. S19.Genome-wide average SCC scores between ground-truth Hi-C matrices from each pair of the five time-steps on dataset 7.

Fig. S20 .
Fig. S20.Genome-wide average SCC scores between ground-truth Hi-C matrices from each pair of the six time-steps on dataset 8.

Fig. S21 .
Fig. S21.Reproducibility results for dataset 8. (A) Boxplots of Pearson correlations at each genomic distance.P-values were computed using the paired Wilcoxon test.(B) SCC scores between ground truth and predicted Hi-C contact matrices.P-values were computed using the paired t-test.Both correlation and SCC values were collected from all chromosomes.

Table S1 .
Details of dataset 1 about numbers of valid read pairs at each time step.

Table S2 .
Details of dataset 2 about numbers of contact pairs at each time step.

Table S3 .
Details of dataset 3 about numbers of valid read pairs at each time step.

Table S4 .
Details of dataset 4 about numbers of contact pairs at each time step.

Table S5 .
Details of dataset 5 for medaka.

Table S6 .
Details of dataset 6 for Xenopus tropicalis.

Table S7 .
Details of dataset 7 for human cardiogenesis.

Table S8 .
Details of dataset 8 for mouse cell reprogramming.

Table S9 .
Validation results of hyperparameter tuning for three next-frame methods (ConvLSTM, ResConvLSTM, and ST-LSTM) and two 3-step ahead methods (SimVP and NaiveNet).The further evaluation results for the highlighted models are shown in Results section.Three ResConvLSTM variants (ResConvGRU, ResConvMUT, and ResConvLSTM2) are trained with hyperparameters same as the final ResCOnvLSTM we used.The hidden dimensions in parentheses for SA-LSTM are for self-attention module.

Table S10 .
The top 3 methods that having the largest sum of all SCC scores for each of the eight datasets.