Removing unwanted variation between samples in Hi-C experiments

Abstract Hi-C data are commonly normalized using single sample processing methods, with focus on comparisons between regions within a given contact map. Here, we aim to compare contact maps across different samples. We demonstrate that unwanted variation, of likely technical origin, is present in Hi-C data with replicates from different individuals, and that properties of this unwanted variation change across the contact map. We present band-wise normalization and batch correction, a method for normalization and batch correction of Hi-C data and show that it substantially improves comparisons across samples, including in a quantitative trait loci analysis as well as differential enrichment across cell types.


Introduction
The Hi-C assay allows for genome-wide measurements of chromatin interactions between different genomic regions (Lieberman-Aiden et al., 2009;Wit and Laat, 2012;Dekker, Marti-Renom, and Mirny, 2013;Schmitt, Hu, and Ren, 2016;Davies et al., 2017). Hi-C has predominately been used to comprehensively study differences in 3D genome structure between loci within a cell type. Partly because of the high cost of the assay, the role of interpersonal variation in 3D genome structure is largely unexplored.
When comparing genomic data between samples, variation can arise from numerous sources that do not reflect the biology of interest including sample procurement, sample storage, library preparation, and sequencing. We refer to these sources of variation as "unwanted" here, because they obscure the underlying biology that is of interest when performing a between-sample comparison. It is critical to correct for this unwanted variation in analysis (Leek, Scharpf, et al., 2010). A number of tools and extensions have been successful at this, particularly for analysis of gene expression data (Leek and Storey, 2007;Leek and Storey, 2008;Gagnon-Bartsch and Speed, 2012;Johnson, Li, and Rabinovic, 2007;Stegle et al., 2010;Leek, 2014;Risso et al., 2014). Existing normalization methods for Hi-C data are single sample methods, focused on comparisons between different loci in the genome. To facilitate this, some methods explicitly model sources of unwanted variation, such as GC content of interaction loci, fragment length, mappability and copy number (Yaffe and Tanay, 2011;Hu et al., 2012;Vidal et al., 2017). Other methods are agnostic to sources of bias and attempts to balance the marginal distribution of contacts (Imakaev et al., 2012;Knight and Ruiz, 2013;Rao et al., 2014;Yan et al., 2017). A comparison of some of these methods found extremely high correlation between their correction factors (Rao et al., 2014); we will use HiCNorm as an exemplar of these within-sample normalization methods (Hu et al., 2012).
By contrast, there has been less work on between-sample normalization. Two existing methods have considered between-sample normalization in the context of a differential comparison, both based on the idea of loess normalization from gene expression microarrays (Yang, Dudoit, et al., 2002). In these methods, the estimated fold-change between conditions are modeled using a loess smoother as a function of either average contact strength (Lun and Smyth, 2015) or distance between loci (Stansfield and Dozmorov, 2017). Using the estimated model, the data are corrected so there is no effect of the covariate on the fold-change. These approaches require a specific comparison of interest and only stabilizes the mean fold-change.
To address the pressing need for cross-sample normalization methods for Hi-C, we developed a BNBC (Bandwise Normalization and Batch Correction), a method for normalization and batch correction of Hi-C data. The method is focused on making individual entries in the contact maps comparable across samples. Our approach is inspired by the observation that patterns of variation between replicates of Hi-C data are different depending on the distance between interacting loci. Therefore, our approach conditions on 1D genomic distance, aggregating all contacts between loci separated by a specific distance across all samples into a matrix where the columns are each sample's Hi-C interactions for a specific inter-locus interaction distance. Normalization and removal of unwanted variation proceeds on these separate matrices, which we term band matrices. We show that important biological and statistical features of the Hi-C contact maps are preserved using this approach, while stabilizing the marginal distributions across samples and substantially reducing unwanted variation.

Unwanted variation in Hi-C data varies between distance stratum
It is well described that a Hi-C contact map exhibits an exponential decay in signal as the distance between loci increases (Lieberman-Aiden et al., 2009). When we quantify this behavior across biological replicates (lymphoblastoid cell lines generated from 3 individuals from each of 3 trios from the HapMap project, Table 1), each with 2 growth replicates, we observe substantial variation in the decay rate from sample to sample (Figure 1a). We use the term "biological replicate" here, as it is widely used in the context of a populationbased study where each biological replicate is a sample from a different individual. Our samples are lymphoblastoid cell lines from the HapMap project (International HapMap Consortium, 2003), because these cell lines have been a widely used model system to study inter-individual variation and genetic mechanisms in numerous molecular phenotypes including gene expression, chromatin accessibility, histone modification, and DNA methylation (Stranger et al., 2007;Pickrell et al., 2010;Montgomery et al., 2010;Degner et al., 2012;Kasowski et al., 2013;McVicker et al., 2013;Kilpinen et al., 2013;Bell et al., 2011). The Hi-C data from 8 lymphoblastoid cell lines were normalized within growth replicate using HiCNorm (Hu et al., 2012). Following application of HiCNorm, contact maps were corrected for library size using the log counts per million transformation and smoothed using the HiCRep approach (Yang, Zhang, et al., 2017); a bandwidth of 5 was selected using this approach (Methods). Smoothing of the contact map has been found beneficial (Yang, Zhang, et al., 2017;Ursu et al., 2017;Yaffe and Tanay, 2011;Imakaev et al., 2012); recent work, which we confirm, has found that the correlation between technical replicates are increased by smoothing (Yang, Zhang, et al., 2017).
The library preparation of these samples were done at 3 different time points, and we use these different time points to define a batch factor (Table 1). This batch factor encompasses other potential differences between the samples (aside from library perpetration batch), because the trios sampled here come from 3 different human populations (Yoruba, Han Chinese and Puerto Rico), and all of the Yoruba libraries (from Ibadan, Nigeria) were prepared on the same date. The other two populations have one growth replicate in each of two batches. It has been established that phenotypic differences, which are unlikely to be explained by genetics, exists between lymphoblastoid cell lines from different HapMap populations (Stark et al., 2010;Choy et al., 2008;Stranger et al., 2007). These differences might be related to cell line creation and division (Stark et al., 2010). For this reason, it is hard to separate out the effect of Hi-C experimental batch from cell line creation and division in our data. Nonetheless, both types of effects represents unwanted variation insofar as they confound attempts to study the inter-individual variation in 3D genome organization.
To assess unwanted variation beyond changes in the mean, we represented our data as a set of matrices indexed by genomic distance (Figure 1b). Each matrix contains all contacts between loci at a fixed genomic distance for all samples (Methods). We call this a band transformation, since these contacts form diagonal bands in the original Hi-C contact matrices. For each band, we observe substantial variation in the distribution of contacts between samples (Figure 1c-e). These marginal distributions suggests the presence of a unwanted variation (Leek, Scharpf, et al., 2010). We argue that this variation is unwanted across biological replicates, since our data reveal it is at least partly technical. Note that not all contact distances are treated equally when interpreting Hi-C data: one goal of Hi-C experiments is to identify enhancer-promoter contacts, which are thought to occur primarily with 1 Mb (Vernimmen and Bickmore, 2015).
To assess the impact of unwanted variation on our Hi-C data, we first asked, for each contact, how much variation is explained by the batch factor? We measured the amount of explained variation using R 2 from a linear mixed effects model with a random effect to model the increased correlation between growth replicates (Methods). We observe an association between explained variation and distance between loci ( Figure 2a), with an average R 2 value of 0.667. This suggests that the effect of the batch factor is substantial and changes with distance. We note again that the "batch factor" here is not simply a Hi-C experiment batch, because the variation between these batches has several potential sources as explained above. To further explore the effect of batch, we performed PCA on each of the band matrices and computed Spearman correlation between each of the first four principal components and the batch indicator. Since our batch factor has 3 levels there are 3 possible orderings of the factor (Figure 2b for one ordering, Supplementary Figure S1 for the other two). This again shows substantial unwanted variation associated with the batch factor, and furthermore shows the dynamic nature of the unwanted variation as distance changes.
To ensure that these data characteristics were not introduced by HiCNorm combined with smoothing, we performed the same measurements on the raw Hi-C contact maps found similar characteristics (Supplementary Figure S2), with one exception. Specifically, we found that the smoothing with a larger bandwidth greatly increased the variation explained by the batch factor (Supplementary Figure S3).
Together, our results highlight the need for between-sample normalization and removal of unwanted variation for Hi-C data, and demonstrates that the effect of unwanted vari-ation depends on genomic distance between loci.

Band-wise normalization and batch correction
To normalize the data and remove unwanted variation we used the band transformation framework. Prior to band transformation we use a 2D smoother on the contact maps. Following smoothing we perform quantile normalization separately on each band matrix. Finally, we apply ComBat (Johnson, Li, and Rabinovic, 2007) separately on each band matrix using the batch factor variable. As said above, we refer to our method as band-wise normalization and batch correction (BNBC) (see Methods). This approach is not critically dependent on the choice of smoother or bandwidth nor on the usage of HiCNorm; we observe similar performance across these choices ( Figure S3c).
To assess the effect of BNBC, we again measured the variation explained by the batch factor and observed a remarkable decrease of this quantity, which no longer changes as a function of distance ( Figure 2c). Specifically mean R 2 decreases from 0.667 to 0.09. Comparing R 2 between HiCNorm and BNBC shows a decrease of essentially every individual contact ( Figure 2e); this pattern depends on distance (Supplementary Figure S4). Likewise, the correlation between each of the first 4 principal components and the batch factor was close to zero ( Figure 2d). In addition, the marginal distributions are stabilized, which is expected since we performed quantile normalization (Supplementary Figure S5). This shows that BNBC removes substantial unwanted variation associated with batch.
We next investigated the impact of BNBC on features of the contact map. The BNBCcorrected data exhibits the standard decay pattern of Hi-C data, without variation across replicates (Supplementary Figure S5a). More interestingly, we observe a contact map very similar to HiCNorm (Supplementary Figure S6). The same is true for its associated first eigenvector, which is commonly used to identify A/B compartments (Supplementary Figure S6). We conclude that the application of BNBC does not distort gross features of the contact map.
Above we show that increasing the bandwidth of the smoother increases the variation explained by the batch factor, and also increases the correlation between technical replicates. When we examine the impact of increased smoothing bandwidth following application of BNBC, we found little effect of bandwidth or put differently, that BNBC was able to correct for the increase. Since increasing the bandwidth does increase the correlation between technical replicates, we use the bandwidth recommended by the HiC-Rep approach (a bandwidth of 5). We note that the HiC-Rep criteria does not include consideration of biological signal and we caution that such signal could be diminished. For example, in work on normalization of DNA methylation arrays, we found that methods which performs best at reducing technical variation do not necessarily perform best when the assessment is replication of biological signal (Fortin, Labbe, et al., 2014).
Popular alternatives to ComBat include SVA (Leek and Storey, 2007;Leek and Storey, 2008;Leek, 2014), RUV (Gagnon- Bartsch and Speed, 2012;Risso et al., 2014) and PEER (Stegle et al., 2010) which are all variation of factor models. These methods construct surrogate variables which represents unmeasured sources of unwanted variation. We experimented with the use of PEER instead of ComBat and observed dramatically reduced performance compared to ComBat (Figures S7, S1). Note that the choice of R 2 for evaluation metric can be considered unfair since ComBat uses the batch factor as input; the correlation plots should not be affected by this. We ran PEER using both 1 and 4 factors; results were very similar. The performance of PEER raises the question of how to best correct for unwanted variation when an explicit batch factor is unavailable. This is an important open question because (1) ComBat requires two samples for each level of the batch factor and (2) unwanted variation may be mediated through other factors than library preparation batch.

Discussion
To analyze Hi-C across samples, including biological replicates, it is clear that betweensample normalization methods are necessary. Here, we have characterized unwanted variation present in Hi-C contact maps and have developed a correction method named BNBC. We show unwanted variation exhibits a distance-dependent effect, in addition to known distance-based features of Hi-C contact maps. We present BNBC, a modular approach where we combine band transformation with existing tools for normalization and removal of unwanted variation. We show that BNBC performs well in reducing the impact of unwanted variation while still preserving important 3D features, such as the structure of the contact map and A/B compartments. Our focus in this work has been the normalization of individual entries in the contact map, but we note that proper normalization of such entries are not a requirement for normalization of higher-order structures. For example, we have previously observed that A/B compartments reproduce well between samples from the same cell type (Fortin and Hansen, 2015). Note that the batch factor we have analyzed here could be driven by either or both of cell line construction and Hi-C library preparation. Proper normalization and correction for unwanted variation will be critical for comparing Hi-C contact maps between different samples.

Data Generation
Hi-C experiments: Lymphoblast Hi-C data analyzed were generated by the dilution Hi-C method using HindIII (Lieberman-Aiden et al., 2009) on 9 lymphoblastoid cell lines derived from the 1000 Genomes project (Table 1). Data are publicly available through 1000 genomes (Chaisson et al., 2017) as well as through the 4D Nucleome data portal (https://data.4dnucleome.org; accessions 4DNESYUYFD6H, 4DNESVKLYDOH, 4DNESHGL976U, 4DNESJ1VX52C, 4DNESI2UKI7P, 4DNESTAPSPUC, 4DNES4GSP9S4, 4DNESJIYRA44, 4DNESE3ICNE1). Hi-C contact matrices were generated by tiling the genome into 40kb bins and counting the number of interactions between bins. We refer to these as raw contact matrices.
Hi-C read alignment and contact matrices: Reads were aligned to hg19 reference genome using bwa-mem (Li, 2013). Read ends were aligned independently as paired-end model in BWA cannot handle the long insert size of Hi-C reads. Aligned reads were further filtered to keep only the 5' alignment. Read pairs were then manually paired. Read pairs with low mapping quality (MAPQ¡10) were discarded, and PCR duplicates were removed using Picard tools 1.131 http://broadinstitute.github.io/picard. To construct the contact matrices, Hi-C read pairs were assigned to predefined 40Kb genomic bins. Bins with low mapping quality (< 0.8), low GC content (< 0.3), and low fragment length (< 10% of the bin size) were discarded.

Band Matrices
To make comparisons across individuals, we form band matrices, which are matrices whose columns are all matrix band i from each sample. A matrix band is a collection of entries in a contact matrix between two loci at a fixed distance. Formally, band i is the collection of j, k entries with |j − k| + 1 = i.

Log counts per million transformation
We use the logCPM (log counts per million) transformation previous described (Law et al., 2014). Specifically, for a contact matrix X we estimate library size L by the sum of the upper triangular matrix of each of the chromosome specific contact matrices. This discards inter-chromosomal contacts as well as the diagonal of the contact matrix. The logCPM matrix Y is defined as where X ij refers to element i, j from the contact matrix X and L is the estimated library size for that matrix. For data normalized using HiCNorm both X and L are not integers.

HiCNorm
We normalized data using HiCNorm (Hu et al., 2012) with an updated implementation (https://github.com/ren-lab/HiCNorm). Following HiCNorm normalization, we applied the log counts per million transformation (see above). We then smooth the contact matrices with a box smoother with a bandwidth of 5 bins; we use HiCRep to choose the bandwidth based on the correlation between technical replicates (Yang, Zhang, et al., 2017). The bandwidth we select is the same as the bandwidth selected for 40kb resolution Hi-C data in Yang, Zhang, et al. (2017). Smoothing was performed using the EBImage package (Pau et al., 2010); this is a separate but equivalent implementation to HiCRep.

BNBC
BNBC has the following components: separate smoothing of each contact matrix, application of the band transformation, quantile normalization on each band matrix and finally application of ComBat on each band matrix.
Following the log counts per million transformation of the raw contact matrices, we smooth individual chromosome matrices using a box smoother with a bandwidth of 5, as selected by the HiCRep approach (Yang, Zhang, et al., 2017). Each contact matrix and each chromosome is smoothed separately. We next apply the band transformation (see above) and quantile normalize each band matrix separately (Bolstad et al., 2003).
Following quantile normalization we apply ComBat (Johnson, Li, and Rabinovic, 2007) to each band matrix separately. We apply the parametric prior described in Johnson, Li, and Rabinovic (2007). Prior to applying ComBat, we filter out matrix cells for which the intra-batch variance is zero for all batches. After applying ComBat we set filtered matrix cells to zero.

Explained variation and smoothed boxplot
To assess unwanted variation for each matrix cell in a contact matrix, we employ a linear mixed model approach. Specifically, we fit a mixed effect model regressing HiC contact strength on batch indicator, with a random effect at the subject level to capture the increased correlation between technical replicates. This model is fit using the R package varComp (Qu, Guennel, and Marshall, 2013) and R 2 for this model is calculated using the method of Edwards et al. (2008).
To display R 2 as a function of distance, we first compute a series of box plots of R 2 , one for each band matrix. We extract the summary measures for the box plots (median, 1st and 3rd quantile and 1.5 times the inter-quartile range). We then display these 5 curves, with color fills. Medians are black, 1st and 3rd quartiles are pink and 1.5 times the interquartile range are blue.

A/B compartments from smoothed contact matrices
A/B compartments were originally proposed to be estimated using the first eigenvector of a suitable transform of the contact matrix Lieberman-Aiden et al., 2009. Specifically, the contact matrix was transformed using the observed-expected transformation where each matrix band was divided by its mean. Our contact matrices following application of the log counts per million transform and smoothing are on the log scale. To get A/B compartments from the output of BNBC (Supplementary Figure S6), we exponentiate every entry in the matrix, multiply by 10 6 , apply the observed-expected transformation and compute the first eigenvector. Finally, we standardize the first eigenvectors to be in (−1, 1) and then smooth the standardized eigenvectors using a moving-average as done by Fortin and Hansen (2015).   Supplementary Figure S1. The performance assessment of all methods using correlation of batch with principal components. As in Figure 2b we assess the influence of batch using Spearman correlation between the batch factor and the 1st-4th principal components of each band matrix, for various methods. Column 1 uses the ordering batch 1, batch 2, batch 3. Column 2 uses the ordering batch 2, batch 1, batch 3. Column 3 uses the ordering batch 2, batch 3, batch 1.   Figure S7. The performance assessment of all methods using R 2 . As in Figure 2a we assess the influence of batch using the percent variation explained by the batch factor (R 2 ), as a function of distance, for various methods.