HiChIP-Peaks: a HiChIP peak calling algorithm

Abstract Motivation HiChIP is a powerful tool to interrogate 3D chromatin organization. Current tools to analyse chromatin looping mechanisms using HiChIP data require the identification of loop anchors to work properly. However, current approaches to discover these anchors from HiChIP data are not satisfactory, having either a very high false discovery rate or strong dependence on sequencing depth. Moreover, these tools do not allow quantitative comparison of peaks across different samples, failing to fully exploit the information available from HiChIP datasets. Results We develop a new tool based on a representation of HiChIP data centred on the re-ligation sites to identify peaks from HiChIP datasets, which can subsequently be used in other tools for loop discovery. This increases the reliability of these tools and improves recall rate as sequencing depth is reduced. We also provide a method to count reads mapping to peaks across samples, which can be used for differential peak analysis using HiChIP data. Availability and implementation HiChIP-Peaks is freely available at https://github.com/ChenfuShi/HiChIP_peaks. Supplementary information Supplementary data are available at Bioinformatics online.


Table S2
Motifs identified as significantly enriched from differentially bound peaks. Results from HOMER using differentially bound peaks from the specified contrasts.

Figure S1
Reads are located around the restriction site, as expected from a Hi-C library. A) Distribution of reads' distance from closest restriction site. Dataset from Naïve T cells, Biological replicate 1, technical replicate 1. B-C) Pileup of short-range reads from HiChIP dataset Naïve T cells combined (using FitHiChIP's utility). The signal is more intense around the restriction sites, creating sparsity elsewhere which can bias other peak calling methods resulting in many small peaks. Hichipper attempts to compensate this by extending the peaks to the nearest fragment. HiChIP-peaks is based on the religation site and doing so ignores the sparsity by design and maximises usable information. Data shown is from GM12878.
A Figure S2

Proportion of read pairs identified as dangling ends, self-circle and re-ligation from the Naïve T cells dataset and a Hi-C library generated with the Arima-Hi-C kit (unpublished).
Libraries generated with the Arima-Hi-C kit have almost no dangling ends and self-circle reads that can be used by Hichipper in SELF mode.

Figure S3
Read count distribution per restriction site as used for HiChIP-Peaks. The noise signal from the HiChIP datasets resembles a negative binomial distribution. Data from the combined Naïve T cells dataset.  Figure S4 Fragment size bias fitted using a LOWESS fit. Fragment size is the sum of the fragments within the tested re-ligation sites. We use this fit to correct the expected background by fragment size in the negative binomial test. Data from the combined Naïve T cells dataset.

Figure S5
P-value distribution from negative binomial test. The null distribution is uniform, showing an appropriate fit of the background model. Data from the combined Naïve T cells dataset.

Peaks recalled from reference vs peaks called from Naïve T cells (A) and GM12878 (B).
Our algorithm can identify more peaks from the reference while calling fewer peaks.
A B Figure S7 Peaks recalled from reference vs genome covered from Naïve T cells (A) and GM12878 (B). Even though the peaks identified by our algorithm can be larger than Hichipper's we can still identify more peaks from the reference at the same amount of genome covered when FDR is set at less than 0.01. FitHiChIP fares unfairly well in this metric because it produces a lot of small peaks that are dispersed along the genome.
A B Figure S8

Precision-recall values for peak calling in subsampling analysis (vs full dataset). Our peak calling method is significantly more consistent compared to hichipper when read depth is reduced. (A) CD4+ Naïve T cells. (B) GM12878 cells.
A B Figure S9 Overlap analysis between loops called using full depth datasets and subsampled datasets for CD4+ Naïve T cells. Supplying our peaks to Hichipper increases the recall of the loops identified from the full dataset without significant degradation in precision. (A) recall vs number of loops called. (B) precisionrecall plot.
A B Figure S10 Overlap analysis between loops called using full depth datasets and subsampled datasets for GM12878 cells. Supplying our peaks to Hichipper increases the recall of the loops identified from the full dataset without significant degradation in precision. (A) recall vs number of loops called. (B) precision-recall plot.
A B Figure S11 Overlap analysis between reference promoter capture Hi-C and Hichipper with default peaks and with our peaks in CD4+ Naïve T cells. Using our peaks to run Hichipper provides a higher recall at the same number of loops called (A) and with a higher precision (B) than using the default peaks.
A B Figure S12 Overlap analysis between reference promoter capture Hi-C and Hichipper with default peaks and with our peaks in GM12878 cells. Using our peaks to run Hichipper provides a higher recall at the same number of loops called (A) and with a higher precision (B) than using the default peaks.
A B Figure S13 Overlap analysis between reference H3K27ac ChIA-PET and Hichipper with default peaks and with our peaks in K562 cells. Using our peaks to run Hichipper provides a higher recall at the same number of loops called (A) and with a higher precision (B) than using the default peaks.
A B Figure S14

Comparing genome covered by loops with recall between reference promoter capture Hi-C and Hichipper with default peaks and with our peaks in CD4+ T cells (A) and GM12878 cells (B).
The anchors generate using peaks from our software are larger due to the larger size of the anchors. We show that even comparing the total basepairs covered by anchors with recall we still have a better ratio.
A B Figure S15 Comparing genome covered by loops with recall between H3K27ac ChIA-PET and Hichipper with default peaks and with our peaks in K562 cells. The anchors generate using peaks from our software are larger due to the larger size of the anchors. We show that even comparing the total basepairs covered by anchors with recall we still have a better ratio. These differences a larger due to the fact that H3K27ac ChIA-PET is a technique focused on H3K27ac.

Figure S16
Overlap analysis between FitHiChIP loops called using the two different peaks sets. Fithichip shows a very high congruency of the loops regardless of the peaks used. (A) recall vs loops called compared to the other setting. (B) precision-recall plot called compared to the other setting.
A B Figure S17 Overlap analysis between Hichipper loops called using the two different peaks sets.  Figure S18 Overlap analysis between reference promoter capture Hi-C and FitHiChIP with default peaks and with our peaks in CD4+ Naïve T cells. Fithichip shows a very high congruency of the loops regardless of the peaks used. (A) recall vs loops called. (B) precision-recall plot.
A B Figure S19 Overlap analysis between reference promoter capture Hi-C and FitHiChIP with default peaks and with our peaks in GM12878 cells. Fithichip shows a very high congruency of the loops regardless of the peaks used. (A) recall vs loops called. (B) precision-recall plot.
A B Figure S20 P-value distribution from DESeq2 for differential analysis. Naïve T cells, biological replicate B2 vs B3. The p-value distribution is as expected in the test from DESeq2, which is a U shape when lfc is set to anything different than 0. (A) lfc = 0.5 (used for the results presented in this paper). (B) lfc = 0.
A B