Exhaustive identification of genome-wide binding events of transcriptional regulators

Abstract Genome-wide binding assays aspire to map the complete binding pattern of gene regulators. Common practice relies on replication—duplicates or triplicates—and high stringency statistics to favor false negatives over false positives. Here we show that duplicates and triplicates of CUT&RUN are not sufficient to discover the entire activity of transcriptional regulators. We introduce ICEBERG (Increased Capture of Enrichment By Exhaustive Replicate aGgregation), a pipeline that harnesses large numbers of CUT&RUN replicates to discover the full set of binding events and chart the line between false positives and false negatives. We employed ICEBERG to map the full set of H3K4me3-marked regions, the targets of the co-factor β-catenin, and those of the transcription factor TBX3, in human colorectal cancer cells. The ICEBERG datasets allow benchmarking of individual replicates, comparing the performance of peak calling and replication approaches, and expose the arbitrary nature of strategies to identify reproducible peaks. Instead of a static view of genomic targets, ICEBERG establishes a spectrum of detection probabilities across the genome for a given factor, underlying the intrinsic dynamicity of its mechanism of action, and permitting to distinguish frequent from rare regulation events. Finally, ICEBERG discovered instances, undetectable with other approaches, that underlie novel mechanisms of colorectal cancer progression.


Introduction
Gene regulation is executed by a combinatorial action of DNA-binding proteins that physically associate with the DNA at regulatory sites across the genome to tether chromatin modifying complexes and RNA Polymerase II and start RNA transcription ( 1 ,2 ).Characterizing the genome-wide binding profile and positioning of all transcriptional regulators is therefore crucial to understand the dynamics of gene expression across development and disease ( 3 ).
Several technologies have been historically employed to map the binding behavior of gene regulators, among which the popular ChIP-seq (chromatin immunoprecipitation followed by sequencing) ( 4 ) and more recently CUT&RUN (Cleavage Under Targets and Release Using Nuclease) ( 5 ).CUT&RUN in particular is emerging as method of choice, for it does not require cross-linking, only necessitates low cell input ( 6 ), can be adapted to target all types of gene regulators ( 7 ), and its variants-such as CUT&Tag-are suitable for single-cell analyses ( 8 ).
These technologies rely on next generation DNA sequencing (NGS) of the genomic sites associated with the targeted protein, and are therefore intrinsically noisy ( 9 ).Peak-calling algorithms have been developed as necessary tools to track real signal over background: among these, commonly used are MACS2 ( 10 ) and the recently adapted for the CUT&RUN more sparse reads SEACR ( 11 ) and GoPeaks ( 12 ).Yetunsolved genomics complexities combined with technical artifacts, which include excessive read amplification and mapping biases, can produce spurious signal that is computationally removed by subtracting pre-packaged 'suspect lists' of problematic regions ( 13 ,14 ).
Common practice and working standards developed for ChIP-seq demand the use of experimental duplicates to assess biological reproducibility ( 15 ).Yet, widespread recent application of these technologies generated the consensus that more replicates may improve the reliability of peak identification ( 16 ).The focus remains on adopting stringent statistics and peak reproducibility-often employing the majority rule of 'trusting' a peak when called in 2 out of 3 experiments ( 16)-as rationale for peak calling.Many pipelines exist to attempt to account for replicates during ChIP-seq peak calling, but these still rely on the initial high stringency statistics ( 17 ,18 ).These practices are meant to exclude noisy background (true negatives) to make sure that no spurious signal is 'called' (false positives) often at the cost of excluding real biological events (false negatives) ( 19 ).
Similar to what many before us have noticed with ChIPseq, we have measured that across our several CUT&RUN experiments the overlap between replicates was invariably poor .Moreover , almost paradoxically, the attempt of increasing the rate of peak discovery by increasing the number of replicates led to a dramatic drop in size of the overlapping set across replicates, thus forcing us to rely only on a very small number of reproduced binding instances.These, we submit, must be the tip of an iceberg of biologically relevant events of interest, most of which are discarded.
To solve this issue, we developed ICEBERG (Increased Capture of Enrichment By Exhaustive Replicate aGgregation), a procedure that utilizes numerous CUT&RUN replicates to discover the entire set of binding events and consistently exclude false positives.We applied ICEBERG to profile H3K4me3-marked regulatory regions, β-catenin and TBX3 targets in human colorectal cancer (CRC) cells, a context in which Wnt / β-catenin constitutes the main oncogenic driver and is subject of intense investigation in search of druggable mechanisms to treat disease (20)(21)(22)(23)(24).We showed that ICE-BERG approaches the discovery of the entire binding profile of H3K4me3 and β-catenin, uncovering previously undetected cancer relevant binding events and revealing that transcriptional modulators, instead of a fixed number of target sites, are associated to a spectrum of detection probabilities across the genome.

CUT&RUN LoV-U
CUT&RUN LoV-U was performed as described in Zambanini et al. ( 7 ).250 000 cells / sample were harvested from culture flasks by incubation with Trypsin-EDTA (Cat.# 25200056, Gibco) for 5 min.The trypsin was quenched with culture media and the cells were washed twice with DPBS (Cat.#14190094, Thermo Fisher Scientific).Nuclear extraction was performed by three washes Nuclear Extraction (NE) buffer (HEPES-KOH pH-8.2 [20 mM], KCl [10 mM], Spermidine [0.5 mM], IGEPAL [0.05%], Glycerol [20%], Roche Complete Protease Inhibitor EDTA-Free).The nuclei were resuspended after the final wash in 40 μl NE per sample and bound to 10 μl Magnetic ConA Agarose beads equilibrated in binding buffer (HEPES pH 7.5 [20 mM], KCl [10 mM], CaCl 2 [1 mM], MnCl 2 [1 mM]) as described in Meers et al. ( 25 ).Bead binding proceeded on a rotator for 15 min, after which nuclei and beads were resuspended in 200 μl wash buffer per sample and split into PCR tubes.Beads were collected on the magnet and then resuspended in 200 μl EDTA wash buffer (wash buffer with EDTA [0.2 mM]) and incubated RT for 5 min.Samples were then directly resuspended in antibody buffer (wash buffer with antibody 1:100) and incubated ON at 4 • C on a rotator.Antibodies used included antiβ-catenin ABIN2855042, anti-H3K4me3 ABIN6971977 and anti-rabbit IgG ABIN101961.After ON incubation samples were washed 5 times and resuspended in 200 μl of pAG-MNase buffer (wash buffer with pAG-MNase 0.6 μg / ml) and incubated for 45 min at 4 • C on a rotator.pAG / MNase was a gift from Steven Henikoff (Addgene plasmid #123461; http:// n2t.net/ addgene:123461; RRID: Addgene_123461) and was expressed and purified in house as previously described ( 25 ).Next, samples were washed five times and during the last wash equilibrated 5 min in wet ice.Samples were resuspended in 200 μl ice cold wash buffer containing 2 mM CaCl 2 , and digestion was allowed for proceed for exactly 30 min in wet ice.The digestion buffer was removed and transferred to tubes containing 1.5 μl of 0.5 M EDTA and 1.5 μl of 0.5 M EGTA and mixed well.The reaction was stopped on the beads with 47 μl of 1 × urea STOP buffer (NaCl [100 mM], EDTA [2 mM], EGTA [2 mM], IGEPAL [0.5%], urea [8.8 M]) and the samples were incubated 1 h at 4 • C for elution.Beads were collected on the magnet and liquid transferred to the PCR tubes containing the digestion buffer.DNA purification was done with two rounds of bead purifications according to manufacturer's directions, using Mag-Bind TotalPure NGS beads (Cat.#M1327, Omega Bio-Tek) at 2 × (200 μl first round and 40 μl second round), and then finally resuspended in 20 μl Tris-HCl pH 7.5.

Library preparation and sequencing
Library preparation was done using the KAPA Hyper Prep Kit for Illumina sequencing (Cat.#KK8504, KAPA Biosystems) following the manufacturer's guidelines with modifications.0.4 × volume reactions were used for End repair and Atailing steps, starting with 20 μl CUT&RUN DNA.The thermocycler conditions were 12 • C for 15 min, 37 • C for 15 min and 58 • C for 25 min.Adapter ligation was also performed in 0.4 × volume reactions.KAPA Dual Indexed adapters were used at 0.15 μM.A DNA cleanup was performed after ligation with Mag-Bind TotalPure NGS beads at 1.2 × the sample volume, and DNA was resuspended in 10 mM Tris-HCl pH 7.5.Library amplification steps were performed in 0.5 × volume reactions.The cycling conditions were: 45 sec initial denaturation at 98 • C, 15 s denaturation at 98 • C, 10 s annealing / elongation at 60 • C, 1 min final extension at 72 • C, hold at 4 • C, for a total of 13 cycles.A post-amplification DNA purification was performed with beads at 1.2 × sample volume.Libraries were then size-selected using the E-Gel EX 2% agarose gel (Cat.# G402022, Invitrogen) and the E-Gel Power Snap Electrophoresis System (Invitrogen).Run time was set to 10 min and DNA between 150 and 500 bp was excised from the gel and purified using the QIAquick Gel Extraction Kit (Cat.#28706 × 4, Qiagen) according to the manufacturer's instructions.Libraries were quantified using Qubit (Thermo Scientific) and the high sensitivity DNA kit (Cat #Q32854, Thermo Scientific).Libraries were pooled according to nM and sequenced with 36 bp pair-end reads on the NextSeq 550 (Illumina) using the Illumina NextSeq 500 / 550 High Output Kit v2.5 (75 cycles) (Cat.#20024906, Illumina).Target depth was 5-10 million reads per sample.

Replicate quality control and alignment
Fastqc (version 0.11.9) ( 26 ) was used to assess initial read quality and duplication rates.Reads were trimmed using bbmap bbduk (version 38.18) ( 27 ) to remove adapters, known artifacts, poly A T and T A repeats, and poly G and C repeats.The trimmed reads were aligned to the hg38 genome with bowtie2 (version 2.4.5) ( 28 ) with the options -local -verysensitive-local -no-unal -no-mixed -no-discordant -phred33 -dovetail -I 0 -X 500.Samtools (version 1.11) ( 29 ) was used to convert sam to bam files, fix improperly paired mates, and remove duplicates.BEDTools (version 2.30.0) ( 30 ) was used to remove reads mapped to the CUT&RUN hg38 blacklist ( 14 ).Final read counts of usable fragments were determined from filtered and deduplicated bam files.Bedgraphs were created with BEDTools (version 2.23.0) ( 30 ) genomecov on pairend mode.Quality control correlation plots and PCAs were performed with deepTools (version 3.5.1-0)( 31 ).Bedgraphs were visualized in IGV ( 32 ).

Peak calling and decay curves
The 10 IgG negative controls were merged to create the aggregate IgG (approximately 70 million reads).The aggregate IgG was downsampled to 5 million reads for peak calling of individual replicate datasets.Initial peak calling was performed with MACS2 (version 2.2.6) ( 10 ) with the optionsf BAMPE -keep-dup all and -q 5e-2 with the downsampled IgG as the control.Intervene (version 0.6.4) ( 33 ) was used to create Venn diagrams.BedSect ( 34 ) was used to overlap all replicates, and the resulting analysis matrix files were used to create decay curves.The peaks were determined using awk to extract the exact regions called in at least x datasets, and then merging neighboring regions with BEDTools and calculating the number of unique regions.Peaks called in exactly x datasets were calculated by subtracting the x + 1 peaks from the x peaks.Curves were plotted in R. Pearson correlation between sequencing depth (reads) and peaks number or FrIP score was calculated with the ggscatter package (R; https: // github.com/kassambara/ ggpubr/ blob/ master/ R/ ggscatter.R ).

ICEBERG
The maximum number of usable fragments was determined by the smallest replicate to be 4 million for β-catenin and 5 million for H3K4me3, and each replicate was downsampled to this depth three different times; reads were shuffled before each downsampling.The downsampled replicates were merged one by one for each of the three splits to create three different aggregate datasets, containing 100 million reads for β-catenin and 50 million reads for H3K4me3, coming from 25 replicates and 10 replicates, respectively.Each of the three aggregates were separately peak called with MACS2 as described above, against the aggregate IgG dataset.Between each replicate addition, peaks were also called in the same way and the number of peaks were used to generate the ICEBERG build curves.This was done 5 times with the replicates in different random orders.The relationship between the number of aggregated replicates and called peaks was modeled by fitting a series of polynomial regressions of increasing order (up to 5th) in R (version 4.2.2, lm function).The function with the highest R 2 value was selected.Peaks called in at least two of the three aggregates were determined with BEDTools, and then further filtered to remove peaks not called in at least one individual replicate when called with MACS2 -p 1e-2, to finally generate the final set of ICE-BERG peaks.The ICEBERG peak sets were overlapped with the peak data for the decay curve of the replicates in order to create the ICEBERG peak decay curve and define the probability groups as explained in the results.

Benchmarking of replicates and peak callers
I-FrIPs were calculated as (reads within ICEBERG peaks) / (total mapped reads).For the benchmarking, MACS2 was used as described above, and also with -p 1e-2 and -p 1e-3 as thresholds.SEACR (version 1.3) ( 11 ) was used with the settings norm relaxed against IgG, norm stringent against IgG, and 0.001 stringent.For the 0.001 peaks, the same settings were applied to the IgG downsampled negative control, and peaks called in the negative control were subtracted from each replicate peak set with BEDTools subtract downstream.Precision was determined by (peaks within ICEBERG set) / (total called peaks), and recall as (peaks within ICEBERG / total ICEBERG peaks).F1 scores were calculated by the equation F1 = 2 × (Precision × Recall) / (Precision + Recall).Duplicate peak sets were obtained by using BEDtools intersect to overlap the peak sets of all the possible replicate combinations.For triplicates, random datasets were overlapped, and both the two of three and three of three peak sets were generated with BEDtools.Redundant results (those containing the same dataset multiple times) were removed.

Downstream analyses and graphs
Signal intensity plots and average profiles were created using ngs.plot(version 2.63) ( 35 ) with options -G hg38 -R bed -N 2 -SC global -IN 0 for peak regions, and with the options -G hg38 -R tss -L 1500 -SC global for the whole genome TSS H3K4me3 plot.Genomic region annotation was done with HOMER (version 4.11) ( 36 ) annotatePeaks on default settings, using -annStats to extract annotation statistics.Genes assigned to promoters were extracted from HOMER results.Motif analysis was done with HOMER with the -size given parameter to find de novo motifs.Peak annotation to genes (not only those in promoters) was done using GREAT (version 4.0.4) ( 37 ) on default settings.Gene ontology was performed for GO biological processes using ShinyGO ( 38 ), results were cutoff at FDR of 0.05.
Baseline RNA-seq data in the form of transcripts per million (TPM) for HCT116 were downloaded from the studies in reference here ( 46 ,47 ).Transcript IDs were converted to gene names, and the lists were filtered for expressed genes as defined by a TPM of > 5.Only genes expressed in both datasets were kept.The resulting list of expressed genes was compared with peak annotated gene sets generated as described above.Hypergeometric tests were performed in R with phyper.
A T AC-seq data for HCT116 from the ENCODE project was used ( 48 ).IDR thresholded peaks were overlapped using BEDTools with the ICEBERG peaks from each group to define the peak regions as open or closed chromatin.
ChIP-seq peaks of β-catenin in HCT116 were downloaded from Bottomly et al. ( 49 ) and lifted over to the hg38 genome using UCSC liftover ( 50 ).Fold-change over control bigwig files mapped to hg38 were downloaded for H3K27ac conducted in HCT116 ( 51 ), and for CTCF conducted in HCT116 ( 52 ), performed by the ENCODE consortium.For Supplementary Figure S1 , ENCODE ChIP-seq peak files (ENCODE4v1.6.1) were downloaded from the accessions ENCSR048CVK, ENCSR000DLN, ENCSR000EVE, ENCSR000EGE, ENCSR000AVB and ENCSR568PGX.For inter-experimental comparison, the preferred (starred, either pseudoreplicated peaks for histone marks, or IDR thresholded peaks for TFs) peak bed files from the union of the two replicates were compared between the experiments using Intervene.For replicate comparisons, the provided peak files for each replicate (histone marks: pseudoreplicated peaks, TFs: IDR thresholded peaks) were compared with Intervene.
HiChIP of H3K27ac in HCT116 data was performed by Chen et al. ( 53 ).bedpe files containing unfiltered loops downstream of HiChipper were downloaded from GEO GSE173699.Loops were filtered, keeping those present in both biological replicates, with a size greater than 5 kb, and a count average of 2 or greater.UCSC liftover was used to convert hg19 to hg38 coordinates, and then loops were over-lapped with β-catenin ICEBERG peaks using BEDTools pairtobed, retaining only loops with one or more anchors overlapping a β-catenin peak.β-Catenin anchored loops were then separated into groups based on peak probability, and the number of loops was divided by number of peaks to obtain average loops / peak.Loop strength was determined for each loop by the average loop count of the two replicates, and loop size was determined by the linear genomic distance between the start of the first anchor and the end of the second anchor.Log 10 of the loop size was used for the plots.Statistical testing was done using non-parametric Kruskal-Wallis test with pairwise comparisons and multiple testing corrections, using GraphPad Prism.

CUT&RUN assays are characterized by low concordance across replicates
Across the many CUT&RUN experiments we have performed, we have noticed a typically low overlap between replicates that persists regardless of protein target, peak caller choice, or statistical stringency (Figure 1 A).For example, of the total peaks called in our datasets targeting the histone modification H3K4me2 in HEK293T, 62.45% of peaks were called in both replicates.When targeting the transcription factor SOX2 in neural stem cells, the overlap was 48.92%, while in CUT&RUN against the co-factor β-catenin the overlap was only 8.85% (Figure 1 B, left column).We refer to the overlap as 'concordance'.The low concordance leaves the question open whether the remaining 31.55%,51.08% and 91.15% of called peaks are real peaks or not.We found this to be a persisting problem also when looking across other published datasets for different gene regulators.In three examples of histone modifications, H3K4me3, H3K9me3 and H3K27ac, the overlaps were 48.86%, 56.47% and 32.51%, respectively (Figure 1 B, top row).For the transcription factors Ikaros, SALL4, and GA T A1 the overlaps were 37.93%, 1.99% and 42.30% (Figure 1 B, middle row).For the co-factors CBP, HD AC1 and HD AC2, 8.40%, 26.24% and 17.03% of peaks were found concordant (Figure 1 B, bottom row).These examples were performed independently by different groups, on different models and analyzed with different pipelines, which reinforces that low concordance is a common outcome in the field.This problem is not only relevant for CUT&RUN, as ChIP-seq data historically also suffer from a lack of concordance.When analyzing public datasets from ENCODE, we found a striking lack of concordance both between replicates of individual experiments, and between experiments done on the same target and cell line by different labs.Similarly, as with CUT&RUN, we noticed that the concordance of datasets targeting histone modifications was typically better than those targeting transcription factors and other gene regulators ( Supplementary Figure S1 A-C).
Many algorithms have been developed in an attempt to computationally address the issue ( 16 ,17 ).Among these, the so-called majority approach ( 16 ), in which the large number of peaks called in the non-overlapping sets are assessed by a third replicate to recover more peaks.However, while using a two of three majority approach can increase the number of retrieved peaks, the number of concordant peaks (i.e.those called in three of three datasets) further decreases.

Increasing the number of replicates reduces the number of reliable events
To assess how the detection of concordant events changes with increasing replicates, we focused on two targets that, according to analysis above, are characterized by higher and lower concordance both when assayed via CUT&RUN (this study) or ChIP-seq ( 51 ,54 ).We performed high numbers of replicates of CUT&RUN LoV-U against H3K4me3, β-catenin and the negative control IgG in the human colorectal cancer of male origin, HCT116 cells, over two independent rounds of experiments (Figure 2 A), sequencing to a depth ranging from 5 to 10 million reads per sample.We called peaks in each of the replicates against the IgG negative control with MACS2, using standard threshold settings ( q < 0.05), producing individual datasets all of which appeared reliable based on the two standard internal controls, namely they display (i) high signal to noise ratio and (ii) enrichment on positive control loci (promoters for H3K4me3, Wnt responsive elements for β-catenin; Figure 2 B).Each dataset, based on (i) and (ii), would be considered as a reliable genome-wide binding pattern.However, when replicates are compared, they ranged in efficiency and number of peaks called.There is no significant correlation between sequencing depth ( R = 0.091, P = 0.66) and number of peaks called, or between depth and FrIP score ( R = -0.27,P = 0.19) ( Supplementary Figure S2 ).These datasets are not fully concordant, and the low overlap is exacerbated by increasing the number of replicates from 2 to 5 (Figure 2 C, Venn diagrams).The strictly concordant peaks (present in all replicates) for both H3K4me3 and β-catenin steadily decrease with increasing replicates.Using the majority approach (2 of 3, 3 of 4, 3 of 5) for H3K4me3 mildly increases the number of considered peaks, while the difference is greater for βcatenin (Figure 2 C, bottom graphs).When comparing the concordance percentage of these β-catenin replicates in HCT116, where β-catenin is constitutively active, it is interesting to note that they are similar to previous CUT&RUN of β-catenin in where > 90% of statistically significant e v ents ( q < 0.05) occur in five or less replicates.

PAGE 7 OF 18
HEK293T (Figure 1 , 8.85%) and of ChIP-seq in the same cell line ( Supplementary Figure S1 C, 7.2%), where the Wnt activator CHIR was used to induce activity of the pathway.This analysis raises the question of whether the decreasing concordance would ultimately stop and converge on a small set of the highest confidence binding events.To address this, we leveraged the large number of replicates (25 for β-catenin, 10 for H3K4me3) to build a 'decay curve' plotting all peaks called in at least 1, at least 2, at least 3, and so on to finally identify the peaks concordant among all replicates (Figure 2 D, left graphs).The H3K4me3 replicates display higher concordance, and this is reflected in a slower decay measured as the lower slope of the curve.On the other hand, the lower concordance of β-catenin replicates is reflected by a rapid initial decay: while 7314 unique peaks are called in at least one replicate, only 13 of these remain concordant in all 25 datasets.This underscores that a relatively large number of events only occur in subsets of replicates and might be therefore rare in their nature.To explore this, we estimated the frequency of their detection by ranking them based on the number of replicates in which they are called, namely plotting those occurring in only 1, only 2, only 3 and so on (Figure 2 D, right graphs).For H3K4me3, almost all of the peaks were concordant and called in all 10 replicates, and thus likely common events in the cell population, consistent with the known stability of chromatin features ( 55 ).On the other hand, this behavior was different for β-catenin, as > 85% of statistically significant events ( q < 0.05) occur in five or fewer replicates.
This analysis exposes several analytical problems.For instance, if only two or three replicates are performed, most binding events of transcriptional regulators risk being missed.Moreover, our analysis imposes the rejection of the majority rule (13 out of 25), which would result in losing more than 90% of detected events for β-catenin.Overall, our data clearly exposes the arbitrary nature of any choice as to what is considered 'reproducible'.

ICEBERG retrieves the complete set of genome-wide binding events
We set out to identify the full set of CUT&RUN peaks for our two targets, H3K4me3 and β-catenin.We constructed an analytical pipeline that would include the information retrieved in all replicates performed: Increased Capture of Enrichment By Exhaustive Replicate aGgregation (ICEBERG).In the ICEBERG approach, we pool equal numbers of random reads from each replicate to build a unique track that would uniformly represent all contributing experiments, where nonspecific signals would average into a homogenous background, abrogating false positives, and at the same time allowing the capture of all true signal enrichment, thereby not tolerating false negatives (Figure 3 A).ICEBERG entails the repetitive construction of aggregate datasets a total of three times with different read splits from the replicates to minimize the bias due to random read assignment.ICEBERG then employs MACS2 ( q < 0.05) to call peaks based on the majority rule (two out of three computational aggregates) against the compilation of the IgG controls ( n = 10) and filters the peaks to retain only those present in at least one of the individual experimental replicates at lower stringency ( P < 0.01) (Figure 3 B), to reduce risk of calling artifactual signal as peaks.Some arbitrary cutoffs that we applied, such as using three aggregates and q -value chosen for MACS2 achieved the best outcomes on our data; yet the ICEBERG approach could be fine-tuned ad hoc for other datasets, for example peak calling on an ICEBERG of a chromatin modification with broad peaks, such as H3K27me3, would require different settings.Additionally, in the presence of higher variability in sequencing depth and thus more possibility for bias, more aggregates might be needed, whereas performing sampling from datasets with extremely homogenous sequencing depth might be unnecessary.
In ICEBERG, subsequent aggregation of each individual dataset increases the number of discovered binding events.We fit a polynomial regression function to model the rate of peak discovery with increasing replicates for both H3K4me3 and β-catenin ( r 2 > 0.98) (Figure 3 C, top graphs).The shape of the curve is indicative of the rate of discovery.The slope is initially steeper, indicating that each individual replicate adds substantial new information.At higher numbers of replicates, the slope decreases, and the curve begins to plateau, as evidenced by the derivative approaching zero (Figure 3 C, bottom graphs).We hypothesize that ICEBERG would allow us to estimate how close we are to the discovery of the full set of binding events of a given factor.For H3K4me3, the regression function reaches a derivative of 0 between 9 and 10 replicates, suggesting ICEBERG may have been performed to exhaustion, identifying a total of 17 641 peaks.For β-catenin, the slope significantly decreases beyond 20 replicates, and the curve is predicted to plateau between 27 and 30 replicates with a total of 13585 to 13542 peaks, respectively, which would mean that our 25 replicates with 12 672 peaks have discovered over 93% of the entire β-catenin binding profile.
Several observations support the reliability of the ICEBERG datasets.First, both are characterized by high signal enrichment over the control, which in the case of H3K4me3 conforms to the expected shape of signal around the transcription start sites (Figure 3  Signal enrichment can be seen on known targets, as well as on many newly discovered regions (Figure 3 E).
To explore how ICEBERG would perform on another target, we applied it to map the binding pattern of the developmental transcription factor TBX3 in CRC cells ( 59 ).Here we also performed 25 replicates, all of which showed high signal to noise ratio and peaks at positive control loci that matched our previous ChIP-seq performed with a different antibody ( 60 ) ( Supplementary Figure S3 A).Similarly to β-catenin, and differently from H3K4m3, TBX3 showed a high rate of peak decay, with the majority of peaks being called in < 5 replicates (MACS2 q < 0.05, Supplementary Figure S3 B).This is in line with the hypothesis that DNA-binding transcription factors are relatively unstable on the chromatin ( 61 ) and present a dynamic, sliding behavior ( 62 ).For TBX3, ICEBERG was able to recover 9755 high confidence peaks ( Supplementary Figure S3 C), and as with β-catenin the ICEBERG aggregate dataset recovered peaks that would be undetectable at high stringency in single replicates ( Supplementary Figure S3 D).This data indicates that an ICEBERG approach is not only applicable but also beneficial when employed to assess a variety of transcriptional modulators.

ICEBERG allows comparing of replicates and benchmarking of peak calling strategies
We reasoned that the ICEBERG datasets can be used as a goldstandard against which to benchmark individual CUT&RUN datasets.We decided to focus on two parameters: the percentage of fragments within peaks (FrIP), a typical measurement to determine the efficiency of an individual experiment, and the signal profile around the peak summit, that exemplifies the signal to noise ratio.The ICEBERG datasets allow us to calculate the percentage of fragments of each replicate that fall within the ICEBERG peaks (I-FrIP).The I-FrIP is therefore a better measurement for the efficiency of each replicate.The average I-FrIP for H3K4me3 is 25.06% ( ±4.34%) and for βcatenin is 1.50% ( ±0.46%) as represented in the violin plots (Figure 4 A, left plots for each factor).The difference in I-FrIPs between the two targets is in line with the previously reported varying difficulty in detecting them ( 7 ).Moreover, the violins appear to be inverted: while for H3K4me3 most of the replicates fall within the higher end of the spectrum, indicating higher efficiency is more common, for β-catenin the outlier is instead the most efficient dataset while the majority are less efficient, confirming the intrinsic difficulty of its detection.Similarly, the average signal per million reads varies greatly for both factors ( > 2-fold for H3K4me3, > 3-fold for β-catenin), though the profile shape is consistent between replicates (Figure 4 A, right plots for each factor).The variability measured by these two metrics is striking as most of the experiments performed could be considered technical replicates and were processed in parallel.This indicates that very small differences between samples can result in drastically different datasets, and that this is likely not reflective of biologically meaningful differences.Therefore, ICEBERG implies that changes in signal intensity from one experimental group to another should be rigorously validated before being considered biologically driven.
We leveraged ICEBERG to benchmark the performance of commonly used peak calling strategies done on individual replicates by comparing their precision (fraction of called peaks that fall within the ICEBERG peaks) and recall (fraction of total ICEBERG peaks identified).A high precision indicates a low rate of false positives, whereas a high recall indicates a low rate of false negatives.We called peaks on each replicate using SEACR and MACS2 in various modalities: (i) SEACR on relaxed mode against IgG, (ii) SEACR stringent mode against IgG, (iii) SEACR 0.001 signal threshold with IgG peaks removed after peak calling, (iv) MACS2 against IgG P < 0.01, (v) MACS2 P < 0.001, (vi) MACS2 q < 0.05, and plotted their precision and recall.The ICEBERG is by definition situated in the top right corner (precision = 1, recall = 1), whereas the bottom left corner would represent high false positive and high false negative rates, the least desirable outcomes (precision = 0, recall = 0).For single replicates of H3K4me3 the replicates were relatively consistent within each peak calling strategy, all of which except for MACS2 P < 0.01 resulted in a > 0.8 precision.MACS2 P < 0.001 scored the highest recall, closely followed by MACS2 q < 0.05 (Figure 4 B, top left).For β-catenin the only strategy with a precision of > 0.8 was MACS2 q < 0.05, though the recall was extremely low at less than 0.2 (Figure 4 B, bottom left).Independently of the peak calling strategy used, β-catenin showed high differences between individual replicates, confirming the larger inter-replicate variability when compared with H3K4me3.
How to approach the ICEBERG with two or three replicates We have shown that to capture more than 90% of β-catenin binding events, > 20 datasets are needed.As most CUT&RUN studies are based on a limited number of replicates, we set out to test how well conducting 3 or < 3 replicates could capture the totality of binding events.We could test this by taking 2 or 3 replicates, overlapping them, and comparing how well in terms of precision and recall they identified the ICE-BERG peaks.Due to the variability between (Figure 4 A and B), the outcome would change based on which combination of samples was chosen.Therefore, we randomly paired our individual datasets in all possible combinations to create duplicates, and in > 50 random combinations to generate triplicates.We overlapped the peaks (2 of 2, 2 of 3, 3 of 3) and plotted the precision and recall scores.Depending on the combination of replicates, the precision and recall varied, but clear distributions could be seen based on the peak calling strategy used.For H3K4me3, precision was high across the board even with the lowest stringency thresholds (Figure 4 B, top row).More conservative approaches, like 2 of 2 or 3 of 3, resulted in lower recall than for 2 of 3, where with P < 0.01 it was possible to achieve > 0.9 precision and recall.For βcatenin, higher precision was achieved with replicates versus single datasets, but the recall remained generally low (Figure 4 B, bottom row).The least stringent MACS2 P < 0.01 in 2 of 3 could recall up to 60-80% of ICEBERG peaks, but only with up to 0.6 precision.Common practice that favors false negatives over false positives would lead to choosing more stringent approaches at the expense of recall.For an unbiased metric, we calculated F1 scores, which relies on the harmonic mean of precision and recall and provides a balance of false positives and false negatives.According to the highest mean F1 score, the best approach for H3K4me3 would be duplicates with MACS2 P < 0.01, and 2 of 3 MACS2 P < 0.01 for β-catenin (Figure 4 C).We suggest that scientists take this information into account when analyzing their data, and choose a strategy based on the biological questions being asked.

ICEBERG discovers that peaks are distributed in a spectrum of probability of detection
We have shown that ICEBERG approaches the full set of genome-wide binding events, and that most single replicates have an incomplete recall and / or precision.This could be due to two explanations: (i) each replicate identifies a random fraction of the complete set of binding events, and no peaks are preferentially detected across replicates, (ii) there could be different types of binding events, each associated with a probability of detection, reflecting their frequency of occurrence in a cell population.To distinguish between these two possibilities, we plotted how many times each ICEBERG peak is identified across replicates (MACS2 q < 0.05).For H3K4me3, each replicate does well in identifying the ICEBERG peaks, as shown by a low slope of the decay curve (Figure 5 A, left).The slope of the β-catenin decay curve on the other hand is steep, indicating most ICEBERG peaks are only identified in a few replicates (Figure 5 A, right).These patterns can be translated into probabilities of detection by ranking each ICE-BERG peak by the exact number of individual replicates in which it is detected.Over 90% of the H3K4me3 ICEBERG peaks are called in 9 or 10 of the single replicates (Figure 5 B, left), further supporting that H3K4me3 is likely easy to detect, homogenous among cells, and a stable feature of the chromatin in HCT116.Almost opposite to this behavior, the majority of β-catenin ICEBERG peaks are rarely detected in single replicates (Figure 5 B, right).
To explore this, we grouped the β-catenin ICEBERG peaks depending on their probability of detection: undetectable in single replicates at q < 0.05 and only detectable at lower stringency P < 0.01 (USR), called 1-5 times (up to 20% probability) at q < 0.05, called 6-10 times (21-40%), called 11-15 times (41-60%), called 16-20 times (61-80%), and those called 21-25 times (81-100%).Motif analysis performed individually on each group showed TCF / LEF as a top enriched motif ( de novo discovery with P < 1e −10 for all), confirming that each group, even the ones containing the rarest peaks, consisted of bona fide β-catenin binding events, while a comparison of the enrichment of the top known motifs reveals that the percentage of peaks containing TCF / LEF motifs increases with increasing detection probability ( Supplementary Figure S4 A).The probability groups were associated with a remarkably different signal to noise profile: the most probable peaks had the highest and most narrow peak of signal, while less probable peaks had progressively lower and broader signal domains (Figure 5 C).The signal profile of a peak is thus an indicator of its probability of detection.
We integrated RNA-seq of HCT116 ( 46 ,47 ) to determine correlation between peak detection probability and gene transcription.Hypergeometric tests revealed that overall, ICE-BERG peaks were associated to transcribed genes (59% of peak associated genes expressed: 1.27-fold enrichment, pvalue 9.54 e −242 , transcripts per million (TPM) > 5, peaks annotated to genes using GREAT) (Figure 5 D, left).This association increased when considering only peaks occurring in promoters, where peak-gene annotation is more precise (70.5% of peak associated genes expressed, 1.51-fold enriched, Pvalue 3.95 e −186 ) (Figure 5 D, center).The rare peaks were equally associated to occurring transcription as shown by both percentage of expressed peak annotated genes, and their level of expression (Figure 5 D, right), indicating that they are part of a biological pathway active in HCT116.
Genomic annotations of peak positions revealed that the high probability peaks were typically associated with intergenic regions and poorly associated with promoters (51% intergenic, 4% promoters) confirming what has been previously shown for TCF / LEF and β-catenin ( 57 ).However, the rarest peaks were statistically more enriched in promoters, and less so for intergenic regions (29% intergenic, 28% promoters: 4.6-fold enrichment, log P −4484 ) (Figure 5 E).Accordingly, the rarest peaks were more often marked by H3K4me3 than the high probability peaks (51.6% versus 39.2%) (Figure 5 F).The rarest peaks also appeared to be more often associated with non-Tn5-accessible chromatin within a cell population (ENCODE A T AC-seq for HCT116, IDR threshold peaks) ( 48 ) (Figure 5 G).This could reflect that these rare β-catenin binding instances are associated with a smaller fraction of cells in which the chromatin might be accessible, but not detectable by bulk level A T AC-sequencing.
In light of the identification of probability groups of βcatenin peaks, we wondered how traditional approached would fair in identifying peaks in the different subsets.Using the most permissive 'majority rule' from our datasets (13 of 25), we reach an overall precision of 0.94 and a recall of 0.023 ( Supplementary Figure S4 B).Expectedly, the majority of these peaks come from the higher probability groups, and thus we obtain a higher recall of these groups specifically.Next, we performed the same analysis for the peaks identified in Pagella et.al. ( 63 ), where CUT&RUN LoV-U was independently performed against β-catenin in HCT116, using two biological replicates and a MACS2 followed by IDR peak calling approach ( Supplementary Figure S4 B).Here, the stringent peak calling had a precision of 0.82 and recall of 0.080 when compared to ICEBERG, and similarly to a majority approach managed a much better recall of the high probability groups versus the lower probability groups (0.92 for 81-100% versus 0.02 for USR peaks), indicating that only ICE-BERG can confidently identify the full set of more dynamic peaks ( Supplementary Figure S4 B).
We also compared ICEBERG to a simpler approach, consisting of summing the reads deriving from different replicates into 'supersamples' and determined how these perform-in terms of peaks calling.For β-catenin we had a total of 156 million reads deriving from 25 samples sequenced homogenously.All reads were subsampled to produce supersamples with sequencing depths corresponding to that of 1, 5, 10, 15, 20 or 25 replicates.The emerging curve of peak discovery recapitulates the average build curve of ICEBERG ( Supplementary Figure S4 C, left), indicating that ICEBERG aggregates do not introduce bias into peak calling procedures and reflect the 'richness' of reads / observations captured across all samples.When this procedure was done for the TBX3 dataset, however, ICEBERG outperformed the supersamples ( Supplementary Figure S4 C, right).Based on our experimental design, the coverage of TBX3 samples ranged from 5-15 million on average but 1 sample alone had ca.40 million reads.ICEBERG equalizes the contribution of each replicate by using equal amounts of reads per sample, thereby preventing skewing the outcome towards experimentally derived idiosyncrasies of individual samples.While this might not be necessary when samples are relatively homogenous and sequenced with similar depth as in the case for our β-catenin experiments, it becomes of crucial importance for groups of samples that present higher inter-replicate variability, for example in sequencing depth, as was the case for our TBX3 tests.

Peak probability identified by ICEBERG predicts cis-regulatory element interactions
We have shown that β-catenin peaks can be divided into probability groups that are characterized by different features such as signal intensity, promoter association and chromatin accessibility.This raises the prediction that β-catenin binding event probability could be associated with differential functionally relevant regulatory events.To test our prediction, we focused on active cis-regulatory three-dimensional chromatin interactions by employing published HiChIP of H3K27ac in HCT116 ( 53 ).This assay uncovers distant regions that are engaging in gene regulation events and that are marked by the open chromatin mark often found in active enhancer regions ( 53 ).We focused on H3K27ac mediated loops that overlap with a β-catenin ICEBERG peak in at least one anchor and assessed the correlation between β-catenin peak detection frequency and (i) number of loops associated to each probability group, (ii) percentage of peaks involved in at least one loop and (iii) loop size in terms of linear genomic distance between the two anchors.We observed a strong correlation between peak probability and number of associated loops: high probability peaks were connected on average to ∼6 loops, while the PAGE 13 OF 18 lowest probability peaks were connected to ∼1 (Figure 6 A, left).Moreover, both percentage of peaks involved in loops and loop size showed a similar pattern, with higher probability peaks associated more frequently and with larger loops (Figure 6 A, center and right).
We turned our attention to the highly dynamic, cancer relevant and well-studied MYC locus, and its super enhancer ( 64 ).ICEBERG unearthed a plethora of β-catenin binding events within the MYC topologically associated domain, ranging in probability of detection (Figure 6 B, upper panel).The vast majority of peaks are associated to a H3K27ac loop, hence are more likely to mark functionally active regulatory elements.Within this locus, even low probability and undetectable in single replicate peaks at q < 0.05 (USR) are connected to loops.Our analysis highlights that while most βcatenin and H3K27ac loops are within the colorectal cancer MYC super-enhancer region and MYC gene body (Figure 6 B, bottom panel), others can be found further upstream, connecting the super-enhancer over 2 Mb distance with a high probability β-catenin peak near the TAD boundary, implying that different β-catenin bound loci may have different functions.

ICEBERG identifies cancer relevant, rare β-catenin direct targets
We set out to discover which are the rarest β-catenin ICE-BERG binding events.To do this, we started from the 6708 USR ICEBERG peaks (undetectable in single replicates, q < 0.05 as per Figure 5 B), and then progressively relaxed our stringency thresholds.1897 ICEBERG peaks remained undetectable in single replicates at lower stringency ( P < 0.001), and of these, 1713 ICEBERG peaks were called < 5 times at the lowest stringency ( P < 0.01).Finally, we defined the 524 ICEBERG peaks called in only one individual replicate with the lowest stringency ( P < 0.01) as the rarest β-catenin binding events (Figure 7 A).This group of rare events constitutes a new discovery made possible by ICEBERG.In fact, they can be identified in single replicates only by loosening the stringency to a point where false positives become overwhelming in number (precision < 0.2, Figure 4 B).Consistently, even a published ChIP-seq replicate for β-catenin in HCT116, which makes use of crosslinking, large number of cells, and high sequencing depth, only identified 7 / 524 ( 49 ).
Despite being rare events, they displayed up to 2.6-fold signal enrichment over the control and were characterized by the presence of AP1 and TCF / LEF family motifs as the most enriched (Figure 7 B).They displayed one of the highest enrichments for promoters that we have seen across groups of βcatenin targets (28% in promoters) (Figure 7 C, left).We annotated the 524 peaks to 840 genes using GREAT and performed gene ontology analysis to determine whether these rare β-catenin binding events could be associated to specific biological processes.Terms related to cell motility and migration were the most enriched (FDR < 0.05), indicating that they could be highly relevant for cancer invasiveness and metastasis (Figure 7 C, center).64.6% of the genes annotated to peaks were expressed in HCT116 ( > 5 TPM) (Figure 7 C, right).Among the highly expressed genes we found notable, cancer relevant players which have never been previously identified as physical targets of β-catenin (Figure 7 D).CD58 , for example, is a surface marker of CRC tumor-initiating stem cells that has previously been found to be expressed by only ∼ 4% of cells in CRC cell lines, correlating well with the observed rarity of the β-catenin peak, yet is capable of enhancing self-renewal and promoting epithelial to mesenchymal transition ( 65 ).CLK2 , CDC-like kinase 2, is known to be upregulated in CRC and is associated with poor survival.It is involved in splicing of Wnt pathway genes, and Wnt has been shown to be responsible for the CLK2 induced cancer progression ( 66 ,67 ).CLK2 inhibitors are now being tested in clinical trials to treat Wntdriven CRC ( 68 ).ASPM has been identified as a biomarker for CRC and is upregulated in CRC tissues, functioning to promote proliferation and inhibit apoptosis, affecting the cell cycle ( 69 ).ASPM levels are positively correlated with β-catenin levels ( 70 ).CDK9 , Cyclin-dependent kinase 9 plays a role in transcriptional elongation and has been shown to be a potential drug target in CRC ( 71 ).TET2 , ten-eleven translocation 2, acts as a tumor-suppressor in many types of cancer via its enzymatic DNA demethylation activity ( 72 ).This warrants screening of all the ICEBERG hits for their developmental or disease relevance.

Discussion
Current technologies such as the popular ChIP-seq and CUT&RUN provide experimental means to detect protein-DNA interactions genome-wide.However, as they rely on NGS, they are intrinsically noisy and pose the challenge of drawing arbitrary thresholds, which when relaxed can increase discovery with the cost of including false signal, and when stringent, become more precise at the expense of excluding real signal ( 16 ,19 ).Here, we attempt to address this problem using CUT&RUN, though this issue is central to the signal detection theory and is not limited to these types of technologies.
As scientists we typically prefer not to be wrong; thus, a common philosophical stance in molecular biology is to prefer false negatives over false positives, which increases our confidence in the reliability of the findings but limits the breadth of the experimental outcome.Here, we propose that this might not always be the best strategy.In the clinics, for example, it is arguably preferable to sustain the cost of investigating minor suspicions, rather than leaving a disease undiagnosed.This is not only an intellectual exercise, as the discoveries concerning binding of transcription factors are now employed for CRISPR-based mutagenesis strategies in clinical trials ( 73 ).
ICEBERG attempts to prevent the need for compromise between false positives and false negatives.By increasing the capture of signal enrichment through replicate aggregation until saturation of peak discovery, ICEBERG is able to average out background and expose reproducible yet typically undetectable signal enrichment.When applied to β-catenin, a notoriously difficult target, ICEBERG built with 25 replicates uncovers more than 93% of all targets according to the fitted function.Moreover, our calculations indicate that each additional replicate after 20 contributes to a proportionally small group of additional peaks.ICEBERG can therefore be adapted to understand how many replicates are needed for a given factor.We believe that this will depend on biophysical properties and mechanism of action of each factor.
We leveraged our datasets to benchmark common experimental strategies and peak calling pipelines and provide estimates of how these perform for different targets.Our results indicate that duplicates of a stable chromatin mark like H3K4me3 are sufficient to identify a significant portion of the ICEBERG with high precision.This is reinforced by the concordance of ENCODE ChIP-seq data for H3K4me3, which also shows high overlap between replicates and experiments ( Supplementary Figure S1 A).However, β-catenin, presumably as other dynamic gene regulators, requires more datasets to approach the totality of binding events.While we cannot rule out that antibody specificity or efficiency could contribute to the differences between the number of replicates require to achieve saturation, the concordance between replicates of ChIP-seq done with three different antibodies (Doumpas et al. ( 74 ), Supplementary Figure S1 C) is similar to the concordance of CUT&RUN done with multiple replicates of the same antibody in HEK293T (Figure 1 B) and in HCT116 (Figure 2 C), further suggesting that concordance is a result of the dynamicity of the factor itself, rather than technical aspects.
While ICEBERG might seem demanding, note that we used a total of 6250000 cells (250000 per replicate), 3.5 μg of antibody (0.14 μg per replicate), and 125 million reads (5 million per replicate) to produce it.These numbers are grossly comparable to the requirements of a single ChIP-seq setup, and with ICEBERG we gained the power of 25 independent replicates, versus one to two for ChIP-seq.Simply performing 2 or 3 CUT&RUN experiments and sequencing them to much higher depths, by this reasoning, would also be underpowered compared to ICEBERG.Working with HCT116, we did not need to scale down in terms of cell number, but CUT&RUN is routinely successfully performed on < 10000 cells ( 25 ), rendering ICEBERG applicable virtually to every tissue or cell type, and friendly to scalability and automatization.One could consider that a limitation of the ICEBERG is the difficulty in identifying an orthogonal technology to validate it.Historically, CUT&RUN has been benchmarked against a single or two replicates of ChIP-seq.However, ChIP-seq itself suffers from the same downfalls as CUT&RUN and has a similarly low overlap between replicates ( Supplementary Figure S1 ) ( 16 ,74 ).Limited replicates of ChIP-seq would thus constitute an underpowered comparison to a CUT&RUN ICEBERG; on the other hand, we posit that ChIP-seq experiments would require their own ICE-BERG.Similarly, this reasoning could be applicable to other high throughput NGS-based approaches.
ICEBERG does more than just identify all binding eventsit also associates them to a spectrum of detection probabilities.Instead of a binary (yes or no) view of binding events, each chromosomal position is associated with a probability of being bound by a given factor.Regions with the highest signal in the ICEBERG are associated with a high probability of detection as they are identified in all or nearly all individual replicates.On the other side of the spectrum, regions with significant yet lower signal in the ICEBERG datasets are rarely detected in single replicates.We suggest two non-mutually exclusive explanations for this: (i) detection probability reflects the fraction of cells in which the event occurs and (ii) the detection probability is proportional to the actual time a factor remains associated to that locus.Single-cell approaches may provide the evidence for the first explanation ( 8 ), while biophysical readouts could be used to assess the second ( 75 ).
Finally, applying ICEBERG to β-catenin in CRC cells provided important biological insights.First, it produced an exhaustive map of its binding pattern, which is all the more critical as β-catenin has been a historically challenging protein to target with these types of assays.Second, ICEBERG revealed > 500 cancer relevant binding instances never previously identified, underlying so far neglected regulatory events of the Wnt / β-catenin pathway.Among these new target genes, are some that are now undergoing clinical trials as drug targets to treat CRC and for which evidence was presented indicating their involvement in the regulation of Wnt signaling ( 66 ,68 ).Individual enhancers identified as binding sites of transcriptional modulators by CUT&RUN are now being targeted by CRISPR / Cas9-mediated deletion in clinical trials ( 73 ,76 ).Hence, our identification of rare and cancerrelevant regulatory sites demands urgent investigation as, provided their functional validation, each of them might constitute a target for therapeutic modulation of aberrant genetic circuits.

Figure 1 .
Figure 1.CUT&RUN replicates display low concordance.( A ) Schematic representation of concordance between CUT&RUN replicates, showing peaks present in only one replicate (left), present and called (abo v e red threshold) in both replicates (center) and present in both replicates but only called in one (right).Concordant peaks are represented by the green center of the Venn diagram.( B ) Proportional Venn diagrams showing peak concordance (green sections) across published CUT&RUN datasets for histone modifications (top row), transcription factors (middle row) and other gene regulators (bottom row).Concordant peaks are shown as the percentage of total peaks called.

PAGE 6 OF 18 NucleicFigure 2 .
Figure 2. Replication decreases the number of concordant peaks.( A ) Schematic representation of the experimental design.Two rounds of CUT&RUN L oV-U w ere perf ormed targeting β-catenin ( n = 25), H3K4me3 ( n = 1 0) and IgG ( n = 1 0).( B ) Tracks of all performed β-catenin and H3K4me3 replicates at the positive control AXIN2 and MYC loci.All replicates show enrichment and high signal to noise ratio.( C ) Top: Venn diagrams representing concordant peaks (MACS2 q < 0.05).Bottom: line graphs of the number of concordant and majority approach peaks.( D ) L eft: deca y curv es based on the number of peaks called in at least x replicates.The curve of β-catenin displays faster decay than H3K4me3.Right: decay curves based on the number of peaks called in exactly x datasets, showing that there are large numbers of peaks only called in subsets of replicates, especially for β-catenin D, left).Second, genomic annotation of peak positions is comparable with several existing datasets for β-catenin and other Wnt regulators ( 56 ,57 ) (Figure 3 D, right).Third, motif analysis displays known transcriptional coregulators of β-catenin as primary consensus signatures ( 58 ) (Figure 3 D, bottom).

PAGE 8 OF 18 NucleicFigure 3 .
Figure 3. Increased Capture of Enrichment By Exhaustive Replicate aGgregation (ICEBERG).( A ) Schematic of the ICEBERG concept, where the summation of replicates leads to the abrogation of both false negatives and false positives.( B ) Schematic of the ICEBERG pipeline.Reads are sampled randomly three times from all replicates e v enly and pooled to form aggregate datasets, which are then peak called and overlapped.Peaks called in at least two of three aggregates, and also called in at least one individual replicate at P < 0.01 are retained.( C ) Top: Regression curve describing the number of called peaks per each added replicate during the ICEBERG aggregate generation.Bottom: Plots of the first derivates of the ICEBERG regression curves.Dashed lines indicate where the derivative equals 0. ( D ) Left: Signal intensity plots of the ICEBERG datasets around the transcriptional start sites (TSS) (H3K4me3, left) and peak regions ( β-catenin, right), showing high signal to noise ratio.Right: Genomic region annotation of ICEBERG peaks.Bottom: de novo motif enrichment in the β-catenin ICEBERG peaks, which displays strong enrichment for the known β-catenin co-regulators AP1 and TCF / LEF. ( E ) Example comparisons of the ICEBERG datasets versus individual replicates on the common target loci AXIN2 and MYC , and the previously unknown as targets ASCL5 and RPS6KC1.

PAGE 10 OF 18 NucleicCFigure 4 .PAGEFigure 5 .
Figure 4. peak calling and replicate strategies using ICEBERG.( A ) Violin plots of I-FrIP scores (fragments within ICEBERG peaks) and signal intensity profiles within ICEBERG peaks of all replicates of H3K4me3 (left) and β-catenin (right).( B ) Graphs showing precision (y-axis, defined as number of peaks called in ICEBERG / total peaks called) and recall (x-axis, defined as number of ICEBERG peaks called / total ICEBERG peaks) of peaks called for each replicate or replicate combination.Different peak calling strategies are shown as different colored dots, and different replicate strategies in different graphs (from left to right: single replicates, duplicates 2 of 2, triplicates 3 of 3, triplicates 2 of 3).H3K4me3 is shown in the top row and β-catenin in the bottom row.In general, H3K4me3 datasets show higher precision and recall, and lower inter-replicate variability than β-catenin datasets, which suffer from lack of recall and ha v e large distributions due to higher variability.( C ) Bar graphs depicting mean F1 scores (harmonic mean of precision and recall) for each replicate and peak calling strategy for H3K4me3 (left) and β-catenin (right).Error bars represent ± standard deviation.

PAGE 14 OF 18 NucleicBFigure 6 .PAGEFigure 7 .
Figure 6.ICEBERG peak probability is correlated with cis-regulatory element activity.( A ) Left: bar graph representing average number of H3K27ac mediated chromatin loops as seen by HiChIP, per β-catenin peak in the different probability groups.Increasing probability of detection correlates with increasing number of chromatin loops with at least one anchor in the β-catenin peak.Middle: bar graph representing percentage of peaks within groups that are connected and not connected to H3K27ac mediated chromatin loops.∼50% of USR peaks are not in v olv ed in loops, while increasing probability peaks show higher percentage of peaks involved in loops.Right: graphs of loop size distribution (measured by linear distance in kb, log transformed) for loops associated with β-catenin peaks.Higher probability peaks are associated with larger loops.( B ) Top: representation of the MYC topologically associating domain (TAD), showing CTCF ChIP-seq, H3K27ac ChIP-seq, the β-catenin ICEBERG dataset, and β-catenin anchored H3K27ac HiChIP loops.B ottom: z oom-in on the MYC gene body and the colorect al cancer super enhancer region.USR = undetect able in single replicates: peaks defined as those ne v er called in single replicates at standard stringency ( q < 0.05) but that can be called with more relax ed parameters ( P < 0.01).