Fast detection of differential chromatin domains with SCIDDO

Abstract Motivation The generation of genome-wide maps of histone modifications using chromatin immunoprecipitation sequencing is a standard approach to dissect the complexity of the epigenome. Interpretation and differential analysis of histone datasets remains challenging due to regulatory meaningful co-occurrences of histone marks and their difference in genomic spread. To ease interpretation, chromatin state segmentation maps are a commonly employed abstraction combining individual histone marks. We developed the tool SCIDDO as a fast, flexible and statistically sound method for the differential analysis of chromatin state segmentation maps. Results We demonstrate the utility of SCIDDO in a comparative analysis that identifies differential chromatin domains (DCD) in various regulatory contexts and with only moderate computational resources. We show that the identified DCDs correlate well with observed changes in gene expression and can recover a substantial number of differentially expressed genes (DEGs). We showcase SCIDDO’s ability to directly interrogate chromatin dynamics, such as enhancer switches in downstream analysis, which simplifies exploring specific questions about regulatory changes in chromatin. By comparing SCIDDO to competing methods, we provide evidence that SCIDDO’s performance in identifying DEGs via differential chromatin marking is more stable across a range of cell-type comparisons and parameter cut-offs. Availability and implementation The SCIDDO source code is openly available under github.com/ptrebert/sciddo. Supplementary information Supplementary data are available at Bioinformatics online.


Overview of experimental data
All analyses were carried out using the official IHEC human hg38/GRCh38 assembly available at www.epigenomes.ca/data/CEMT/resources. We selected the following high quality DEEP samples to include both closely related as well as more distantly related cell types in our analysis: two replicates of HepG2 (HG 1 and 2; Supplementary Table S1), two replicates of hepatocytes (He 2 and 3; Supplementary Table S1), three replicates of monocytes (Mo 1, 3, and 5 [Wallner et al., 2016]) and two replicates of macrophages (Ma 3 and 5 [Wallner et al., 2016]). All primary cell types were isolated from healthy, adult donors. For each replicate, we downloaded the DEEP reference alignments for six histone marks (H3K4me1, H3K4me3, H3K27ac, H3K27me3, H3K36me3, H3K9me3) and the corresponding Input control as BAM files (Supplementary Table S1). Additionally, we downloaded DEEP mRNA expression data for all samples as raw read FASTQ files (Supplementary Table S2). The hg38 genome reference was restricted to fully assembled auto-and gonosomes for all data preprocessing steps. The differential analysis with SCIDDO was then limited to autosomes and chromosome X to alleviate any effects arising from the uneven distribution of sexes in our dataset. Annotation data were likewise limited to the same set of chromosomes. The GeneHancer [Fishilevich et al., 2017] enhancer annotation was licensed for academic use on 2017-05-30. The GeneHancer annotation was reduced to gene-enhancer pairs that could be mapped to gene identifiers in the GENCODE v21 annotation [Harrow et al., 2012]. The analysis of chromatin dynamics at enhancer regions was based on EP300 data downloaded from ENCODE [The ENCODE Project Consortium et al., 2012] under accessions ENCFF674QCU and ENCFF806JJS.

Generation of chromatin state maps
Following IHEC recommendations, all histone BAM files were filtered using Sambamba v0.6.6 [Tarasov et al., 2015] to exclude low quality reads (mapping quality ≥ 5; no duplicated, unmapped or non-primary reads/alignments). These filtered BAM files were used as input to generate chromatin state segmentation maps for all samples. We used the pre-trained 18-state ChromHMM (CMM18) model provided by the Roadmap Epigenomics Mapping Consortium (REMC [The Roadmap Epigenomics Consortium et al., 2015]). We decided to use this pre-trained model because it has been carefully designed using the large compendium of epigenomes generated by the REMC. We thus assumed that this model robustly captures chromatin states irrespective of the biological source of the samples at hand. As an additional benefit, the chromatin states of the CMM18 model were functionally characterized and labeled by the REMC to make interpretation of the state segmentation maps straightforward (Supplementary Figure S1 and Table  S3). We executed version 1.12 of ChromHMM with commands BinarizeBam -b 200 and MakeSegmentation -b 200 and otherwise default parameters to create the state segmentation maps.

Differential gene expression analysis
Gene expression estimates per replicate were computed with Salmon v0.9.1 [Patro et al., 2017] using the GENCODE v21 [Harrow et al., 2012] annotation for protein coding genes. For each gene in the GENCODE reference, we extracted genomic coordinates for the gene body (5' to 3' end) and for the promoter (-2500 bp to +500 bp around the 5' end) using custom scripts (see section "Availability of raw data and code" for link to sources). After expression quantification, we used DESeq2 v1.18.1 [Love et al., 2014, Soneson et al., 2016] to obtain differential expression estimates for all six possible pairs of sample replicate groups in our dataset. We split the DESeq2 results into groups of differentially expressed genes (DEGs) and non-differentially expressed genes (stable genes) based on an absolute log2 fold change in expression of at least 2 and a multiple testing corrected p-value of less than 0.01.

Differential histone peak calling
We selected PePr [Zhang et al., 2014] as a current state-of-the-art tool for differential chromatin analysis as a reference to compare to. We executed PePr v1.1.18 to perform differential analysis including postprocessing for all six possible pairs of sample replicate groups in our dataset. All available replicates were processed in a single run of PePr for each comparison. PePr was executed with histone peak type set to broad for the mark H3K36me3, and otherwise default parameters. The resulting histone peak sets were filtered to peaks with a q-value of less than 0.01 using custom scripts (see section "Availability of raw data and code" in the main text for link to sources).

Generation of randomly sampled genomic regions
For each set of DCDs, we generated randomly sampled genomic regions as control set. The DCDs were used as input to a custom script that generated a random sample of genomic regions following the DCD length distribution. Since this process was performed in parallel over chromosomes, the generated control regions resemble the DCDs both in length and in chromosomal distribution. Additionally, the control regions for a specific set of DCDs are disjoint (as DCDs are also disjoint by construction). To limit the runtime of the sampling process, a lower limit of at least 95% generated control regions was set (see section "Availability of raw data and code" in the main text for link to sources). 4 1.7 Algorithm 1: computation of length normalization factor L The Expect (E) value of a differential chromatin domain is computed according to the formula where the value L is the length of the chromosomal sequence, R is the raw score of the segment, and K and λ are the Karlin-Altschul parameters that are estimated by external routines (see main text for references). The length normalization factor L is computed as follows: Algorithm 1: Compute length normalization factor L Input: chromatin state maps for 2 sample groups (GROUPS), each consisting of at least one replicate r Output: Length normalization factor L c per chromosome c L c = (3+1+1)+(3+0)−3 = 5. Note that Algorithm 1 is used when SCIDDO's parameter --adjust-group-length is set to adaptive for the scan subcommand. This is the recommended setting for comparing two groups of high-quality (replicated) samples. For cases where the sample groups to be compared are heterogeneous, the --adjust-group-length parameter should be set to linear. For the linear setting, the length normalization factor L c will simply be computed as L c = |c| · #comparisons. For the example above, the result would then be L c = 3 · 6 = 18.

Fit of random scores to Gumbel-type extreme value distribution
The calculation of the E-value (see Materials and Methods in the main text) assumes a null model of random sequences. Following the theory (cf. Theorem 1 in [Karlin et al., 1990] and examples in [Karlin and Altschul, 1993]) the normalized maximal scores should follow a Gumbel-type extreme value distribution when comparing random state sequences, in the limit of the sequence length n. Since SCIDDO supports the use of customized scoring schemes, it also supports the user in assessing if the chosen scoring scheme follows this theoretical assumption. To that end, SCIDDO scans the randomly shuffled chromatin state maps of all sample pairs for high scoring subsegments and retains only the maximally scoring subsegment per chromosome; if several segments with identical scores emerge, only the first one is kept. This process is iterated until a pre-specified number of these "random" scores have been found. The scores underlying Figure S2A have been generated in that way. The user can then use these "random" scores and, e.g., assess their fit to a Gumbel-type extreme value distribution following our example in Figure S2A. Notably, in Figure S2A, we jointly fitted all "random" scores of all chromosomes to simplify the visualization. 6 2 Supplementary Results

Differential chromatin scores follow extreme value distribution
The last step in the SCIDDO workflow described main Figure 1 consists of turning the differential chromatin scores (DCSs) into an E-value that is used for filtering the set of candidate regions to obtain the final set of differential chromatin domains (DCDs). This step relies on theory developed for biological sequence analysis (see main text Materials and Methods) and requires first a normalization of the raw cumulative DCSs to account for the fact that comparing longer chromosomal sequences increases the chances of observing higher cumulative DCSs. This normalization uses two estimated statistical parameters, λ and K. These parameters have no biological interpretation, but can be thought of as scaling factors for the scoring system and the sequence length, respectively. Moreover, the theory assumes a null model of random sequences, and under this null model, the distribution of the scores should in the limit converge in distribution to a Gumbel-type extreme value distribution (see main text Materials and Methods). We confirmed that this is indeed the case in our analysis by comparing randomized chromatin state maps with each other and fitting all maximal DCSs identified during this sampling procedure to a Gumbel distribution ( Figure S2A). We also plotted the per-chromosome estimates of the statistical parameters λ and K that are needed for the score normalization ( Figure S2B; see Methods), and could confirm that the estimates are within reasonable bounds given examples from literature [Karlin and Altschul, 1993]. The observed agreement with theory thus supports the last step in the SCIDDO analysis (Main Figure 1 step (C)) of filtering candidate regions based on their E-value.

SCIDDO robustly identifies differential chromatin domains
Histone ChIP-seq data is known to be affected by various sources of noise, e.g., ranging from artifacts introduced during library preparation, to irregularities caused by varying mappability in the reference genome or to spurious signal due to unspecific antibody binding [Bailey et al., 2013, Head et al., 2014, Landt et al., 2012. In combination, biological and technical variation can render any differential analysis pointless if the results are dominated by noise, and not by the biological signal of interest. To test if the identified candidate regions were indeed representative and not replicate-specific, we computed the Spearman correlation of the E-values between all overlapping candidate regions. We visualized an exemplary case selected based on the mean of all comparisons. This exemplary case shows a Spearman correlation of 0.72 between the candidate regions ( Figure S3). The red bars in the lower left corner indicate candidate regions that are unique to the respective replicate comparison. It can be observed that unique candidate regions tend to have comparatively lower E-values whereas those candidate regions found in both replicate comparisons tend to have higher E-values. In general, the average Spearman correlations across all replicate comparisons are consistently in high range from 0.67 (HepG2 vs. hepatocytes) to 0.73 (HepG2 vs. monocytes; Table S5). 7

Chromatin state transitions in DCDs show consistent patterns of changes in genomic activity
For all six sample comparisons, we plotted the chromatin state transitions observed in the identified DCDs to examine if the overall change in chromatin activity showed comparison-specific characteristics ( Figure S4). The chromatin state transition frequencies indicate a general trend that activity states related to transcription and transcriptional regulation are down-regulated by, e.g., polycomb-mediated silencing. For the four comparisons involving one liver and one blood cell type ( Figure S4 B-E), a switch from polycomb-repressed to heterochromatin can also be observed quite frequently, indicating long-term silencing of the respective region. The state transitions for the monocytes versus macrophages comparison seem to show a different overall pattern, with more state switches within the borad functional categories related to (active) transcription and transcriptional regulation. In other words, when comparing more distantly related cell types, the identified DCDs seem to indicate rather broad changes in genomic activity including long-term silencing via heterochromatin formation. For the two closely related cell types monocytes and macrophages, the chromatin state switches appear to be more constrained to the respective functional activity level.

DCDs overlapping regulatory regions show higher E-values
The results presented in main Section 3.2 ("Differential chromatin domains occur in variousregulatory contexts") illustrate that the distribution of overlaps seems not to be affected by the number of DCDs identified. In all comparisons (main Figure  . This effect is most pronounced for annotated promoters and transcription factor binding sites (TFBS), and this seems not to be an effect of regulatory region size ( Figure S6, top panel). The average number of distinct regulatory region overlaps per DCD shows that a DCD often spans several of the shorter regulatory regions, with the exception of TFBS, which is the least abundant region type with a median size < 1 kbp in the Regulatory Build. At the other end of the size spectrum are promoters, which also show hardly any variation around a median of one DCD overlap per promoter.

Methodological and biological limitations for chromatin-based detection of differentially expressed genes
The theory borrowed from local scoring and implemented in SCIDDO is used to assign a measure of statistical stringency -the E-value -to each discovered DCD (see main Materials and Methods). Yet, the theory does not offer a way to decide what threshold on the E-value best separates genuine from chance observations. The necessary normalization to account for the length of the sequences being compared immediately suggests that short but biologically true differential regions will be assigned an (untransformed) E-value above SCIDDO's default threshold of 1. We checked the extent to which the default E-value threshold of 1 could limit SCIDDO's ability to identify -especially short -DEGs. We binned all DEGs by their gene body size and plotted the amount of genes with a DCD overlapping their gene body at E-value thresholds of 1 and 100 (Supplementary Figure S13). The histogram shows the expected behavior of SCIDDO to predominantly recover longer DEGs by means of finding a DCD in their gene body. However, relaxing the E-value threshold seems not to affect this general trend as the additional DEGs also show a tendency toward longer gene bodies. We thus wondered if other technical or biological artifacts might exacerbate the detection of DEGs on the chromatin level. We focused specifically on the comparison of monocytes to macrophages where approximately only 54% of all DEGs could be recovered using DCDs (see main Figure 3F).
As a first step, we examined if artifacts in the data could be the reason for the low DEG recovery rate. Besides chromatin states with annotated function, chromatin state maps usually include a so-called background state that represents regions of no detectable signal (state number 18 labeled as "quiescent" in the CMM18 model). It is important to realize, though, that the interpretation of this background state is difficult. While it is conceivable that technical problems caused this lack of a signal in certain regions of the genome, it may be biologically meaningful in others. Moreover, the six canonical histone marks included in this study certainly cover a wide range of functionally important chromatin signals, but they do not represent the entire regulatory chromatin landscape. To give an example, the recently characterized H3K122ac histone modification is also found at active enhancers that lack the canonical H3K27ac marking [Pradeepa, 2017]. Given these uncertainties, we opted for a conservative approach and considered the background state as not differential relative to all other chromatin states (see main Materials and Methods). We evaluated how many DEGs might not be recoverable under these conditions for the monocyte to macrophage comparison. For each of the 1110 DEGs that could not be recovered, we computed the percentage of the gene body length covered with the background state (averaged over all replicates in the respective groups). We found that close to a hundred genes that are covered to at least 60% with the background state are shared between the monocyte and the macrophage group (Supplementary Figure S14A). At a higher threshold of 80% body coverage, this number drops to 35 genes. Given that this considers genes that are in the same uninformative chromatin state to roughly the same extent in all samples -and being differentially expressed at the same time -it seems justifiable to assume that the non-detection of these genes is not a limitation of SCIDDO. When focusing on the genes that are covered with the background state in either monocytes or macrophages, the numbers rise considerably ( Supplementary Figure S14B). 164 genes are above the lower threshold of at least 60% coverage, and when raising the threshold to at least 80% coverage, 72 genes are still affected. In this scenario, the non-detection of the DEGs is hence largely driven by the lack of a signal in one of the two sample groups.
Considerations involving the background state might explain a few hundred cases of DEGs that could not be recovered by SCIDDO. It follows that a considerable amount of genes were assigned biologically meaningful chromatin states and yet were not detectable. We hypothesized that a plausible cause for this could be a comparatively weak change in gene expression for non-detectable genes. When a gene is switched from "off" to "on", a substantial change in the histone marking can be expected. However, if the gene is already actively transcribed and then simply upregulated, e.g., by activating additional enhancer elements (cf. Supplementary Figures S7-S9), it is not obvious why this change in expression should lead to differential chromatin marking in the gene body. We tested this hypothesis by plotting the mean difference in expression, plus the minimal and maximal expression level in any sample, for all DEGs in the monocyte to macrophage comparison ( Figure S14C-E). We split the genes into three groups based on DCD overlap in their gene body, in any associated enhancers but not in the body and no DCD overlap at all, i.e., the non-detectable genes. The mean change in gene expression is significantly higher in genes overlapping with a DCD compared to those genes that have no differential chromatin marking. Interestingly, the minimal expression level ( Figure S14D) is still relatively high for those genes that show differential chromatin marking only in their enhancers. When relating the minimal to the maximal expression level (Figure S14D/E), the change in expression can be characterized as follows: genes with a DCD in their gene body jump from a low to a high expression level; genes with no DCD in their body but in their enhancer(s) show increased expression relative to an already high level, and genes with no DCD at all remain at a low to mildly elevated expression level. It should be pointed out that the implied directionality is supported by the observed expression changes for the monocyte to macrophage comparison (Supplementary Figure S8).
There is a multitude of mechanisms beyond the chromatin level that can fine-tune gene expression [Coulon et al., 2013, Kim and Ren, 2006, Orphanides and Reinberg, 2002. Given that the DEGs lacking any sign of differential chromatin marking show also limited dynamics in their expression changes, we wondered whether there was any evidence of post-transcriptional control of these genes. As control group, we selected all genes that were not classified as differentially expressed but nevertheless showed signs of differential chromatin marking in their gene body (N=760 for the monocyte to macrophage comparison). We then plotted the number of annotated micro RNA targets using the TargetScan v7.2 [Agarwal et al., 2015] annotation for both groups of genes (Supplementary Figure S15, bottom panel). There is indeed a small but statistically significant difference in the number of annotated micro RNA targets per gene between the two groups. This difference seems not to be caused by a difference in 3'-UTR length, where it is actually the group of DEGs without an overlapping DCD that has the larger 3'-UTR regions on average (Supplementary Figure S15, Figure S1 and ( , ) ( , ) ( , ) Figure S7: DCD formation affects gene expression: (left) all 20,091 genes were stratified by the amount of DCD overlap either covering more than 50% of the body (body; orange curve) or less than 50% of the body or the promoter region (partial; blue curve). Expression fold change of the genes in the respective groups is plotted along the x-axis within a restricted window for improved readability. Statistical significance of the difference in mean fold change of the groups relative to the no overlap group ("none") was computed separately for negative and positive fold change genes using a two-sided Mann-Whitney-U test ("*" significant with p < 0.01, "-" not significant otherwise). (middle/right) same analysis as for the gene body, but here counting the number of intra-and intergenic enhancers (anywhere, middle) or only intergenic enhancers (right) per gene that overlap a DCD. Expression fold changes plotted within a restricted window for improved readability. Statistical significance assessed as above.  Figure S8: DCD formation affects gene expression: (left) all genes were stratified by the amount of DCD overlap either covering more than 50% of the body (body; orange curve) or less than 50% of the body or the promoter region (partial; blue curve). Expression fold change of the genes in the respective groups is plotted along the x-axis within a restricted window for improved readability. Statistical significance of the difference in mean fold change of the groups relative to the no overlap group ("none") was computed separately for negative and positive fold change genes using a two-sided Mann-Whitney-U test ("*" significant with p < 0.01, "-" not significant otherwise; "//": not enough data to compute test statistic). (middle/right) same analysis as for the gene body, but here counting the number of intra-and intergenic enhancers (anywhere, middle) or only intergenic enhancers (right) per gene that overlap a DCD. Expression fold changes plotted within a restricted window for improved readability. Statistical significance assessed as above. : DCD formation affects gene expression: (left) all genes were stratified by the amount of DCD overlap either covering more than 50% of the body (body; orange curve) or less than 50% of the body or the promoter region (partial; blue curve). Expression fold change of the genes in the respective groups is plotted along the x-axis within a restricted window for improved readability. Statistical significance of the difference in mean fold change of the groups relative to the no overlap group ("none") was computed separately for negative and positive fold change genes using a two-sided Mann-Whitney-U test ("*" significant with p < 0.01, "-" not significant otherwise). (middle/right) same analysis as for the gene body, but here counting the number of intra-and intergenic enhancers (anywhere, middle) or only intergenic enhancers (right) per gene that overlap a DCD. Expression fold changes plotted within a restricted window for improved readability. Statistical significance assessed as above.  Figure S15: Difference in annotated miRNA targets and 3p UTR length: genes overlapping a DCD in their gene body were split into two groups based on expression behavior (stable or differential) for the monocyte to macrophage comparison. (top) box plots show distribution of 3p UTR length as annotated in Ensembl v78 for the genes in the respective groups. (bottom) box plots show distribution of number of annotated miRNA targets per genes in the respective groups (TargetScan v7.2). Differences in magnitude between the two groups were assessed with a one-sided Mann-Whitney-U test (alternative less or greater as indicated) and considered significant "*" at p < 0.01. DEGs. At least one DCD/differential H3K36me3 peak (PePr) in the gene body and at least three DCDs/differential H3K27ac peaks (PePr) in geneassociated enhancers were required for a DEG to be considered detected on the chromatin level. Differences in performance were assessed with a one-sided Mann-Whitney-U test and considered significant "*" at p < 0.01 At least one DCD/differential H3K36me3 peak (PePr) in the gene body was required for a DEG to be considered detected on the chromatin level. The quiescent chromatin state was not treated as "not differential" by default in the SCIDDO analysis. Differences in performance were assessed with a one-sided Mann-Whitney-U test and considered significant "*" at p < 0.01 At least one DCD/differential H3K36me3 peak (PePr) in the gene body and at least three DCDs/differential H3K27ac peaks (PePr) in gene-associated enhancers were required for a DEG to be considered detected on the chromatin level. The quiescent chromatin state was not treated as "not differential" by default in the SCIDDO analysis. Differences in performance were assessed with a one-sided Mann-Whitney-U test and considered significant "*" at p < 0.01     Table S4: Runtime (in minutes of wall clock time) of individual SCIDDO commands executed in order from top to bottom to perform the differential analysis. The runtime for the scan command refers to a single comparison of two versus two samples. The last scan command is provided as an example of the scaling behavior of SCIDDO (scanning for differential chromatin domains between the four liver and the five blood samples in the dataset