Identification of mammalian transcription factors that bind to inaccessible chromatin

Abstract Transcription factors (TFs) are proteins that affect gene expression by binding to regulatory regions of DNA in a sequence specific manner. The binding of TFs to DNA is controlled by many factors, including the DNA sequence, concentration of TF, chromatin accessibility and co-factors. Here, we systematically investigated the binding mechanism of hundreds of TFs by analysing ChIP-seq data with our explainable statistical model, ChIPanalyser. This tool uses as inputs the DNA sequence binding motif; the capacity to distinguish between strong and weak binding sites; the concentration of TF; and chromatin accessibility. We found that approximately one third of TFs are predicted to bind the genome in a DNA accessibility independent fashion, which includes TFs that can open the chromatin, their co-factors and TFs with similar motifs. Our model predicted this to be the case when the TF binds to its strongest binding regions in the genome, and only a small number of TFs have the capacity to bind dense chromatin at their weakest binding regions, such as CTCF, USF2 and CEBPB. Our study demonstrated that the binding of hundreds of human and mouse TFs is predicted by ChIPanalyser with high accuracy and showed that many TFs can bind dense chromatin.


INTRODUCTION
Site specific transcription factors (TFs) control gene expression by binding to gene promoters and enhancers ( 1 , 2 ) but prediction of their binding in different cell types has been a significant challenge to date.Ther efor e, we do not have a clear understanding of the underlying mechanism that underpins TF binding in various biological contexts.The advent of high throughput technologies, such as chroma tin immunoprecipita tion followed by sequencing (ChIPseq) drove huge progress in generating empirical data on TF binding profiles, which is now the gold standard method to experimentally determine TF binding profiles ( 3 ).Nevertheless, despite their notable impact on profiling TF binding, ChIP methods do not provide a mechanistic understanding of the binding e v ents of TFs to the genome.
TFs bind to the DNA at specific short sequences known as motifs , where TFs recognize and bind their motifs with much higher affinity than any other sequence (4)(5)(6).There is a wide array of methods, both in vitro and in vivo , that can be used to determine TF binding sites ( 3 , 7 , 8 ).Howe v er, the presence of a motif is not sufficient for a TF to bind or regulate a gene (9)(10)(11).Thus, other factors besides DNA sequence must influence TF binding, for example TF concentration (12)(13)(14)(15).Furthermore, TFs in humans and other higher eukaryotes also face the challenge posed by the complex structure of chromatin.Indeed, nucleosomes have long been known to impede TF binding and nucleosome-rich r egions ar e associated with transcriptionally inacti v e chromatin ( 16 , 17 ).Thus, nucleosome positioning and DNA accessibility is another significant factor influencing TF binding.
While it is assumed that in general TFs cannot bind nucleosome-rich chromatin, a subset of TFs known as pioneer factors can interact with nucleosomes and bind their co gnate DN A.Not onl y that, but they are also able to displace nucleosomes and make way for other TFs to bind without the use of ATP-dependent chromatin re-modelers ( 18 ).The first pioneer factors discovered were FOXA1 and GATA4, two TFs that play an important role in endoderm formation during embryogenesis ( 19 , 20 ).Since then, several other TFs involved in a variety of processes have been identified as having pioneer function ( 21 ).Other examples are the pluripotency factors (SOX2, OCT3 / 4, KLF4 and c-MYC), which can reprogram somatic cells and re v ert them to a pluripotent state.These factors have been shown to hav e pioneer properties, e xcept for c-MYC w hich a ppears to lack pioneer properties itself, but is a co-factor that enhances the activity of SOX2 / OCT4 ( 22 , 23 ).
Bookmarking of binding sites by TFs (i.e. the continued occupancy of TFs at their binding sites e v en during transcriptionally inacti v e phases, such as during mitosis) is a mechanism for quick reactivation of transcription sites after cell division.This behaviour has been observed in several TFs such as FOXA1 ( 24 ), GATA1 ( 25 ), as well as the pluripotency factors ( 26 ) and it serves to maintain cellular dif ferentia tion, or lack thereof, in the case of the pluripotency factors.
The mechanisms by which pioneer factors open chromatin are not well understood.While it has been shown tha t FOXA1-media ted chroma tin relaxa tion does not require ATP-dependent chromatin re-modelers, it is unclear whether r e-modelers ar e r ecruited ( 18 , 27 ).In the case of FOXA1, there is evidence for it directly causing chromatin relaxation through the displacement of linker histones ( 28 ).Other factors, such as OCT4, are known to recruit chromatin re-modelers to their binding site to facilitate chromatin opening.For example, OCT4 recruits the SWI / SNF complex of chromatin re-modelers, particularly Brg1 (29)(30)(31).
When a TF binds to dense chromatin, it can either open the chromatin, stay idle or help maintain it in a closed state.If binding results in opened chromatin, the TF would be classified as a pioneer factor, while in the latter case, it can be classified either as an insulator or a bookmarking TF.Insulators can bind at the boundary between dense and open chromatin and stop the spreading of heter ochr omatin, ther eby pr e v enting gene silencing and gene inactiva tion ( 32 ).In addition, insula tors can bind between the enhancer and the promoter, thus blocking their interaction and interfering with gene expression ( 33 ).One interesting example of an insulator is CTCF, which appears to have the ability to displace nucleosomes after cell division and maintain nucleosome depleted regions in some contexts ( 34 ), while other times its binding is inhibited by the presence of nucleosomes ( 35 , 36 ).This indicates that the relationship between TFs and chromatin is complex and context dependent.
Here, we investigate the binding profiles and chromatin accessibility pr efer ences of human and mouse TFs using ChIPanalyser, an e xplainab le statistical model we previously de v eloped ( 13 , 37 ).Other tools to predict ChIP profiles (38)(39)(40) train an opaque / black box machine learning (ML) model on ChIP data and predict binding profiles in other cell types.While those models can be interpreted, it is often challenging to interpret mechanistically what dri v es the binding of a TF.ChIPanalyser uses a bottom-up approach, where the models start from known biological and physical components of TF binding and we train the parameters of the model on ChIP data ( 11 , 13 , 37 , 41 ).One assumption we make is that the DNA binding motif (in the form of PWM) is an accurate r epr esentation of the binding pr efer ence of a TF.This is often the case ( 42 ), but there are also exceptions, and the ML models predict different binding motifs (38)(39)(40).
TF binding to DN A and DN A accessibility are dynamic processes.Consequentl y, identifying w hether TF binding is affecting DNA accessibility or DNA accessibility limits TF binding is a difficult task.For example, if one observes a transcription factor binding in open chromatin, it is difficult to know whether the presence of the TF opened the chromatin or the genomic area was already accessible and allowed that transcription factor to bind.ChIPanalyser allows to discriminate between these two scenarios by distinguishing between: (i) the case when a TF is bound in an open region of the chromatin because of being restricted only to binding in open regions of the chromatin, or (ii) the case when the chromatin is open at all the sites in the genome where the TF can bind to the DNA.To achie v e this, we model the TFs as being able to bind anywhere in the genome independently of DNA accessibility le v els and compare that to binding being restricted to the TF to top accessible regions.If ChIPanalyser predicts many peaks that are not observed in ChIP-seq data when allowing binding e v erywhere in the genome, but not when restricting the TF binding to top accessible regions, then we can conclude that the TF binding is affected by the DNA accessibility.In contrast, if by allowing binding e v erywhere in the genome (including dense chromatin regions) we do not predict novel peaks (not observed in ChIP-seq data) compared to allowing binding only in top accessible regions of the genome, then the binding of the TFs is not limited by DNA accessibility.Ne v ertheless, this approach (using ChIPanalyser), cannot discriminate whether a TF opens the chromatin (being a pioneer factor), binds together to the DNA with another TF that opens the chromatin (co-factor of a pioneer), or shares a similar binding motif with a pioneer TF or its co-factors.
We first train the ChIPanalyser model on bulk ChIP-seq and DNA accessibility data (e.g.ATAC-seq or DNaseI-seq) data in order to estimate TF binding parameters, and then use those to determine whether TFs prefer to bind open or nucleosome associated DNA.Our results show that many TFs bind to DNA in an accessibility independent manner at their strongest binding sites.We highlight different behavioural classes among mouse and human TFs.Additionally, the ability to bind to inaccessible DNA can be linked to various functional roles such as bookmarking, chromatin opening or chromatin insulation.

Datasets
K562 cell line.To investigate TF binding behaviour, we considered 244 TFs in K562 cells for which ChIP-seq data were available from ENCODE ( 43 , 44 ).First, we selected ChIP-seq datasets in WT K562 cells and excluded from further analysis data where the cells were subjected to different treatments (e.g.RNAi).Furthermore, TFs for which a PWM motif was not available in the JASPAR database and had less than 60 ChIP-seq peaks were also excluded from the dataset.All metadata information for the downloaded data can be found in supplementary data Supplementary Table S1.The final number of TFs after triage was 110 (see Supplementary Table S2).Where multiple experiments were available for one TF, the data were merged.
In addition to the TF ChIP-seq data, DNase I hypersensitivity data were also downloaded from ENCODE for the K562 cell line (experiment accession ENCSR000EOT).This was processed in the same way as the ChIP-seq data (see below) until the peak calling stage, where broad peaks were called with a q-value threshold of 0.1 instead.

Mouse cell lines.
We considered 78 ChIP-seq datasets in 8 mouse cell lines from ENCODE ( 43 , 44 ) and, following the same filtering steps as in the K562 cells (motif in JAS-PAR core and more than 60 ChIP-seq peaks), 60 ChIP-seq datasets were selected (Supplementary Table S6).We used se v eral datasets for DNA accessibility ( 43 , 45-48 ) (see Supplementary Table S7).
Note that whilst ATAC-seq, DNase-seq and MNase-seq follow the same accessibility pattern (starting at 100% accessible DNA for QDA = 0 and then gradually decreasing their percentage of accessible genome), NOMe-seq does not follow this pattern and in this case QDA = 0 corresponds to 41% of accessible regions and QDA = 0.99 corresponds to 8% accessible regions (Supplementary Figure S10C-D).In other words, there are no reads from 59% of the genome in the NOMe-seq data from IMR90.In the case of IMR90 cells, MNase-seq does not reach 0%, which can be explained by the quality of the data and the size of the sequencing library (MNase-seq measures the nucleosome profiles and, thus, increasing the sequencing depth could allow capturing regions with higher density of nucleosomes and lower accessibility).
To select regions with strong, medium and weak ChIP signal, we first ordered 50 kb regions based on the number of ChIP peaks they contain for the corresponding TF and then we selected groups of 50 regions from the top, middle and bottom of the list.
When analysing the different methods to estimate DNA accessibility, we found that ATAC-seq and DNaseI-seq showed better performance compared to MNase-seq and NOMe-seq.Since this was performed in two different cell lines with different DNA accessibility datasets, our results indica te tha t this would not be affected by the dataset quality.Howe v er, gi v en the lower number of analysed TFs for the comparison of the different methods to measure accessibility, these results could also be impacted by a single lower quality dataset.

Modelling TF binding with ChIPanalyser
Model description.ChIPanalyser implements an approximation of the statistical thermodynamics model and is described in detail in ( 54 , 37 ).Briefly, ChIPanal yser estimates ChIP-seq like profiles based on four parameters: (i) a weighted DNA binding motif r eferr ed to as a position weight matrix (PWM), (ii) DNA accessibility data, (iii) the number of molecules bound to the DNA (determined experimentally or predicted) and (iv) a factor that modulates TF specificity.The model outputs the probability that a TF is bound to a site j as gi v en by: where N is the number of TF molecules bound to the genome, a j is a measure of DNA accessibility at site j on the genome (the probability that site j is in accessible chromatin), λ is the specificity scaling factor, w is the PWM score and L and n are the length and ploidy of the genome, respecti v ely.

Computing optimal parameters
The number of bound molecules N and the TF specificity factor λ are two of the parameters that are needed for estimating ChIP-like profiles.These are difficult to measure experimentally, so we estimated them by training ChIPanalyser on the corresponding ChIP-seq data.We binned the genome into 50 kb bins and trained the model on the top 10 bins with the highest ChIP-seq signal.We then validated the estimated parameters on a different set of 50 bins of 50 kb each from the same ChIP-seq dataset.
Quantile density accessibility (QDA).To assess to role of chromatin accessibility in the binding of TFs, we calculated quantiles between 0 and 0.99 for the accessibility data and subset it based on these ( 37 ).We termed this analysis quantile density accessibility (QDA).Each QDA r epr esents a subset of the genomic regions that the model considers accessible.For each quantile, we selected the regions with accessibility scores equal or greater than the quantile value such that a QDA of 0 corresponds to the subset of regions greater or equal to the '0 quantile' of the distribution of scores (i.e.all regions), a QDA of 0.5 corresponds to the subset of regions with scores greater or equal to the median of the distribution (top 50% regions with highest DNA accessibility signal) and a QDA of 0.9 corresponds to the subset of regions with scores greater than the 90th percentile of the distribution (top 10% regions with highest DNA accessibility signal) (Figure 1 B, C).

Clustering of the TFs
To cluster TFs based on their pr efer ence for open or dense chromatin, we first used k-means clustering to identify the different classes of TFs from the data without prior assumptions.Our analysis showed that there are two main classes of TFs (Supplementary Figure S3A), namely: (i) TFs displaying accessibility independent binding ( 61) and (ii) TFs displaying a clear accessibility dependent binding ( 48 ).We did not find TFs that display inaccessibility dependent binding, which indica tes tha t either ther e ar e fewer TFs displaying this behaviour, or that our collection of ChIP-seq data does not contain many TFs displaying that behaviour.Ne v ertheless, four TFs showed higher accuracy of predictions when assuming they can bind anywhere in the genome, compared when their binding is restricted only to accessible regions (see purple TFs in Supplementary Figure S3B, C).This indica tes tha t some of the classes in Figure 1 B have too few TFs in our dataset to be detected by k -means clustering.Since only a few TFs displayed some specific behaviours, we also used a manual selection of thresholds to group TFs (Figure 3 A).We computed the mean AUC for the dense chromatin (QDA between 0 and 0.2) and open chromatin (QDA between 0.8 and 0.95) and the difference between these means.Using these values we classified TFs as: (i) accessibility independent factors (AIF) when both means are above 0.8 and the difference is lower than 0.1, (ii) accessibility dependent factors (ADF) when the difference is greater than 0.3 and the mean AUC in open chromatin is > 0.8, (iii) inaccessibility dependent factors (IDF) when the difference is > 0.3 and the mean AUC in dense chromatin is greater than 0.8, (iv) partial AIF / ADF when the difference is between 0.1 and 0.3 and the mean AUC in open chromatin is > 0.8, (v) partial AIF / IDF when the difference is between 0.1 and 0.3 and the mean AUC in dense chromatin is > 0.8 and (vi) poorly predicted if both means are below 0.65.TFs that did not meet any of these criteria were classified as other.The threshold of 0.8 AUC was selected based on the result from the k -means analysis by rounding the lowest AUC for the AIFs cluster, which was 0.79.In addition, we selected a threshold of 0.65 for poorly predicted TFs by rounding the average AUC for the others cluster in the k -means analysis, which was 0.64.

Modelling binding of human transcription factors
To investigate TF binding behaviour, we considered 110 TF ChIP-seq datasets as well as DNA accessibility data (DNaseI-seq) in K562 cells available from ENCODE ( 43 , 44 ) (Supplementary Figure S1 and Supplementary Table S1).For each TF, we used a binding motif, in the form of position weight matrix (PWM), available from the JASPAR core database ( 55 ).To model the binding profiles of these TFs in K562 cells, we used ChIPanalyser ( 13 , 37 ), a Bioconductor package that estimates ChIP-seq profiles based on four parameters: (i) binding motif of the TF in PWM format, (ii) DNA accessibility data, (iii) the number of TF molecules bound to the DNA (N) and (iv) a factor that modulates TF specificity ( ).The latter models the ability of a TF to discriminate between high and low affinity binding sites, where the affinity is estimated based on the PWM score.In particular, is inversely proportional to the capacity to dif ferentia te between high and low affinity sites.In other words, a high means that there are more of the weaker binding sites, while low values for means that there are fewer but stronger binding sites for a TF.The number of molecules bound (N) and specificity factor ( ) are difficult to measure experimentally but can be estimated by fitting the model to ChIP-seq data and selecting the values that minimise the mean squared error (MSE) between the predicted profile and the actual ChIP-seq profile ( 37 ) (Figure 1 A).To run this analysis, the genome was tiled into 50 kb bins and the model was then trained on 10 regions with strongest ChIP-seq signal.The top 10 regions of 50 kb size contain the strongest peaks (true positi v es), but at the same time, they contain many regions that are not bound by the TF (true negati v es).This pr ovides an appr opriate set of input data to train our model with sufficient true positi v es and true negati v es ( 37 ).Note that training on regions with fe wer and weaker ChIP-seq peaks would not contain enough true positi v es regions and the best trained model will be a flat line at 0 ( 37 ).Following training, the estimated parameters were v alidated b y computing the area under the curve (AUC) between the predicted profile and the ChIP-seq profile on the subsequent 50 regions with strongest ChIP-seq signal ( 37 ).
ChIPanalyser uses DNA accessibility data among other parameters to predict TF binding, and this provides the opportunity to investigate the role of DNA accessibility on the binding of TFs to the genome.We subset the DNA accessibility signal using a quantile vector into quantile density accessibility (QDAs) between 0 and 0.99 ( 37 ).In practice, this means that for each QDA, the model considers a percentage of the top ATAC-seq or DNaseI-seq signal regions as accessib le, regar dless of their actual accessibility scores (see Figure 1 B, C and Materials and Methods).
To observe the pr efer ences of the various TFs for chromatin accessibility, we ran the analysis for each QDA and used the AUC as a goodness of fit metric to estimate the model accuracy.If the AUC for the lower QDAs is high, it indica tes tha t the TF can bind dense chroma tin with high affinity.This is because the prediction accuracy remains high e v en though all or most of the genome is considered accessible to the TF by the model, including dense chromatin.In contrast, if the AUC is high only for the high values of QDAs, it indicates that the TF binds only in open chromatin, as only the most open r egions ar e consider ed accessible to the TF by the model.We were able to identify three main classes of TFs based on their chromatin acces-sibility pr efer ence: accessibility independent factors (AIF), accessibility dependent factors (ADF) and inaccessibility dependent factors (IDF) (Figure 1 D).Each TF class has different DNA binding properties and impact on the surr ounding chr oma tin, as illustra ted in Figure 1 D.
Figure 2 shows the analysis for CTCF in K562 cells.First, for each accessibility threshold value (QDA value), we fit the model to the ChIP-seq data over the training dataset and evaluate its performance over the validation dataset (Figure 2 A, B). Figure 2 B re v eals that there are negligible differences between the performance of the model when assuming that CTCF can bind to all regions of the genome independent of their DNA accessibility le v els (QDA = 0) or when assuming that CTCF can bind only to the top 10% accessible regions of the genome (QDA = 0.9).This means that CTCF binds to DNA in an accessibility independent way and, thus, can be classified as an AIF. Figure 2 C shows an example comparison between the predicted CTCF profile and the ChIP-seq da ta, illustra ting that the predictions are accur ate.Similar ly, Supplementary Figure S2A shows two additional examples for other TFs (PBX2 and BACH1), while Supplementary Figure S2B shows the distribution of AUC values across all TFs, which confirm that the predictions display high accuracy.A complete list of the estimated optimal parameters and AUC values for all TFs is available in Supplementary Tables S2  and S3.

Many human TFs are predicted to bind to the DNA at their strong binding regions independent of DNA accessibility
Follo wing this workflo w, we analysed the rest of the available ChIP-seq data in K562 cells.Supplementary Figure S2 confirms that our model accurately fits the ChIP-seq data.Mor e pr ecisely, 95 out of 110 ChIP-seq datasets were modelled with high accuracy having an AUC of at least 0.8 and none have an AUC lower than 0.65 (Supplementary Figure S2B).To group TFs with respect to their binding preference in open and dense chromatin, we first tried k-means clustering.We used k -means clustering to identify groups of TFs based on how well ChIPanalyser fits the ChIP-seq data (using AUC metric) assuming that the TF can bind in regions with different le v els of accessibility.Howe v er, w e w ere only able to detect two clusters despite observing multiple distinct behaviours when inspecting the AUC Each row r epr esents a TF and each column an accessibility thr eshold (QDA value).The blue colour r epr esents the AUC le v el for the corresponding QDA and TF.We mark with green and bold the TFs that were previously reported to act as pioneer TFs.For FOS, there are two different datasets leading to opposite results and that was highlighted by bold and orange colour of the text.Finally, we mark by purple and bold TFs that display a decrease in AUC with increasing the accessibility, which are potential Inaccessibility Dependent Factors (IDFs).trends.For example, K-means clustering was not able to identify any IDFs.This is most likely due to having too few TFs exhibiting IDF behaviour in our dataset (see Materials and Methods and Supplementary Figure S3).To evaluate if our original k -means analysis was too stringent, we also performed the k -means analysis with fiv e clusters (Supplementary Figure S4).This analysis resulted in grouping all IDFs together with poorly predicted ChIP-seq datasets and splitting the ADFs in two subgroups.Thus, we opted for a threshold-based approach to classify the TFs, where the values for the threshold selection were informed by the k -means clustering (see Materials and Methods).This analysis identified se v eral classes of TFs with respect to their binding in open and dense chromatin (Figure 3 A).Most importantly, the classification based on manual thresholds leads to similar results to K-means clusters (Supplementary Figure S4F).
Unexpectedly, we found that many TFs display no preference for open or dense chromatin (33 AIFs), while some display a slight pr efer ence for open chromatin (49 partial AIFs / ADFs) (Figure 3 B, C).CREB1, FOXA1, FOS, GATA1 and JUN were previously identified as pioneer factors (see Supplementary Table S4) and, thus, their classification as AIFs or partial AIFs (green TFs in Figure 3 B,  C) is supported by the pre vious wor k.Only a small number of TFs displayed strong pr efer ence for open chromatin (12 ADFs) or moderate pr efer ence for dense chromatin (4 partial AIFs / IDFs) (Figure 3 D, E).Se v en TFs only partially met the criteria for either ADFs or AIFs and were classified as 'other' (Figure 3 F).Finally, fiv e TFs (NR2C2, YBX1, SOX6, SMAD2 and E2F8) were not accurately predicted independently of the parameters that were used (Figure 3 G), likely due to the quality of the ChIP-seq data or the PWM motif (Supplementary Table S5).
It is worthwhile noting that FOS was classified as both an Accessibility Independent and Dependent Factor by our analysis (orange TFs Figure 3 B, D and Supplementary Figure S3B, C).Interestingly, the classification as an AIF was based on ChIP-seq da ta tha t was genera ted with an eGFP tagged version of FOS, while the ADF classification was based on the untagged version of FOS.One possibility is that, in the eGFP tagged experiment, the le v els of FOS are higher than endogenous le v els in K562 cells and at higher concentrations FOS could also bind in dense chromatin regions of the genome.This raises the possibility that a TF can act as an accessibility dependent factor in a cell line where it is expressed at low or medium le v els and as an accessibility independent factor in cells where it is expressed a high or very high levels.In addition, it is known that eGFP can dimerise ( 56 ) and this could lead to a stronger recruitment of the eGFP tagged version of FOS compared to untagged version of FOS.One possibility is that homodimers of eGFP tagged version of FOS could have the capacity to bind dense chromatin.
It is also possible that some of the TFs identified as AIFs in our analysis bind in the same regions of the genome and our analysis identifies co-factors of TFs binding in highoccupancy target (HOT) regions ( 57 ) To investigate this, we looked for overlapping peaks of all TFs classified as AIF in K562 cells.Between the peaks used for model validation (based on which we performed TF classification), we found only negligible overlaps (Supplementary Figure S5A), ther efor e the classification is not likely to be dri v en by TFs binding HOT r egions.Furthermor e, we also investigated overlaps among the top 1000 and top 5000 peaks and we observed that the extent of overlap increases with the addition of weaker peaks (Supplementary Figure S5B-C), indicating that most of these TFs do not co-localise in the strongest binding regions in the genome, but do colocalise in weaker regions.The three main groups of TFs that co-localise are USF1 with USF2, CREB1 with ATF1 and SP2 with NFYA.TFs belonging to the same families, such as USF1 and USF2 or CREB1 and ATF1 are expected to share some lower affinity binding sites.Indeed, both pairs of TFs have similar binding motifs to each other.Furthermore, SP2 has been predicted to interact with NFY family members such as NFYB and NFYC ( 58 ).Ther efor e, the co-localisation of these TFs is most likely a result of similar function or recognition of similar low-affinity motifs and not due to many TFs binding in HOT regions.
Next, we used an alternative method to evaluate which TFs display strong binding in regions of the genome with low accessibility.In particular, we plotted the average ChIPseq signal at 1000 regions with strongest and 1000 regions with lowest DNaseI-seq signal (Supplementary Figures S6 and S7).We found that 9 AIFs (ATF4, CTCF, E2F6, EGR1, IRF2, MAFK, NFIC, NFYB and ZKSCAN1), 10 partial AIFs / ADFs (CEBPB, ELF1, HMBOX1, JUN, JUND, MAFF, MEIS2, NFE2, YY1 and ZNF384), 3 partial AIFs / IDFs (ZNF146, REST and ZNF274) and only 2 ADFs (ZBTB40 and MYNN) display at least similar ChIPseq signal in dense chromatin regions as in open chromatin r egions.Inter esting, ZNF146 and REST showed stronger binding in dense chromatin than in open chromatin further supporting their classification as IDFs.Altogether, these results show that a large number of TFs that our analysis predicted to display DNA accessibility independent binding also show strong binding in dense chromatin.
In our analysis, we have considered all ChIP-seq peaks independent if they are promoter proximal or distal.Thus, we investigated if the classification of TFs as AIFs in our analysis has been impacted by their pr efer ence for proximal or distal binding to TSS.If open chromatin next to promoters were to dominate the signal (leading to classification of AIFs), then we should see that for all AIFs the majority of the ChIP-seq peaks used for the analysis are next to a TSS.In contrast, if dense chromatin was to dominate the signal (leading to classification of AIFs), then we should see that all AIFs have only a few peaks next to a TSS.W ha t we see is a mixture, with some TFs displaying preference for proximal binding and others for distal binding for both AIFs and ADFs (Supplementary Figure S8B), which indica tes tha t whether we train the model on TSS peaks or distal peaks does not influence the classification of TFs as AIFs or non-AIFs.We repeated the analysis by using only proximal or only distal ChIP-seq peaks for nine TFs that were selected since they were classified as AIFs, showed strong binding in dense chromatin (Supplementary Figures S6A  and S8A) and displayed different le v els of TSS proximal and distal binding (Supplementary Figure S8B).Our results showed only negligible differences in the AUCs of the nine TFs when considering either TSS proximal or TSS distal only binding, and TFs were consistently classified as (partial) AIFs (Supplementary Figure S8D).Note when TFs that were classified as partial AIFs or other (ZKSCAN1, NFIC and MAFK), the analysis was performed on regions with weaker ChIP-seq signal (Supplementary Figure S8C) because these TFs had few TSS proximal ChIP-seq peaks.

Mouse TFs display similar behaviour to human TFs
Next, we performed a similar analysis for ChIP-seq datasets in mouse cell lines from the ENCODE project ( 43 , 44 ).We obtained 60 ChIP-seq datasets covering 30 TFs (with a PWM motif in the JASPAR core database) in eight cell lines (Supplementary Figure S9 and Supplementary Table S6).We also considered and additional study of MYOD1 from ( 59 ).Most of the ChIP-seq datasets (90%) were modelled with high accuracy (AUC > 0.8); see Figure 4 A. Depending on availability, we used DNase or ATAC-seq datasets as measures of DNA accessibility (see Supplementary Table S7).For each ChIP-seq dataset, we computed the AUC for all QDA values (Supplementary Table S8) and then used the threshold-based approach to group these TFs into the different classes (Figure 4 B-H).Interestingly, we observed that while some TFs were classified in the same group for all cell lines, ther e wer e some that were classified in two or e v en three groups (Figure 4 B).( 59 ).Most of the ChIP-seq datasets (90%) were modelled with high accuracy (AUC > 0.8); see Figure 4 A.
Similarly, to our findings in the human cell line, CTCF was classified as an AIF in three of the mouse datasets.Interestingly, it was also classified as a partial IDF in two of the mouse datasets.This behaviour is in line with the literatur e surrounding CT CF, which suggests that its pr efer ence for chromatin state varies depending on context (34)(35)(36).Finall y, CTCF was poorl y predicted in one dataset, likely due to the quality of the data.
USF2 is the only TF that was identified in all four datasets (two in human and two in mouse) as an AIF suggesting that there is a high probability this TF binds in a DNA accessibility independent manner.USF1 displayed DNA accessibility independent binding characteristics in three datasets (two in mouse and one in human) and partial pr efer ence to dense chromatin in one dataset (mouse).These two TFs have been previously reported to bind heter ochr omatin ( 60 ), thus further supporting our findings.ZNF384 also displayed DNA accessibility independent binding characteristics in three datasets (mouse), but a partial pr efer ence to open chromatin in one dataset (human).NRF1 and ZKSCAN1 both showed consistent DNA accessibility independent binding in all three datasets.MYOD1, a pr eviously r eported pioneer TF (61)(62)(63) was identified in our study as displaying DNA accessibility

Binding pr edictions ar e consistently better when using A T A Cseq and DNaseI-seq and ar e mor e accur ate at r egions with strong and medium ChIP signal
One surprising finding from our analysis in K562 cells is that many TFs are predicted to bind independently of DNA accessibility despite only a handful of TFs having been previously identified as pioneer TFs.This suggests that pioneer TFs are only a subset of AIFs and other types of TFs, such as co-factors of pioneers, bookmarking TFs or chromatin re-modelers also bind in an accessibility independent manner.Howe v er, our model is unable to distinguish between these types of factors.Moreover, the use of DNaseI-seq data might also be a potential source of bias and using different methods to estimate DNA accessibility could change some of these results.Similarly, we focussed our analysis at regions of DNA displaying the strongest ChIP signal and it is not clear if these results would remain the same when investigating regions of the genome with weaker signal.
To address these issues, we analysed the binding of ele v en TFs in IMR90 cells and three TFs in HepG2 cells (Supplementary Figure S10A-B and Supplementary Table S9).First, we considered four different DNA accessibility measures (DNaseI-seq, ATAC-seq, MNase-seq and NOMEseq) in IMR90 and three (DNaseI-seq, ATAC-seq and MNase-seq) in HepG2; see Supplementary Figure S10C-D and Materials and Methods.Secondly, we ran the validation analysis on 50 regions with strong, medium and weak ChIPseq signal (see Supplementary Table S9 and Materials and Methods).Our results showed that while most analyses resulted in a high prediction accuracy, there were significantly more cases compared to previous analyses where our model did not fit the data well (compar e Figur e 5 A to Supplementary Figure S2B and Figure 4 A).To identify if there is a subgroup resulting in these lower accuracy models, we split the cases based on whether the validation ChIP signal was strong, medium or weak and found that most of the cases where our model did not fit the data well were at regions with weaker binding (Figure 5 B).Next, we ran a similar analysis, but we split the data based on the DNA accessibility method.We found that while ATAC-seq and DNaseIseq produced similar results, MNase-seq and NOMe-seq resulted in worse performance by our model (Figure 5 C).Altogether, these results support that PWM, DNA sequence, binding specificity ( ), TF concentration and DNA accessibility can accurately e xplain observ ed binding profiles of TFs at regions of the genome displaying strong and medium binding strength, but only partially at regions displaying weaker binding.Furthermore, ATAC-seq and DNaseI-seq resulted in consistently better predictions than MNase-seq or NOMe-seq in our model, but this might be a reflection of the quality of those particular datasets (see Supplementary Figure S10C, D and Materials and Methods).
In agreement with our findings in K562 cells and mouse cell lines, our analysis re v ealed that CTCF can be classified as both AIF and as partial IDF (Figure 5 D, Supplementary Figure S11 and Supplementary Table S10).Interestingly, the latter is mainly associated with regions of weaker binding, while the former with strong and medium binding.Supplementary Figure S12 shows an example where ChIP analyser r eproduces with high accuracy (AUC = 0.95) the CTCF ChIP-seq data when assuming that it can bind anywhere independent of the le v el of DNA accessibility ( A ) Histogram with the AUC for the optimal parameters of all cases considered (13 TFs in two cell lines analysed with three or four DNA accessibility methods and strong, medium and weak binding genomic regions).( B ) Density plot with the AUC for the optimal parameters of all combinations when the datasets are split based on ChIP-seq signal strength.We performed a Mann-Whitney U test and found that the differences in the means are statistically significant (p-value for strong compared to medium 0.046, strong compared to weak 2.46 × 10 −13 and medium compared to weak 3.37 × 10 −11 ).( C ) Density plot with the AUC for the optimal parameters of all combinations when the datasets are split based on DNA accessibility method.We performed a Mann-Whitney U test and found that ATAC-seq and DNaseI-seq lead to similar results and these are different from MNase-seq and NOMe-seq (p-value for ATAC-seq compared to DNaseI-seq 0.32, MNase-seq compared to NOMe-seq 0.66, ATAC-seq compared to MNase-seq 3.03 × 10 −5 , ATAC-seq compared to NOMe-seq 3.52 × 10 −5 , DNaseI-seq compared to MNase-seq 2.97 × 10 −4 and DNaseI-seq compared to NOMe-seq 5.15 × 10 −4 ).( D ) Classification of the 14 TFs in AIF, partial AIFs / ADFs, ADFs, partial AIFs / IDFs, other and poorly predicted.
(QDA = 0), but misses se v eral peaks (AUC = 0.5) when assuming that it can bind only to the top 5% accessible DN A (QDA = 0.95).Similarl y to CTCF, USF2, CEBPB and NFL2L2 act as AIFs at regions with stronger ChIP signal, but display IDF characteristics at weaker regions.RFX5 and MXI1 act as AIF, mainly at regions with strong binding.
FOXA1, CREB1 and FOS were previously identified as pioneer factors (Supplementary Table S4) and were classified by our analysis as (partial) AIF (Figure 5 D, Supplementary Figure S11 and Supplementary Table S10).Overall, we found that many TFs (nine out of fourteen) behave as AIFs, but pr efer entially at r egions with stronger binding.At regions with weaker binding, they either behave as AIFs or our model cannot accurately capture their binding profile.MAFK is predicted most of the times to act as IDF and most likely to prefer dense chromatin.MAZ and GATA4 were identified to bind preferentially to open chromatin in regions with strong and medium binding.
One possib le e xplana tion for the observa tion tha t fewer TFs are classified as AIFs at regions with weaker binding is that they have weaker binding sites in those regions.To investigate this, we used PWMEnrich ( 64 ) to measure the strength of the binding sites located in peaks within strong, medium and weak binding r egions.Our r esults showed that the majority of TFs (8 out 13) showed only small or negligible reduction in the strength of the binding sites located in medium or weaker binding regions (Supplementary Figure S13).Ne v ertheless, one e xception is CREB1, which displayed a large reduction in the strength of binding sites located in weaker binding regions compared to binding sites located in strong binding regions.

Jun is predicted to bind before the chromatin is open in MCF10A cells upon her2 o ver expr ession
Recent work showed that HER2 ov ere xpression in MCF10A cells resulted in a large number of regions gaining DNA accessibility and a small number losing DNA accessibility ( 65 ).One possibility is that these changes in DNA accessibility can be explained by changes in the le v els of some AIF binding to those regions.HER2 ov ere xpression also resulted in some TFs displaying changes in phosphorylation which could result in an increase or decrease of the number of bound TF molecules.We cross-r efer enced the TFs that were classified at least once as (partial) AIFs in our K562 experiment with the TFs that were shown to change phosphorylation upon HER2 ov ere xpression and found se v en TFs (ATF1, ETV6, JUN, JUND, MYC, NFATC3 and SRF).Using RNA-seq data, we estimated the number of bound molecules in MCF10A by m ultipl ying the number of bound molecules in K562 cells with the ratio between the mRNA le v els for the corresponding TF in the two cells, and kept the same as in K562 analysis (for each TF: N MCF10A = N K562 × mRN A MCF10A / mRN A K562 and MCF10A = K562 ).Rescaling the number of bound molecules by changes in RNA-seq between two cell types or two conditions was shown previously to generate results that reproduced ChIP-seq profiles with high accuracy ( 37 ).Then, for the HER2 ov ere xpression e xperiment, we further rescaled the number of bound molecules based on the change in amounts of phosphorylated TFs ( 65 ).
Figure 6 shows the predicted binding le v els at regions that gained and regions that lost DNA accessibility assuming that these se v en TFs can bind independently of DNA accessibility le v els.Interestingly, ETV6, JUN and SRF ar e pr edicted to bind strongest at these regions.SRF is the only TF that our model predicts to bind in an accessibility independent fashion and shows a noticeable increase in binding upon HER2 ov ere xpression, but that happens at both regions that gained and regions that lost DNA accessibility (also see Supplementary Figure S14).
Our results show that there are negligible differences in the binding of JUN at regions that gained accessibility upon HER2 ov ere xpression, suggesting that this TF is bound ther e befor e the r egions become accessible.JUN is part of the AP-1 complex, which has been reported to have pioneer functions (see Supplementary Table S4).Thus, while JUN can potentially bind inaccessible r egions, ther e isn't a subsequent opening of chromatin; instead, JUN appears to remain bound to closed chromatin.This is consistent with JUN being a bookmarking TF that binds to the DNA in dense chromatin and, when another co-factor is recruited, it could potentially open the chromatin.

DISCUSSION
Using our statistical thermodynamics frame wor k, we systema tically investiga ted the binding characteristics of a large number of human and mouse TFs in different cell lines and focussed our analysis on the relationship between TF binding and DNA accessibility.For se v eral TFs, we also investigated the impact of using different DNA accessibility methods (ATAC-seq, DNaseI-seq, MNase and NOMe-seq) on our binding predictions in depth, to better understand potential sources of bias.In addition, we also compared the effects of modelling regions with stronger binding and regions that display w eaker binding.Overall, w e found that more TFs than previously reported do not display binding pr efer ence for open or dense chr omatin (appr oximately one third), but this generally occurs at their strongest binding sites.Regions of the genome displaying weaker binding often are poorly modelled or show pr efer ence for either open or dense chromatin.In addition, we also provide evidence of DNA accessibility independent binding for TFs that have been previously reported to bind dense chromatin and explain their complex interaction with chromatin.

Sever al tr anscription factors have no or limited pr efer ence f or open or dense chromatin
A previous study proposed that a pproximatel y 16% of TFs display pioneer functions ( 42 ).Here, we find a higher proportion of TFs that bind chromatin independent of the DNA accessibility status (approximately one third).Furthermor e, another r ecent large scale study identified 32 TFs as potential pioneer TFs in se v eral human cell lines ( 66 ).Most importantly, 15 (out of 33) of the TFs that we classify as AIFs (ATF1, BACH1, CREB1, CTCF, CUX1, E2F6, USF2, EGR1, ELK1, GABPA, NFYA, NFYB, USF1, USF2 and ZKSCAN1) and 13 (out of 49) that we classify as partial AIFs (ATF1, CEBPB, E2F1, CUX1, NR4A1, ETS1, HMBOX1, MAX, NFATC3, NR2F2, POU5F1, SP1 and SREBF1) in this study wer e pr eviously identified as potential pioneer TFs using different methods in ( 42 , 66 ).In addition to pioneers, AIFs also include bookmarking TFs (binding both open and dense chromatin) and co-factors of pioneers (being recruited by pioneer factors), which indica tes tha t our estima tes of the number of TFs that bind in an accessibility independent fashion are supported by this previous study.Previous studies have reported that CTCF is a chromatin insulator ( 67 ).Ne v ertheless, CTCF also shows depletion of nucleosomes ∼200 base pairs (bp) at the centre of its binding site ( 68 ).While the former suggests that CT CF pr efers dense chromatin and does not open the chromatin, the latter indicates potential pioneer functions, wher e CT CF binds dense chromatin and opens it.In addition, it is reported that CTCF helps maintain the open chroma tin sta te by steric hindrance of the DNA methylation machinery ( 69 ).Our results also showed that regions displaying weaker binding of CTCF could represent areas of the genome wher e CT CF acts as a potential insulator, while regions with strong binding re v eal areas of the genome wher e CT CF might act as a pioneer factor.Altogether our results confirm that both functions are possible modes of action for CTCF.
Similarly to CTCF, CEBPB and NFE2L2 were identified as AIFs, mainly at stronger binding sites, and partial IDFs at weaker binding sites.CEBPB has been previously reported as having pioneer function and being able to maintain open chromatin ( 70 ), while NFE2L2 (NRF2) was reported as being a transcription activa tor ( 71 ).W hile our results support these previous findings, we also discovered a new role at weaker binding sites for these TFs.We propose that they pr efer entially bind the genome in denser chromatin and either maintain a closed chromatin state, or are unable to open the chroma tin a t those sites.This could be a consequence of the TF not being recruited in sufficient numbers to open the chroma tin a t weak sites, or it could indicate a bookmarking role, such as in the case of JUN.
USF1 and USF2 are members of the highly conserved basic-Helix-Loop-Helix-Leucine Zipper proteins (bHLH-LZ) that bind to the symmetrical DNA sequences called Ebox es ( 72 ).Pr eviously, it has been hypothesised that USF1 and USF2 are able to bind heter ochr omatin and behave as pioneers ( 60 ).USF1 has been shown to form heterodimers that act as insulators, pre v enting the spread of heter ochr omatin ( 73 ).PLAG1 is one cofactor of USF2 that was suggested to display pioneer activity and enable USF2 binding, but the direct interaction was not proven ( 74 , 75 ).Altogether, these previous findings support our results that USF2 binds as an AIF, and USF1 as an AIF or partial IDF.
Our analysis also identified NRF1, ZNF384 and BHLHE40 as AIF.NRF1 binding is methylation restricted in mouse embryonic stem cells, as it can be outcompeted by de novo DNA methylation.This suggests that methylationsensiti v e TFs may rely on neighbouring pioneer TF binding to ensure a h ypometh ylated environment ( 76 ).ZNF384 was shown to interact with a variety of structural and regulatory proteins (vimentin, zyxin, PCBP1), but their individual roles in transcription regulation is not entirely clear ( 77 ).BHLHE40 was shown to induce binding-site directed DNA demethylation and hypothesised to have pioneer function ( 78 , 79 ), which provides additional evidence for our classification as an AIF.
Many TFs in our analysis display accessibility independent binding, but this is mainly the case for regions of the genome for which they have the highest affinity.One question that arises is whether these TFs could truly bind independently of DNA accessibility.In our analysis, RFX5 was predicted to be an AIF only at regions with strong binding.Ne v ertheless, RFX5 was shown to displace nucleosomes, indica ting tha t it has pioneer function ( 80 ).Altogether, these results support the fact that e v en if a TF is found to bind in a DNA accessibility independent way only at its strongest binding sites, it does not mean it cannot be a pioneer factor or that they are pioneer factors and not co-factors or bookmarking TFs.
FOXA1 is one of the best characterised pioneer TFs ( 28 ).In our analysis, FOXA1 was characterised as a partial AIF in K562 cells and in three scenarios in HepG2 (for regions with strong and medium binding) and only in one case as an AIF in HepG2 cells (at regions with medium binding when using ATAC-seq data) (Figures 3 C and 5 D).Interestingly, a previous study also reported that FOXA1 would have a reduced pioneer activity ( 42 ).Nevertheless, it was shown that despite its capacity to bind nucleosomes ( 18 ), its binding is chromatin context dependent ( 81 ).In other words, modifications of the histone tails might have an effect of FoxA1 capacity to bind nucleosomes and could explain why in some of our datasets and conditions we predict to display only partial accessibility independent binding.
JUN is part of the AP-1 complex, which has been proposed as a pioneer factor ( 82 ).Our analysis identified JUN as a partial AIF in K562 cells and mouse cell lines.Nevertheless, we predicted that JUN is already bound at regions that become accessible upon HER2 over expr ession in MCF10A cells.This indica tes tha t JUN is more likely a bookmarking TF that can bind in dense chromatin and r equir es co-factor(s) expressed upon HER2 over expr ession to open the chromatin.
Finally, knockdown of MXI1 was shown to block chroma tin condensa tion ( 83 ) suggesting tha t this TF could potentially act as an IDF.In our analysis MXI1 was consistently classified either as an AIF or as a partial AIF.While our analysis does not contradict the capacity of this TF to bind dense chromatin, it provides additional insights into its binding mechanism.
In our analysis, we did not consider the role of DNA methylation on TF binding.Previous work has shown that some TFs can distinguish between methylated and unmethylated DNA (84)(85)(86)(87)(88).Se v eral pre vious in vitro studies ( 85 , 89 , 90 ) have observed that the binding of a large percentage of TFs seem to be affected by DNA methylation.Howe v er, in studies using cellular context, this is not the case and a recent paper showed that there was no significant change in the function of 97% of enhancers after changes in DNA methyla tion ( 87 ), indica ting tha t binding of TFs in vivo is independent of DNA methylation at many binding sites.Furthermore, another recent study found that both sites with no methylation or intermediate le v els of methylation displayed ChIP-seq peaks for se v eral TFs (including CTCF, CEBPA, and FOXA1) ( 91 ).Some of the TFs included in our analysis that pr efer entially bind unmethylated DNA include CTCF, GABPA, ELK1 and REST.The former three were classified as AIFs in our analysis and, for all thr ee, ChIP analyser pr edicts their ChIP-seq profiles accurately (AUC > 0.85; CTCF = 0.914, GABPA = 0.958 and ELK1 = 0.871) without considering DNA methylation.Supplementary Figure S6 confirms that CTCF displays very strong binding in dense chromatin (with similar strength as in open chromatin) further supporting that its binding is not affected by DNA accessibility.In contrast, GABPA and CREB1 display significantly weaker binding at dense chromatin regions (Supplementary Figure S6) indica ting tha t they might not be capable of binding dense chroma tin a t least by themselves or that their binding is impacted by DNA methylation in dense chromatin.Overall, this shows that without taking DNA methylation into account, ChIPanalyser is able to predict the vast majority of the TF binding sites in a cell type specific manner.

T r anscription factors that prefer dense or open chromatin
MAFK was found to be pr efer entially associated with dense chromatin in most cases and thus classified as an IDF.These r esults ar e supported by the fact that MAFK lacks a transactivating domain ( 92 ) and is mainly associated with heter ochr omatic parts of the genome ( 93 ).
We also identified four TFs with moderate pr efer ence for dense chromatin, namely: REST, ZNF274, ZNF24 and ZNF146.REST has been reported to be preferentially associated with silenced chromatin ( 94 ), ZNF274 is part of a complex that recruits H3K9me3 writers and, thus, is involved in heter ochr omatin establishment and maintenance ( 95 ), ZNF24 is associated with gene r epr ession ( 96 ), while ZNF146 pr efer entially binds at silenced Line-1 elements ( 97 ).These pr evious r eports support our classification of these TFs as IDFs.
TBP, ETS1, JUND, MAZ and GATA4 were TFs that were consistently identified to show binding pr efer ence for open chromatin.Interestingly, GATA4 has been reported to act as a pioneer factor (Supplementary Table S4), but in our analysis, we found only partial AIF properties at regions with medium and weak binding (Figure 5 D).This indica tes tha t GATA4 function might be concentration-and chromatin context-dependent.
W hen investiga ting dif ferent da tasets in the same cell type or in different cell types, we found that the classification of a TF may differ (e.g. Figure 4 B).There are se v eral possibilities that could explain this.One possibility is that the datasets have different qualities in terms of quality of the antibody or library and fragment size.Modelling datasets with different qualities can result in classifica tion dif ferences but selecting the most reproducible classification from biological replicates for the same TF can remove some of these biases.Alternati v ely, TFs will display different concentrations in different cell lines and, as seen in the analysis of JUN in K562, this can result in different preferences for dense chromatin.Finally, co-factors of TFs could be expressed in one cell type and not in another.This could help differentiate if a TF can open the chromatin by itself (it is modelled as an AIF in all cell types where it is expressed) or if it is also modelled as a bookmarking TF (modelled as an AIF in cell types where the pioneer co-factor is expressed and as an IDF where the co-factor is not expressed).

DA T A A V AILABILITY
All scripts used for pre-processing and further analysis can be accessed at https://doi.org/10.5281/zenodo.8085275 .

SUPPLEMENT ARY DA T A
Supplementary Data are available at NAR Online.

Figur e 1 .
Figur e 1. Wor kflow over view.( A ) ChIPanal yser models TF binding based on PWM motifs , DNA sequence , and DNA accessibility data together with ChIP-seq data.After extracting TF binding profiles from ChIP-seq data, we split the genomic regions into 50 Kb bins and selected the top 10 regions with the highest number of ChIP-seq peaks as training regions, and the following 50 regions with the highest number of ChIP-seq peaks as validation r egions.The ChIP analyser model was tr ained on the tr aining r egions to minimise mean squar ed error (MSE), then valida ted to estima te the accuracy of the binding profile predictions.(B, C) Approach to investigate TF pr efer ence for DN A accessibility. ( B ) DN A accessibility data (DNaseI-seq or ATAC-seq) is used to ( C ) select the regions of the genome that have a signal above a threshold (e.g.QDA of 0.1 results in selecting the 90% of the genome with the highest le v els of accessibility signal).( D ) Graphical r epr esentation of the binding properties of each class of TF and how each class af fects chroma tin structure after binding.Accessibility Independent Factors (AIFs) bind open or dense chromatin without preference, but AIF binding sometimes can displace nucleosomes.Accessibility Dependent Factors (ADFs) bind open chromatin only and no changes in accessibility are observed after binding.Finally, Inaccessibility Dependent Factors (IDFs) bind chromatin in thr ee differ ent scenarios: (i) bind dense chromatin and maintain it (no changes are observed); (ii) bind dense chromatin and r einfor ce it, thus changes are observed in the increased number of nucleosomes and lastly (iii) bind open chromatin, which becomes compacted.The different classes of TFs that group as AIFs (pioneer, bookmarking, chromatin remodeller and co-factors of those), ADFs (traditional TFs and their co-factors) and IDFs (insulators, chromatin remodeller and co-factors of those) are shown in the side panels.

Figure 2 .
Figure 2.Model of CTCF binding.We plot the analysis of CTCF binding in K562 cells.( A ) Heatmap showing the optimal range for the optimal QDA for CTCF (0.9). ( B ) AUC for the optimal parameters estimated for CTCF for all QDAs.( C ) ChIP profile estimated with ChIPanalyser based on the optimal parameters.The grey shaded area represents the ChIP signal, the orange line represents the ChIPanalyser prediction of the ChIP signal, the blue lines r epr esent occupancy at each locus, the yellow shaded areas represent closed chromatin, and the white shaded areas represent open chromatin.

Figure 3 .
Figure 3. K562 classification of TFs .( A ) We present the rules used to group the TFs in the dif ferent classes.(B-G) Hea tmaps with the AUC for the optimal parameters estimated for each TF for all QDAs: ( B ) AIFs, ( C ) partial AIFs / ADFs, ( D ) ADFs, ( E ) partial AIFs / IDFs, ( F ) other and ( G ) poorly predicted.Each row r epr esents a TF and each column an accessibility thr eshold (QDA value).The blue colour r epr esents the AUC le v el for the corresponding QDA and TF.We mark with green and bold the TFs that were previously reported to act as pioneer TFs.For FOS, there are two different datasets leading to opposite results and that was highlighted by bold and orange colour of the text.Finally, we mark by purple and bold TFs that display a decrease in AUC with increasing the accessibility, which are potential Inaccessibility Dependent Factors (IDFs).

Figure 4 .
Figure 4. Classification of TFs in mouse cell lines .( A ) Histogram with the AUC for the optimal parameters of the 60 TFs analysed in mouse cells.( B ) Bar plot r epr esenting the differ ent classifica tions for each TF in the mouse cell line.We also included the classifica tion in human cells for the corresponding TF if that analysis was performed in K562 cells.(C-H) Heatmaps with the AUC for the optimal parameters estimated for each TF for all QDAs: ( C ) AIFs, ( D ) partial AIFs / ADFs , ( E ) ADFs , ( F ) partial AIFs / IDFs , ( G ) other and ( H ) poorly predicted.Each row represents a TF and each column an accessibility threshold (QDA value).The blue colour represents the AUC le v el for the corresponding QDA and TF.

Figure 5 .
Figure 5.Effect of DNA accessibility method and ChIP-seq signal strength on TF classification .We consider 14 ChIP-seq datasets in IMR90 and HepG2 cells and investigated the effect of DNA accessibility method (DNaseI-seq, ATAC-seq, MNase-seq and NOME-seq in IMR90 and DNaseI-seq, ATAC-seq and MNase-seq in HepG2) and ChIP-seq signal strength (validation regions with strong, medium and weak ChIP-seq signals) on the classification of TFs.(A ) Histogram with the AUC for the optimal parameters of all cases considered (13 TFs in two cell lines analysed with three or four DNA accessibility methods and strong, medium and weak binding genomic regions).( B ) Density plot with the AUC for the optimal parameters of all combinations when the datasets are split based on ChIP-seq signal strength.We performed a Mann-Whitney U test and found that the differences in the means are statistically significant (p-value for strong compared to medium 0.046, strong compared to weak 2.46 × 10 −13 and medium compared to weak 3.37 × 10 −11 ).( C ) Density plot with the AUC for the optimal parameters of all combinations when the datasets are split based on DNA accessibility method.We performed a Mann-Whitney U test and found that ATAC-seq and DNaseI-seq lead to similar results and these are different from MNase-seq and NOMe-seq (p-value for ATAC-seq compared to DNaseI-seq 0.32, MNase-seq compared to NOMe-seq 0.66, ATAC-seq compared to MNase-seq 3.03 × 10 −5 , ATAC-seq compared to NOMe-seq 3.52 × 10 −5 , DNaseI-seq compared to MNase-seq 2.97 × 10 −4 and DNaseI-seq compared to NOMe-seq 5.15 × 10 −4 ).( D ) Classification of the 14 TFs in AIF, partial AIFs / ADFs, ADFs, partial AIFs / IDFs, other and poorly predicted.

Figure 6 .
Figure 6.Prediction of TF binding in MCF10A cells upon HER2 ov ere xpression .We considered the case of WT MCF10A cells and MCF10A cells were HER2 is over expr essed.Over expr ession of HER2 resulted in se v eral regions gaining DNA accessibility and a few losing DN A accessibility.Heatma ps with the prediction of TF binding using ChIPanalyser for ATF1, ETV6, JUN, JUND, MYC, NFATC3 and SRF at regions that lost or gained DNA accessibility ( ±2 kb) in control and HER2 over expression.