Mechanisms of in vivo binding site selection of the hematopoietic master transcription factor PU.1

The transcription factor PU.1 is crucial for the development of many hematopoietic lineages and its binding patterns significantly change during differentiation processes. However, the ‘rules’ for binding or not-binding of potential binding sites are only partially understood. To unveil basic characteristics of PU.1 binding site selection in different cell types, we studied the binding properties of PU.1 during human macrophage differentiation. Using in vivo and in vitro binding assays, as well as computational prediction, we show that PU.1 selects its binding sites primarily based on sequence affinity, which results in the frequent autonomous binding of high affinity sites in DNase I inaccessible regions (25–45% of all occupied sites). Increasing PU.1 concentrations and the availability of cooperative transcription factor interactions during lineage differentiation both decrease affinity thresholds for in vivo binding and fine-tune cell type-specific PU.1 binding, which seems to be largely independent of DNA methylation. Occupied sites were predominantly detected in active chromatin domains, which are characterized by higher densities of PU.1 recognition sites and neighboring motifs for cooperative transcription factors. Our study supports a model of PU.1 binding control that involves motif-binding affinity, PU.1 concentration, cooperativeness with neighboring transcription factor sites and chromatin domain accessibility, which likely applies to all PU.1 expressing cells.


Mass spectrometry analysis of bisulfite-converted DNA -Genomic regions for MassArray
analyses were chosen that either indicated an epigenetic transition from HPC to monocytes (induction of H3K4me1) or the induced binding of transcription factors in combination with de novo H3K4me1 appearance during monocyte to macrophage differentiation. Primer design, sodium bisulfite conversion, amplification and MALDI-TOF mass spectrometry (MassARRAY Compact MALDI-TOF, Sequenom, San Diego, CA) were done as described. Methylation was quantified from mass spectra using the Epityper software (Sequenom, San Diego, CA). The following primers were used to generate amplicons from bisulfite treated DNA: Microscale Thermophoresis -The sequence of the full-length hPU.1 was amplified by PCR from pORF9-hSPI1 (InvivoGen San Diego, USA) and recombined into a modified pDM8 vector, encoding an N-terminal His-tag, using the Gateway technology (life technologies). The protein was expressed in Rosetta2(DE)pLysS (Novagen) and purified by Nickel affinity chromatography (Qiagen). Doublestranded DNA molecules were annealed from single-stranded, HPLC-purified oligonucleotides (Sigma-Aldrich). The annealing reaction (10 µl) was performed in 1x annealing buffer (20 mM Tris-HCl pH 7.4, 2 mM MgCl2, 50 mM NaCl) and comprised 20 µM of the Cy3-labeled oligonucleotide (upper strand) and 20.8 µM of the unlabeled oligonucleotide (lower strand). The annealing reaction was incubated for 15 min at 95°C in a thermoblock (peQLab) and afterwards allowed to slowly cool down to room Pham et al.
temperature over night. The annealing reaction was checked on an 8% native polyacrylamide gel which was analyzed on a fluorescence imager (FLA-5000, Fujifilm).
The unlabeled protein was titrated in a 1:1-dilution series starting with a concentration of 23 µM. Every binding assay comprised one control reaction without any protein. After loading the binding reactions into standard capillaries (NT.115) the mixture was incubated for 15 min at 25 °C in the Nanotemper before starting the measurement. The data was analyzed using the NT-analysis acquisition software

Figure Methods
Examples of thermophoresis curves obtained for four representative motifs. The top motif represents the motif with the highest PWM log-odds score, the bottom motif (mutated) contains a three nucleotide exchange in the core recognition site of PU.1 and shows no detectable binding. Pham et al.

ChIP-seq peak finding and annotation -
Analysis of mapped ChIP-seq tags was performed using HOMER, which is freely available at http://biowhat.ucsd.edu/homer/. ChIP-Seq quality control and transcription factor peak finding, TSS annotation (based on GENCODE V13) and motif analysis were done essentially as described [1,2]. Genome Ontology annotation and ChIP-seq tag annotation of peak sets, or motif-centered regions was done using scripts provided by HOMER. Next generation sequencing data (either published or generated in this study) that were used in this study are listed below: (A) Published sequencing data used in this study De novo motif searches -Motif enrichment in transcription factor peak sets was done using HOMER by comparing sequences of cell type-specific peaks (+/-100 bp) to 50,000 randomly selected genomic fragments of the same size, matched for GC content and autonormalized to remove bias from lowerorder oligo sequences. Due to the numerous enrichment tests made during the motif discovery procedure and the vast search space, corrections for multiple hypothesis testing must be carried out empirically by randomizing the target and background assignments and repeating the motif discovery procedure. One hundred randomizations (which were performed for each individual motif search) failed to yield motifs with enrichment P-values less than 1e-19, implying the false discovery rate for motifs with a P-value less than 1e-19 reported in this study is < 1%. Motif enrichment around bound motifs (+/-100 bp) was done by comparing motif-centered regions with non-overlapping, GC-matched, and autonormalized regions centered on non-bound motifs. In either case, motif enrichment is calculated using the cumulative hypergeometric distribution by considering the total number of target and background sequence regions containing at least one instance of the motif. Reannotation of the PWM to the human genome (hg19, either total or repeat-masked) was done using the scanMotifGenomeWide.pl script contained in the HOMER suite. HPC, MO, and MAC PU.1 ChIPseq tags were counted around all motif instances (+/-100 bp) across the repeat-masked human genome to determine non-bound PU.1 motifs (no ChIP-seq tag within the 200-bp window) across the non-repetitive genome. To determine the total set of bound PU.1 motifs, all bound PU.1 motif instances from HPC, MO and MAC were merged using bedTools. Extraction of log-odds scores for individual motifs or peaks was done using the annotatePeaks.pl program (provided by HOMER), which returns the highest scoring motif position as well as log-odds scores for each peak/region. Sequences were extracted using homerTools (provided by HOMER).  Table S1 Complete results of the microscale thermophoresis measurements for 75 selected sequences included in the PU.1 PWM.

Table S2
Characteristics of chromatin domain categories.

Figure S1
Dynamics of PU.1 binding during HPC to monocyte differentiation.

Figure S2
Genome Ontology enrichment analysis for bound and non-bound PU.1 motifs

Figure S3
Distribution of epigenetic marks at bound and non-bound PU.1 consensus sequences.

Figure S4
Relationship between transcription factor binding and DNA methylation.

Figure S5
Motif composition around bound vs. non-bound PU.1 consensus sites or around PU.1 peaks not recognized by the consensus PU.1 motif.

Figure S6
Distribution of motif log-odds scores at PU.1 bound regions for three cell stages and alternative PWM.

Figure S7
Comparison of motif log-odds scores with signal intensity Z scores from protein binding microarray (PBM) experiments.

Figure S9
Bound PU.1 motifs with differentiation-dependent DNase I accessibility changes.

Figure S11
Expression correlation in CTCF-flanked domains contingent of their H3K4me1 level.

Figure S12
Motif distribution in CTCF-flanked domains contingent of their H3K4me1 level.

Figure S13
Distribution of motif-associated PU.1 tag counts in CTCF-flanked domains contingent of their H3K4me1 level.

Figure S14
Motif score distribution and DNAse I accessibility in CTCF-flanked domains contingent of their H3K4me1 level.

Figure S15
Cell type-specific domain activities in MO, MAC and HPC.

Figure S16
Motif analyses in domains with differential activity between MO and liver.

Figure S17
Enrichment of PU.1 co-associated transcription factor consensus sites in domains showing cell type-specific activity

Figure S18
Distribution of PU.1 motifs across domain categories.   , the remaining data was generated by the ENCODE or the Roadmap Epigenomics projects (high-throughput sequencing data sets used in this study are listed in Table S1).  Heat maps depict the methylation status of individual CpGs from red (100%) over blue (50%) to yellow (0%) with each box representing a single CpG. Data of at least three independent donors were averaged.     Motif log odds score ***

Figure S6 Distribution of motif log-odds scores at PU.1 bound regions for three cell stages and alternative PWM. (A-D)
Combined bean & box plots showing the distribution of motif log-odds scores of annotated PU.1 motifs (white boxes) or best scoring motifs within total (blue boxes) or cell type-specific peaks (light blue boxes). (A) corresponds to the motif de novo extracted from human HPC PU.1 peaks, (B) corresponds to a motif de novo extracted from mouse peritoneal macrophage PU.1 peaks. In (C,D), motifs de novo derived motifs from human macrophage and monocyte PU.1 peaks were used, which were generated using normalized motif frequencies to correct for the depletion of CpG containing motifs. Solid bars of boxes display the interquartile ranges (25-75%) with an intersection as the median; whiskers, max/min values. Significantly different motif score distributions in pairwise comparisons are indicated (*** P < 0.001, Mann-Whitney U test, two-sided). The log odds score representing the motif detection threshold is indicated by the horizontal dotted line. The motif logos are shown on top of each plot along with the fraction of PU.1 bound regions (200 bp) containing at least one motif instance, the expected frequency of the motif in random sequences (in parentheses) as well as P values (hypergeometric) for the overrepresentation

Figure S7
Comparison of motif log-odds scores with signal intensity Z scores from protein binding microarray (PBM) experiments. Published PBM data for the DNA binding PU.1 ETS domain and all possible 8-mers was used to compare PBM signal intensity Z scores, which represent a measure of protein affinity, with motif log-odds scores.
To adjust for size differences between both measures, we focussed on the central 8-mers in our PWM (NNGGAANN). If several 12-mers of the original PWM overlapped a core PWM 8-mer, the highest log-odds score was assigned to it. The scatterplot shows a good agreement between both measures (coefficient of determination R 2 =0.59).

Figure S13
Distribution of motif-associated PU.1 tag counts in CTCF-flanked domains contingent of their H3K4me1 level. The PU.1 PWM was mapped across the masked human genome and all recognized sites were binned contingent on their motif log-odds scores and their location in CTCF-flanked domains. Bean plots show the distribution of PU.1 ChIP-seq tag counts (TC) associated with motifs. High score motifs show the highest PU.1 tag counts and are almost always bound in domains with high activity. The dotted line represents a TC of 12, which was used to define binding events.

Figure S17
Enrichment of PU.1 co-associated transcription factor consensus sites in domains showing cell type-specific activity. Hierarchical clustering (Pearson correlation uncentered, average linkage) of enrichment values for co-association of the indicated PU.1-bound consensus motifs (within +/-100-bp) in liver-or osteoblast (OB)-specific domains and MO-specific domains (compared to either liver or OB). P values for motif co-enrichment were calculated using the hypergeometric test relative to the distribution in the total repeat-masked set. Data are presented as a heatmap where red (blue) coloring indicates a significant enrichment (depletion) of motif co-occurrence. Numbers in boxes represent corresponding relative changes in motif co-enrichment.