Discovery and validation of information theory-based transcription factor and cofactor binding site motifs

Abstract Data from ChIP-seq experiments can derive the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. We analyzed 765 ENCODE ChIP-seq peak datasets of 207 human TFs with a novel motif discovery pipeline based on recursive, thresholded entropy minimization. This approach, while obviating the need to compensate for skewed nucleotide composition, distinguishes true binding motifs from noise, quantifies the strengths of individual binding sites based on computed affinity and detects adjacent cofactor binding sites that coordinate with the targets of primary, immunoprecipitated TFs. We obtained contiguous and bipartite information theory-based position weight matrices (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The reliability and accuracy of these iPWMs were determined via four independent validation methods, including the detection of experimentally proven binding sites, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF coregulatory interactions (e.g. TF complexes). These iPWMs constitute a powerful tool for predicting the effects of sequence variants in known binding sites, performing mutation analysis on regulatory SNPs and predicting previously unrecognized binding sites and target genes.


INTRODUCTION
Transcription factors (TFs) interact with regulatory elements in genes to mediate positive or negative regulation of tissue-and stage-specific expression (1,2). TFs either directly bind to DNA by recognizing specific sequence motifs, or indirectly interact as partners (or cofactors) of sequence-specific TFs (3). Interactions between these two types of TFs, as well as between sequence-specific TFs, abound across the whole genome (3,4). For instance, NF-Y extensively coassociates with FOS over all chromatin states and CTCF extensively colocalizes with cohesins consisting of SMC1/SMC3 heterodimers and two non-SMC subunits RAD21 and SCC3 (5,6). The genome-wide distributions of both types of bound TFs have been analyzed by sequence analysis of immunoprecipitated chromatin (ChIPseq) (7). ChIP-seq can identify the repertoire of binding site sequences in a genome, and often pull down binding sites of coregulatory cofactors.
Sequence-specific TFs either recognize contiguous sequence motifs, or form homodimeric or heterodimeric structures that contact half sites separated by gaps that together comprise bipartite binding sites (8). Although generally the binding sequences of TFs are well conserved, significant variability at most positions of their binding motifs characterizes most TFs. Information theory-based position weight matrices (iPWMs) can quantitatively and accurately describe these base preferences. A contiguous iPWM is derived from a set of aligned binding sites using Shannon information theory and a uniform background nucleotide composition (9,10). This approach may be more appropriate for defining binding sites than Relative Entropy because the contacts between the TF and the nucleotides do not depend on the background genomic composition (10,11). A bipartite iPWM consists of two contiguous, adjacent iP-WMs, each corresponding to a half site, separated by a range of sequence gaps. The individual information content (R i ) of a TF-bound sequence, which represents the affinity of the TF-DNA interaction, is the dot product between the binary matrix of the sequence and an iPWM of the TF (10). The R sequence value of an iPWM is the mean of the R i values of all the binding site sequences used to compute the iPWM, and represents the average binding affinity (12). Our laboratory previously developed the Bipad software to generate bipartite (and contiguous) iPWMs from ChIP-seq data (8).
TF binding motifs have been derived from both experimental evidence and computational approaches. Weirauch et al. (13) measured TF binding by octanucleotide microar-

ENCODE ChIP-seq datasets
The ENCODE Consortium conducted ChIP-seq assays for human TFs and generated initial peak datasets for each replicate of each assay using a uniform peak calling pipeline (7,17). For some assays, these analyses produced optimal and conservative IDR-thresholded peaks after applying the IDR (irreproducible discovery rate) framework to the initial datasets to improve consistency of motifs obtained from multiple biological replicates. In addition, Factorbook (3,18) also reports motifs from refined datasets (limited to the top 500 peaks) generated by the SPP peak calling software (19).
We started with the IDR-thresholded peak datasets, because we found that these data are more likely to produce primary or cofactor motifs than the initial (i.e. unprocessed) datasets; they contain greater numbers of ChIPseq peaks (and thus more binding sites) than the truncated SPP datasets. The initial, unfiltered datasets were examined if neither IDR-thresholded nor SPP datasets were available.

The Maskminent motif discovery pipeline
Initially, iPWMs from ChIP-seq reads were derived by entropy minimization with Bipad (Supplementary Methods). However, we noted that these iPWMs sometimes exhibited cofactor or noise motifs, rather than the expected primary motifs. In order to improve detection of primary motifs, the Maskminent software, which implements a generalization of the objective function used in Bipad, enables new motif discovery by recursively masking sequences detected by previous analyses of a ChIP-seq dataset while defining thresholds for inclusion of the maximum number of top peaks to eliminate peaks with lower signal intensities whose inclusion can result in emergence of noise over primary or cofactor motifs (Supplementary Methods). Multiple ChIP-seq datasets from distinct cell lines for the same TF, if available, were examined for enriched sequence motifs to assess whether this approach was reproducible, and discover tissue-specific sequence preferences between these sources. This masking technique, which contrasts with the likelihood approach used by MEME (20), provides a means of discovering additional conserved motifs adjacent to primary TF binding sites within the same datasets. The sequences detected by motifs found in previous iterations are masked and the next lowest entropy motif is derived. The coordinates of all the predicted binding sites in a dataset scanned with prior iPWMs are recorded and skipped in the subsequent reanalysis. The specified parameters include the length of the motif, number of Monte Carlo cycles used in entropy minimization, a motif masking file for recursion, and for bipartite binding sites, the lengths of the left and right motifs and the gap length range between the half sites (Supplemental Methods). Once a motif is generated, another program, Scan, is used to detect binding sites in a DNA sequence and determine their respective information contents, or binding strengths.
To eliminate noisy patterns that suppress the expected TF binding motifs due to ChIP-seq peaks with low signal strengths (i.e. read counts), the dataset is truncated based on signal strengths as follows ( Figure 1). First, all the peaks are ranked in the descending order of strengths, and the top 200 peaks are selected. If the iPWM derived from the top 200 peaks exhibits the primary/cofactor motif, then the minimum threshold peak strength is contained within the range from the strength of the 200th peak (i.e. the initial value of G) to the peak with the weakest signal (i.e. the initial value of S). A half-interval search iterated over sets of progressively weaker peaks narrows this range until the number of peaks contained in the range is ≤500. The value of G is the threshold peak signal strength above which the top peaks can still produce the primary/cofactor motif. The minimum threshold obtained for G (i.e. the final value of G) defines the approximate peak set containing the maximum number of top peaks that can produce the primary/cofactor motif.

Binding site motif validation
The methods used to evaluate the accuracy of our iPWMs include: i) To detect experimentally proven binding sites in known target genes, derived iPWMs were used to evaluate the R i value of each site; ii) To predict changes in binding site strength, characterized variants were evaluated with the corresponding iP-WMs. The predicted changes were compared with experimentally supported effects on TF binding or gene expression; iii) The iPWMs were compared with the corresponding annotated motifs in the CIS-BP database (13) based on their normalized Euclidean distances; iv) To distinguish true binding motifs from noise motifs, we delineated the relationship between R i values of binding sites discovered by the iPWM and their corresponding binding energy (i.e. higher R i values have lower binding energies) (Supplementary Methods). Primary/cofactor motifs are expected to demonstrate this relationship, whereas noise motifs are not; that is, for primary/cofactor motifs, the linear regression fit between R i values and binding energy are expected to have slopes well below 0 which is the expected slope for noise motifs. After applying F-tests to evaluate this relationship, F-values for the two categories of motifs were compared using a Mann-Whitney U test.

RESULTS
The derived iPWMs displayed primary motifs for 93 TFs (Supplementary Table S1), as well as 23 cofactor motifs for 127 primary TFs (Supplementary Table S2). We also describe 6 high-confidence novel motifs that have not been previously annotated in these ChIP-seq data (Supplementary Table S3). The initial iPWMs directly exhibited primary motifs for 76 TFs and 18 cofactor motifs for 107 primary TFs. Thresholding the datasets revealed 31 primary motifs and 14 cofactors for 38 primary TFs. We used the masking technique to discover an additional 4 primary motifs; 7 cofactor motifs were also found in 21 datasets (Supplementary Tables S1  and S2).
For each TF ChIP-seq dataset with a derived primary motif (n = 367), we determined the false positive detection rate from the null R i distribution, which is approximately Gaussian (12). The iPWM was used to scan for binding sites in a random 10 000 nucleotide sequence that conserved the mono-and dinucleotide composition as the dataset (Supplementary Table S1). The means of all null distributions range from −97.5 to −12.3 bits with standard deviations from 6.9 to 22.5 bits. The probabilities of observing a potentially functional binding site, i.e. with R i > 0, in these sequences range from 1.2E-7 to 0.06.
Similarly, the independence of contributions of each position in a binding site to the overall information content was analyzed for one iPWM of each primary motif. The total mutual information, which measures the interdependence between individual positions in the same binding site, was determined by summing the pairwise mutual information at each position (Supplementary Table S1). Then, the percentage of the total mutual information relative to the average information, R sequence, was determined. For 83 TFs (∼89.2%), <10% of the information present in the iPWM is dependent, and for 62 TFs (∼66.7%), <5% is dependent. Neglecting the interactions between positions introduces a minimal error into the calculation of R i values of binding sites, and would be expected to have little impact on assessment of the mutations in these sequences.

Primary binding motifs
Contiguous iPWMs. Correct iPWMs were successfully derived for 65 TFs with contiguous binding motifs, which are concordant with published descriptions of these motifs (3). All of these motifs can be characterized as degenerate and do not correspond to published consensus sequences. Consensus sequences miss TF binding sites of weak or intermediate strength (16). We determined the frequencies of such sequences appearing on a genome scale for 10 TFs by counting the peaks containing these sequences in their respective datasets ( Figure 2A). Surprisingly, only 0.015-7.3% of all peaks contain binding sites with these sequences, demonstrating that these sites are extremely rare in ChIPseq datasets. Thus, intermediate and low-affinity TF-DNA interactions are the most prevalent in vivo and are able to regulate gene expression (21).
Bipartite iPWMs. For 19 TFs, bipartite iPWMs were successfully derived and were in agreement with previously reported motifs. The following examples illustrate key insights that can be taken from bipartite modeling: i) El Marzouk et al. (22) demonstrated that ESR1 is able to recognize binding sites with half sites separated by nucleotide spacer lengths from 0 to 4 nt, in which sites containing a 3 nt spacer are most common and have the highest binding affinities. We allowed the spacer length to vary from 0 to 5 nt in bipartite iPWMs derived from the T47D cell line data. The resultant iPWMs show the documented predominant sequences and are palindromic. The bipartite iPWM exceeds the average information content of the corresponding contiguous iPWM prepared from the same dataset, and the dominant gap between half sites is 3 nt ( Figure 2B). Nevertheless, 333 binding sites (∼9%) in this iPWM exhibit a 5 nt spacer, implying that ESR1 may be capable of binding to sites that were not previously detected. The symmetry between the half sites exhibited by the bipartite iPWMs suggests that dimeric ESR1 may bind a narrow range of sequences with similar half site affinities. ii) The palindromic predominant sequence of the AP2 family is 5 -GCCN 3 GGC-3 , and other binding sequences confirmed in an in vitro bindingsite selection assay include 5 -GCCN 4 GGC-3 and 5 -GCCN 3/4 GGG-3 . Another binding site 5 -CCCCAGGC-3 was also found in the SV40 enhancer (23). The spacer lengths in the bipartite iPWMs for AP2A and AP2C range from 2 to 4 nt, which is representative of the genome-wide pool of true binding sites ( Figure 2B). We also noted that the two outermost positions are the most variable, and that adenine (instead of the consensus guanine) can also appear at the first position of the right half site. These bipartite iPWMs exhibit similar conservation levels across all the individual positions, suggesting that these binding sites of the two AP2 members may exhibit similar degrees of binding affinity, though iPWMs can recognize different sequences. iii) The predominant spacer length separating half sites recognized by STAT1 is 3 nt; however, previous reports describe sites with a 2 nt gap, but not those separated by 4 nt (24). However, the STAT1 bipartite iPWM is based on 1709 binding sites (∼18%) with a 4 nt spacer, with most half sites separated by 2 or 3 nt ( Figure 2B). The left-and rightmost nucleotides are nearly invariant, whereas the inner 2 nt contacts in each half site are variable. iv) NFE2 and BACH1 heterodimerize with the MAF family (MAFF, MAFG and MAFK), and recognize two types of bipartite palindromic motifs, defined by the predominant binding sites TGCTGA(C)TCAGCA and TGCTGA(CG)TCAGCA (25). The previously reported binding motifs (3) are contiguous, and do not account for the dimeric interaction that gives rise to this bipartite binding pattern. The bipartite iPWMs indicate that the inner six positions surrounding the dominant 1 nt spacer exhibit higher information contents than the outer six positions ( Figure 2B).

Comparing iPWMs for the same TF in distinct cell lines.
Cell-type-specific differences between iPWMs of the same TF were evident for certain contiguous and bipartite motifs. For instance, among the three contiguous iPWMs of ESR1 derived from the ECC1 steroid-responsive endometrial cell line, conservation levels in the respective half sites are asymmetric, whereas the average information of these half sites are much more symmetric in iPWMs derived from T47D, a breast tumor cell line ( Figure 3A). For the TFs MAFF and MAFK, the discrepancy between the bipartite iPWMs from K562 and HepG2 cells is evident: the outer six positions show a greater degree of conservation than the internal six positions in HepG2, but in K562 the opposite trend is illustrated ( Figure 3A). The MAFK iPWM derived from ChIP-seq data of IMR90 cells resembles the HepG2 iPWMs, whereas the iPWMs from HeLa-S3 and H1-hESC datasets resemble the K562 iPWMs. The compositions of binding sites (i.e. different target genes for the same TF in different tissues) account for these differences because TFs can display distinct cell-type-specific DNA sequence preferences (26). Consistent iPWMs between replicate datasets makes it unlikely that the skewed base conservation between ChIP-seq datasets for the same TF in different cell lines arises from sampling differences; however, this possibility cannot be excluded.

Cofactor binding motifs
Discovery of the binding motif of a cofactor in the same ChIP-seq dataset for a primary TF implies that the two TFs transcriptionally co-regulate this set of common target genes. This could be accomplished either by formation of a physical complex on the promoter, or by synergistic or antagonistic cis-regulatory effects. De novo motif discovery from ChIP-seq datasets provides an effective approach for confirming or predicting statistically significant TF interactions on a genome-wide scale; by contrast, the abundant, existing literature overwhelmingly documents geneby-gene evidence about such interactions which constrains arguments supporting their generalizability. Figure 4 illustrates TF-cofactor interactions revealed by the Maskminent pipeline. Confirmation of known cofactors. The derived iPWMs confirmed genome-wide interactions between 22 cofactors and 102 primary TFs (Table 1), which were supported by the previous studies (3,5,6,15,. For example, the interaction between SP1 and multiple members of the ETS and AP1 families has been well characterized (94)(95)(96)(97)(98)(99). ELK1 and SRF can recruit each other to form a ternary complex on CArG-ETS elements (100). TEAD-AP1 cooperation with steroid receptor coactivators (SRC) drives downstream gene transcription to regulate cancer cell migration and invasion (101), and STAT1, STAT2 and IRF9 form a heterotrimer that regulates transcription of genes containing IFN-stimulated response elements (102). Consistent with previous reports (15), the existence of a YY1-THAP1 complex is predicted from co-segregation of their binding motifs in the K562 dataset of THAP1. Similarly, we predict that the SOX2-OCT4 complex colocalizes with BCL11A, similar to Wang et al. (3). A DNA-binding complex consisting of GATA1, TAL1, E2A, LMO2 and LDB1 is present in the erythroid cell lineage (103). Based on the proxim-ity and coprecipitation of these binding sequences, we and others (3,104) find that this complex, in which GATA1 and TAL1 contact DNA, coordinately binds with TEAD4 and other non-DNA binding proteins (P300, PML, RCOR1 and TBL1XR1). The GATA1-TAL1 and SOX2-OCT4 complexes emerged from the datasets of TAL1 and OCT4 as primary motifs, respectively, which implies the formation of the two complexes may be necessary for binding of TAL1 and OCT4.
Discovery of novel cofactors. Maskminent revealed a number of previously unrecognized cofactor motifs (n = 10) for 46 primary TFs (Table 1), which supports novel TF cobinding and interactions. This includes possible associations between the IRF and RUNX families, and their further cooperation with BCL11A, MEF2A, MEF2C, CEBPB, EED and P300 in GM12878 cells (Table 1 and Figure 4). Similarly, the TEAD-AP1 complex is predicted to recruit MYC, STAT3 and GATA2 in multiple cell lines. The finding that NR2F2 and STAT5A motifs are in close proximity to sequences recognized by the GATA1-TAL1 complex suggests  these factors may coordinately regulate target genes. Many cofactors were also discovered among datasets of nonsequence-specific primary TFs, which is consistent with the possibility that these primary TFs are recruited to gene promoters through their association with DNA-binding cofactors (Table 1).

Cofactor binding sites.
To validate the predicted cobinding between cofactors and primary TFs, we determined the intersite distance distributions by scanning the individual ChIP-seq intervals with the derived iPWMs for each ( Figure  5 and Supplementary Table S4). A minimum information threshold was applied to the R i values of predicted binding sites in order to remove the relatively large number of weak A yellow ellipse denotes a cofactor and a white ellipse denotes a primary TF. A hexagon denotes a TF family with dash lines connecting its members. For a TF family only members for which ENCODE provides peak datasets are shown. A red rectangle denotes a known or predicted TF complex with black or blue dotted lines indicating its components, respectively. An undirected line denotes the interaction between a primary TF and a cofactor which may be a complex or a TF family. A directed line links two cofactors, denoting that in a dataset of the starting TF, the ending TF was discovered as a cofactor. Black lines denote known interactions and blue lines denote the newly discovered interactions. Figure 5. Distributions of intersite distances between primary TFs and discovered cofactors versus negative controls. The minimum threshold on information contents of predicted binding sites is R sequence . Each graph illustrates a much higher frequency of short (<20 nt) intersite distances between primary TFs and cofactors (blue) compared to the negative control (SOX2-OCT4; red).
binding sites that are likely to be low-complexity sequences (e.g. R sequence [or 0.5 * R sequence , if too many cofactor binding sites were eliminated at the higher threshold]). The SOX2-OCT4 complex was used as a primary negative control, as it is primarily expressed in the H1-hESC cell line and is unlikely to be a cofactor for primary TFs in other cell lines. A large percentage of peaks have short intersite distances between the primary TF and the corresponding cofactor binding sites (e.g. <20 nt), whereas there is no such a trend for the negative control sequences and the primary TF. The same difference is observed between the distribution for the documented TEAD4-AP1 binding site pair and for the negative control set. Consistent with previous reports (4), the binding sites of cofactors and primary TFs in peak datasets were physically overlapped between the IRF and RUNX motifs, between the TEAD4 and AP1 motifs, and between USF and ATF3 (AP1) recognition motifs.
Tissue-specific preferences of predicted cofactors relative to primary TFs. Several cofactors were recurrently associated with different primary TF partners, notably in specific cell lines. One possible explanation is that these cofactors are coordinately regulated with different primary TFs preferentially in specific cell types. For example, the datasets of 25 primary TFs in which the IRF family was discovered as a cofactor were all derived from lymphoblastoid (e.g. GM12878) cell lines, with four exceptions (Table 1). Regulation by the IRF family is central to B-lymphocyte expression programs (105). All the datasets of 11 primary TFs from which the GATA and GATA1-TAL1 motifs emerged as cofactors were derived from K562 erythrocytic leukemia cells (Table 1), which is consistent with the activation role that the GATA family exhibits in hematopoietic lineage gene expression (106,107). Similarly, FOXA family members bind to the same sequences as seven primary TFs in the HepG2 cell line derived from hepatocellular carcinoma cells (Table 1), which is consistent with the fact that FOXA proteins regulate the initiation of liver development (108). Datasets of GATA3 and P300 from the T47D breast cancer cell line are also linked to FOXA. Another TF family known to be a key factor regulating hepatocyte differentiation and liver-specific functions is HNF4 (109), which was discovered as a cofactor of SP1 in a HepG2 dataset. SOX2 and the SOX2-OCT4 complex were unveiled as cofactors only in datasets of three primary TFs from the H1-hESC cell line representing embryonic stem cells (Table 1), which is supported by the requirement for SOX2, OCT4 and NANOG to maintain pluripotency (110). Interestingly, all the datasets (n = 12) in which YY was revealed as a cofactor were from K562 cells, with one exception (Table 1). Unlike the GATA TFs, the YY family is ubiquitously distributed and not known to play an especially central role in erythroid lineage development, although YY1 is known to act as a developmental repressor of the ⑀-globin gene along with GATA1 (111). Not surprisingly, the SP family was found to be capable of interacting with the maximum number of TFs, which is consonant with its role in constitutive transcriptional activation. Similarly, the ubiquitously expressed AP1 interacts with 10 TFs in multiple cell lines, and these interactions do not show any preference in cell type.
A number of primary TFs exhibit an extensive capability of interacting with multiple cofactors in different tissues. The unique distribution of these cofactors across multiple cell lines suggests the tissue-specific functions of the primary TFs. For instance, TEAD4 was found to coimmunoprecipitate with GATA1-TAL1 in K562 cells, NRSF in A549 cells, FOXA in HepG2 cells, and AP1 in multiple cell types. Cofactors of P300 include IRF-RUNX in GM12878 cells, SP in H1-hESC cells, AP1 and CEBPB in HeLa-S3 cells, FOXA in HepG2 and T47D cells and GATA1-TAL1 in K562 cells. Cosegregation analysis revealed interactions between BCL11A and IRF-RUNX in GM12878 cells, and SOX2-OCT4 in H1-hESC cells. STAT5A and TBL1XR1 cosegregated with members of the IRF family in GM12878 cells and with GATA1-TAL1 in K562 cells.
Discordance between iPWMs derived from the same ChIPseq assay. We noticed some discrepancies between IDRthresholded datasets and SPP datasets from the same ChIPseq assay. For example, for the primary TF BRG1, iP-WMs exclusively from SPP-derived datasets exhibit motifs of GATA1 and AP1; IDR-thresholded BRG1 data produced only noisy low information content motifs. We also noticed that the motifs derived from different biological replicates of the same ChIP-seq assay were sometimes inconsistent. One replicate of the TEAD4 ChIP-seq assay from the A549 cell line revealed only the NRSF binding motif, whereas both the cofactor AP1 and the primary motif were derived from the other replicate.

Novel binding motifs
We uncovered six high-confidence novel motifs that have not been previously annotated ( Figure 3B). The 'NM1' motif was considerably enriched in the datasets of BAF155 and BRG1 (which do not bind DNA directly) from HeLa-S3 cells and the 'NM2' motif was highly conserved in the datasets of BCL11A and NANOG from H1-hESC cells. The 'NM3' motif was revealed in the ESRRA and SREBF2 datasets from GM12878 cells, in the MAX dataset from HCT116, in the CREB1 and GTF3C2 datasets from K562, and in the non-DNA-binding RCOR1 dataset from IMR90 cells. The Euclidean distances between these novel motifs and primary motifs are dissimilar, ranging from 3.1 to 3.4 bits/nt. The 'NM4', 'NM5' and 'NM6' motifs were discovered in the datasets of GATA3, MXI1 and FOSL1 from MCF-7, SK-N-SH and H1-hESC cells, respectively, with distances ranging from 2.9 to 3.4 bits/nt. We investigated whether these novel motifs were enriched in hallmarks of open chromatin, based on the co-occurrence with DNase I hypersensitive sites and near H3K4me and H3K27ac histone modifications (112). After scanning the complete genome with these iPWMs, the proportions of sites detected within these corresponding ENCODE chromatin tracks were determined for the respective cell lines ( Table 2). These proportions (5-35%) are consistent with previously reports of binding sites for other TFs (113). The frequencies of sites detected with the NM2 and NM6 motifs within the H3K4me1 and H3K27ac peaks are significantly higher than those found after intersection of each NM binding site with the H3K4me2 and H3K4me3 tracks, Nucleic Acids Research, 2017, Vol. 45, No. 5 e27 respectively. The co-occurence of NM2 and NM6 with the H3K4me1 and H3K27ac epigenetic marks supports the assignment of these motifs as components of transcriptional enhancer elements, because these histone modifications are present in nucleosomes flanking enhancer elements (114). Additionally, the co-occurence of these two motifs within DNase I hypersensitive intervals exhibit the highest among all the six motifs. The remaining motifs could represent binding motifs of currently unknown TFs or other nonannotated functional elements.

Binding site motif validation
Detection of true binding sites with iPWMs. A total of 803 experimentally-confirmed, previously published binding sites were verified for the 93 TFs whose primary binding motifs had been identified (Supplementary Table S5). We detected these sites with the derived iPWMs by scanning promoters of known TF target genes for binding elements with positive R i values. There was complete concordance between these true binding sites and those detected with the iPWMs, both in terms of their locations and relative strengths. For example, an electrophoretic mobility shift assay analysis of the SERPINA3 promoter proved that the nucleotide sequence starting at GRCh38 (chr14:94612260) contains a stronger binding site of STAT1 than the one starting at GRCh38 (chr14:94612291) (Supplementary Table S5) (115); the binding site (5 -TTCTGGTAA-3 with R i = 9.02 bits; Row 781) detected by the bipartite iPWM is indeed 2 2.13 (or 4.38)-fold stronger than the other site (5 -TTCTCGGA-3 with R i = 6.89 bits; Row 782) detected in this promoter.
Correspondence between functionally characterized SNPs and changes in information content. Based on the change in the R i value of a binding site, the effect of a SNP on the binding site strength can be predicted with iPWMs (10,12). For 153 SNPs within the binding sites of 29 TFs, we determined R i values of the variant sequence for the corresponding iPWM and compared the predicted consequence to observed TF binding, and if available, published changes in expression (Supplementary Table S6). For 130 SNPs (∼85.0%) affecting binding sites of 27 TFs, the predictions of the iPWMs and the experimental observations are completely concordant. For 16 SNPs (∼10.5%) affecting binding sites of 10 TFs, the predicted and observed experimental findings are concordant, but the extents of these changes differ (e.g. TF binding is predicted to only be weakened, but binding or expression was completely abolished). For 7 SNPs (∼4.6%) altering binding sites of three TFs, the predicted and observed experimental changes were discordant. iPWMs for two (CEBPB and SP1) of these three TFs were validated for other SNPs.

Comparison between iPWMs and other binding motifs.
Binding motifs of eukaryotic TFs in the CIS-BP database were previously reconstructed from oligonucleotide binding selection assays (13); these motifs represent another type of ground truth reflecting the genuine sequence preferences of these TFs. For 133 TFs, we quantitatively compared the iP-WMs with these motifs by determining the normalized Euclidean distances between them, and classified the distances into three categories. We observed that the iPWMs derived in this study and the reconstructed motifs are nearly identical (<1 bit/nt) for 75 TFs, or only differ at 1 or 2 positions (1-2 bits/nt) for 18 TFs. The discovery of cofactors was the predominant explanation for large distances (>2 bits/nt) for 39 of these TFs.
Statistical analyses on iPWMs. To distinguish true binding motifs from noise motifs, the relationship between R i values and binding energy was evaluated by performing F tests on all binding sites in all of the contiguous iPWMs that we derived (674 primary/cofactor, 312 noise). The Fvalues are plotted as a histogram to illustrate probability density distributions ( Figure 6; data available in Supplementary Tables S1 and S2). The histogram shows that most F-values between 0 and 100 were significantly enriched for noise motifs. In general, the F-values of primary/cofactor motifs significantly exceed those derived from noise. The primary/cofactor motif and noise motif distributions are different (Mann-Whitney U test; P = 3.1E-57 at 1% significance level). We note that only primary and cofactor motifs exhibit F-values > 1000, which comprise 37.2% (251 of 674) of all iPWMs. The iPWMs with F-values < 1000 remain valid based on the other criteria described above.

DISCUSSION
In this study, we derived and validated TF binding motifs from ChIP-seq datasets using an information theorybased approach, also revealing TF cofactor binding sites e27 Nucleic Acids Research, 2017, Vol. 45, No. 5 PAGE 10 OF 14 and other novel motifs. The primary TF motifs were validated by comparison with motifs derived independently from binding studies, by analysis of gene variants known to alter TF binding affinities, and by comparing the locations of binding sites predicted by iPWMs with those of true sites previously determined in published binding and expression studies. In addition to contiguous iPWMs, bipartite iPWMs with variable-length spacers were also derived. These iP-WMs more precisely reflect the binding behavior of dimeric TFs, as they incorporate intermediate and often weak binding sites that are often excluded from consensus sequencebased (strong) binding site sets (3). This enables these iP-WMs to accurately quantify binding site strengths across a broad range of affinities (Supplementary Table S5). To test this, the iPWMs were applied to mutation analyses of regulatory SNPs (Supplementary Table S6). We have recently used this approach to identify and prioritize variants affecting TF binding in 20 risk genes of 287 hereditary breast and ovarian cancer patients (116) and 7 genes from 102 such patients (117). In present study, the iPWMs were also used to delineate known and novel TF-cofactor interactions. TF binding sites across the genome have been predicted from promoter accessibility analyses with high-throughput DNase-seq assays. For each of 20 TFs, Yardımcı et al. (118) obtained a set of true binding sites by intersecting ChIP-seq peaks with the 50 000 strongest binding sites predicted by JASPAR and TRANSFAC PWMs in the genome. The FLR (Footprint Log-likelihood Ratio), which is defined as the logarithm of the ratio between probabilities that a DNase I footprint is produced by either a true binding site or a background sequence, was determined at these sites. We attempted to detect these true sites using the derived iPWMs. For these 20 TFs, all of these sites (ranging from n = 31 to 21 550, depending on the TF) were successfully detected by the iPWMs (R i > 0). By contrast, the FLR identified 35-85% of the verified binding sites (Supplementary Table S7). As weak binding sites tend not to generate footprints and thus not to be discovered by DNase-seq, the expectation is that the sites detected by DNase-seq would be stronger than those that evade detection. In fact, this trend was observed for only 10 TFs and the average strengths of these classes of these binding sites were not significantly different.
In the Maskminent pipeline, the weak peaks below the threshold signal intensity do not necessarily contain weak or are missing binding sites; in fact, the distribution of R i values of binding sites in these bottom peaks is similar to that in the top peaks used to derive the iPWM (Supplementary Methods). Thresholding the dataset is required in order to ensure that the iPWM for the primary motif consists of binding sites from as many peaks as possible, while preventing alternative motifs from dominating the objective function used in Maskminent.
We also compared results produced by the Maskminent pipeline with other motif discovery tools from two perspectives of revealing primary and cofactor binding motifs (Supplementary Table S8). MEME-ChIP was previously used to derive motifs for 457 ChIP-seq datasets (119) and Se-qGL (120) was used to analyze 105 datasets. Among the sequence-specific TFs (n = 98) investigated by both tools, Maskminent and MEME-ChIP discovered primary motifs for 80 (∼81.6%) and 92 (∼93.9%) TFs, respectively. Among the 59 TF datasets analyzed by Maskminent, MEME-ChIP, SeqGL and HOMER (121), primary motifs were revealed for 45 (∼76.3%), 51 (∼86.4%), 49 (∼83.1%) and 47 (∼79.7%) datasets, respectively. The cofactor motifs that Maskminent found (which MEME-ChIP and SeqGL failed to detect) primarily comprise the SP family. Since MEME and SeqGL discriminate binding sites from background sequences using nucleotide frequencies computed from all input sequences, binding motifs with compositions similar to the background may fail to be discovered, such as the SP motif; in contrast, Maskminent does not rely on background compositions and will always return the lowest entropy motif. While MEME-ChIP and SeqGL revealed a greater number of cofactor motifs, selecting only the top 500 or 2000 peaks increases the likelihood that those cofactors appeared by chance. This is because MEME-ChIP and SeqGL were configured to report multiple motifs, whereas the main objective of Maskminent was to discover primary motifs (i.e. if the initial iPWM derived from a dataset exhibits the primary motif, the masking and thresholding techniques will no longer be used, unless it is explicitly masked). Finally, the ability of Maskminent, MEME-ChIP, SeqGL to reveal binding motifs was compared on the 105 datasets (120). Each tool discovers cofactor motifs that others do not recognize.
Arvey et al. (26) trained support vector machines (SVMs) that use flexible k-mer patterns to capture DNA sequence signals more accurately from 286 ChIP-seq experiments than traditional motif approaches, and these SVMs can also integrate histone modifications and DNase accessibility to significantly more accurately predict TF occupancy than simpler approaches. However, the SVM approach does not provide any insight into binding strength. Even though accessibility constrains the number of binding sites and increases the accuracy of binding site detection, it is not possible to compare binding site strengths once the designated sites are combined with DNase I hypersensitivity profiles and other chromatin accessibility marks.
In fact, the number of TFs for which cofactor motifs were revealed exceeds the number of TFs whose primary binding motifs were discovered, partially because only cofactor motifs can be found in the datasets of TFs which exhibit little or no sequence specificity (e.g. CCNT2, INI1 and P300). For 11 primary TFs, the binding site sequences were extremely variable; that is, the overall conservation levels of their binding motifs contain less information than noisy, low complexity sequences or cofactor motifs. For 18 primary TFs associated with cofactors, which themselves physically contact DNA, the primary TF motif was not enriched. The inability of the software to discover such primary motifs is a limitation of this approach. Interactions between the primary TFs and a subset of the cofactors which are known to cooperate with them were detected, since the association has to occur with a prevalence sufficient to produce a recognizable motif (usually >0.5 bit/nt over the entire site). Nevertheless, the algorithm may not find cofactors with weakly conserved motifs or those that overlap with other conserved motifs.
While unable to discover cofactors nor identify bipartite motifs of variable spacing, the oligonucleotide microarray technique adopted by Weirauch et al. (13) and Jolma et al. (14) theoretically is able to determine binding specificities for all the sequence-specific TFs, because contiguous binding sites of TFs are reconstructed from overlapping oligonucleotide sequences by directly detecting complexes with the TF. This eliminates interference of noisy sequences or cofactors which may emerge as false minimum entropies using our method.
The Maskminent pipeline can be applied to other ChIPseq data not included in ENCODE. The quality control criteria we described are capable of ensuring that the user-built iPWMs are accurate and can be used for binding site detection. The first and second criteria are particularly important, because they provide a straightforward assessment of iPWM performance. The recursively thresholded feature is crucial for guaranteeing that the discovered cofactors do not appear by chance, because the greater the number of peaks from which a cofactor is derived, the higher the confidence that the cofactor indeed interacts with the primary factor.
In summary, we comprehensively investigated and implemented a new approach to define TF binding specificities based on the ChIP-seq TF data that ENCODE has released. This allowed us to mine and quantify both known and previously unrecognized TF binding motifs and cofactor interactions on a genome scale. This information expands the granularity of the current knowledge on TF interaction with DNA and points out potential directions for future experimental study on interaction between TFs.