Discovery of Primary, Cofactor, and Novel Transcription Factor Binding Site Motifs by Recursive, Thresholded Entropy Minimization

Data from ChIP-seq experiments can determine the genome-wide binding specificities of transcription factors (TFs) and other regulatory proteins. In the present study, we analyzed 745 ENCODE ChIP-seq peak datasets of 189 human TFs with a novel motif discovery method that is based on recursive, thresholded entropy minimization. This method is able to distinguish correct information models from noisy motifs, quantify the strengths of individual sites based on affinity, and detect adjacent cofactor binding sites that coordinate with primary TFs. We derived homogeneous and bipartite information models for 89 sequence-specific TFs, which enabled discovery of 24 cofactor motifs for 118 TFs, and revealed 6 high-confidence novel motifs. The reliability and accuracy of these models were determined via three independent quality control criteria, including the detection of experimentally proven binding sites, comparison with previously published motifs and statistical analyses. We also predict previously unreported TF cobinding interactions, and new components of known TF complexes. Because they are based on information theory, the derived models constitute a powerful tool for detecting and predicting the effects of variants in known binding sites, and predicting previously unrecognized binding sites and target genes.

) 6 6 1 -4 2 5 5 ; Email: Transcription factors positively or negatively interact with the regulatory elements in genes to mediate elaborate regulation of tissue-and stage-specific expression, which is a critical contributor to the marvelous complexity of the human body (1,2). They could be divided into two general classes: one consisting of those TFs that are able to directly bind to DNA by recognizing specific sequence motifs, and the other comprising those TFs which indirectly bind to DNA by acting as interacting partners of the sequence-specific TFs (3). Some sequence-specific TFs have homogeneous binding motifs due to their independent binding activity, while others need to first form homodimers or heterodimers so that their binding sites are bipartite (4).
Not only do genome-wide interactions occur frequently between the two classes of TFs, but also abound between the sequence-specific TFs, with each interacting party being called a cofactor. For instance, NF-Y extensively coassociates with FOS over all chromatin states, cluster classes, and genic features, with 45% of NF-Y sites directly overlapping a FOS site and 39% of FOS sites directly overlapping an NF-Y site in K562 cells ( 5 ) . NF-Y and FOS sites are located just as close (<50bp) as that observed between the NF-Y subunits or between FOS and JUN. CTCF extensively colocalizes with cohesins consisting of SMC1/SMC3 heterodimers and two non-SMC subunits RAD21 and SCC3 ( 6 ) .
Although generally the binding sequences of TFs are highly conserved, some degree of variability at most positions of their binding sites is still evident in most TFs. Information theory-based models can be used to quantitatively and accurately describe base preferences in these motifs. A homogenous model for a TF is a positional weight matrix (PWM) can be derived from a set of aligned binding sites using Shannon information theory ( 7 ) . A bipartite model consists of two PWMs, each of which corresponds to a half site, separated by a range of sequence gaps. Our laboratory previously developed software to generate information-theory based models of bipartite (and homogeneous) binding sites (4). The individual information content (R i ) of a sequence bound by the TF, which represents the affinity of the TF-DNA binding site interaction, can be determined from these models.
The R sequence value of a model is the mean of the R i values of all the binding site sequences used to compute the model, and represents the average binding affinity ( 8 ) . ChIP-seq assays, which use chromatin immunoprecipitation (ChIP) to identify the repertoire of binding site sequences in a genome, can provide substrates for development of robust information models of TFs, and likely include binding site sequences of coregulatory TFs and potentially, cofactors.
Previous work related to the derivation of TF binding motifs has been based on experimental evidence supported by computational approaches. Weirauch et al.  sequencing. However, the oligonucleotide-based approach is not able to well account for variablelength spacers in bipartite binding sites, and it may produce multiple potential models among which correct ones, if present, may not be discriminated. In addition, the dynamic range of oligonucleotides used in the DNA microassays is limited so that not all the possible binding site sequences in the genome are covered as in the ChIP-seq assays, and there is no way to discover cofactors. Wang et al. (

)
carried out de novo motif discovery for 119 human TFs from 457 ChIP-seq datasets using the MEME-ChIP software suite, and Kheradpour et al.
provided a systematic motif analysis for 427 human ChIP-seq datasets of 123 TFs using five motif discovery tools. However, these studies did not generate bipartite motifs with half sites separated gaps varying in length; more importantly, the derived motifs were only based upon strongest ChIP-seq signal peaks (top 500 or 250 peaks), effectively eliminating thousands of intermediate or weak binding events and biasing the resulting PWMs toward high-affinity, consensus-like binding sites. This is necessary, as the weakest binding sites inferred from ChIP-seq are essentially noise that can obfuscate the detection of true binding motifs. Nevertheless, this type of bias distorts the binding strengths estimated from the population of sites (12).
In this study, we developed a novel motif discovery approach that integrates recursive and thresholded features into the entropy minimization algorithm (4). We then applied this method to 745 ENCODE (Encyclopedia of DNA Elements ) ChIP-seq datasets of 189 human TFs. The majority of the primary binding motifs were successfully recovered and refined for the immunoprecipitated TFs, in addition to motifs for many known and previously unreported cofactors.

ENCODE ChIP-seq datasets
The ENCODE Consortium conducted 907 ChIP-seq assays for human TFs and generated initial peak datasets for each replicate of each assay using the uniform peak calling pipeline (14), and for some assays further produced the optimal and conservative IDR thresholded peaks after applying the IDR (Irreproducible Discovery Rate) framework to the initial datasets to improve irreproducibility between multiple replicates of one assay. In addition, Factorbook also provided some refined datasets directly generated by the SPP peak calling software .
We primarily used the IDR thresholded peak datasets, because we found that these data is more likely to discover primary or cofactor motifs than from the initial (i.e., unprocessed) datasets; and they contain more peaks and thus more binding sites than the truncated corresponding Factorbook datasets. One notable exception, however, was BRG1 which was found to exhibit motifs of cofactors Initially, information models from ChIP-seq reads were derived by entropy minimization with Bipad (5).
However, we noticed that these models would not always contain the expected primary motifs and might show motifs of known cofactors of the primary TF. Additionally, datasets with an excess of weak and false positive peaks were noted to result in models consisting of overrepresented, noisy motifs. In order to enhance its ability to detect primary motifs in large datasets, we introduce a novel motif discovery algorithm by adding recursive masking and thresholding functionalities to entropy minimization ( Figure 1). Multiple ChIP-seq datasets from distinct cell lines for the same TF, if available, were all examined for enriched sequence motifs to assess whether this approach was reproducible, and discover tissue specific preferences between these data sources.
This masking technique provided a means of discovering additional conserved motifs in the same data by masking the motifs found in previous iterations of this modeling procedure. It is implemented by scanning a dataset with previous models and recording the coordinates of all the predicted binding sites, and then skipping these intervals when reanalyzing the dataset with Bipad. This is done to reduce the entire multiple alignment search space by the part covered by all the predicted sites. We rename and revise the objective function for building bipartite models to: where D is the ChIP-seq dataset, S i is the set of intervals of all the predicted sites of previous models to be masked, MA is one multiple alignment, respectively. Then, the operation of thresholding the dataset D to obtain the top n peaks is defined by: If the primary motif or a cofactor is unveiled by this operation, then theoretically there exists an accurate threshold δ on the number of top ranked peaks between showing the desired motif and not showing it and we will narrow down the error range within which δ is to less than 500 peaks. The masking and thresholding techniques can be combined to discover primary motifs.
Models are derived from ChIP-seq data with the Maskminent program, which implements the above algorithm. The -m option indicates whether to switch on the masking, otherwise the commands specifying the run to build homogeneous or bipartite models are similar to Bipad, respectively: The Scan program is used to detect potential binding sites with an information model in a DNA sequence fragment based on the following parameters: ./Scan InfoModelFile SequenceFile (1 LengthOfInfoModel)/(2 LengthOfHalfSite MinSpacerSize MaxSpacerSize)

Quality control
The quality control criteria used to evaluate the accuracy of our information models include: 1) Validating our models with experimentally proven binding sites in known target genes; 2) Comparing our models with the annotated motifs in the CIS-BP database ( 9 ) by determining their normalized Euclidean distances; 3) Evaluating the linearity of R i values versus binding energy by performing F-tests on linear models and nested constant models between the two properties of predicted binding sites in our information models, in order to distinguish between correct and noisy, low complexity motifs.
We found that the R i values are inversely proportionate to binding energy. The frequencies of binding sites appearing in a ChIP-seq dataset are linearly related to their binding energy (log 2 K d ), which is delineated by Equation 1 derived by us starting from . And the R i values of binding sites are proportionate to their frequencies, since the higher the affinity of a binding site is, the more the number of times it is bound by the TF will be in a ChIP-seq assay.
where K di is the dissociation constant of the ith predicted binding site, F i ' is its frequency, F 1 ' and K d1 are these values for the strongest site (i.e., the consensus sequence), and [Tf] is the concentration of unbound TF.
[Tf] is negligible for most TFs.
The frequency of each binding site sequence within the multiple alignment of a model occurring in the ChIP-seq dataset was calculated as follows. If only one binding site occurred in a peak, its frequency would increase by one. In the case that a peak contained multiple binding sites, the

RESULTS
The motif discovery pipeline was applied to 745 ChIP-seq peak datasets of 189 TFs. The derived information models displayed primary motifs for 89 TFs (Supplementary Table S1 and S2), as well as 24 cofactor motifs for 118 primary TFs (Supplementary Table S3). We also describe 6 highconfidence novel motifs that have not been previously annotated.

Primary binding motifs
Homogeneous information models. Correct information models were successfully derived for 62 TFs with homogenous binding motifs, which are concordant with published descriptions of these motifs ( 1  9  ) demonstrated that ESR1 is able to recognize binding sites of spacer lengths from 0bp to 4bp, with sites containing a 3bp spacer exhibiting the highest affinity. We allowed the spacer length to vary between 0 -5bp from T47D cell line ChIP-seq data. Consistent with their findings, the resultant models show the documented predominant motif and are palindromic ( Figure   2B). The bipartite model contains greater information contents than the corresponding homogeneous site, and the dominant gap size remains 3bp. However, the models contain some bipartite sites with positive R i values and a 5bp gap, implying that ESR1 may be capable of binding to sites that were not previously examined. The symmetry between the half sites exhibited by the bipartite models suggests that dimeric ESR1 may bind a narrow range of sequences with similar half site affinities.
The palindromic predominant motif of the AP2 family is 5'-GCCN 3 GGC-3', and other binding sequences confirmed in an in vitro binding-site selection assay include 5'-GCCN 4 GGC-3' and 5'- Another binding site 5'-CCCCAGGC-3' was also found in the SV40 enhancer . Indeed, the gaps in the bipartite models of AP2A and AP2C range from 2bp to 4bp, which is concordant with published reports; therefore, these models are representative of the genome-wide pool of bona fide binding sites ( Figure 2B). Besides the fact that they are quite symmetric, we also Discrepancies between information models of the same TF in distinct cell lines. In several instances, we noticed a potential discrepancy between models of the same TF from different cell types, in both homogeneous and bipartite motifs.
For instance, among the three homogeneous models of ESR1 from ECC1 cells one half site has very high conservation levels and the information content of the other half site is lower, while the two half sites are much more symmetric in the T47D models ( Figure 3A). For MAFF and MAFK, the discrepancy between the symmetric bipartite models from K562 and HepG2 cells is prominent; the outer 6 positions of these HepG2 models show a greater degree of conservation than the internal 6 positions, but the K562 models illustrates the opposite trend ( Figure 3A). Additionally, MAFK motif derived from IMR90 cells resembles the HepG2 model, while the models from HeLa-S3 and H1-hESC datasets resemble the K562 model. This could be accounted for by differences in the compositions of binding sites, i.e., different target genes for the same TF in different tissues.

Cofactor binding motifs
Discovery of the binding motif of a cofactor in one dataset of a particular primary TF suggests that there exists transcriptional co-regulation between the two TFs on their common target genes, either by binding of the cofactor in the complex formed by their physical interactions, or by their cobinding to achieve a synergistic cis-regulatory effect. De novo motif discovery from ChIP-seq datasets provides an effective approach for confirming or predicting the genome-wide TF interactions, in contrast to the abundant yet scattered existing literature evidences in specific genes about such interactions. Figure   4 illustrates the cofactors and their interactions with the primary TFs revealed by our motif discovery algorithm.
Confirmation of known cofactors. . Based on the proximity and coprecipitation of these binding sequences, suggest that this complex in which GATA1 and TAL1 contact DNA can coordinately bind with NR2F2, STAT5A, and TEAD4. The complex may also incorporate P300, PML, RCOR1 and TBL1XR1, which are non-DNA binding proteins.
Many cofactors were also discovered in datasets of non-sequence specific TFs, which likely means that these primary TFs are recruited to genes by associating with the DNA-binding cofactors.
Our predictions based on this include interactions between NFKB and KDM5A in H1-hESC cells, non-sequence specific TFs in multiple tissues.
To validate the predicted cobinding between cofactors and primary TFs, we investigated distributions of their intersite distances by scanning the ChIP-seq datasets with our information models ( Figure 5, Supplementary Table S4). We applied a threshold (e.g. R sequence or ½ * R sequence if using R sequence leads to few binding sites of cofactors) on the R i values of predicted binding sites in order to remove the relatively large number of weak binding sites that are likely to be low-complexity sequences. The SOX2-OCT4 complex was used as a primary negative control, as it is merely expressed in the H1-hESC cell line and thus cannot be a cofactor for the primary TFs in other cell lines. It can be seen that the percentage of peaks in which intersite distances between the discovered cofactor and the primary TF are quite small (e.g., less than 20bp) is overwhelming in the respective dataset, while there is no such a trend for the negative control and the primary TF. The distribution of intersite distances between TEAD4 and AP1, whose extensive colocalization was already observed in literature as mentioned above, was also generated. There exists the same difference between it and the distribution for TEAD4 and the negative control. site. Similarly, a large quantity of AP1 and USF binding sites intersect. Given the USF predominant sequence is CA(C/T)GTGAC, the common intersecting pattern is that the first two bases TG of an AP1 site are shared by the last two bases of the E-box in a USF site, and vice versa.

Tissue-preferential activity of cofactors and their tissue-specific distributions relative to primary TFs.
Several cofactors were revealed recurrently with different primary TF partners, and notably in specific cell lines. This is consistent with the possibility of coordinate regulation with these primary TFs in a tissue-preferential manner. For example, all the datasets of 24 primary TFs in which the IRF family was discovered as a cofactor came from GM12878 or related cell lines, except that the ones of STAT1, STAT2, ATF1 and one dataset of EZH2 were from K562 cells, and the one of FOXP2 was from SK-N-MC cells. The GM12878 cells are of B-lymphoblastic type, which is concordant with the critical role that the IRF members play in the regulation of the immune system ( 7 5 ) . All the datasets of 10 primary TFs from which the GATA and GATA1-TAL1 motifs emerged as cofactors were derived from K562 cells of the erythrocytic leukemia type, which is supported by the role the GATA family exhibits in hematopoietic lineage gene expression, e.g., GATA1 being an erythroid-specific activator . Similarly, FOXA was revealed to be a cooperating partner in datasets of 8 primary TFs from the HepG2 cell line derived from hepatocellular carcinoma cells, except that two datasets of GATA3 and P300 were produced from the T47D cell line representing breast cancer cells. This suggests that the FOXA proteins, which are known to regulate the initiation of liver development ( 7 8 ) , are highly active in liver tissue. Another TF family known to be a key factor regulating hepatocyte differentiation and liver-specific functions is HNF4 , which was discovered as a cofactor of SP1 in a HepG2 dataset. SOX2 and the SOX2-OCT4 complex were unveiled as cofactors only in datasets of 3 primary TFs from the H1-hESC cell line representing embryonic stem cells (ESC), which is in agreement with the fact that SOX2, OCT4 and NANOG are essential for maintaining ESC pluripotency

Novel binding motifs
We uncovered 6 high-confidence novel motifs that have not been previously annotated ( Figure 3B).
The "NM1" motif was considerably enriched in the datasets of non-DNA-binding proteins, BAF155 and BRG1, from HeLa-S3 cells and the "NM2" motif was highly conserved in the datasets of BCL11A and NANOG from H1-hESC cells.
The "NM3" motif was revealed in the datasets of ESRRA and SREBF2 from GM12878 cells, MAX from HCT116 cells, CREB1 and GTF3C2 from K562 cells, non-DNA-binding RCOR1 from IMR90 cells. The Euclidean distances between these novel motifs and primary motifs range from 3.1 -3.4 bits per nucleotide. The "NM4", "NM5" and "NM6" motifs were discovered in the datasets of GATA3, MXI1 and FOSL1 from MCF-7, SK-N-SH, and H1-hESC cells, respectively. These distances range from 2.9 to 3.4 bits per nucleotide. These motifs may comprise the binding motifs of currently unknown TFs or other non-annotated functional elements in vivo.

Quality control
Validation of information models with bona fide binding sites. We found a total number of 794 experimentally confirmed binding sites in literature for all of the 89 TFs whose primary binding motifs were discovered, and attempted to detect these sites using our information models, in order to assess the credibility and accuracy of our models well capturing the genuine binding specificities of these TFs.
It turned out that our models did not fail to detect a single true binding site. The detailed results are listed in Supplementary Table S5.

Comparison between our information models and other binding motifs. Binding motifs of eukaryotic
TFs were previously reconstructed from oligonucleotide binding selection assays (9); these motifs represent a type of ground truth reflecting the bona fide sequence preferences of these TFs. For 133 TFs, we compared our information matrices with these motifs quantitatively by determining the normalized Euclidean distances between them, and classified the distances into three categories (Table 1). We observed that, for most TFs, the models derived in this study and the reconstructed motifs are nearly identical or only differ at 1 or 2 positions. The discovery of cofactors by recursive, thresholded entropy minimization was the predominant explanation for large normalized distances for 39 of these TFs.
Statistical analyses on information models. An F-test was carried out on linear regression models to assess the distributions of the R i values and binding energy of putative binding sites in each information model. F-test values on 60 models showing primary binding motifs and 60 noise models randomly selected were calculated ( Figure 6; data available in Supplementary Table S2). The F-test values of 70% correct models (42/60) are above 1,000, while all the noise models have F-test values below 1,000, making 1,000 an appropriate threshold to distinguish between correct and noise models.

DISCUSSION
Using our motif discovery pipeline based on recursive, thresholded entropy minimization, from ENCODE ChIP-seq datasets of 189 human TFs we obtained accurate homogeneous and bipartite information models showing primary binding motifs for 62 and 18 TFs, respectively. In addition, we discovered 24 cofactor motifs for 118 TFs and 6 high-confidence novel motifs as well. Our findings lend support for predictions of known and novel TF-cofactor interactions, new subunits of known complexes, and can reveal target genes that previously, were not known to be regulated by both TF and one or more cofactors. In addition to homogeneous information models, bipartite models with variable-length gaps, which more precisely reflect the binding behavior of dimeric TFs, were also generated. Our approach is generally able to distinguish correct models from noisy, low complexity motifs that are coincidentally enriched in the ChIP-seq data. The models do not only take high-affinity binding sites into consideration, but also encompass many intermediate and lower-affinity sites. This level of accuracy justifies using these models to detect variants in known binding sites that alter the affinity of their interactions with their cognate TFs.
Within a peak from a ChIP-seq dataset, the strength of the binding site of the primary TF is expected to be proportionate to the integrated signal intensity of the peak. Therefore, the average strengths of the binding sites contained in a subset of top ranked peaks is greater than the remaining sites in a dataset; while this introduces some bias, this subset comprises many more sequences than those found in Factorbook for any of the transcription factors. Without this thresholding, the minimum entropy generates noisy or cofactor models, and therefore, top ranked peaks filtering more effectively reveals primary TF motifs. The masking and thresholding techniques decrease the multiple alignment search space from distinct directions, with masking removing a portion in which the initial model is contained, and thresholding increasing the possibility of a non-noise motif possessing the smallest entropy in the remaining search space.
In addition to PWMs, high-throughput DNase-seq assays have also been used to predict TF binding sites across the genome in literature, which is based on how similar a footprint shape of a candidate site is to an absolute DNase-seq footprint. In order to compare the ability of our information models and DNase-seq predicting direct binding, we obtained DNase-seq data on true binding sites of 20 TFs within ChIP-seq peaks from Yardimci et al.

( 8 2 )
, and attempted to detect these sites using our models. All these binding sites were successfully detected by our models, whereas DNase-seq footprinting was only able to identify 35%-85% of them, when 0 was used as the threshold on FLR values indicating the similarity between digestion shapes of candidate sites and definite footprints (Supplementary Table S6). As weak binding sites tend not to generate footprints and thus not to be discovered by DNase-seq, the expectation is that the sites detected by DNase-seq would be stronger than those that evade detection. In fact, this trend was observed for only 10 TFs and the average strengths of these classes of these binding sites were not significantly different.
These information theory-based models not only are capable of detecting strong binding sites, but also discover intermediate and weak sites (Supplementary Table S4). This can be ascribed to the fact that as many peaks as possible were retained when deriving these models from ChIP-seq datasets, under the precondition that the correct binding motifs were shown. By including weaker binding sites in addition to those resembling consensus-like motifs, the weights in the R i (b,l) PWMs are more representative of the full range of binding affinities. The resultant R i values are more effective for forecasting potential binding sites and target genes, detecting variants in known binding sites and predicting their effects. Our models can also be used to forecast whether a SNP (Single Nucleotide Polymorphism) linked to a disease risk lies within a TF binding site as a means of modulating expression of these genes, and potentially offer new explanations for disease etiology in some individuals. The novel cofactor binding sites in TF gene targets suggest previously unrecognized coregulation mechanisms in which they may be operative in these genes.
In fact, the number of TFs for which cofactor motifs were revealed exceeds the number of TFs whose primary binding motifs were discovered, partially because only cofactor motifs can be found in the datasets of TFs which exhibit little or no sequence specificity (e.g., CCNT2, INI1 and P300). For 11 primary TFs, the binding site sequences were extremely variable, that is, the overall conservation levels of their binding motifs contain less information than noisy, low complexity sequences or cofactor motifs. Entropy minimization finds the highest information content motif in a ChIP-seq dataset for a given length motif and degree of sequence conservation over that length. For 18 primary TFs associated with cofactors, which themselves physically contact DNA, the primary TF motif was not enriched. The inability of the software to discover such primary motifs is a limitation of this approach.
Interactions between the primary TFs and a subset of the cofactors which are known to cooperate with them were detected, since the association has to occur with a prevalence sufficient to produce a recognizable motif (usually >0.5 bits per nucleotide over the entire site). Nevertheless, the algorithm will miss some cofactors with weakly conserved or that overlap with other conserved motifs.
While unable to discover cofactors nor identify bipartite motifs of variable spacing, the PBM technique adopted by Weirauch et al.
and Jolma et al.
is able to determine binding specificities for a wide range of sequence-specific TFs, in that homogeneous binding sites of TFs are reconstructed from overlapping oligonucleotide sequences by directly detecting complexes with the TF. This eliminates interference of noisy sequences or cofactors which may emerge as false minimum entropies using our method. The configuration of outputting up to five significant motifs in MIME and the simultaneous use of five motif discovery tools lead Wang et al. (

)
and Kheradpour et al.
( 1  1  ) to discovering a few cofactors that were not revealed by our method, albeit using only top 500 or 250 peaks increased the likelihood of those cofactors appearing by chance.
Our motif discovery approach can be applied to other ChIP-seq data, to derive information models of primary TFs and to reveal cofactors that complex with primary TFs. The quality control criteria we described are capable of ensuring that the user-built models are accurate and can be used for binding site detection. The first quality control criterion is particularly important, because it provides a straightforward assessment of model performance. The recursively thresholded feature is crucial for guaranteeing that the discovered cofactors do not appear by chance, because the greater the number of peaks from which a cofactor is derived, the higher the confidence that the cofactor indeed interacts with the primary factor.
. Figure 1. The motif discovery algorithm. Pseudocode describing the conversion of ChIP-seq data sets into information models using Bipad, a program which discovers sequence elements in unaligned sequences.   for which ENCODE provides peak datasets are shown. A red rectangle denotes a known or predicted complex with black or blue dotted lines indicating its components, respectively. An undirected line denotes the interaction between a primary TF and a cofactor which may be a complex or a TF family.
A directed line links two cofactors, denoting that in a dataset of the starting TF the ending TF was discovered as a cofactor. Black lines denote known interactions and blue lines denote the newly discovered interactions.