Recent advances in technology have led to a dramatic increase in the number of available transcription factor ChIP-seq and ChIP-chip data sets. Understanding the motif content of these data sets is an important step in understanding the underlying mechanisms of regulation. Here we provide a systematic motif analysis for 427 human ChIP-seq data sets using motifs curated from the literature and also discovered de novo using five established motif discovery tools. We use a systematic pipeline for calculating motif enrichment in each data set, providing a principled way for choosing between motif variants found in the literature and for flagging potentially problematic data sets. Our analysis confirms the known specificity of 41 of the 56 analyzed factor groups and reveals motifs of potential cofactors. We also use cell type-specific binding to find factors active in specific conditions. The resource we provide is accessible both for browsing a small number of factors and for performing large-scale systematic analyses. We provide motif matrices, instances and enrichments in each of the ENCODE data sets. The motifs discovered here have been used in parallel studies to validate the specificity of antibodies, understand cooperativity between data sets and measure the variation of motif binding across individuals and species.
Chromatin immunoprecipitation (ChIP) (1) followed by hybridization to an array (ChIP-chip) (2,3) or sequencing (ChIP-seq) (4) enables the genome-wide identification of the binding locations of transcription factors (TFs) present in a given condition and cell type or tissue. As these technologies have matured, their use has become increasingly widespread. The resolution of these experimental techniques can be as low as 300 bp for ChIP-chip (5) and 50 bp for ChIP-seq (6), depending on the experimental design (e.g. fragment size, paired-end sequencing) and algorithmic processing of the raw data.
The use of these technologies on a variety of factors across many cell types has increasingly highlighted the complex nature of TF activity, often violating the simple model of a factor binding to its recognition pattern (motif) in isolation: binding has been shown to be dynamic across cell types, requiring the coordinated binding of cofactors or specific configurations of the underlying chromatin. Moreover, TF binding frequently occurs in the absence of any discernible motif instance (7,8) or to ‘hot-spots’ where several factors are simultaneously found (9). Understanding this complex binding necessitates identifying the underlying sequence features responsible. To address this need, we have performed a systematic, motif-centric analysis of hundreds of TF binding experiments made available as part of the human ENCODE project (8,10). As part of this, we provide a collection of motifs for each assayed factor, both taken from the literature and through de novo discovery, and also an annotation of motif instances genome-wide, which may be used to pinpoint the specific regulatory bases in regions bound by TFs.
We found that no single algorithm or database comprehensively assays the motifs relevant to the binding diversity surveyed by ENCODE. Therefore, our approach was to collect motifs from several literature sources (11–16) and supplement them with motifs discovered de novo on the data sets themselves using five established tools (17–21). Although this general approach of using multiple motif discovery tools is popular [e.g. (22–24)], its application to this number of data sets is unprecedented and permits the identification of TFs that are likely to be interacting or participating in common pathways.
This work is accompanied by a web interface for browsing the discovered and literature motifs along with their enrichments (Figure 1; http://compbio.mit.edu/encode-motifs). In addition to the browsing interface, we provide several data files including all motif matrices and their matches to the genome, as well as software to compute enrichments and perform unified motif discovery with the five tools we use. Together, these permit both analyses of individual factors (e.g. to identify cooperating TFs) in addition to systematic analysis (e.g. to examine differences between TFs). Moreover, the breadth of data sets available enables systematic comparisons and analyses that are not possible when only one or a few factors are studied in isolation.
Later in the text, we describe the details of how the resource was generated and conduct an initial analysis to provide examples of its usage and to highlight potentially interesting results.
MATERIALS AND METHODS
Our goals were to produce a resource that (i) contains a comprehensive collection of relevant motifs for each factor; (ii) avoids repetitive, weakly enriched motifs that do not contribute to the in vivo specificity of the factor or its partners; and (iii) excludes variants of the same motif, particularly among the discovered motifs. With this in mind, we conducted motif discovery separately on each data set using five motif discovery tools and manually placed all its data sets into ‘factor groups’ on the basis of known motifs and homology (Figure 2). Known motifs from the literature and the top 10 most enriched discovered motifs (excluding duplicates) were collected for each factor group (see Supplementary Methods) and named as TF_known# for known motifs and TF_disc# for discovered motifs, where TF denotes the factor group (e.g. FOXA, CTCF, etc.). Known motifs were ordered arbitrarily, whereas the discovered motifs were ordered in descending order of the enrichment value that was used for their selection.
The 427 ENCODE experiments analyzed correspond to 123 TFs, which we place into 84 factor groups (Figure 3a). We failed to discover an enriched motif for only 12 of the 84 factor groups, of which 9 lack DNA binding domains (BRF, CTBP2, HDAC8, KAT2A, NELFE, SUPT20H, SUZ12, WRNIP1 and XRCC4) as identified by UniProt (27), and 6 have all their data sets flagged as unreliable based on various quality metrics [BRF, KAT2A, NELFE, NR4A, SUPT20H and ZZZ3; see (A. Kundaje, L.Y. Jung, P.V. Kharchenko, B. Wold, A. Sidow, S. Batzoglou and P.J. Park, in preparation)]. Of these factor groups, only NR4A has a previously identified known motif.
We exclude from the discussion below motifs that we consider unlikely to be relevant to our analysis, while maintaining them as part of the overall resource where they may be useful. These include 46 discovered motifs that are either low-complexity (e.g. dinucleotide repeats) or consistently have weak enrichment (<2) and do not match known motifs (Supplementary Table S1). These are likely a consequence of slight biases in the discovery pipeline, or are due to real, but relatively weak, specificity for the factor. We also exclude an additional 36 motifs that have a weak similarity to the known motif for the factor but for which a better matching and enriched motif is also found (Supplementary Table S2). These are most frequently seen for longer motifs that can be broken up into recognizable, but globally dissimilar, patterns that are not captured by our automatic exclusion criteria (see Supplementary Methods). Together, these represent 28% of the 293 discovered motifs.
Using motif similarity metrics, we are able to link the discovered motifs directly to the TFs that recognize them through their known motifs. Here we use these inferred relationships between TFs to make specific biological insights, illustrating the types of analyses that our resource enables. In the interest of clarity, most descriptions of TFs will be omitted, but may be found along with further references at RefSeq (28) and Entrez (29).
Recovery of known specificity for TFs
Most of the known literature motifs we collect are derived from biochemical in vitro assays. Thus, they provide a largely independent, although somewhat imperfect way to evaluate the performance of our discovered motifs. Recovery of known motifs varies significantly by method, but taking the most enriched motif (our pipeline) is competitive with the best single method (Figure 3b). Overall, our pipeline found a motif matching a previously characterized literature motif for 41 of the 56 factor groups with a known motif.
One of the most striking observations of this analysis is how frequently other distinct motifs were also found. For 29 of these 41 factor groups other motifs are found, even after manually excluding redundant or repetitive motifs, and for 9 factor groups one or more of these discovered motifs is ranked higher than the motif matching a known motif (see Supplementary Table S3). In the next section, we will analyze the additional motifs we found for these factors, which in many cases identify factors known to interact, either cooperatively or competitively.
For the remaining 15 of 56 factor groups with a known motif (e.g. HSF, NANOG, PBX3, SREBP and TAL1) the known motif is not found at all, including NR4A where no enriched motif is discovered. Frequently this is because the known motif itself is not enriched and may not accurately capture the specificity of the factor in vivo. For example, the ‘known’ EP300 motif from Transfac was likely built on a specific bound region of EP300 and would not accurately capture its binding in all cell types where it interacts with a variety of factors and has no DNA binding domain of its own (we avoided removing such motifs to prevent bias in the database). Likewise, we do not discover a motif that matches the known ZBTB33 specificity, and moreover the known motif itself is not enriched at all in the bound regions.
Although some known motifs were of apparently low quality, we largely found our database of known motifs to be relatively comprehensive and had difficulty finding matches to novel motifs outside it. An exception is ZNF263_disc1, which does not match a motif in our database, but does roughly match the specificity for ZNF263 indicated in (30) despite only having weak enrichment (1.8-fold).
Although the motifs that match each other (either known or discovered) generally have similar enrichments, in some cases we find substantially higher enrichment for some motif variants over others (Figure 4 and Supplementary Table S3). For example, NFE2_disc1 matches the known NFE2 motif, but has a 76-fold maximal enrichment across NFE2 data sets, compared with 56-fold enrichment for the most enriched known NFE2 motif. Different known motifs for the same factor often show a broad range in enrichment: MEF2 has six motifs described in Transfac, with an enrichment differential of as much as 4-fold consistently across data sets. This enrichment analysis provides a systematic way to choose among variants of a motif.
We also saw varying enrichment of the known motif, depending on the specific data set for a factor group. For example, CTCF_known2 is enriched in CTCF data sets in a range from 30- to 78-fold on identically processed data. This may be a result of varying quality of the samples across data sets or may be a consequence of true biological differences.
Identifying the sequence specificity for factors that were previously uncharacterized is of particular interest. In all, 17 factor groups had no known motif but now have discovered enriched motifs (BCL, BDP1, CCNT2, CHD2, CTCFL, HDAC2, HMGN3, RAD21, SETDB1, SIRT6, SMARC, SMC3, SP2, SIN3A, THAP1, TRIM28 and ZNF263). These discovered motifs may represent the direct or indirect (e.g. through cofactors) DNA binding specificity.
Shared motifs suggest interacting relationships
We find that most factors have motifs for other factors enriched in their binding sites (summarized in Supplementary Table S4). This may occur due to (i) cooperative binding of the two factors to the same locations; (ii) interfering binding between factors where one binds near the other to prevent binding; (iii) some similarity in motif specificity; (iv) the two factors functioning on a similar set of genes (e.g. ones specific to one tissue), without directly interacting; or (v) the factors binding to similar genomic regions (e.g. near genes). Our analysis does not directly rule out any of these possibilities; however, (iii) is generally verifiable using our motif similarity metrics and (v) can be examined by inspecting only the TSS-proximal enrichment.
The motif most enriched in multiple data sets was the TPA DNA response element (TRE; TGA[C/G]TCA), which is recognized by the AP1 TF when it is formed by FOS/JUN dimers (31) and other factors including MAF and NFE2. The enrichment of the TRE in a data set is often stronger than that of even the known in vitro sequence specificity and may arise from a number of phenomena, including (i) a cooperatively interaction with AP1, (ii) competition with AP1 for the same binding sites, leading to a potentially repressive role for the TF or (iii) reuse of binding sites due to, for example, accessibility of chromatin. We find a motif matching the TRE motif for 20 factor groups (AP1_disc3, AP2_disc1, BATF_disc1, BCL_disc2, CTCF_disc8-9, EP300_disc1, GATA_disc2, HMGN3_disc1, IRF_disc2, MAF_disc1, MEF2_disc3, MYC_disc3, NFE2_disc1, NR3C1_disc2, PRDM1_disc2, RXRA_disc3, SMARC_disc1, STAT_disc2, TCF7L2_disc1 and TRIM28_disc1).
We found that the enrichment of the TRE to be particularly notable for a few factors. GATA and AP1 have known cooperative binding (32). TFs in the SMARC factor group are members of the SWI/SNF chromatin remodeling complex (33), which is necessary for proper regulation by FOS/JUN dimers (34); and TCF7L2_disc1, which matches the TRE, is more enriched than the known TCF7L2 motif (TCF7L2_disc2) in only the TCF7L2 colorectal cancer cell line HCF-116 data set, consistent with the known interaction of JUN and TCF7L2 during intestinal cancer development (35).
AP1 also binds to the cAMP response element (CRE; TGACGTCA) when the dimer is formed by ATF3/JUN (31) and this is the motif we find as AP1_disc1. However, AP1_disc3 (which matches the TRE) is the most enriched motif in FOS data sets. Interestingly, ATF3_disc1 is not the CRE, but rather the E-box (see later in text). We do, however, find a variant of the CRE (with additional specificity) as ATF3_disc2. The most enriched discovered motif for E2F, E2F_disc1 also matches the CRE and is highly enriched in all data sets.
MYC is a critical regulator, which recognizes the E-box sequence. To aid in comparisons, we include MAX, which forms complexes with MYC, and USF1/2, which also recognizes the E-box sequence, in the MYC factor group. We find multiple motifs enriched in MYC binding sites, highlighting the multifunctional role MYC and the other E-box recognizing proteins play. We found a version of the E-box with additional specificity (MYC_disc1) that was highly enriched in USF1/2 bound regions (max 98-fold for USF2 versus <9-fold enrichment for MYC/MAX). This motif was more enriched than the known E-box motifs, including known USF motifs, in many USF data sets. We find a second, less specific E-box motif (MYC_disc2), which shows more even enrichment across factors. We also find discovered motifs of other factors matching the E-box, including SIN3A_disc2 (discussed later in text), NFE2_disc2-3 and SIRT6_disc1. It is notable that although SIRT6 is a chromatin-associated protein without a known DNA binding domain (36), the only discovered motif matches the E-box (with 16-fold enrichment in SIRT6 bound regions), suggesting that MYC or another E-box recognizing factor may play an important, but indirect, chromatin-related role.
Motif enrichment is able to identify both positive and negative interactions for the same factor. For example, SIN3A, a corepressor known to interact with a number of proteins, has discovered motifs matching REST (SIN3A_disc1 and more weakly disc3–4) and MYC (SIN3A_disc2). These are consistent with SIN3A’s known involvement in repression by REST (37) and SIN3A being a known antagonist for MYC (38).
Morever, MYC_disc4 matches RFX5 and is enriched particularly for MAX-bound regions in H1-hESC and GM12878, and MYC_disc5 matches the CEBPB known motif and is enriched in MYC regions bound in unstimulated K562 cells. MXI1, which was not included in the MYC factor group although it does interact with MAX to bind to MYC-MAX sites (39), has MXI1_disc1 that matches RFX5 in both the K562 and HeLa-S3 cell lines.
We analyzed six IRF family data sets: IRF1 binding in K562 cells stimulated by IFNa (viral innate response) or IFNg (viral, bacterial and tumor control); IRF3 in HepG2, GM12878 and HeLa-S3; and IRF4 in GM12878. The most strongly enriched motif (IRF_disc1, matching NFY) is highly enriched (>20-fold) for all three IRF3 data sets and IRF1 in K562 under IFNg stimulation. This suggests that binding of IRF to NFY sites occurs only under specific conditions and by only some IRF members and potentially expands on the previously documented interaction of NFY and IRF2 at a single promoter (40). IRF_disc4, which matches SP1, is enriched in the same cell types, albeit at much lower levels. IRF_disc3, which matches the known IRF consensus, shows weak-to-no enrichment in these data sets, but shows an enrichment of 8.8-fold for IRF1 bound regions in K562 cells under IFNa stimulation and 3.1-fold enrichment for IRF4 bound regions in GM12878. IRF_disc2, which matches the TRE, is enriched primarily in GM12878 regions bound by IRF4. The known SPI1 motif matches IRF_disc5, and reciprocally SPI1_disc2 matches the IRF motif, consistent with the importance of SPI1 in hematopoietic development (41).
Beyond the discovered motif for IRF, several other discovered motifs (AP1_disc2, CEBP_disc2, E2F_disc4, PBX3_disc1, RFX5_disc2 and SP1_disc1-2) match the known NFY specificity (CCAAT). These discovered motifs are consistent with several known interactions of NFY. RFX5 promotes the cooperative binding between RFX and NFY (42), CEBPB and NFY interact in at least one promoter (43) and SP1 and NFY are known to interact (44). E2F_disc4 has particularly high enrichment in E2F4 data sets, consistent with the cooperative role E2F4 and NFY play in cell cycle regulation (45).
STAT factors are involved in regulating number of growth-related functions. We analyze STAT1, STAT2 and STAT3 here in the context of GM12878, HeLa-S3, MCF10A-Er-Src and K562 cells. We find relatively consistent enrichment of the STAT full site (TTCCNGGAA), which STAT_disc1 matches, while finding weak enrichment for just the half-site (TTCC). We also find motifs involved in other proliferative functions including STAT_disc2, which is particularly enriched in STAT3 data sets and matches the TRE, consistent with STAT3 being one of the many interaction partners for AP1 (46). STAT_disc3 matches the IRF consensus and has enrichment that is particularly high in STAT1 and STAT2 data sets stimulated by IFNa, highlighting the cooperativity of STAT factors and IRF in immune functions. STAT_disc4 is a match to the CEBPB motif and is found enriched in STAT3 data sets, consistent with the known cooperative role for these two factors (47).
TFs with ETS domains are highly conserved and involved in several cellular processes [reviewed in (48)]. A number of TFs have discovered motifs that match the ETS consensus, including EGR1_disc2, GATA_disc3, MEF2_disc2, NRF1_disc2, NR2C2_disc1 and PAX5_disc4. These discovered motifs are supported by known interactions between GATA and ETS in sea squirts (49), MEF2 and the ETS factor PEA3 (50) and NR2C2 with the ETS factor ELK4 (51). Moreover, PAX5 and ETS factors have shared roles in the development of B-cells (52,53). Looking at the discovered ETS motifs, we find that ETS_disc8 matches the known motif for MYB and the two have been known to cooperate, a relationship that is important in the context of certain cancers (54).
THAP1 has two discovered motifs, both of which match the known YY1 motif (the first with additional specificity added by an apparent HNF4 motif). To our knowledge, the relationship between THAP1 and YY1 has not been directly observed; however, THAP1 has been known to associate with the coactivator HCF-1 (55), and YY1 and HCF-1 are known to interact (56). Our result suggests that THAP1 and YY1, possibly with the addition of HNF4, may interact at least in the K562 cell line for which we have THAP1 binding data. RAD21_disc3 also matches YY1, suggesting an additional interaction.
NANOG, an important pluripotency TF, has a known motif that is only weakly enriched (1.3-fold) in the bound regions and not discovered by our pipeline. We see much stronger enrichment for the known POU5F1 and POU2F2 motifs, for which we also find similar motifs (NANOG_disc2 and NANOG_disc4, respectively), consistent with their shared roles in pluripotency (57,58). The interaction of these factors is further supported by POU5F1_disc2 matching the known POU2F2 motif. Additionally, NANOG_disc2 and disc3 match the known motifs for TCF7L2 and TCF12, respectively, again consistent with the important role TCF proteins play in stem cells (59).
CTCF plays a variety of vital roles in the organization of chromatin architecture (60) and the motifs we discover matching the known CTCF specificity (RAD21_disc1, SMC3_disc1,2-4, CTCFL_disc1,10, ZBTB7A_disc1,2, SP2_disc3 and RXRA_disc2,5; some weakly) are largely compatible with this role. RAD21 is a highly conserved protein involved in DNA double-strand repair (61) known to co-localize with CTCF (62). Cohesin, of which SMC3 is a subunit, is brought to the chromatin by CTCF (63). Further, although the function of the CTCF paralog CTCFL is not completely known, it does appear to be involved in imprinting through interaction with a histone methyltransferase (64).
Combinations of motifs
A few of the discovered motifs contain additional specificity or have distinct segments matching multiple motifs. For example, EGR1_disc4 appears to be a combination of multiple motifs (EGR1, IKZF1 and a homeobox motif), and SETDB1_disc1 contains the ZNF143 core sequence with significant additional specificity. The appearance of these motifs suggests highly specific ‘grammars’ for these motifs that may require specific spacing and orientation of binding sites for functionality.
We find several additional enrichments of potential interest. PBX3_disc2 matches the known MEIS1 motif, consistent with the known cooperative binding of MEIS1 and PBX (65). TAL1_disc1 matches GATA, with the potential connection that GATA and TAL1 are known to be important in hematopoesis and vascular development (66,67). HSF_disc1 matches the known CEBP motif and has much higher enrichment in HSF data sets (31-fold) compared with the known motifs for HSF (<9-fold). Additionally, EGR1_disc5, HNF4_disc5, NRF1_disc3, PAX5_disc2, RXRA_disc4/PAX5_disc3 and SREBP_disc1 match the known motifs for ZIC, SOX, SP1, PAX2/PAX3, IRF and RFX5, respectively, suggesting additional previously uncharacterized interactions. Lastly, we find some motifs that show more ambiguous matches: SMARC_disc2 shows weak similarity to homeobox TGTAGT motif, NR2C2_disc2–3 weakly matches the known HNF4 motif and EGR1_disc3/SETDB1_disc2 matches the repetitive NRF1 motif.
General factors enriched in cell line-specific key regulators
Factors directly responsible for the establishment of enhancers, chromatin restructuring or polymerase recruitment frequently exhibit binding that is highly cell type specific. Because most of these factors do not have their own sequence specificity, their binding is often correlated with that of regulators important for the specific cell line. We analyze several such factors (BCL, BDP1, CCNT2, EP300, FOXA, HDAC2, HMGN3, TATA, TCF12 and TRIM28) and find that key cell line regulators can be identified by examining enrichments in cell lines-specific data sets.
As a transcriptional coactivator, EP300 interacts with numerous TFs [reviewed in (68)] and has been shown to have binding that can identify tissue-specific enhancers (69). Conversely, FOXA has a DNA binding domain and plays an important role in liver development and function (70) and is a pioneer factor responsible for priming chromatin for the binding of other factors (reviewed in (71)]. Other proteins involved in chromatin restructuring include HDAC2, which transcriptionally represses through histone deacetylation (72) and HMGN3 (73). Further, two factor groups are directly involved in transcription including three RNA Pol3 subunits (BDP1, RPC155 and TFIIIC-110) and CCNT2, which is involved in the elongation of Pol2 (74).
Eight of these ten factor groups have at least one data set in K562 (erythroleukemia cells), and for four of these we discover motifs that match the GATA consensus, which is then enriched specifically in the K562 data sets (BCL_disc5, CCNT2_disc1, HDAC2_disc1 and HMGN3_disc2). GATA has a known important role in K562 (75), and we also have previously found an association with GATA motifs and chromatin state-derived enhancers for K562 cells (76). We also find three additional motifs that have enrichment specific to the factor group’s K562 data set: BDP1_disc1, a 23-nt motif that contains the STAT consensus; HMGN3_disc1, which matches the TRE; and TRIM28_disc2, which matches no known motif and may be associated with an uncharacterized regulator active in this cell line.
Likewise, for GM12878, an EBV-mediated lymphoblastoid cell line, we find three discovered motifs (BCL_disc4, EP300_disc5 and TCF12_disc4) that match the known IRF consensus. IRF4 has been shown to be important in the establishment of these cell lines (77), and the family is an important player in immune cells (78). This enrichment is also consistent with our previous study using epigenetic marks (76), where we found IRF to be the strongest enriched motif in GM12878-specific enhancers. We also find GM12878-specific enrichment for motifs matching NFKB (BCL_disc6) and POU2F2 (TATA_disc9), consistent with the known biology of these factors (79,80).
The motifs we find specifically enriched in HepG2 (liver carcinoma) data sets match the known motifs for FOXA (EP300_disc3, HDAC2_disc2, and TCF12_disc2), HNF4 (FOXA_disc5 and HDAC2_disc5) and CEBP (EP300_disc2,6), three key liver regulators (70,81). We find motifs with enrichments specific to H1-hESC, which include matches to the pluripotency factor POU2F2 (TATA_disc9), the near universally expressed repressor REST (BCL_disc3 and HDAC2_disc4) and key metabolic regulator NRF1 (HDAC2_disc4). We find additional cell line-specific enrichments for FOXA_disc3 (TCF12) in ECC-1, FOXA_disc4 (STAT) in both T-47D and ECC-1 and EP300_disc2,6 (CEBP) and EP300_disc4 (ETS) with enrichment in the HeLa-S3 data set.
Even for these factors, we find motifs that are consistently enriched across assayed cell lines for a given factor. FOXA_disc1, for example, matches the known FOXA motif, indicating that FOXA’s own motif also plays an important role in its specificity. Most of the motifs we identify for RNA Pol2 machinery (TAF1, GTF2B, GTF2F1 and TBP) are enriched in all cell lines, including the known TATAAA motif (TATA_known2). Also, TATA_disc1, disc6 and disc8 have consistent enrichment and match the known motifs for YY1 (which is known to be important in establishing transcription) (82), NFY and ETS. The top discovered motif BCL_disc1 matches the known ETS motif and is also enriched across data sets.
Interestingly, we find that the TRE motif is found and enriched in a cell line-specific manner for several factors, but for different cell lines. For example, HMGN3_disc1 is enriched in K562, BCL_disc2 has the highest enrichment in GM12878, TRIM28_disc1 is only enriched in the HEK2932 and U2OS cell lines and EP300_disc7 has enrichment in the neuroblastoma cell line SK-N-SH-RA and HeLa-S3. This suggests that perhaps AP1 or other factors recognizing TRE are selectively interacting with these proteins depending on the cell line.
Novel motifs raise possibility of unknown regulators
Although we are able to putatively explain the majority of the motifs we discover as either matches to previously known motifs or low complexity sequences, we do identify 30 putative novel motifs (Figure 5). We placed these into eight groups on the basis of their similarity: Novel1 (BRCA1_disc1, CHD2_disc1, ETS_disc3,6, NR3C1_disc3 and ZBTB33_disc1-4), Novel2 (EGR1_disc4, ETS_disc1,5,7, SETDB1_disc1, SIX5_disc1-3, SMARC_disc2 and ZNF143_disc1-3), Novel3 (SP2_disc3, TCF12_disc3 and ZBTB7A_disc2), Novel4 (RFX5_disc3), Novel5 (BDP1_disc2), Novel6 (TATA_disc5,7), Novel7 (TRIM28_disc2) and Novel8 (E2F_disc6).
Novel1 (using ZBTB33_disc1) is highly enriched in at least one data set for each of the factor groups for which it is found (BRCA1, CHD2, ETS, NR3C1 and ZBTB33). All five factor groups except CHD2 have at least one known motif, and for each of these data sets Novel1 is more enriched in at least one data set than any known motif [the result for NR3C1 is questionable because only one data set has enrichment and that data set has been independently flagged as problematic; see http://www.encodeproject.org/encode/qualityMetrics.html]. The shared role of BRCA1 and CHD2 in DNA damage repair (83,84) suggests that Novel1 may be involved in this or other shared roles for these factors and highlights the utility in shared motif enrichment even outside of motifs directly tied to a factor.
Similarly, for SIX5, we see only weak enrichment of the known SIX5 motif and fail to discover a motif similar to it. However, Novel2 (using SIX5_disc1) shows over 100-fold enrichment for all three data sets (K562, GM12878 and H1-hESC). Novel2 also shows high enrichment in data sets for which it was not rediscovered, including ATF3 (all data sets have >20-fold enrichment with GM12878 having 106-fold) and NRF1 (all data sets have >30-fold enrichment). Moreover, the known ZNF143 motif, which is 4-fold enriched in the one ZNF143 data set, is also not recovered, but Novel2 is 24-fold enriched. The breath of data sets sharing this motif suggests it may be recognized by an important yet unknown or under-characterized regulator.
Like the known ZBTB7A motif, Novel3 (using SP2_disc3) is largely poly-G, which causes us to underestimate its enrichment due to our shuffling process. Despite this, however, it does show enrichment in several data sets, including for the factor groups for which it was identified. This motif shows similarity to other poly-G motifs, such as known SP1 motifs, but appears to be distinct due to its other bases.
Novel4 (RFX5_disc3) shows moderate, but consistent (2- to 6-fold) enrichment across the RFX5 data sets. The consensus is composed of two of the same components as the known motifs (AAC and TGA), but ordered differently. Consequently, it may represent the binding specificity of, for example, an alternative isoform of RFX5. The remaining motifs (Novel5-8), were found for factors that show cell line-specific enrichments. Consequently, these may represent specificities for regulators that are previously unidentified.
Experimental and evolutionary validation of novel motifs
Following the motif discovery and selection of these putative novel motifs, a study released hundreds of new motifs generated using high-throughput SELEX (16). Two of the putative novel motifs described in this section match motifs generated by (16): Novel1 matches the motif for ETV6 and Novel6 matches ZBED1. Although we have incorporated these SELEX motifs into our resource, we continue to include Novel1 and Novel6 as putative novel motifs because they were identified without knowledge of these new specificities and thereby strengthen the evidence for the remaining novel motifs.
Four of these putatively novel motif groups (Novel1–3, 6) match motifs that were previously identified using conservation signals across four mammals (85) (Supplementary Table S5). Therefore, this study provides additional support for these conservation-based motifs and, conversely, the motifs identified here gain comparative evidence. The relatively few distinct novel patterns that are found in this study and the comparative support for many of the few that are found suggests that there may be a limited number of human TF motifs with many instances and which interact with one of the assayed factors that remain unknown.
In this article, we provide a systematic and comprehensive collection of motifs for hundreds of human TF binding data sets. TF binding can be complex, with a factor recognizing several or motifs or binding in the apparent absence of any motif [reviewed in (86)]. We also show that it is possible to identify cofactors that may be partially responsible for binding or function.
This motif resource has already been used in several articles while this article was in preparation, demonstrating its value for high-throughput analyses. Our motifs are being matched at low stringency to identify peaks that are void of any motif to understand the mechanism through which motif-less peaks are generated (8). The collection of known motifs and enrichment techniques we present here was also used as a secondary validation of peaks (87). Because having the motifs allows for more precisely determining the bases responsible for binding, these motifs enable analyzes involving population data (88) and for interpreting GWAS data (89). Two other ENCODE articles also perform motif discovery: (90) produce a non-redundant list of discovered motifs but do not perform an extensive analysis of the relationships between factors and (91) use DNaseI footprinting data to identify relevant motifs.
Having a motif catalog is also the first step in identifying high-quality computational targets of factors, which may allow the identification of binding sites that were, for example, not found in the conditions assayed. Two popular strategies are used for this purpose. One is using clustering of motif instances for factors known to cooperate to form cis-regulatory modules (92,93). This resource is well suited for this purpose because it naturally provides sets of motifs that are likely to cooperate.
A second approach is the use of conservation on many closely related species (85,94–97). This can be performed readily on these motif instances because a dense tree of mammalian species has been sequenced readily permitting their alignment and measuring selection of a near-nucleotide level. Because changes in the underlying motif matches are largely responsible for changes in binding across species (98), evolutionary-based approaches on the motif instances may be a means to deal with the high rate of non-functional binding (99–101).
A web interface, along with data files and accompanying software, is available at http://compbio.mit.edu/encode-motifs.
Supplementary Data are available at NAR Online, including [102–110].
National Institutes of Health (NIH) [HG004037, HG007000 and HG006991]. Funding for open access charge: NIH [HG004037, HG007000 and HG006991].
Conflict of interest statement. None declared.
The authors thank Ewan Birney, Christopher Bristow, Luke Ward, Jason Ernst, Anshul Kundaje, Gerald Quon and other members of the Kellis Laboratory for helpful discussions.