Knowledge of all binding sites for transcriptional activators and repressors is essential for computationally aided identification of transcriptional networks. The techniques developed for defining the binding sites of transcription factors tend to be cumbersome and not adaptable to high throughput. We refined a versatile yeast strategy to rapidly and efficiently identify genomic targets of DNA-binding proteins. Yeast expressing a transcription factor is mated to yeast containing a library of genomic fragments cloned upstream of the reporter gene URA3 . DNA fragments with target-binding sites are identified by growth of yeast clones in media lacking uracil. The experimental approach was validated with the tumor suppressor protein p53 and the forkhead protein FoxI1 using genomic libraries for zebrafish and mouse generated by shotgun cloning of short genomic fragments. Computational analysis of the genomic fragments recapitulated the published consensus-binding site for each protein. Identified fragments were mapped to identify the genomic context of each binding site. Our yeast screening strategy, combined with bioinformatics approaches, will allow both detailed and high-throughput characterization of transcription factors, scalable to the analysis of all putative DNA-binding proteins.
Identification of DNA-binding sites for transcription factors can be a difficult task, in particular because they often consist of short, degenerate DNA sequences. Traditionally, DNA-binding sites have been identified by electrophoretic mobility shift assays (EMSA) ( 1 ), SELEX enrichment ( 2 , 3 ) and DNAse I footprinting assays ( 4 ). More recently, microarray technology has been adapted to this task ( 5 ), but the identified in vitro binding sites do not necessarily address the in vivo activity of a particular binding site. The human genome encodes 2000–3000 transcription factors ( 6–9 ), but at present only 123 transcription factors (from any species) have experimentally determined binding sites in the JASPAR database ( http://mordor.cgb.ki.se/cgi-bin/jaspar2005/jaspar_db.pl ). This indicates a strong need for additional approaches that are more efficient.
Experimental strategies in the yeast Saccharomyces cerevisiae can be fast, cheap and almost universally accessible, as has been demonstrated by the broad use of the two-hybrid assay ( 10 ) and a wide array of additional experimental strategies in yeast for high-throughput screens ( 11–18 ). More recently, Deplancke and colleagues ( 19 ) studied 72 digestive tract promoters of Caenorhabditis elegans in high-throughput yeast assays with 117 proteins and found a large number of previously unknown protein–DNA interaction networks. Recent techniques involving high-throughput sequencing called ChIP-Seq ( 20 ), provides a promising new approach, but requires high-quality antibodies to the transcription factor of interest and sufficient starting material for chromatin immunoprecipitation. Thus the technique is not readily scalable to hundreds or thousands of transcription factors.
We have created the necessary tools and protocols to perform yeast screens that identify both the sequences of DNA-binding targets of transcription factors and biologically active sites of binding. This article demonstrates the utility using libraries made from the zebrafish and mouse genomes. We have validated the approach for both libraries using two transcription factors: Foxl1 and p53. For both transcription factors, yeast screens generate accurate consensus DNA-binding sites and potential target genes. The techniques are readily scalable to new, high-throughput sequencing approaches for comprehensive binding data on large numbers of transcription factors.
Transcription factor expression plasmid
A cDNA expression vector, pYoh-1, was constructed by inserting a double-stranded oligo containing multiple restriction enzyme sites (forward, 5′-TCG AGC TCA GTC GAC TGG TAC CGA TAT CGA ATT CGG ATC CCC GGG GCC TC-3′ and reverse, 5′-CAT GGA GGC CCC GGG GAT CCG AAT TCG ATA TCG GTA CCA TGC GAC TGA GC-3′) into pACT2 (Clontech) at NcoI/XhoI sites and replacing LEU2 with ADE2 (Supplementary Data). The ADE2 gene was amplified from yeast genomic DNA using the primers 5′-AAT GCA ATC GAT TAA CGC CGT ATC GTG ATT AAC-3′ and 5′-ACG TAA GCG GCC GCC GCT ATC CTC GGT TCT GC-3′. Zebrafish FoxI1 cDNA coding region was subcloned into pYoh-1 at the multiple cloning site and fused with the Gal4 activation domain and the HA epitope tag [originally described in Ref. ( 21 )]. Yeast expression plasmids for human p53 have been described previously ( 22 , 23 ).
Transcription factor plasmids were introduced into MATα yeast strains W303 (kindly provided by Carl Wu's laboratory; genotype ade2-1, can1-100, his3-11,15, leu2-3,112, trp1-1, ura3-1 ) or BY4735 ( ade2Δ::hisG his3Δ200 leu2Δ0 met15Δ0 trp1Δ63 ura3Δ0 ) ( 24 ). Standard yeast manipulations were performed as described ( 25 ). Fusion protein expression of FoxI1 was confirmed by western blot using anti-HA tag antibody.
Yeast genomic libraries
To generate the genomic libraries, the URA3 reporter plasmid pHQ366 ( 26 ) was modified by replacing the PstI -p53-binding site-EcoRI linker with a new PstI-EcoRI linker (5′-GTA TCT CGA GG-3′ and 5′-AAT TCC TCG AGA TAC TGC A-3′), yielding plasmid pYoh366. Zebrafish or mouse genomic DNA was partially Tsp509I-digested, gel-purified and cloned into pYoh366 at the EcoRI site. Ligation reactions were ethanol precipitated and electroporated into Escherichia coli ElectroMAX DH5α-E cells (Invitrogen). The library complexity was assessed by counting a serial dilution of transformants on LB-ampicillin plates. The remainder was plated on large LB-ampicillin plates, allowed to grow and washed into a flask containing LB media. Plasmids were isolated using standard procedures. Yeast strain BY404 ( MAT a ade2Δ::hisG his3Δ200 leu2Δ0 trp1Δ63 ura3Δ0 ) ( 24 ) was transformed with genomic fragment-containing pYoh366 and frozen in aliquots (5×10 8 cells/ml).
We used the method from Ref. ( 27 ) as an optimized mating protocol, originally designed for yeast two-hybrid screening. Mating of two haploid yeast strains of opposite mating type, each harboring one of the respective plasmids, results in the formation of doubly transformed diploid zygotes. An aliquot of BY404 with the genomic library in pYoh366 was combined with a freshly raised culture of p53-expressing or FoxI1-expressing yeast ( MAT α) and then subjected to the standard mating procedures. Mating efficiency was calculated by the number of diploid colonies on plates selecting for diploid yeast per total colony number on YPAD or YPD plates. When ADE2 plasmids were used (i.e. the FoxI1 screening), yeast expressing the transcription factor also contained an empty HIS3 plasmid. This allowed for selection of diploid yeast on the actual screening plates (-his, -trp, -ura) while using the ADE2 -based color assay for detection of the transcription factor plasmid.
Sequencing of positive clones
Library plasmids were rescued from positive yeast clones as described ( 28 ) and sequenced using Forward366 (5′-GCG CTT TAA GAG AAA ATA TTT GTC CTG-3′) and Reverse366 (5′-CGG CTA TTT CTC AAT ATA CTC CTA ATT AAT AC-3′). Alternatively, the genomic library fragments were PCR amplified directly from yeast using Forward366 and ReverseUra (5′-GTA GCA GCA CGT TCC TTA TAT GTA GC-3′). ReverseUra targets the plasmid in a region outside the SPO13 promoter. Before sequencing, 20 μl of PCR product were treated with 0.3 U shrimp alkaline phosphatase (Amersham) and 3 U EXO I (USB) in 20 mM Tris–HCl, pH 8.0 and 10 mM MgCl 2 for 1 h at 37°C and 80°C for 15 min. The fragments were diluted with 95 μl of ddH 2 O and sequenced with nested primer Reverse366.
General strategy in yeast for genome-wide screening for binding sites of DNA-binding proteins
We devised an improved assay system to perform whole genome screens for transcription factor-binding sites. The screens in yeast yield minimal background and quickly eliminate false-positives ( Figure 1 ). In essence, a transcription factor is tested against random genomic fragments to isolate DNA that can be directly bound by the protein of interest. By fusing the transcription factor to the activating domain of the yeast transcription factor GAL4 (GAL4AD), the protein will become a transcriptional activator, regardless of the normal role it plays in vivo or any co-factors that it would normally need to activate transcription. Binding of the transcription factor to the genomic DNA fragment results in activation of the URA3 reporter gene and growth of yeast on plates lacking uracil.
The screening efficiency relies on several key aspects: (i) high-quality libraries of genomic DNA with potentially several-fold coverage of the zebrafish and mouse genomes; (ii) tight repression of the SPO13 promoter upstream of the URA3 reporter gene; (iii) negative selection of genomic library yeast in 5-FOA containing media prior to screens resulting in few false-positives; (iv) ADE2 -based expression plasmids for both auxotrophic selection and a visual color ‘sectoring’ assay; (v) maintenance of libraries and transcription factors in haploid yeast of opposite mating type for fast and easy execution of library screens and (vi) PCR amplification of the genomic fragments directly from yeast for rapid sequencing and reanalysis of the positive genomic fragments.
Construction of whole genome libraries for zebrafish and mouse
Libraries were constructed in plasmid pYoh366 that contains the URA3 reporter gene downstream of the SPO13 promoter ( 28 ) in a CEN plasmid (one copy per yeast cell) with the selective marker gene TRP1 ( Figure 1 ) ( 29 ). Genomic DNA from zebrafish and mouse was partially digested with Tsp509I. Genomic fragments of ∼500 bp in length were gel-purified and cloned into the EcoRI site of pYoh366. A total of 3×10 7 independent clones with an average size of 300 bp were obtained for the zebrafish library, an approximate 4- to 6-fold coverage of the zebrafish genome. A total of 1.7×10 7 independent clones with an average size of 700 bp were obtained for the mouse library, representing an approximately 3- to 4-fold coverage of the mouse genome. For both libraries, >90% of plasmids had inserts. The libraries were transformed into yeast strain BY404 ( MAT a) ( 24 ), and >10 8 independent colonies from each library were pooled and frozen in aliquots.
Yeast expression plasmids for DNA-binding proteins
Several marker genes (such as HIS3 or LEU2 ) are available for yeast expression plasmids. These markers allow for screens that involve more than one transcription factor. The marker gene ADE2 is particularly advantageous because loss of the plasmid can be monitored by accumulation of a red adenine precursor on plates with limiting amounts of adenine ( 30 ). A colony that grows on plates lacking uracil yet shows red sectors or is completely red indicates that part or all of the colony has lost the transcription factor plasmid, easily identifying it as a false-positive clone.
Whole-genome screens for DNA-binding proteins in yeast
Screens were performed according to the outline in Figure 2 . A total of 2×10 8 haploid yeast of mating type MAT a containing the genomic library were mated on non-selective YPD plates with an equal number of haploid yeast ( MAT α) containing the expression plasmid for a transcription factor. This approach is significantly more efficient than transforming the library into yeast expressing the target transcription factor, because (i) maintaining the library in yeast makes it an easily renewable resource and (ii) mating assays are more convenient and efficient than DNA transformation. The typical mating efficiency was 5–10%, yielding >10 7 diploid yeast per screen. After 3 days of growth at 30°C, Ura + clones (non-sectoring white colonies in the case of ADE2 plasmids) were single-colony purified and checked for plasmid-dependency of the Ura + phenotype. The library plasmids were rescued, transformed into BY404, retested for the Ura + phenotype in mating assays for the transcription factor, sequenced and analyzed computationally.
Characterization of the mouse genomic DNA library using p53
p53 is a tumor suppressor protein that is activated in response to cellular stresses. It has well-established DNA-binding characteristics ( 2 , 31 ). All experiments were performed with the complete mouse library of 1.7×10 7 independent genomic fragments. Growing the yeast containing the genomic library in 5-FOA-containing media prior to the screen eliminated background from ‘self-activating’ fragments almost entirely (18 Ura + clones per 1×10 6 yeast).
We performed a screen with p53 under fully optimized conditions. Yeast with an empty control plasmid was processed in parallel to determine the background of false-positives. We obtained 330 Ura + colonies per 1×10 6 diploid yeast for p53, whereas yeast with an empty plasmid yielded only two colonies per 1×10 6 .
The very low level of false-positives in our assay allowed us to streamline the screening procedure significantly. Once Ura + clones emerged after 2–5 days, a PCR reaction was used to amplify the genomic insert directly from yeast. One primer hybridized to the upstream SPO13 sequence and one to the downstream URA3 sequence so that only plasmid, but not genomic yeast, DNA was amplified. The PCR product was reintroduced into haploid yeast by co-transformation with a partially overlapping gapped plasmid, and the reporter gene plasmid was generated in yeast by homologous recombination ( 32 ). The phenotype of the genomic fragment was confirmed in yeast after mating to haploid yeast with the transcription factor. In parallel, the PCR product was purified and sequenced. This approach reduces manual processing of yeast clones, speeds up screens and is adaptable to high-throughput screening.
Characterization of the zebrafish genomic DNA library using FoxI1
FoxI1 is a forkhead class transcription factor involved in proper organization of the zebrafish and mouse otic vesicle and zebrafish cranio-facial cartilage during early embryonic development ( 33–36 ). Because FoxI1 can be either an activator or a repressor in vivo ( 21 ), we fused the open reading frame of FoxI1 to the Gal4 activating domain (to ensure transcriptional activation) and screened the zebrafish library (to ensure transcriptional activation) and screened the zebrafish library (3×10 7 unique genomic inserts, 4- to 6-fold coverage of the zebrafish genome). In parallel, we performed a control-mating assay for haploid yeast with an empty plasmid. For this particular screen, we cloned the FoxI1 open reading frame into the pYOH-1 (modified from the Clontech plasmid pACT-1) plasmid marked with ADE2 . False-positives will not need to maintain the FoxI1 plasmid to grow on SC-Ura plates, so the colonies would take on a ‘sectored’ or all red appearance. True-positives should remain completely white.
Using the approach of Figure 1 and described for p53, 2.5×10 7 diploid yeast were screened on plates lacking uracil for 5 days at 30°C. The negative control resulted in 14 false-positive Ura + colonies per 1×10 6 diploid yeast screened. In contrast, FoxI1 resulted in 132 Ura + colonies per 1×10 6 . A total of 710 white colonies were single colony purified. We PCR-amplified the genomic inserts directly from yeast and sequenced the fragments with nested PCR primers. The fragments were simultaneously re-tested in yeast as described before. All 710 fragments were again Ura + and dependent on FoxI1 for growth.
Computational analysis of the genomic fragments for consensus DNA-binding sites
Of the 140 starting clones, 37 were duplicates and 3 had no insert. Of the remaining 100, 90 clones could be fully sequenced using primers up- and downstream of the genomic fragments, and 10 clones had an interior sequence gap. Sequences from the reference mouse genome were retrieved to fill the gap. The remaining six clones were chimerical and could not be mapped precisely because the sequencing gap included the region where the two distinct genomic fragments join. The February 2006 mouse genome assembly and BLAT at http://genome.ucsc.edu/were used to map the 94 complete clones. Thirty-five clones could not be mapped because they contained exclusively repetitive sequences. Of the remaining 59 clones, 33 were mapped to a single and 26 split clones to more than one genomic locus (24 with 2, 1 with 3 and 1 with 4 fragments). Chimerical clones were a result of the library construction because the genomic fragments had compatible ends. In total, because of some chimerism, 59 clones mapped to 81 unique loci. Typically, chimeric clones had a predicted binding site in only one of the two (or more) genomic fragments in the clone. In summary, ∼70% of library plasmids (94/140) were useful for the p53 DNA-binding site analysis and ∼40% could be uniquely mapped to the genome.
The consensus p53 DNA-binding site is two half-sites of RRRCWWGYYY separated by 0–13 bp ( 2 ). We built a new matrix from a larger data set of 162 p53 DNA-binding sites. According to the site predicted by our 162 known p53-binding sites, the spacing of 0 between the p53 half-sites is much preferred over any other spacing. This result was confirmed in yeast assays ( 26 ). A logo of this half-site model is shown in Figure 2 A.
We applied the CONSENSUS program ( 37 ) to 94 isolated clones with complete sequence data. The best motif is a 10 bp approximate palindrome that matches our standard p53 DNA-binding site model perfectly with a P -value of 1.73×10 −6 ( Figure 2 A).
The p53 motif signal was very strong when analyzing the complete set of 94 clones. To test whether we can obtain an accurate motif for p53 with fewer sequences, we generated random subsets of the 94 sequences ranging from 5 to 50 and applied CONSENSUS to these data sets. With 20 or more sequences, the top predictions matched perfectly with our p53 DNA-binding site model at least 70% of the time. We then added two additional motif-finding programs, Gibbs sampler ( 38 ) and Projection ( 39 ). The combined sensitivity of these three programs was 100% on 10 random subsets of 20 sequences each indicating that just 20 genomic fragments per transcription factor may be sufficient to characterize the majority of putative transcription factors.
We mapped the likely p53 DNA-binding site in the library sequences. Out of the 94 complete library clones, 81 contained at least one perfect p53 DNA-binding site. Some clones contained as many as five predicted sites. Eleven of the remaining clones contained a good match to the p53 consensus DNA-binding site, but scored below our very stringent cutoff value. Two clones contained only one good p53 half-site.
A total of 154 unique sequences were obtained from the 710 total clones. One hundred and thirty-four of the 154 unique sequences could be mapped to the zebrafish genome (zebrafish build Zv6). Occasionally a sequence mapped to more than one location with 100% identity, these occurrences can often be explained by errors in the current zebrafish genomic build. In general, mapping was less ‘robust’ because of high polymorphism rates in the zebrafish genome and the less ‘complete’ nature of the assembly when compared to mouse. Approximately 50% of the clones were chimeric consisting of two different fragments fused together (Supplementary Data).
Human FoxI1 (HFH-3, FREAC6, HNF-3, FKHL10) binds to the motif TRTTTRKDD as determined by SELEX enrichment ( 40 ). The CONSENSUS program identified two common motifs in the isolated fragments. The less frequent motif was identical to the published binding site ( Figure 2 B), however, the more commonly represented binding site, TSATTGGYY, while similar, had some obvious differences ( Figure 2 B), particularly the presence of an A in position 6 that is a T in the published consensus. This could either reflect slight differences in the preference of FoxI1 binding between the human and zebrafish forms, or differences in binding preference in the context of histone packaging in vivo . Forkhead class transcription factors have been shown to bind more stably to DNA in the context of histones ( 41 ), this may change the recognition site preferences when compared to in vitro determinations. Thirteen fragments did not contain either predicted motif. The discovery of another putative binding site for FoxI1 as well as the confirmation of the previously published site reinforces the value of the technique. It is of particular note that motif 1 contains the NF-Y core sequence CCAAT, common in most eukaryotic promoters ( 42 ). It is unclear what the presence of this sequence signifies, but NF-Y has been shown to be involved in DNA compaction ( 43 ), similar to the role described for FoxI1 ( 21 ), and like forkhead proteins, has a structure that is similar to histones ( 44 , 45 ). Further study is needed to determine the significance of the CCAAT motif.
Identification of putative target genes
Fifty-nine unique clones identified 81 fragments with unique genomic locations because several clones were chimeric. Of the 56 fragments with a good to excellent p53 DNA-binding site, 10 mapped to introns. The remaining fragments were found several thousand base pairs up- or downstream of genes, similar to the findings of ChIP studies for p53 ( 46 , 47 ) (Supplementary Data). These 46 fragments would not have been sampled by typical ‘promoter’ arrays. From the list of genes neighboring the isolated fragments with p53 DNA-binding sites, six genes ( Sec61a2, Ass1, Aldh2, Kit1, Ela3b and Rprm ) are known to be up-regulated in a p53-dependent fashion, but their p53 DNA-binding sites have not been reported ( 48–51 ).
Because of chimerism, 154 unique sequences were mapped to 134 genomic loci. The remaining fragments were unmappable (Supplementary Data). There were only 20 instances where the fragment mapped within 10 kb of the transcriptional start site. This is consistent with a proposed second role for FoxI1 as a global chromatin remodeling protein ( 21 ). There are not many genes known to be directly regulated by FoxI1, but within the small set of genes that have FoxI1 sites <10 kb away, two genes, Lhx3 and Cldna , have roles in ear development ( 52 , 53 ). This makes them candidates for regulation by FoxI1 since Foxl1 has a known role in ear development ( 34–36 , 54 ).
In vivo testing of zebrafish fragments
Recently developed techniques allow researchers to use zebrafish embryos as a rapid readout for testing in vivo activity of putative cis -regulatory elements ( 55 , 56 ). We selected 13 fragments and subcloned them into a TOL2-based transposon vector containing Gateway™ compatible cloning sites upstream of a minimal promoter and GFP (gift of A. McCallion). All 13 fragments showed a significant increase in GFP expression compared to empty vector alone ( Figure 3 A). This demonstrated a specific transcriptional response dependent on the inserted genomic fragment since previous research has shown that random genomic fragments do not consistently activate GFP expression ( 55 , 56 ). Typically, expression of GFP was similar to the known expression pattern of FoxI1 ( Figure 3 B). In order to demonstrate that transactivation was dependent on FoxI1, we selected two fragments, z84 and z11, and tested them in the presence or absence of morpholino oligonucleotides that inhibit FoxI1 expression (OpenBiosystems). Fragment z11 showed no significant differences in the presence or absence of embryonic FoxI1. However, fragment z84 showed a significant increase in GFP expression when FoxI1 protein was inhibited ( Figure 3 C, P < 0.001). We have previously demonstrated that FoxI1 is capable of remodeling chromatin higher order structure and only a few genes are activated or inhibited by FoxI1 expression ( 21 ). In this study, we further found that FoxI1 has an inhibitory role on GFP expression in the context of certain genomic fragments. It is likely that the FoxI1 binding is crucial to recruit other transcription factor to the sites on those fragments that are acting as positive and/or negative regulators.
In summary, the fragments isolated using the yeast technique showed strong enrichment for cis -regulatory elements (13/13 tested) and we have demonstrated in one instance (of two tested) that FoxI1 is responsible for transcriptional regulation from the fragment.
We present an approach in yeast that can rapidly identify the target binding sequence of nearly any transcription factor of interest, and simultaneously identify target sites that are biologically active and relevant to the function of that transcription factor. This redesigned and optimized yeast approach combines three important elements: (i) yeast mating (instead of transformation) for efficient screening of an entire genome in one pass; (ii) red/white sectoring of the ADE2 gene for visual identification of false-positives and (iii) use of URA3 as the reporter gene for expansion of libraries in 5-FOA to eliminate yeast with ‘self-activating’ fragments and reduce false-positives to near zero. In two test cases, p53 and FoxI1, a yeast-based screening approach proved to be a powerful and efficient way to study DNA-binding proteins. Recent studies using our p53-binding site data as a starting point, have identified endogenous retroviral (ERV) LTR's carrying functionally active p53-binding sites specific to the primate lineage, indicating the potential utility of this approach both to identify binding sites that are active in vivo , and to make novel discoveries about cis -regulatory regions in general ( 57 ). The genomic fragments can be directly PCR-amplified from yeast for both sequencing and immediate retesting in yeast using ‘gap repair’. The approach is readily extended to hetero-dimeric or -multimeric transcription factor complexes.
This technique provides a low-cost, potentially high-throughput and rapid assessment of DNA-binding targets for transcription factors that nicely complements the current technologies, such as ChIP-Seq. We have shown that DNA-binding sites can be predicted in as few as 20 isolated sequences and the identified fragments have a very high likelihood of being biologically active.
Supplementary Data are available at NAR Online.
We thank Carl Wu for providing reagents. Support for T.W. is from the Helen Hay Whitney Foundation. G.S. is supported through funding from NIH grant HG00249. This research was supported in part by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (S.B.). Funding to pay the Open Access publication charges for this article was provided by the National Human Genome Research Institute.
Conflict of interest statement . None declared.