Structural basis for Klf4 recognition of methylated DNA

Transcription factor Krüppel-like factor 4 (Klf4), one of the factors directing cellular reprogramming, recognizes the CpG dinucleotide (whether methylated or unmodified) within a specific G/C-rich sequence. The binding affinity of the mouse Klf4 DNA-binding domain for methylated DNA is only slightly stronger than that for an unmodified oligonucleotide. The structure of the C-terminal three Krüppel-like zinc fingers (ZnFs) of mouse Klf4, in complex with fully methylated DNA, was determined at 1.85 Å resolution. An arginine and a glutamate interact with the methyl group. By comparison with two other recently characterized structures of ZnF protein complexes with methylated DNA, we propose a common principle of recognition of methylated CpG by C2H2 ZnF proteins, which involves a spatially conserved Arg–Glu pair.

The cytosine modifications can be 'interpreted' or 'read' by effector (or reader) molecules. There are currently three best-known classes of mammalian proteins containing domains that recognize modified DNA. The first class includes methyl-binding domains that recognize methylated cytosine in fully methylated CpG dinucleotides (11). The second class includes Su(var)3-9, En(zeste), and Trithorax (SET) and really interesting new gene (RING) finger-associated domains that recognize hemimethylated CpG sites-transiently generated during DNA replication and methylated on the parental strand only (12). The third class of mammalian proteins that recognize methylated DNA is the C2H2 zinc finger (ZnF) proteins that preferentially bind to methylated CpG within a longer specific DNA sequence (13). This unique feature of ZnF proteins is important in that 'sequences longer than CpG would be necessary for the regulation of gene expression by methylation' (14).
Recently, ZnF DNA-binding domains from two proteins, Kaiso and Zfp57, were structurally analyzed in complex with their respective methylated DNA elements (15,16). Here we analyze the interaction of transcription factor Kru¨ppel-like factor 4 (Klf4) with its target methylated DNA element. By comparing three examples of ZnF-methylated DNA interactions, we have derived an apparent consensus sequence motif associated with recognition of methylated CpG elements.
Klf4 is one of 26 members of the specificity protein/ Kru¨ppel-like factor (Sp/Klf) family of ZnF transcription factors (17)(18)(19) and is one of the four Yamanaka reprogramming factors (20). Two recent studies suggested Klf4 binds specific methylated and/or unmethylated elements. Using a DNA pull-down approach combined with quantitative mass spectrometry, three Klf proteins (Klf2, Klf4 and Klf5) were identified as 5mC readers in mouse embryonic stem (ES) cells (21). Using a protein microarray-based approach, 47 human transcription factors including human KLF4 could bind to methylated CpG sites (22). Both mouse Klf4 and human KLF4 proteins share an identical DNA-binding domain composed of three standard Kru¨ppel-like ZnFs (Figure 1a). The consensus-binding element for Klf4 was  (22), R458A and D460A of hKLF4 are located in the last (third) ZnF, which does not directly participate in methyl-CpG binding. (b) Schematic representation of mKlf4 DNA-binding ZnF domain. The sequence and the secondary structure are shown as follows: (arrows) b strands and (ribbons) a helices. The positions highlighted are responsible for Zn ligand binding (C2H2) and DNA base-specific interactions at À1, À4, À5 and À7 positions (relative to the first zinc-binding histidine): solid lines (direct hydrogen bonds) and dashed lines (van der Waals contacts). The DNA sequence used for the study is shown with the majority of base interactions involving the top strand from 3 0 -to-5 0 (left-to-right). The central GCG sequence is colored in magenta and the letter 'm' indicates the methyl group in 5mC. Dotted and solid vertical lines indicate specific binding interactions. (c) The mKlf4 ZnF protein binds in the major groove of DNA with ZnF1 (blue), ZnF2 (green) and ZnF3 (pink). (d) Lys413 of ZnF1 at the À7 position interacts with the O6 oxygen atoms of both guanines at G9 (of upper strand) and G10 (of lower strand). (e) His416 of ZnF1 at the À4 position interacts with the TpG dinucleotide. (f) Arg443-Gua6 interaction; a layer of ordered water molecules (marked 'w') shields the methyl group of lower strand 5mC. (g) The upper strand 5mCpG interacts with Arg443 and forms a 5mC-Arg-Gua triad. (h) One of the carboxylate oxygen atoms of Glu446 forms a weak C-H. . .O type of hydrogen bond with the methyl group of the upper strand 5mC. (i) Asp445 of ZnF2 at the À5 position interacts with Arg443 at the À7 position and the N4 atom of 5mC of the upper strand. (j) Arg449-G4 interaction. (k) Arg471-G3 interaction; Asp473 of ZnF3 at the À5 position interacts with Arg471 at the À7 position and the N4 atom of Cyt4 of the lower strand. (l) His474-A2 interaction. (m and n) Structural comparison of mKlf4 Glu446 in the absence (m) and presence of methylation (n). determined by both base-specific mutagenesis [5 0 -(A/ G)(G/A)GG(C/T)G(C/T)-3 0 ] (18) and chromatin immunoprecipitation sequencing (ChIP-seq) [5 0 -GGG(C/ T)G(T/G)GG-3 0 ] (23). These both share a central GG(C/ T)G, which contains either CpG, which can be methylated, or TpG, which is intrinsically methylated on one strand and can be methylated on the other strand (CpA) by DNA methyltransferase 3a (Dnmt3a) (24,25). In other words, as with Kaiso (see below), TpG can substitute for the (±methyl)-CpG dinucleotide in the consensus sequences, and has a methyl group in the same position as methylated C (5-carbon of the pyrimidine). Here we analyze the Klf4 interactions with methylated DNA both structurally and biochemically.

Crystallography
The purified Klf4 protein was incubated with annealed oligonucleotides at an equimolar ratio for 0.5 h on ice before crystallization. The final solution contained 0.8 mM protein-DNA complex. Crystals were obtained by the sitting-drop method; the mother liquor contained 100 mM Tris-HCl (pH 8.5), 250 mM NaCl and 20% polyethylene glycol 8000. Crystals grew within 3 days at 16 C.
The crystals were flash frozen by plunging into liquid nitrogen. X-ray diffraction data were collected at the SER-CAT beamline at the Advanced Photon Source, Argonne National Laboratory. HKL2000 (26) and CCP4 packages (27) were used for the data processing. The structure was solved by molecular replacement with the coordinates of 2WBU (28) as an initial searching model using the PHENIX (29) and Phaser programs (30). Model refinement was performed with COOT (31) and PHENIX. Molecular graphics were generated with the Pymol program (DeLano Scientific LLC). The Dali server (32) was used for determining the rootmean-squared deviations of different structures.

Genomic analysis of methylation profiles within Klf4-binding sites
For the methylation profile in the mouse Klf4-binding sites in ES cells, the published Klf4 ChIP-seq (Gene Expression Omnibus (GEO) data set GSM288354) (23) and methylome data in both ES cells and neuronal progenitor (NP) cells (GEO data set GSE30202) (33) were used. We recorded the frequency of the methylation levels at the GGCG sites in the 60-nt windows centered around the midpoints of the reported ChIP-seq peak regions ( Figure 2e). Of the 10 297 Klf4 ChIP-seq peak regions in the mouse ES cells, 5530 contain the GGCG motif. As a comparison, we randomly picked an equal number of 60-nt windows in the mouse ES genome and carried out the same analysis ( Figure 2f). Similarly, for the methylation profiles in Klf4-binding sites in human H1 cells, the published KLF4 ChIP-seq (GEO data set GSM447584) (34) and methylome data (GEO data set GSM432685) (34) were used. Mouse genome mm9 and human genome hg19 were used in these analyses.

Structure of Klf4 bound with methylated DNA
Klf4 contains three tandem C2H2 fingers at its C-terminus (Figure 1a). A structure was previously determined for mouse Klf4 ZnF domain bound to the 10-bp oligonucleotide (5 0 -GAGGCGTGGC-3 0 ) (28), which is present in the basic transcriptional element of CYP1A1 gene (18). We modified this 10-bp DNA to contain a fully methylated CpG site and determined the complex structure of Klf4 at a resolution of 1.85 Å (Table 1). Except for the side chain of Glu446 (see below), the overall structure of the Klf4 is essentially unchanged between complexes with methylated or unmodified DNA, with a root-mean-squared deviation of <0.5 Å when comparing 85 pairs of Ca atoms. The three ZnFs of Klf4 bind in the major groove of the DNA (Figure 1b and c). ZnF3 interacts with the 5 0 sequence (GAG), ZnF2 interacts with the central (potentially)-methylated GCG and ZnF1 interacts with the 3 0 sequence (TGGC) (Figure 1b and c; note, from left to right, the protein sequence runs from N to C termini, whereas the DNA sequence of the recognition strand runs from 3 0 to 5 0 ).
As with the Zfp57-DNA complex (16), the two 5mCs of the two DNA strands exhibit different patterns of interaction with Klf4. A layer of ordered water molecules (marked 'w' in Figure 1f) envelops the methyl group of 5mC on the bottom strand. In contrast, the methyl group of 5mC in the top strand makes van der Waals contacts with the guanidine group of Arg443, which in turn forms bifurcated hydrogen bonds with the 3 0 guanine G6 (Figure 1g), forming a 5mC-Arg-Gua triad (35). In addition, the 5mC methyl group interacts with the carboxylate group of Glu446, forming a weak (3.6 Å ) C-H. . .O type of hydrogen bond ( Figure 1h)-a common but underappreciated interaction in biomolecules and molecular recognition (36).
Among the side chains involved in DNA base-specific interactions, Glu446 of Klf4 undergoes one of the largest conformational changes on binding methylated versus unmethylated CpG DNA. In the structure of the Klf4 bound with unmodified DNA (28), the carboxylate group of Glu446 points away from C5 position of the cytosine, and forms a weak hydrogen bond with Arg449, which in turn interacts with the 5 0 guanine G4 (Figure 1m). Superimposing the two structures reveals that Glu446 moves from the Arg449-interacting conformation to the 5mC-interacting conformation via a $100 rotation of the side chain torsion angle 1 (Figure 1n).

Sequence and methyl-specific binding in solution
To verify the structural observation of Klf4 binding to methylated GCG, we used fluorescence polarization analysis to measure the dissociation constants (K D ) between Klf4 fingers and double-stranded oligonucleotides containing a single CpG dinucleotide. Because contact with the 5mC methyl group on the bottom strand involves only water-mediated interactions (Figure 1f), we initially altered only the top strand, replacing the 5mC with unmodified cytosine (C) or three different oxidative modifications (5hmC, 5fC and 5caC), [E446A] (nM)  (33) that is present within the core GGCG Klf4-ChIP sites identified in ES cells (23). The red arrows indicate changes of methylation levels, from hypomethylation in mouse ES cells to hypermethylation in NP cells (see Table 2 Thus, methylation had significant, though modest, effects on binding. In contrast, mutating the sequence outside of the central GCG abolished Klf4 binding, regardless of the GCG modification status (Figure 2c). Together, these data indicate that the interaction between Klf4 and DNA depends largely on the specific sequence context and significantly but less profoundly on the cytosine modification state. This observation differs from that of Sprujit et al. who used a DNA pulldown with recombinant GST-Klf4-ZF domain, followed by western blotting against GST (21). Two examples with different DNA sequences were shown with a sequence containing four repeats of GAC and a sequence containing three overlapping Klf4 consensus motifs. In both instances, western blots indicated the highest binding to oligonucleotides containing 5mC (21), even though the (GAC) 4 sequence does not resemble the Klf4 consensus sequence.
The structural results strongly implicate Glu446 in cytosine modification discrimination, so we replaced the negatively charged Glu446 of mouse Klf4 with alanine (E446A). The E446A mutant exhibited no detectable selectivity of methylated over oxidative derivatives (5fC and 5caC), though it maintained 5mC selectivity over unmodified and 5-hydroxymethylated cytosines (C and 5hmC) (Figure 2d). This change in selectivity is not via decreased relative affinity for 5mC, but rather via an increase in the relative affinities for 5fC and 5caC. Like the corresponding glutamate residue in Zfp57 (37), the side chain of Glu446 in Klf4 (the size and the charge) is dispensable for methyl group recognition. Extensive substitution study of the corresponding glutamate in Zfp57 suggested that the negatively charged glutamate side chain carboxylate group might be critical in discriminating against the negatively charged carboxylate moiety of 5caC (37).
Using previously published data sets of Klf4 ChIP-seq profiles (23) and bisulphite-sequenced methylomes (33), we then examined the methylation status of the CpG site in the core Klf4 binding motif GGCG. Our analyses indicate that a substantial number of Klf4-binding sites in mouse (Figure 2e) and human ES cells (Figure 2g) are methylated (21). Importantly, many of the unmethylated Klf4-binding sites in mouse ES cells become hypermethylated in NP cells (33) (Figure 2e). Table 2 lists 15 such sites containing the 5 0 -GGCGTG-3 0 Klf4binding sequence that exhibit both hypomethylation (<20%) in ES cells and hypermethylation (>80%) in NP cells. This suggests that Klf4 may bind methylated loci in differentiated cells (which should be tested by ChIP-seq of Klf4 in NP cells), and thereby initiate stem-cell-specific  gene expression patterns during reprogramming (21). For comparison, randomly chosen GGCG sites in mouse genomes are heavily methylated in both ES and NP cells (Figure 2f). Because the bisulphite sequencing method used to generate mouse methylomes in stem cells and NPs (33) does not distinguish between 5mC and 5hmC, or C between 5fC and 5caC (38), the exact modification status of these sites is unknown. All of these sites (Table 2) are located in the gene bodies, as are over half of the Klf4 ChIP-seq sites. While promoter methylation strongly correlates with gene silencing, DNA (hydroxyl)methylation within the gene body is associated with gene activation (39), and the mechanisms of how gene body (hydroxyl)methylation correlates with gene expression are currently under investigation (40). In this respect, it is noteworthy that a very recent study in mouse ES cells (41) indicates that the 5mC dioxygenease Tet1 primarily affects 5hmC levels at gene promoters and transcription start sites, whereas Tet2 mainly modulates those levels in gene bodies.

Structural comparison with Zfp57 and Kaiso
As in previously characterized DNA-binding ZnF structures (42), the DNA base contacts are made by the side chains in the N-terminal portion of the a helix, together with the residue immediately preceding the a helix. Because the first zinc-binding histidine (C 2-4 CX 12 H 2-6 H) is located almost always in the middle of the DNA recognition a helix and the spacing between Cys2 and His2 is constant (12 residues), we use the amino acids at positions À1 to À8 (relative to the first zinc-binding histidine) in the following text to discuss the residues making base contact. This numbering scheme allows us to discuss from the perspective of primary sequence without relying on the more variably spaced first position of the a helix. In the ZnF2 of Klf4, the arginine at the À1 position (RH) makes direct base contact to the 5 0 Gua, the glutamate at À4 interacts with the central 5mC and the arginine at À7 recognizes the 3 0 Gua of methylated GCG (Figure 3a). Zfp57 recognizes the sequence GMGGCA [where M = 5mC; in the literature, the sequence of the opposite strand TGCCGC was initially used (43)]. Structural analysis of the complex between fully methylated DNA and the tandem two fingers of the mouse Zfp57 DNAbinding domain revealed that the methylated GCG sequence is read by the second ZnF using an arginine at the À1 position (RH), a glutamate at À4 and an arginine at À8 (Figure 3b). Pairwise comparison of the corresponding ZnFs of Klf4 and Zfp57 indicates that the Arg side chains, at the À7 position of Klf4 and À8 in Zfp57, are spatially superimposable (Figure 3c).
Kaiso recognition sequences contain either a methylated CpG (44) or a TpG dinucleotide (45) (both have a methyl group at 5-carbon of the pyrimidine). Structures of the three-ZnF DNA-binding domain of Kaiso, in complex with its methylated CpG or TpGcontaining cognate sequences, have been examined (15). Strikingly, interactions similar to those in Klf4 and Zfp57 are observed: an arginine at the À1 position (RH) of ZnF1 interacts with the 3 0 guanine of the 5mCpG or TpG dinucleotides, while a glutamate at the À5 position of ZnF2 interacts with 5mC (Figure 3d) or T. The fact that methyl-CpG sites within the consensus recognition sequences of Klf4 and Kaiso can be substituted by TpG raises an intriguing possibility. Perhaps, TpG/CpA sites, which could be methylated by Dnmt3a (24), are selected for when it is advantageous for a particular DNA sequence to be treated as if it is a permanently (hemi)methylated version of the recognition sequence; the shared recognition mechanism for (5mC/T)pG provides that option.

A noncontiguous Arg-Glu pair for methyl-CpG recognition
The fact that all three ZnF proteins examined here use an arginine and glutamate pair to recognize the methyl group of 5mC (or thymine) implies that glutamate might be a favorable amino acid for recognizing 5mC (or T). However, there is one substantial difference: Glu446 of Klf4 makes a weak C-H. . .O type of hydrogen bond with the 5mC methyl group (Figure 3e), whereas Glu182 of Zfp57 (like Glu335 of Kaiso) forms a van der Waals contact with the methyl group of 5mC, while one of its carboxylate oxygen atoms also interacts with the N4 atom of the same 5mC base (Figure 3f and g). This difference in interaction might explain the relatively small increase of binding affinity of Klf4 for 5mCpG versus CpG.
The prediction of methyl-CpG binding proteins from primary sequences is still challenging. However, to date, all structurally characterized methyl-CpG binding proteins (except the base-flipping SET and RING fingerassociated domain proteins) involve a 5mC-Arg-Gua triad (35). Here we examined structures of three ZnF proteins in complex with modified and unmodified DNA molecules. We propose that the presence of a spatially conserved (nonconsecutive but spatially proximate) Arg-Glu pair in C2H2 ZnF proteins is suggestive of a 5mCpG binding preference (see Figure 3). In the cases of Klf4 and Zfp57, the Arg-Glu pair is within a single ZnF, with an arginine at the À7 or À8 position and a glutamate at À4 (relative to the first zinc-binding histidine). In the case of Kaiso, the methyl-CpG binding residues come from two neighboring ZnF fingers, with an arginine at the À1 position (RH) of the N-terminal ZnF and a Glu at the À5 position of the C-terminal ZnF.
Significantly, sequence identity is >65% among the three-finger DNA-binding domains of the Sp/Klf family (46) (Figure 3h). This conservation implies a similar pattern of DNA recognition among the family members and suggests that other Sp/Klf proteins might be sensitive to DNA methylation status. It is interesting that, while the ubiquitous G/C-rich elements serve as binding sites for all tested Sp/Klf family members, Sp proteins mainly function as transactivators, whereas Klf proteins can activate or repress gene expression (46). The ability to bind either unmethylated or modified elements may contribute to the diverse regulatory mechanisms of Sp/Klf-mediated gene expression via selectively recruiting chromatin cofactors (47). Although our knowledge is currently limited to the genome-wide   distribution of 5mC and 5hmC during cellular differentiation, the ability to predict transcription factor sensitivity to DNA modifications is becoming increasingly important.

ACCESSION NUMBERS
The X-ray structures (coordinates and structure factor files) of Klf4-5mC DNA have been submitted to PDB under accession number 4M9E.