On the dependent recognition of some long zinc finger proteins

Abstract The human genome contains about 800 C2H2 zinc finger proteins (ZFPs), and most of them are composed of long arrays of zinc fingers. Standard ZFP recognition model asserts longer finger arrays should recognize longer DNA-binding sites. However, recent experimental efforts to identify in vivo ZFP binding sites contradict this assumption, with many exhibiting short motifs. Here we use ZFY, CTCF, ZIM3, and ZNF343 as examples to address three closely related questions: What are the reasons that impede current motif discovery methods? What are the functions of those seemingly unused fingers and how can we improve the motif discovery algorithms based on long ZFPs’ biophysical properties? Using ZFY, we employed a variety of methods and find evidence for ‘dependent recognition’ where downstream fingers can recognize some previously undiscovered motifs only in the presence of an intact core site. For CTCF, high-throughput measurements revealed its upstream specificity profile depends on the strength of its core. Moreover, the binding strength of the upstream site modulates CTCF’s sensitivity to different epigenetic modifications within the core, providing new insight into how the previously identified intellectual disability-causing and cancer-related mutant R567W disrupts upstream recognition and deregulates the epigenetic control by CTCF. Our results establish that, because of irregular motif structures, variable spacing and dependent recognition between sub-motifs, the specificities of long ZFPs are significantly underestimated, so we developed an algorithm, ModeMap, to infer the motifs and recognition models of ZIM3 and ZNF343, which facilitates high-confidence identification of specific binding sites, including repeats-derived elements. With revised concept, technique, and algorithm, we can discover the overlooked specificities and functions of those ‘extra’ fingers, and therefore decipher their broader roles in human biology and diseases.


Supplemental Information and Experimental methods descriptions Protein constructs and expression
For ZFY-related proteins, 6x-His-SUMO tag was fused to the N-terminus of ZFP coding sequences and cloned into NEB DHFR construct, whereas for CTCF and its mutant, 6x-His-HALO tag was fused to the N-terminus of CTCF coding sequences with TEV protease cleavage site as linker sequence as Fig. S1 and S2.  Table S1, recombinant proteins used in current work

Figure S1
The ZFY and CTCF constructs used in current study.

Figure S2
EMSA separation of bound CTCF-DNA complexes from unbound DNA in various constructs. The binding reaction volume for each sample is 20uL, added with 2uL PURExpress reaction containing CTCF construct. Before sample loading, all reactions are equilibrated at room temp for at least 30mins. 12% Trisglycine gels were loaded with samples at cold room. Running conditions were set at 200V, 50mins.

Data analysis procedures and reproducibility check
The data analysis protocol for ZFY and CTCF are very similar to previous work. General introduction to the data analysis protocol can be found at https://github.com/zeropin/ZFPCookbook.

Figure S3
Data reproducibility for different constructs (Dashed lines are 0.5kT energy deviation bounds).  Figure S6. Comparison of inactivation rates without competitor DNA. Without competitor probes, we still observed slowly decreasing anisotropy values over time, most likely due to spontaneous inactivation of the zinc finger proteins in room temperature. The observed inactivation rates showed no major difference between different constructs and are significantly slower than the measured dissociation rates. Upstream motifs of CTCF in regular and extended spacing formats  Supplemental Information for ZIM3 and ZNF343 results Figure S11. Distribution of binding sites strength estimated by RCADE's analysis of ZIM3's flanking motifs at positions (-5, 4, 5, 6) and the aggregate ChIP-exo footprinting plots as in Fig. 5E. Figure S12. Aggregate ChIP-exo footprinting of all possible defective core sites within human genome Figure S13. Correlation analysis between ChIP-exo reads around each type of anchor site and predicted binding energy by inferred ZIM3 motif from intact core case.

Figure S14
A) Contact residues for human ZNF343; Motif prediction by B1H method; B) Motif from RCADE analysis of ChIP-exo data; C) Extended motif by reanalysis of ChIP-exo data with prefixed core GAAGCG; D) HT-SELEX results of ZNF343; E) Distribution of binding sites based on predicted binding energy by inferred flanking motifs; Sites can be further sorted into four groups with equal energy bandwidth; F) Aggregate ChIP-exo reads by Group, with GAAGCG prefixed in -3 to +2 positions; G) Extended motifs by reanalysis of ChIP-exo data with all single variants of GAAGCG as the prefixed core; Heatmap is generated by auto-correlation analysis of all extended motifs with different cores and HT-SELEX result; The ChIP-exo reads footprints near associated prefixed cores are shown on the right; H) Aggregate ChIP-exo signals classified by type of cores and groups, respectively; The reads number are normalized by number of sites within each group; I) Inferred recognition model of ZNF343; J) Annotation of identified specific binding sites in Group I, II associated with good cores; K) Top five repeat names for each repeat classes among specific binding sites. Figure S15. Alignment of identified ZIM3 specific binding sites mapped to the consensus sequence of corresponding repeat element.