-
PDF
- Split View
-
Views
-
Cite
Cite
Ankit Gupta, Ryan G. Christensen, Heather A. Bell, Mathew Goodwin, Ronak Y. Patel, Manishi Pandey, Metewo Selase Enuameh, Amy L. Rayla, Cong Zhu, Stacey Thibodeau-Beganny, Michael H. Brodsky, J. Keith Joung, Scot A. Wolfe, Gary D. Stormo, An improved predictive recognition model for Cys2-His2 zinc finger proteins, Nucleic Acids Research, Volume 42, Issue 8, 1 April 2014, Pages 4800–4812, https://doi.org/10.1093/nar/gku132
Close -
Share
Abstract
Cys2-His2 zinc finger proteins (ZFPs) are the largest family of transcription factors in higher metazoans. They also represent the most diverse family with regards to the composition of their recognition sequences. Although there are a number of ZFPs with characterized DNA-binding preferences, the specificity of the vast majority of ZFPs is unknown and cannot be directly inferred by homology due to the diversity of recognition residues present within individual fingers. Given the large number of unique zinc fingers and assemblies present across eukaryotes, a comprehensive predictive recognition model that could accurately estimate the DNA-binding specificity of any ZFP based on its amino acid sequence would have great utility. Toward this goal, we have used the DNA-binding specificities of 678 two-finger modules from both natural and artificial sources to construct a random forest-based predictive model for ZFP recognition. We find that our recognition model outperforms previously described determinant-based recognition models for ZFPs, and can successfully estimate the specificity of naturally occurring ZFPs with previously defined specificities.
INTRODUCTION
Defining the grammar underlying the transcriptional regulatory elements within the human genome remains a critical step in understanding both developmental and disease processes (1). The advent of high-throughput sequencing technology has fueled the development of methodologies for the genome-wide characterization of regulatory features, such as global histone modifications (1–10). These data coupled with global analysis of RNA transcript levels (6,11), chromatin immunoprecipitation (ChIP)-based occupancy data for sequence-specific transcription factors (TFs) (7,12–14) and chromatin conformational capture techniques (15) provide a framework for deconvoluting regulatory networks directing gene expression patterns (16,17). Currently, only a small subset of human TFs has been characterized by ChIP-based approaches in any given cell line (7,13,14), although some sequence occupancy can be inferred from DNaseI (12,17) and MNase (18) data. In the absence of genome-wide binding data, knowledge of the DNA-binding specificities of the TFs within regulatory networks in concert with data sets on sequence conservation, chromatin accessibility and histone modifications can be exploited by computational algorithms to predict TF genomic occupancy, and thereby construct more elaborate transcriptional regulatory models (1,9,17,19–24). Given the difficulty in characterizing the diverse binding patterns of all expressed TFs in all possible temporal and spatial expression patterns in vertebrates, the ability to estimate the specificity of the constellation of TFs expressed at any given time in a given cell type provides a critical data set for constructing these regulatory models.
Cys2-His2 zinc finger proteins (ZFPs) are the largest class of TFs within most metazoans (25), with an estimated 675 members in the human genome (26) harboring an average of 8.5 finger units per gene (27). The majority of these ZFPs are believed to be involved in DNA-recognition, as many of the neighboring fingers are connected by a Krüppel-type TGE(K/R)P linker, which is a hallmark of DNA-binding fingers (28). The canonical DNA-recognition model for an individual finger is based on the ZFP-DNA co-crystal structure of Zif268 (29,30) and other naturally occurring and engineered ZFPs (31–35), wherein each finger potentially recognizes a 4-bp subsite that overlaps the recognition site of the neighboring N- and C-terminal fingers by 1 bp (Figure 1A). Amino acid residues at positions −1, +2, +3 and +6 of the recognition helix typically mediate the recognition preference of a finger within its subsite. The target site preference of a tandem array of fingers reflects a complex interaction between the individual finger modules, as the recognition properties of an individual finger can be influenced by its position within an array and the recognition determinants displayed by its immediate neighbors (36–41).
(A) Schematic representation of the canonical recognition pattern of two zinc fingers recognizing a hexamer sequence. Each zinc finger unit spans ∼30 amino acids and folds into a ββα-motif around a tetrahedrally coordinated zinc ion (42,43). DNA-binding specificity is typically mediated by residues at positions −1, +2, +3 and +6 of the recognition helix, where the numbering scheme refers to the position of each residue relative to the start of the α-helix. The boxed base pair (N4) represents the position of potential recognition overlap in the canonical recognition model. (B) Schematic representation of the two-stage process used to identify two-finger modules with the desired sequence preference. In Stage 1, the B2H system is used to select two-finger modules from an OPEN-based library, where the finger pools used correspond to the finger 2 (F2) and finger 3 (F3) subsites in each target site (44,45). These two-finger libraries are selected in the context of a constant finger 1 (F1) module that recognizes GCG in the neighboring subsite. The DNA-binding specificity of active clones recovered from the B2H selection was determined using the B1H system using a 6-bp randomized library adjacent to the constant GCG F1 binding site. The recovered binding sites are determined by Illumina sequencing and then a binding site motif is calculated from these sequences (46).
DNA-binding specificities have been determined for only a small fraction of ZFPs in metazoan genomes (13,17,26,47–50). Unlike other TF families where the majority of the resident factors in diverse species share a high degree of homology (26,51–54), evolutionary analysis of ZFPs indicates that a substantial fraction of resident members do not have highly conserved homologs across metazoans. Instead, the number and composition of fingers within these ZFPs is dynamic between species (27,55,56) and can even vary within a species [e.g. the variation in human PRDM9 isoforms (57,58)]. The specificity determinants within these ZFPs are under strong positive selection, implying the rapid diversification of their recognition potential (27). Consequently, naturally occurring ZFPs can specify a wide variety of different DNA sequences based on both the number and composition of fingers within the array.
Although some principles that govern the recognition properties of zinc fingers have been established, the accurate prediction of their DNA-binding specificity remains challenging. Specificity determinants at individual recognition helix positions with defined base preferences have been extracted from the biochemical and structural characterization of naturally occurring ZFPs (42,47,49,50,59–61) and the selection and characterization of artificial ZFPs that recognize novel target sequences (37,38,41,44,62–74). These data provide a foundation for the construction of predictive recognition models that estimate DNA-binding specificity based on the sequence of the recognition helix of each incorporated finger. Initial models focused on using the amino acid identity at key determinant positions (−1, +2, +3 and +6) to estimate the base preference at their primary DNA contact positions within the DNA subsite bound by each individual finger (75–77). Recently, more advanced predictive models have been constructed with improved performance that incorporate context-dependent recognition, which allows determinants to influence more binding site positions than prescribed by the standard recognition model (76–82). However, the construction of these models has been hampered by the limited amount of existing quantitative specificity data for ZFPs that links individual fingers with recognition of particular subsites.
A comprehensive recognition model for canonically binding ZFPs should be achievable using the growing archive of quantitative specificity data from recent bacterial one-hybrid (B1H) analysis of a large number of artificial (41,62,71) and naturally occurring ZFPs (49,50), where the position of each finger within the recognition sequence is defined or can be inferred. This data set spans 678 two-finger modules, including the characterization of 95 two-finger modules generated using the Oligomerized Pool ENgineering (OPEN) system (44,45) described herein. A sizeable fraction of these data explicitly examines the impact of recognition residues at the finger–finger interface on the preferred specificity at the junction of the finger binding sites, which remains the most challenging recognition feature to model. These data permit an improved estimation of context-dependent effects requiring the use of predictive models [such as support vector machine (83) or random forests (RFs) (84)] that implicitly capture these complex properties. Building on our previous efforts using RF models to estimate the specificity of homeodomains (85), we have constructed an RF predictive model for ZFPs using our B1H data that are superior to existing predictive models and that can effectively estimate the DNA-binding specificity of a number of naturally occurring ZFPs.
MATERIALS AND METHODS
OPEN finger selections
OPEN selections were performed to generate a set of two-finger modules that recognize all 64 possible GNNGNG-type sequences in the context of an N-terminal ‘GCG’ binding anchor zinc finger (recognition helix: RSDTLAR). All target sites used in the selection of novel recognition fingers were of the form GNNGNGGCG. Zinc finger libraries for each target site were assembled from the corresponding Finger 2 and Finger 3 OPEN pools as previously described but with a fixed Finger 1 module (44,45). OPEN selections were performed essentially as previously described (44,45) but using a beta-lactamase (bla) antibiotic-resistance gene instead of the HIS3 gene (70). For each of the 64 selections, we assayed the abilities of up to five clones to activate expression of a lacZ reporter gene in a bacterial two-hybrid (B2H) system as previously described (45) and determined the amino acid sequences of these clones. Fifty-eight of the 64 selections displayed active clones, from which we chose 95 clones that could activate expression of lacZ in the B2H system by ∼2.5-fold or more for further evaluation via B1H binding site selections (Supplementary Table S1).
CV-B1H method
To determine binding site specificities of OPEN-selected and other 2F-modules, the CV-B1H (Constrained Variation Bacterial one-Hybrid) assay was performed essentially as described previously (46). Two-finger modules were evaluated as fusions to the GCG anchor finger. Following transformation into the selection strain, 1 × 106 cells containing the zinc finger plasmid (1352-omega-UV2-ZFP) and the 6-bp randomized binding site library (in pH3U3) were plated on selective NM minimal medium plates (100 × 15 mm) containing 50 µM IPTG and 1 or 2 mM 3-AT and grown at 37°C for 22–30 h. All cells on the plate were pooled, and the pH3U3 plasmids containing the compatible binding sites were isolated for identification of the functional DNA sequences. The binding site region was PCR amplified, barcoded and sequenced via Illumina sequencing, and then binding specificities were determined from these data using GRaMS modeling and the log-odds method (46,71,86).
Construction of the RF ZFP regression model
Based on a pilot study and previous work with homeodomain recognition modeling (85), we developed a recognition modeler based on a RF regression approach (84) using the ‘randomForest’ module from the R package [http://www.r-project.org/(87)]. Two different ZFP RF regression models were trained based on the B1H specificity data: one-finger and two-finger models. The training data for the two-finger model consisted of 678 protein sequences for two fingers of ZFPs and the position frequency matrices (PFMs) obtained from the B1H experiments described above. The one-finger model was trained on the same set but contained 1209 individual fingers (redundancy removed, Supplementary Table S2). Preliminary analysis showed that including additional protein positions beyond the canonical −1, +2, +3 and +6 recognition positions in each finger did not improve the accuracy of the model, so all further training used only those positions. Of the 678 two-finger examples, there are 530 unique combinations of residues at positions −1, +2, +3 and +6; all of them are kept in the data set because the PFMs, while similar between repeats, are not identical and this maintains the inherent variability in the data. These models use the RF regression engine that was previously described (85). The modeler predicts the PFM for a zinc finger protein based on its sequence at the recognition positions, and the RF regression minimizes the mean-squared error (MSE) between the predicted and observed PFMs. MSE values for a single position can range from 0, if the two PFMs are identical, to 0.5 if they contain probabilities of 1.0 for different bases. A random position (probability of 0.25 for each base) would have a maximum MSE of 0.1875 compared with a position with probability of 1.0 for any base. This has the effect of generating PFMs that tend toward random at some positions instead of making high probability predictions that are frequently incorrect.
We used the default value of 500 trees while training the RF model. In this model, a single tree picks predictive variables, specific amino acids at specific positions, randomly and then applies regression to estimate their contribution to each PFM parameter. The set of individual trees are then weighted by regression to minimize the overall MSE between the observed and predicted PFMs. Accuracies were determined by 10-fold cross-validation, where the total data set was divided into 10 subsets and training was based on nine of them and the accuracy measured on the remaining subset. Each of the subsets was left out in turn, and the testing accuracy is reported as the means and medians on the test sets.
We chose to minimize MSE because we are specifically trying to find optimal PFMs that fit the entire distribution of binding site affinities. However, other objectives could be used instead. There have been a large number of different methods proposed to compare motifs with each other and determine a quantitative measure of similarity (88–94). The MSE that we use is closely related to maximizing the Pearson correlation and is often a highly ranked method, particularly when trying to assign a motif to a specific class of transcription factors. In other approaches more emphasis is put on matching high information content positions in the binding sites and low information content positions are scored similar to mismatches. For example, the recently published zinc finger predictor from the Princeton group (82) specifically maximizes the number of correctly predicted positions with high information content, which has advantages for some purposes (see later in the text).
Construction of ZFP recognition motif predictions
We established a Web site that will predict the binding motif for an input ZFP containing any number of fingers (http://stormo.wustl.edu/ZFModels/). ZFP sequences can be submitted in two forms as follows: a concatenation of the four critical recognition residues of each finger (−1, +2, +3 and +6) or the entire protein sequence. In the latter case, the Web site will determine the locations of the recognition residues in each finger based on a HMMER analysis (95) of zinc finger motifs present within the sequence. Three different ZFP motif generation methods are available based on the trained RF regression models: one-finger model, multi-finger model and the average of these models. In the one-finger model, the predictions are based on training of single fingers, and the complete motif is predicted by concatenating the individual predictions. In the multi-finger model, the predictions are based on the two-finger training data, and the complete motif is stitched together from the overlapping two-finger predictions, where the positions of overlap between the motifs are averaged (Supplementary Figure S1). The third method averages together the prediction from the one-finger and two-finger models to generate the final prediction. Generally, the different predictions are in close agreement but sometimes there is a divergence and the most accurate may depend on the specific zinc finger protein; therefore, we advocate testing with each model to examine the inherent variation.
Evaluation of Bcl6 predictive motif for predicting ChIP-seq peaks
The predicted DNA-binding specificity of Bcl6 was estimated using the multi-finger model through the ZFModels interface. The top 100 ChIP-seq peaks for Bcl6 (96) were extracted using Galaxy (97), and a motif for Bcl6 was extracted from these peaks using MEME (zoops mode) (98). MSE was calculated from this PFM against different motifs as described above. FIMO (99) was used to determine the number of the top 100 ChIP peaks containing favorable Bcl6 binding sites (P < 10−4) based on each motif.
RESULTS
Selection and characterization of two-finger modules recognizing GNNGNG target sites
We used OPEN selections (44,45) to identify two-finger modules recognizing 64 different 6-bp target sites of the form GNNGNG (Figure 1B). This set of target sites was chosen to include a focused set of sequences that were available in the OPEN system to explore the quality of the B2H-generated fingers. In addition, for the defined target positions (constant guanines), there are strong expectations about the complementary recognition determinants that would be selected. Deviations from the expected residues in the recovered sequences would be indicative of context-dependent effects. These two-finger modules were selected via the B2H system in the context of a three-finger array harboring a fixed N-terminal anchor finger that recognizes a GCG subsite. Fifty-eight of these selections yielded zinc finger arrays that bound their target site as evidenced by their ability to activate transcription in a B2H lacZ reporter assay (Supplementary Table S1).
We determined the DNA-binding specificity of a representative set of the B2H-selected two-finger modules using the B1H system (49,71). Each two-finger module was characterized using a reporter system containing a 6-bp randomized binding site library adjoining the finger 1 recognition element—GCG (46,71) (Figure 1B). After selection, surviving colonies carrying the functional DNA-sequences for each two-finger module recovered from this library were pooled and characterized by Illumina sequencing from which a preferred recognition motif was determined (46). This analysis yielded motifs for 95 OPEN-selected two-finger modules (Supplementary Figure S2). For 64 of these two-finger modules, the preferred recognition sequence matched the expected target site. The remaining modules are complementary to their target sequence, but actually prefer a related binding site. These modules expand the population of characterized two-finger modules for the construction of artificial zinc finger arrays, and the coupled specificity data provide additional information on the recognition potential of specific determinant combinations for the construction of improved predictive models.
Assessing context dependence in our selected two-finger modules
As a basis set for constructing predictive recognition models for ZFPs, we have used quantitative B1H specificity data on a large group of naturally occurring (49,50) and artificial (41,62,71) zinc finger arrays. To facilitate the evaluation of DNA-recognition by these zinc fingers, we have parsed this data set into 1209 different one-finger modules or 678 different two-finger modules. For example, a characterized three-finger array is broken down into three one-finger modules or two overlapping two-finger modules with their associated subsite motifs (Supplementary Figure S1). Figure 2 shows the base preferences at base pair positions 1, 2 and 3 within the core subsite (contacted by specificity determinants at positions +6, +3 and −1, respectively; see Figure 1) for this data set of one-finger modules. In general, the observed amino acid to base correlations at each position are consistent with previous studies of recognition preferences for zinc finger proteins (42,43,50,76–78). The strongest correlations are observed at the central base; amino acid changes at position +3 in the recognition helix primarily influenced recognition at the middle base position of the altered finger subsite in our two-finger modules when examined over the data set (Supplementary Figure S3). The independence of recognition at this position was previously harnessed to expand the recognition diversity of our two-finger modules in a directed manner in many instances (71).
Base preferences observed across the data set for specificity determinants at each of the canonical recognition positions (+6, +3 and −1). For each amino acid (X-axis) at the finger positions +6 (top), +3 (middle) and −1 (bottom), the corresponding base preferences, averaged over all examples, are garnered from the B1H-determined recognition motifs. Base preferences at binding site position 1 are indicated for position +6 specificity determinants; base preferences at binding site position 2 are indicated for position +3 specificity determinants; base preferences at binding site position 3 are indicated for position −1 specificity determinants.
Weaker correlations at other positions highlight the role of context on specificity. The influence of context dependence on the DNA-binding specificity of individual fingers is apparent from a qualitative analysis of finger sets within our data set, particularly at the finger–finger interface for a subset of two-finger modules where residues on both sides of the interface were randomized to more effectively capture these effects (Figure 1A) (62,71). For many individual two-finger modules, the base at position 4 is highly specified. However, when the preferred specificity at this position is binned across the data set based on the type of residue at position +6 of the N-terminal finger (Figure 3A), some amino acids are associated with each of the four bases in different C-terminal finger contexts. Glutamate at position +6 provides a notable example, where two-finger modules containing this residue display distinct preferences for each of the four bases at position 4 (Figure 3B). The potential influence of residues within the C-terminal finger, in particular the residue at position +2, on recognition at base position 4 are well documented (29,31,38,100). Consistent with the potential influence of position +2 on recognition, changes in the residue at position +2 in the recognition helix in many instances appear to influence neighboring base preference, particularly at position 4 (Supplementary Figure S4). These data highlight the need for a predictive model that can capture the influence of each determinant position on multiple base positions within the zinc finger recognition sequence.
Context-dependent preferences observed for the base at position 4 (P4) recognized by the two-finger modules across the entire data set. (A) Stacked bar plot showing the distribution of base preferences dictated by each amino acid at position +6 of N-terminal finger in a two-finger module. The height of each bar corresponds to the number of zinc finger modules with the amino acid labeled on the X-axis. The height of each colored bar segment corresponds to number of modules preferring a particular base. Preference was defined as nonspecific if the information content at a position is <0.3. (B) Examples of context-dependent preference at position P4. Logos representing the specificity of four different two-finger modules with Glu at position +6 (red) of N-terminal finger with different base preferences at P4. Above each observed motif are the amino acids at the four canonical recognition positions (−1,+2,+3 and +6) for the N-terminal and C-terminal fingers.
RF recognition models for ZFPs
Zinc fingers have been the focus of several studies on qualitative recognition codes [reviewed in (42,43)]. More recently, several groups have developed models that predict quantitative motifs for zinc finger proteins based on the residues present at canonical recognition positions within each finger (76–79). Although superior to purely qualitative recognition codes, their accuracies leave considerable room for improvement. These models were limited because they were trained primarily on qualitative data: collections of proteins and their binding sites with high binding affinity, but where the preference of each ZFP for its target site relative to other sequences was unknown. Our B1H-characterized zinc finger data provide a much larger training set with quantitative information about the preferences of different proteins for different DNA binding sites, which allows us to train new recognition models to obtain higher accuracy predictions. In pilot studies, we tested the feasibility of creating recognition models using several different machine learning algorithms, including neural networks (78), support vector machines (83), k-nearest neighbors (101), partial linear regression (102) and RF (84). We found that RF-based models performed as well or better than those of other methods and its implementation was computationally less demanding, so we used an RF regression algorithm to create a predictive model for ZFPs. The results of these preliminary studies were similar to those we previously reported for predicting the specificity of homeodomain proteins (85).
We trained RF predictive models on either one-finger or two-finger module specificity data, where the latter model is designed to capture context-dependent effects between neighboring fingers. Training the two-finger model takes as input the amino acids at the eight canonical recognition positions (−1, +2, +3 and +6 of each finger) and builds regression trees to predict recognition preference over the entire 6-bp binding site. (The one-finger model was similarly trained on individual fingers and each 3-bp binding site.) Importantly, these models are not restricted to the canonical interactions between particular finger recognition positions and bases within the binding site, unlike many previous recognition models (76,77). Because we have a much larger training set than was available for previous models, a wider range of potential interactions between these recognition positions and the binding site are allowed within the model to capture context-dependent effects observed within the data. Consequently, each recognition position within the two-finger module contributes to the overall predicted PFM, although the strongest contributions within the model will be between the most highly correlated amino acids and base pairs.
The objective during model training is to minimize the MSE between the observed and predicted PFM values for each two-finger module. Table 1 shows the average value (both the mean and median with standard deviations) obtained in a 10-fold cross-validation of our two-finger model. This was compared with predictions by each of four other published models that were readily available for testing (76–79). The MSE is greatly reduced with the new ZFModels predictions to less than half for means and less than one-third for medians when compared with other prior models. The prediction error is fairly evenly distributed across the positions of the binding sites (Table 2). Figure 4 displays several examples that are near the median value of MSE to show the degree of similarity between observed and predicted PFMs. Many of the highest accuracy examples contain guanine at positions 1 and 6 because the training set was biased with fingers recognizing guanine at these positions. Figure 4 highlights examples deviating from this pattern, demonstrating that our ZFModels can generate accurate predictions for a wide variety of different types of motifs. As expected, the two-finger predictive model can capture the context dependence at the finger–finger junction observed in our data set, such as the motifs in Figure 3B, whereas the one-finger predictive model fails to capture this subtlety (Supplementary Figure S5).
Examples of observed motifs for two-finger modules that are within our data set, and predicted motifs for these fingers using our final predictive model. Above each observed motif are the amino acids at the four canonical recognition positions (−1, +2, +3 and +6) for the N-terminal and C-terminal fingers. The MSE value between the observed and predicted PFMs is displayed above the predicted motif.
MSE for several prediction programs
| Program . | ZFModelsa . | Benosb . | Kaplanc . | Zifnetd . | ZIFIBIe . |
|---|---|---|---|---|---|
| Mean | 0.017 0.005 | 0.044 | 0.047 | 0.040 | 0.072 |
| Median | 0.009 0.002 | 0.033 | 0.035 | 0.032 | 0.063 |
| Program . | ZFModelsa . | Benosb . | Kaplanc . | Zifnetd . | ZIFIBIe . |
|---|---|---|---|---|---|
| Mean | 0.017 0.005 | 0.044 | 0.047 | 0.040 | 0.072 |
| Median | 0.009 0.002 | 0.033 | 0.035 | 0.032 | 0.063 |
MSE for several prediction programs
| Program . | ZFModelsa . | Benosb . | Kaplanc . | Zifnetd . | ZIFIBIe . |
|---|---|---|---|---|---|
| Mean | 0.017 0.005 | 0.044 | 0.047 | 0.040 | 0.072 |
| Median | 0.009 0.002 | 0.033 | 0.035 | 0.032 | 0.063 |
| Program . | ZFModelsa . | Benosb . | Kaplanc . | Zifnetd . | ZIFIBIe . |
|---|---|---|---|---|---|
| Mean | 0.017 0.005 | 0.044 | 0.047 | 0.040 | 0.072 |
| Median | 0.009 0.002 | 0.033 | 0.035 | 0.032 | 0.063 |
MSE for each position, for one-finger and two-finger models (mean/median)
| Nucleotide position . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
|---|---|---|---|---|---|---|
| 1 finger | 0.016/0.004 | 0.015/0.005 | 0.008/0.001 | |||
| 2 fingers | 0.006/0.001 | 0.007/0.003 | 0.006/0.001 | 0.012/0.004 | 0.010/0.004 | 0.004/0.000 |
| Nucleotide position . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
|---|---|---|---|---|---|---|
| 1 finger | 0.016/0.004 | 0.015/0.005 | 0.008/0.001 | |||
| 2 fingers | 0.006/0.001 | 0.007/0.003 | 0.006/0.001 | 0.012/0.004 | 0.010/0.004 | 0.004/0.000 |
Note: The reported median values represent the bin the median value falls in, where the bins are 0.001 wide and labeled with the lower value. So if the median value is reported as 0.000 that means the median is in the bin between 0.000 and 0.001. These values come from training and testing on the complete data rather than from cross-validation, resulting in lower values than in Table 1.
MSE for each position, for one-finger and two-finger models (mean/median)
| Nucleotide position . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
|---|---|---|---|---|---|---|
| 1 finger | 0.016/0.004 | 0.015/0.005 | 0.008/0.001 | |||
| 2 fingers | 0.006/0.001 | 0.007/0.003 | 0.006/0.001 | 0.012/0.004 | 0.010/0.004 | 0.004/0.000 |
| Nucleotide position . | 1 . | 2 . | 3 . | 4 . | 5 . | 6 . |
|---|---|---|---|---|---|---|
| 1 finger | 0.016/0.004 | 0.015/0.005 | 0.008/0.001 | |||
| 2 fingers | 0.006/0.001 | 0.007/0.003 | 0.006/0.001 | 0.012/0.004 | 0.010/0.004 | 0.004/0.000 |
Note: The reported median values represent the bin the median value falls in, where the bins are 0.001 wide and labeled with the lower value. So if the median value is reported as 0.000 that means the median is in the bin between 0.000 and 0.001. These values come from training and testing on the complete data rather than from cross-validation, resulting in lower values than in Table 1.
Evaluating the utility of the RF-based zinc finger recognition model
Several published studies have determined specificity of ZFPs using SELEX (26,103–105). None of these examples were included in the training data and so they constitute an independent test set. Supplementary Figure S6 contains the logos from the published PFMs for a subset of these ZFPs and the logos predicted by ZFModels. In every case, the predictions match preferred binding sites from the experiments when we take into account the variable spacing between neighboring fingers due to noncanonical linkers in some instances. However, the quantitative models are less consistent than the average fits to zinc fingers within our data set via cross-validation analysis (Supplementary Table S3). This may be due to the SELEX data being evaluated after multiple rounds of selection where the resulting PFM is heavily weighted toward a subset of the highest affinity sites, leading to an over-specified motif. We also compared the ZFModels predictions on some of the same data sets with the predictions made by a recently published method (zf.princeton.edu) based on support vector machine training (83). ZFModels makes more accurate predictions as measured by MSE (Supplementary Table S4) on these independent test sets than the Princeton model, although the Princeton model often contains more matching positions with high information content (see Discussion).
Ideally, our recognition model would also allow prediction of ZFPs with uncharacterized DNA-binding specificity throughout the genome. We chose to evaluate its predictive utility for Bcl6, as this ZFP has been characterized by B1H (50), PBM (47) and SELEX-seq (26), which allows a comparison of our predictive motif against DNA-binding specificities determined via multiple methods, and against ChIP-seq data for this factor (96). The Bcl6 recognition motifs produced by B1H, PBM and SELEX-seq are all similar, although the SELEX-seq motif appears over-specified (Figure 5). We also generated a predicted recognition motif for Bcl6 using the Princeton SVM model for comparison with our model. The Princeton motif has greater information content than our ZFmodel motif, but at many positions, the Princeton motif predicts a particular base with absolute certainty, which much like the SELEX-seq motif suggests that it is over-specified. When judged against an independent source, a MEME (98) motif from the top 100 Bcl6 ChIP-seq peaks (96), the B1H and PBM motifs appear most similar. The ZFModels multi-finger predictive model also shows good similarity to the determined motifs (MSE values 0.04 from the MEME-ChIP motif, 0.05 from either the PBM- or B1H-based motifs, 0.05 from the Princeton motif and 0.08 from the SELEX-seq motif), but it is a bit worse than the average value of <0.01 in our cross validation studies. FIMO analysis (99) of these ChIP peaks using each motif confirms this assessment: the MEME-derived motif from the Bcl6 ChIP data discovers a good Bcl6 binding site (P < 10−4) in 74 of 100 peaks, the B1H motif in 56 of 100 peaks, the PBM motif in 52 of 100, the SELEX-seq motif in 43 of 100, the ZFModels predicted motif in 25 of 100 and the Princeton motif in 9 of 100, where only four would be expected by chance. Thus, our predictive motif has value for the discrimination of binding sites within the genome, and in this example is superior to the Princeton motif, but it can still benefit from the incorporation of additional experimental data to improve its quality. Figure 5 displays logos in two formats, the original information-based method (106) and a PFM-based method where the height of each base is proportional to its frequency in the model (107). The frequency representation demonstrates that even in cases where our model does not make a confident (high probability and high information content) prediction, it generally gets the preferred base correct. Combining all of the experimental models with the MEME model from the ChIP-seq data, one finds a consensus sequence of TTCCTnGAAAG (positions 5–15 in the alignment). Our model agrees at every position except 13, where it prefers G slightly to A, but many of those predictions are low confidence. In contrast, the Princeton model has more high information content positions that match the consensus, but it also contains several positions where the preferred base is assigned a very low probability. Our model has an overall better fit to the other models, as evaluated by MSE and similarities to the rank distributions of all possible binding sites, but there are some purposes for which maximizing the number of high confidence, correct predictions is useful (see ‘Discussion’ section).
Comparison of the MEME motif from the top 100 Bcl6 ChIP peaks (96) with the motif predicted for the five canonically linked fingers by ZFModels and the Princeton SVM method (82) and the recognition motifs determined directly for Bcl6 by B1H (50), SELEX-seq (26) and PBM (47). The left column displays the motifs as information content, whereas the right column displays the motifs as position frequency plots. The frequency of a strong motif match (P < 10−4) for each motif in the top 100 ChIP peaks as determined by FIMO is indicated above each motif.
DISCUSSION
The development of platforms for rapidly characterizing the specificity of transcription factors has dramatically increased the amount of data that is available for all of the major TF families (108), but there are still barriers to generating data for all naturally occurring ZFPs. The average number of fingers in a human ZFP is 8.5 (27), and these polydactyl (i.e. many fingered) ZFPs may have complex binding modes due to the presence of independent DNA-recognition modules. For example, genome-wide ChIP analysis of NRSF (109,110), a 9-finger ZFP, recovered two different types of binding sites: a prominent motif that contains a juxtaposition of two subsites and a set of additional motifs with variable spacing between these subsites. Taipale and colleagues noted the difficulty in characterizing ZFPs by either SELEX-seq or PBM (26): they successfully characterized only 8% of ZFPs and only 3% with more than eight fingers (26). Similarly, our B1H motif set includes only seven naturally occurring ZFPs with ≥8 fingers with a success rate of ∼38% of the attempted Drosophila ZFP genes (50). With the possibility that polydactyl ZFPs use different finger sets to bind multiple distinct motifs, describing their recognition properties is critical to understanding their regulatory mechanisms. The growing body of quantitative specificity data for naturally occurring and artificial ZFPs provides a foundation for the development of improved predictive models for this family to help facilitate a broader understanding of their function as regulators within the genome, where other direct analysis methods may be challenging to use.
Our efforts to construct an improved predictive model have focused on two aspects of the problem as follows: expanding the population of quantitatively characterized finger modules and using new methods for training improved recognition models. We have used OPEN-based ZFP selection methods (44,45) to expand our existing set of B1H-characterized artificial and naturally occurring fingers to 1209 one-finger modules and 678 two-finger modules. The latter group captures context-dependent effects that can occur at the finger–finger interface, allowing the construction of recognition models that span more than a single finger, thereby providing additional information on the recognition potential of specific determinant combinations for the construction of improved predictive models. These finger archives and the underlying data also have value in the design of artificial ZFPs to recognize specific sequences. Thus, the assembly of these modules can be data driven by applying ‘rules’ for recognition of particular sequences to estimate which assembled finger models are likely to provide the desired composite specificities.
Our assessment of ZFModels shows that the motif predictions obtained are superior to previously published predictors. This is likely due to our larger and better (i.e. quantitative) training sets that allow us to consider more interactions, not just the canonical ones that have been primarily used in the past. We have also leveraged our two-finger module data to extend the model construction beyond a one-finger to two-finger units, where the two-finger model constructs motifs by assembling interfaces via a stitching assembly (62) to try to minimize edge effects of the two-finger module data on the resulting motif. This model is accessible to the community though our Web site (http://stormo.wustl.edu/ZFModels/). Users can input a protein sequence and an HMM-based algorithm will extract the determinants in each finger for construction of a recognition motif. Users can use either the one-finger or multi-finger model, or a hybrid (average) of these two models for generating a motif for their factor. On an independent test set, the hybrid model performed slightly better (Supplementary Table S3), although the results from each method are similar.
There is still room for improvement in our predictive model, especially for some classes of C2H2 ZFs with noncanonical linkers that may lead to alternate finger sequences or binding modes, but in nearly every case tested the predictions are at least partially correct and allow for the alignment of the individual fingers with the segments of the binding motifs that they interact with. A recently reported large compendium of zinc finger proteins selected for binding to specific DNA sequences (74), and then with their specificities determined by B1H, may provide additional, more diverse information to improve the predictive models further, but this has not been tested yet. Currently, predictions from our models are not accurate enough on their own to make reliable regulatory networks, but may be useful in conjunction with accessibility data and DNaseI footprinting data (12) to identify their regulatory sites. They can also aid in assigning ZF-TFs to particular motifs that are discovered through computational analysis of other genomic features, although for that particular problem, the alternative SVM approach of the Princeton group (82) will sometimes work better. Their approach trains their model to maximize the number of high information content positions that are correctly predicted. By then applying string matching methods, one can sometimes identify a ZF-TF that is likely to bind to a known motif [e.g. PRDM9 (58)] in cases where our model may yield a less definitive consensus because it may predict many low information content positions. In some cases, these approaches may also allow us to determine whether only a subset of ZFs are used to recognize DNA, or if different subsets are used to recognize different classes of binding sites, as when ZFPs use alternative modes of binding for interacting with different sequences. Given the rapid diversification of ZFPs during evolution and the technical challenges associated with experimental determination of their specificities, the continued refinement of predictive models will likely play an important role in understanding the roles of these proteins in transcriptional regulatory networks.
FUNDING
U.S. National Institutes of Health (NIH) [GM068110 to S.A.W., HG000249 to G.D.S., HG004744 to M.H.B. and S.A.W., GM078369 to J.K.J., S.A.W., G.D.S.]. Funding for open access charge: U.S. National Institutes of Health (NIH).
Conflict of interest statement. J.K.J. has financial interests in Editas Medicine and Transposagen Biopharmaceuticals. J.K.J.’s interests were reviewed and are managed by Massachusetts General Hospital and Partners HealthCare in accordance with their conflict of interest policies.
ACKNOWLEDGEMENTS
The authors thank members of the Brodsky, Joung, Stormo and Wolfe laboratories for their assistance with these studies.
REFERENCES
Author notes
†These authors contributed equally to the paper as first authors.








Comments