Understanding transcription factor (TF) mediated control of gene expression remains a major challenge at the interface of computational and experimental biology. Computational techniques predicting TF-binding site specificity are frequently unreliable. On the other hand, comprehensive experimental validation is difficult and time consuming. We introduce a simple strategy that dramatically improves robustness and accuracy of computational binding site prediction. First, we evaluate the rate of recurrence of computational TFBS predictions by commonly used sampling procedures. We find that the vast majority of results are biologically meaningless. However clustering results based on nucleotide position improves predictive power. Additionally, we find that positional clustering increases robustness to long or imperfectly selected input sequences. Positional clustering can also be used as a mechanism to integrate results from multiple sampling approaches for improvements in accuracy over each one alone. Finally, we predict and validate regulatory sequences partially responsible for transcriptional control of the mammalian type A γ-aminobutyric acid receptor (GABA A R) subunit genes. Positional clustering is useful for improving computational binding site predictions, with potential application to improving our understanding of mammalian gene expression. In particular, predicted regulatory mechanisms in the mammalian GABA A R subunit gene family may open new avenues of research towards understanding this pharmacologically important neurotransmitter receptor system.
Co-regulation is a basic mechanism to coordinately control expression of genes in modules, biochemical pathways and protein complexes ( 1–3 ). In eukaryotes, expression is most often mediated by transcription factors (TFs) that bind upstream of the transcription start site (TSS) and recruit the polymerase assembly ( 4 ). TFs bind, with varying affinity, to a set of similar, short (∼6–20 nt) sequences ( 5 ). Computational binding site discovery focuses on finding significantly overrepresented sequences in upstream regions of co-regulated genes ( 6–8 ). Thus, computational TFBS prediction algorithms must begin with an input set of promoters from genes hypothetically co-regulated by a shared TF. The algorithms aim to predict the binding positions and consequently the nucleotide specificity of the TF ( 9–11 ).
The first part of transcription factor binding site (TFBS) discovery, the input set, can be identified using either computational or experimental methods. Experimental techniques, such as chromatin immunoprecipitation (ChIP) ( 12 ), have been successfully used to generate a genome scale mapping of approximate TF-binding positions ( 10 , 13 , 14 ). Computational techniques, such as phylogenetic profiling ( 15 , 16 ) and artificial neural networks, can also be used to identify sets of co-regulated genes. Both experimental and computational approaches, however, suffer from a significant false positive (FP) prediction rate. Inclusion of extraneous promoters in the input sets dilutes the enrichment of the shared TFBS sequences making computational TFBS discovery significantly more challenging ( 17 ). We term such erroneously included promoters decoy sequences (DSs).
After receiving a set of upstream regions co-regulated by a shared TF as input, computational methods aim to predict the binding positions of that TF ( 6–8 , 18 ). Given a set of input promoters, motif detection algorithms identify a set of short, oligonucleotide segments hypothesized to bind to the TF of interest. The predicted sequences can be used to construct a position weight matrix (PWM) representing the average nucleotide frequencies for each position in the site ( 19 ). Ideally, computational detection will return all sequences that bind to every TF with biologically relevant function in those upstream regions. However, since the source of binding specificity for TFs is not well understood ( 20 ), heuristic approaches and ad hoc multiple alignment based scoring schemes are used to identify locally optimal solutions ( 17 ). Each local optimum that exists in a given set of promoters may correspond to distinctly different motifs, and may score differently relative to each other according to different scoring schemes.
Binding site prediction algorithms are generally confounded by several factors: degeneracy in the binding site; the unknown length of the binding site; the relatively large length of promoters; and the inclusion of DSs in the input sets ( 17 , 21 , 22 ). As a result as few as 10% of predicted positions correspond to biologically functional binding sites ( 23 ). Due, in part, to the low accuracy rate, computational binding site identification has been of limited use ( 23 ). Problems identifying binding sites are further exacerbated in mammalian genomes by larger promoter regions ( 24 ) and scarcity of reliable information on co-regulation of genes. Thus, the most demanding test of efficacy for TFBS identification approaches lies in their application to mammalian systems and subsequent validation of predictions.
Because of computational complexity of the problem, Gibbs sampling is often used to identify binding positions ( 18 ). In this paper, we present a new strategy that clusters Gibbs sampling results at each input nucleotide—a technique we term positional clustering—to improve accuracy of predicted TF binding. We evaluate the efficacy of our approach using known examples of binding and regulation in yeast and experimentally testing predicted TF-binding sites upstream of the subunit genes coding for the heteromeric mammalian neurotransmitter receptor system, the type A γ-aminobutyric acid receptor (GABA A R).
The GABA A R is the major inhibitory neurotransmitter receptor in the central nervous system (CNS) ( 25 , 26 ) with important roles in development ( 27 , 28 ) and disease ( 29–31 ). The receptor is believed to be a pentamer made up of multiple subunits that come from at least four different subunit classes (α, β, γ and δ) ( 32 ). At least 19 genes code for the various subunits that differentially combine to form numerous pharmacologically distinct GABA A receptor isoforms ( 29 , 30 ). Isoform utilization depends in part on the relative abundance of the subunits, which may change under various conditions ( 33–35 ). Understanding subunit regulatory mechanisms may provide insight into GABA A receptor isoform usage and related phenotypes ( 36 ).
In the current study, we test the ability of positional clustering to detect known TF-binding sites in a series of increasingly noisy sets of yeast promoters, and found marked improvement in the percentage of correct predictions over Gibbs sampling alone. We also present de novo predictions of TF-binding sites in promoter regions of GABA A receptor subunit genes (GABRs) whose expression is altered (either up-regulated or down-regulated) in an animal model of temporal lobe epilepsy ( 35 ). Positional clustering identified a number of putative cis -regulatory sites, many of which correspond to known binding elements for TFs found in the CNS. Mobility shift assays showed several predicted GABR-binding sequences specifically bind nuclear proteins derived from primary neocortical neurons kept in culture. Furthermore, a particular non-consensus GABR putative regulatory sequence was shown to have a functional role in cultured cortical neurons demonstrating the efficacy of positional clustering in detecting functional regulatory elements in mammals.
Saccharomyces cerevisiae promoter selection
We identified S.cerevisiae genes predicted at high confidence ( P ≤ 0.001) to be regulated by the TF STE12 in YPD growth media, according to whole-genome TF location data ( 14 ). For the 51 identified genes, we collected upstream intergenic promoters. Intergenic regions were truncated at 1 kb upstream of the gene's TSS.
GABR promoter selection
We selected for study a set of six GABRs: GABRA1, GABRA4, GABRB1, GABRB3, GABRD and GABRE. Promoters were extracted for each gene, including two alternative first exons of the GABRB3 ( 37 ), giving a set of seven promoters. The length of each promoter was: GABRA1, 3733 bp; GABRA4, 1546 bp; GABRB1, 1353 bp; GABRB3 (exon 1), 1310 bp; GABRB3 (exon 1A), 2080 bp; GABRD, 6625 bp; and GABRE, 5278 bp. We augmented the input set with orthologous promoters from rat, with the exception of GABRB3 for which an orthologous gene from mouse was used. In total, 14 promoters upstream of six GABRs were selected for analysis.
Evaluating long-term Gibbs sampling behavior
For a given input set of promoters, we ran the Gibbs sampler BioProspector ( 8 ) 400–550 times, evenly distributed across all motifs widths from 6–15 bp. We used a third-order background model derived from appropriate genomic promoters. We collected the best three results from each BioProspector run. We counted the number of times BioProspector identified each nucleotide in the input set. For each promoter, we identified the maximally occurring nucleotide, and extracted all positions identified by BioProspector >35% of the maximum. We clustered together neighboring positions into putative TFBS. As a dust filter, we removed all putative TFBSs <6 bp long ( Figure 1 ).
For sets of S.cerevisiae promoters, we used 1200 results from 400 BioProspector runs in our evaluation. For GABRps, we considered all 127 non-empty subsets of the seven promoters (orthologous sequences were always considered together). We used results from 70 000 BioProspector runs, evenly distributed across all promoter subsets, in our analysis. In addition to dust filtering, we required putative TFBSs to occur both in the human and in the orthologous rodent promoter.
Evaluating STE12-binding site recovery
Evaluating robustness to DSs
We complemented the seed set of 51 STE12-bound promoters with 1–50 randomly chosen yeast promoters. We performed our motif detection procedure on each input set, and compared the PPV of putative TFBS with that of raw BioProspector results ( Figure 2 , solid lines).
To evaluate the background rate of STE12-binding site recovery, we created a seed set of 51 randomly chosen S.cerevisiae promoters. We evaluated the percentage of STE12-like binding sites identified in the random seed set, as well as in versions of the seed set augmented with 1–50 randomly chosen yeast promoters ( Figure 2 , dashed lines).
For additional yeast evaluations (HAP4, TEC1, YAP1 and YDR026C), we substituted for BioProspector an in-house implementation of the BioProspector algorithm. Comparisons of results from each implementation show the two implementations to be approximately equivalent.
Identification of known binding motifs in GABR predictions
We ran MotifScanner ( 39 ) to search GABR promoters for all vertebrate TF-binding motifs found in TRANSFAC ( 40 ). For each promoter analyzed, we used a prior probability of 0.1 and the corresponding organism specific third-order promoter background model from Eukaryotic Promoter Database (EPD) ( 41 ).We considered positional overlap between MotifScanner predictions and putative TFBSs indicative of known binding motifs in our predictions.
Electrophoretic mobility shift assay (EMSA)
Double-stranded oligonucleotides for EMSA contained the following sequences:
GABRA4: GTGCACACACACGCCCACCGCGGCTCGGG and
Nuclear extracts were prepared ( 42 ) and used for gel shift analysis after concentration (Microcon no. 10 columns, Amicon, MA). Quantification was performed on EMSAs under conditions that yield a standard curve for band intensity.
Double-stranded oligonucleotide functional analysis
Single-stranded sense and antisense phosphorothioate oligonucleotides for the predicted GGCGGCGTGCACACACACGCCCACCGCGG binding site are annealed by boiling sense and antisense oligonucleotides for 5 min at equal molar ratios in dH 2 O. Oligos are then cooled to room temperature and placed on ice. Transfections using DOTAP (Roche)/HEPES solutions are performed with oligonucleotides corresponding to wild-type, mutant or with DOTAP (Roche)/HEPES solution lacking oligonucleotides (MOCK) as described in ( 29 ). Effects of oligonucleotide application to neurons are assessed by real-time RT–PCR.
Gibbs sampling framework
Since TFBS are predicted computationally by local optimization strategies, we evaluate the extent to which one of these strategies, Gibbs sampling, identifies the same set of segments in repeated runs using the same input data. Identifying stably recurring motifs requires clustering of related results which, in turn, requires definition of ‘related’. Sequence similarity based clustering is impaired by the combination of sequence variation within motifs, the short length of TF-binding sites, and aligning motifs of different lengths. Instead of using sequence based clustering, we chose to cluster results by position, counting the number of times Gibbs sampling identifies each nucleotide in the promoter ( Figure 1 ). We find that Gibbs sampling predictions, generated using BioProspector ( 8 ) are power-law distributed over nucleotide position (Supplementary Figure S2). Gibbs sampling converges on the majority of nucleotides very infrequently, and a small number of nucleotides very frequently. Thus, the most frequently recurring nucleotides appear in as few as 10% of results. Moreover, we find the power-law distribution of results is robust to Gibbs sampling algorithm and scoring scheme (data not shown). We can hypothesize that the most frequently occurring positions are the most biologically significant. Thus, discarding the least frequent Gibbs sampling results may yield higher accuracy and robust identification of biologically insignificant positions.
As a preliminary test of the above hypothesis, we applied repeated runs of Gibbs sampling to a set of 51 S.cerevisiae promoters enriched in STE12 binding as identified by whole-genome ChIP-chip experiments ( 10 ). We used positional clustering of 1200 results to identify the most frequently recurring positions (see Methods). Incorporation of additional results did not significantly alter the distribution of results (data not shown). We chose STE12 because it is one of the best studied TFs, with a well known, highly conserved and experimentally well-defined binding motif ( 10 , 43 ). The most frequently recurring positions were compared with the known STE12-binding motif ( 40 ). We classified predictions into two categories: true positive (TP) if they resemble the experimentally identified STE12-binding motif, and false positive (FP) otherwise (see Methods). Finally, we calculated the positive predictive value PPV as PPV = TP/(TP + FP).
We find that positional clustering and subsequent selection of frequently recurring nucleotides improved the PPV of the STE12 binding site by at least 37% over Gibbs sampling alone ( Figure 2 ). To validate that the above results were not specific to the number of input promoters, the STE12-binding motif, or the particular Gibbs sampling implementation, we repeated the above prediction process for promoters predicted to bind to YAP1, TEC1, HAP4 and YDR026C. We also repeated the analysis replacing the original Gibbs sampling procedure with our own implementation and MotifSampler ( 44 ). In all cases, we found positional clustering significantly improves on results over local optimization procedures alone ( Figure 3 ).
Computational discovery of TFBS can have two types of FP predictions. One type is the identification of an incorrect motif from a set of upstream regions known to bind to a TF of interest as described above (see Methods). The second type of FP error is the background discovery rate of the correct motif using upstream regions that do not bind to the TF. To simulate this rate for STE12-like binding site recovery we repeated the analysis as described above starting with 51 randomly chosen yeast promoters. We find that positional clustering identifies STE12-like sites in <5% of results, compared with 10–15% for Gibbs sampling alone. Thus, using positional clustering, the performance of computational motif discovery is enhanced not only by improving the positive predictive value in promoters of genes co-regulated by STE12, but also by decreasing the false discovery of STE12-like sites by ∼10%.
Robustness to DSs
Next, we evaluated the effect of adding DSs on the performance of Gibbs sampling with and without positional clustering. Addition of DSs dilutes enrichment of the TF-binding site in the input set, making motif detection more challenging ( 17 , 22 ). Modeling DSs, we repeated our estimate of PPV of TFBS detection with the addition of 1–50 random yeast promoters (DSs) to the original set of 51 STE12-bound promoters. We found that positional clustering improves the PPV of Gibbs sampling by >20% through the addition of up to 80% noise or 40 DSs ( Figure 2 , Supplementary Figure S3). Additionally, results of Gibbs sampling both with and without positional clustering decay linearly with the addition of decoys [ R2 = 0.81 and 0.95, respectively (Supplementary Figure S4)]. Extrapolating, we predict positional clustering will maintain an improved PPV through the addition of >100% noise or 70 DSs.
To address issues of generality, we repeated the procedure on additional sets of S.cerevisiae promoters (YAP1, TEC1, HAP4 and YDR026C). An added benefit is that we can evaluate the effect of information content of the binding motif and number of promoters on the improvement from positional clustering ( 22 ). Repeating the analysis, we again find that independently of the set or sampling procedure, positional clustering improves accuracy through a broad range of random DSs ( Figure 3 ). Improvement appears to be limited and unreliable only when sampling alone correctly identifies the binding site in fewer than ∼20% of results. This result is consistent with our analysis of STE12-bound promoters ( Figure 2 ), and may correspond to a lower limit for the efficacy of positional clustering.
Integrating sampling strategies
Recently, researchers have noted that complementary motif detection approaches can be used together to predict binding sites more effectively than either method alone ( 23 ). With this in mind, we evaluated positional clustering in terms of its ability to combine results from two different sampling implementations. For each dataset, an equal number of results from each approach were combined into a single dataset, and positional clustering was used to predict binding sites as described above ( Figure 3C ). We measured the average percent change in PPV for each TF on each dataset, and found positional clustering improved combined sampling by 94% compared with 25% and 27% improvement for BioProspector and MotifSampler, respectively. Additionally, clustering combined sampling improved 19 of the 22 datasets evaluated, whereas clustering of BioProspector and MotifSampler results improved 17 and 16 datasets, respectively. Thus, positional clustering is an effective mechanism to integrate results from multiple sampling procedures.
Identification of GABR cis -regulatory sequences
As described above in Introduction, identifying functional TFBS in mammals is difficult due in part to inclusion of decoy sequence from long upstream regions and lack of information on co-regulation of genes. Positional clustering, as shown above, is more robust to noisy input than Gibbs sampling alone, and thus may be better suited to identify de novo cis -regulatory elements in mammalian promoters that are coordinately regulated. To test this possibility, we chose seven mammalian GABR promoters (GABRps) whose activity is potentially altered in response to status epilepticus as identified through change in mRNA levels of the gene products ( 31 , 35 ). We also included orthologous rodent promoters in the input sets ( 45 ). Orthologous promoters were included to provide more instances of binding sites in the input set than would be expected by random, allowing for easier detection of the sites. Inclusion of orthologous promoters has the additional effect of selectively amplifying evolutionarily conserved binding sites. Such binding sites are more likely to have major functional roles in the regulation of the GABR receptor. Thus, sensitivity to such sequences is improved at the expense of sensitivity to species-specific binding sites. With this effect in mind, we require all GABR-binding site predictions to exist in orthologous promoters. Since the mechanisms of co-regulation for the seven GABRs are unknown, hypothetical co-regulation models were evaluated by querying all 2 7 possible subsets of the seven GABRps. Clustering results on nucleotide positions and selecting the most frequently occurring positions, we predicted 13 functional TF-binding sites.
Predictions were compared with instances of known binding motifs from TRANSFAC ( 40 ), and 8 of the 13 predictions (61.5%) resembled known binding sites for 10 TFs ( Table 1 ). Of the 10 TFs, 7 have been identified in the CNS of rodents: SP-1 ( 46 ); AP-2, TST-1 (POU3F1), OCT-1 (POU2F1), OLF-1 ( 47 ); CP-2 ( 48 ); and RREB-1 ( 49 ). Furthermore, previous analyses of GABR promoter regions agree with our predictions that assign putative regulatory roles to SP-1, OCT-1, OLF-1 in the regulation of GABRs ( 29 ). We chose to validate novel motif predictions with EMSAs and functional studies in primary cultured neurons.
In total, we predict 15 orthologous pairs of regulatory sequences, representing 13 unique sequences. Comparing with known mammalian binding motifs, eight of the predictions show similarity to previously characterized TFBS, as indicated. Where no known binding motif exists, the corresponding in vitro EMSA and functional assay, if applicable, is indicated. Similar predictions are grouped together and aligned by hand.
EMSA ( 50 ) was performed with an excess of cold competitors to define specificity of protein binding in nuclear extracts derived from primary neocortical neurons and fibroblasts (FIBs) kept in culture. As shown in Figures 4–6 , out of six predicted binding sites found upstream of the (α, β, γ and δ) subunit genes, four (GABRA4, GABRB1, GABRB3 and GABRD) displayed specific binding. In addition to specific binding of neuronal extracts to novel GABRA4 motifs, we have evidence for specific binding using FIB extracts ( Figure 5A and B ), of especial interest given that the expression of GABRs is restricted to the nervous system and repressors such as the RE1-silencing transcription factor (REST) ( 51 , 52 ) expressed in non-neuronal cells have been implicated in the neural specificity of gene expression.
Clearly, protein binding to DNA does not always necessitate regulatory function. To begin to address the functional significance of our predicted regulatory motifs, we evaluated the effects of transfecting neurons with double-stranded oligonucleotides containing one of the GABRA4 novel binding motifs (dsA4O), as described above. GABRA4 is especially interesting given that it is regulated by brain derived neurotrophic factor (BDNF) after status epilepticus ( 31 , 53 ). Transfection with the dsA4O produced a significant down-regulation of GABRA4 gene expression in neocortical neurons as monitored by quantitative real-time RT–PCR with no change after MOCK transfection or transfection with a dsO containing three copies of a cAMP regulatory element (CRE) ( Figure 6 ).
How reliable are the binding site predictions returned by Gibbs sampling based TFBS identification algorithms? We began by evaluating the stability of binding site predictions via repeated runs of Gibbs sampling. To quantify the robustness of predictions, we counted the number of Gibbs sampling results at each nucleotide position in the input set ( Figure 1 ) over a large number of repeated trials. We find that the most frequently returned positions better predict TF binding sites than the maximally scoring motifs from Gibbs sampling ( Figures 2 and 3 ). Since scoring functions are empirically derived and do not necessarily represent biological reality, the result is not altogether unexpected ( 17 ). However, we find that selecting frequently recurring positions allows filtering of up to 90% of spurious sampling results caused by convergence on biologically uninformative local minima. Positional clustering allows unbiased aggregation of results from different motif widths, thus approximating the width of the binding site ‘for free’ ( 54 ).
Next we show that positional clustering improves robustness to the addition of DSs ( Figures 2 and 3 ). Such sequences arise from inclusion of promoter regions in input sets without direct binding to the TF either due to experimental error or computational mis-annotation ( 17 , 22 ). In the STE12 example studied, linear regression models indicate our approach will maintain an advantage over traditional Gibbs sampling through addition of up to 150% noise to the original signal (Supplementary Figure S4). Empirical data, however, show a sharp decrease in improvement close to the addition of 45 DSs, or roughly double the input set ( Figure 2 ). Moreover, evaluations using promoters co-regulated by other TFs indicate positional clustering is less likely to improve predictions when Gibbs sampling identifies a correct site in <20% of repetitions ( Figure 3 ). Thus, it is possible the rather simplistic linear model overestimates improvement in robustness beyond what is practically achievable. Moreover, when multiple motifs exist in the input promoters, preliminary evidence suggests positional clustering will uniquely identify a single dominant motif (Supplementary Figure S5). With further refinement, however, it may be possible to recover subordinate motifs, enabling identification of cis -regulatory modules. In spite of these limitations, using positional clustering of repeated runs, researchers can successfully apply sampling algorithms in identification of functional binding sites in datasets with a significant proportion of noise.
Computational prediction of TF binding in mammalian genomes poses just such a challenge due to increased decoy sequence in large upstream regions ( 24 ). Thus, having established increased robustness to DSs in yeast, we applied our approach to identify potentially unknown GABA A receptor subunit gene regulatory sequences that may participate in the response of the genome to seizure activity. We reasoned that GABA A receptor subunit genes either up-regulated or down-regulated in the animal model of epilepsy would share common binding motifs. Using positional clustering, we predicted 13 TF-binding sites upstream of GABA A receptor subunit genes ( Table 1 ). Twelve of our predictions were verified by either comparison to known binding sites or experimental verification using in vitro binding assays. Initially positive experimental results highlight the ability of computational techniques to direct research into transcriptional regulation in mammalian models. As such, our approach may be applicable in the study of other protein complexes in the mammalian proteome.
The reported predictions may enable pharmacologically important downstream research. For example the predicted sites can be used as a starting point for quantifying in vivo effect on downstream transcription; for identifying the TFs bound; and even for the more complex task of understanding the role of each site in determining the relative abundance of GABA A receptor isoforms. Research along these lines may dramatically improve our understanding of GABA A receptor regulation and its role in disease and development. Additionally, a more comprehensive evaluation of the remaining GABA A receptor subunit genes may reveal additional TF-binding sites that uncover the evolutionary significance of γ-α-β GABR clusters in the human genome.
Supplementary Data are available at NAR Online.
Charles DeLisi is partially supported by NIH grants A08 POGM66401A and J50 01-130021. Daniel S. Roberts is supported by NIH training grant 2T32 GM00854. Shelley J Russek is supported by NIH/NINDS Grant NS050393. Funding to pay the Open Access publication charges for this article was provided by the Boston University Bioinformatics Program.
Conflict of interest statement. None declared.