Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis -regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
The identification of genomic regulatory elements is an important but unsolved problem in genome annotation ( 1 ). Of the ∼5% of mammalian genome that is estimated to be under evolutionary selection pressure, less than a third is coding ( 2 , 3 ). The remaining portion is believed to be composed of untranslated regions, non-coding genes, chromosomal structural elements and regulatory elements that control a variety of biological processes including gene expression, translation, chromosomal replication and condensation. However, little is currently known about the vast array of these regulatory elements. Whereas the number of coding genes in many of the sequenced organisms can now be reasonably estimated, there is no clear estimate of the number of functional regulatory elements in these genomes, especially in the metazoa. In eukaryotes ranging from the nematodes to flies to the mammals, the number of coding genes is similar ( 2 , 4 – 6 ), and it is now thought that organismal complexity may be attributed to phenomena such as alternative splicing, DNA rearrangement and increased number of transcriptional regulatory elements as well as transcription factors (TFs) which regulate gene expression ( 7 ). The identification of cis- regulatory elements controlling gene expression, and characterization of their interaction with the respective TFs, thus lie not only at the very heart of understanding of the network of gene interactions but also of explaining the origins of organismal complexity and development.
Genomic regulatory elements are frequently represented by DNA motifs. As such these representations are general and can be used to describe any class of short DNA sequence elements. However the theories for weight matrix model, which is a common way to represent a collection of DNA elements, are based on the biophysical considerations of protein–DNA interactions ( 8 – 11 ) (as described in the following section). Therefore, here we primarily discuss the transcriptional regulatory elements, more specifically the DNA sites that are bound by the TFs.
It is estimated that there are ∼2000 TFs in the mammalian genomes ( 2 , 5 ) and ∼1000 in the flies and worms ( 7 ). However, only for a minority of the TFs (∼900 in human, 700 in mouse, 200 in Drosophila and 100 in Caenorhabditis elegans ) is there currently any known information on binding sites or interacting protein partners ( 12 ). DNA-binding site models are available for ∼500 vertebrate TFs, and <5000 genomic sites are known in all vertebrates in fewer than 3000 genes ( 12 ). Based on the information available from a few of the well-studied genes ( 13 ), it appears that the total number of such sites in the multicellular genomes could be at least an order of magnitude higher than the number of coding genes, i.e. in the order of hundreds of thousands or more. Thus, our knowledge of the TFs, and especially their binding sites and regulated genes, is severely limited at this present time.
Computational methods for modeling and identification of DNA regulatory elements have been developed over the past two and a half decades ( 14 – 20 ). Orthogonal information from comparative genomics or co-regulation at the transcriptional level have also been integrated into these methods to identify cis- regulatory sites ( 21 – 25 ). More recently, methods have been developed ( 26 – 32 ) to analyze composite regulatory elements, i.e. modules consisting of multiple DNA sites bound by the regulatory factors ( 33 ). All of these methods have been valuable in expanding our limited knowledge of regulatory elements in the genome.
Here we discuss the progress made in the computational identification of genomic regulatory elements, recent advances, the utility of orthogonal data, existing challenges in the field and some selected examples where these methods have been successfully applied to discover novel functional elements. Rather than exhaustively covering the literature, we focus on the key concepts. Significant progress has also been made in the experimental characterization of regulatory sequences, a discussion of which is beyond the scope of this article, and the reader is referred elsewhere for further information ( 34 – 43 ).
REPRESENTATION AND SEARCH OF DNA REGULATORY ELEMENTS
Recurrent motifs in a collection of DNA sites are most commonly modeled by sequence patterns (also called regular expressions), or by position weight matrices (PWMs, also called profiles and position-specific scoring matrices, PSSMs). The sequence patterns are simply strings over the 4-letter alphabets [A,C,G,T] that form the DNA. In order to capture variation in a specific position, the degenerate IUPAC nucleic acid codes ( 44 ) ( http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html , nomenclature for incompletely specified bases in nucleic acid sequences) are used ( Figure 1 ). An L -long sequence motif can also be represented by a 4 × L matrix with weights giving the frequency of the four DNA bases (or the logarithm, see below) in each of the L positions ( 18 , 45 , 46 ) ( Figure 2 ). It is worth noting here that some DNA-binding site motifs are bipartite, i.e. have two halves that are sometimes separated by a spacer element in between. Such bipartite motifs can often be palindromic, e.g. 5′-CGGnnnnnnnnnnnCCG-3′, the binding site for yeast TF, Gal4 ( 47 ).
The pattern and PWM representations serve as complementary approaches and have been used widely since the 1980s. The DNA sequence patterns are simpler in representation, and advantageous in terms of exact enumeration of their significance in the genome using statistical methods (explained in detail later). The PWMs are able to capture information on the variability of a collection of DNA sites in a quantitative manner, which is not possible with the DNA patterns. However when detecting significant PWMs in DNA sequences, heuristic methods have to be used instead of exhaustive enumeration. Perhaps the most important practical utility of these models is their application in scanning DNA sequences for new regulatory element candidates. For this, one needs the appropriate motif models representing the regulatory elements, a statistical framework to score sites and determine their significance, and suitable thresholds to minimize false positives and false negatives. These issues are discussed below in more detail.
Currently there are two comprehensive and curated databases containing information on TFs binding site profiles ( 12 , 48 ). JASPAR ( 48 ) contains a smaller set that is non-redundant (i.e. each TF has only one profile), while TRANSFAC ( 12 ) contains multiple profile models for some TFs. In addition to the above generic databases, other organism-specific databases exist that host transcriptional regulation data ( 47 , 49 – 51 ) ( http://arep.med.harvard.edu/dpinteract/ ). Profiles are biased by the observed sites on which they are built. Therefore if a large enough sampling of sites is not available, they may not capture all possible variations of the functional sites. This tends to produce false negative calls on new sites that do not match well with the previously characterized ones. The binding sites for structurally related TFs are often similar, and in such cases building familial binding profiles instead of profiles for each TF may be useful ( 52 ).
In both patterns as well as PWM representations, the significance of a particular motif is given by a measure of statistical surprise (or likelihood) for the motif given the data. In the case of patterns, significance can be calculated given the distribution of all occurrences of patterns using standard statistical procedures ( 20 , 53 ). This has been used in several software packages that discover over-represented patterns from input sequences ( 54 – 58 ). In the methods using weight matrix models, the measure of significance is commonly given by the information content (IC, also called relative entropy) ( 8 , 9 , 11 ):
Given a DNA motif, searching sequences for candidate sites is straightforward; but distinguishing real sites from artifactual ones is difficult. Consequently the signal-to-noise ratio in such searches is often small. The problem of distinguishing true versus false sites arises from the fact that TF DNA-binding sites are usually degenerate and many subsequences may match a given motif. The situation is illustrated with the collection of eight known sites for a Saccharomyces cerevisiae TF, Rox1 [taken from the SCPD database ( 47 )] ( Figure 2 ). Even though there is a preferred base at 11 of the 12 positions, at 8 of those more than one base is tolerated (in a larger collection of sites, even these 4 conserved positions may show variations). Only two of the known binding sites correspond to the consensus pattern, (T/C)CCA T T GTT CTC (conserved positions are shown in boldface), so a search of the genome with this consensus pattern would miss many functional sites. On the other hand, the IUPAC representation of the sites, YYHMTKGTTBDB, captures the variation of all the sites but is also likely to find a large number of false positives and non-functional sites in a genome-wide search.
Given a sequence and a PWM, one can score any subsequence in that input. Suppose the length of the PWM is L and the weight of base i at position j is w i,j , the score of a subsequence ( s ), when aligned to the PWM, is then given by the following equation:
In addition to the motif degeneracy, there are other biological issues relating to the predictions of functional TF-binding sites in the genome, which are discussed below. Although there are limitations, the PWM models work fairly well in representing the specificity of DNA-binding sites and predicting TF-binding probability to a given site ( 11 , 18 ). Some TFs are by nature moderately or poorly specific in their DNA binding and achieve higher specificity only in the context of other binding partners. One also has to remember that the chromatin structure and DNA methylation play important roles in gene regulation ( 68 – 70 ). Large portions of the chromosomal DNA are sequestered by histones forming part of the nucleuosomal structure, and are therefore not accessible for binding by the TFs. DNA methylation can inhibit interaction of the regulatory proteins with cognate DNA sites and also influence the chromatin structure. While doing genome-wide searches for putative binding sites using a motif model, one typically does not know the chromosomal regions that are open for the regulatory proteins to bind, or (with a few exceptions) the binding partners for a given TF. A blind search of the entire genome without such information usually returns a large number of sites, many of which would probably bind to the TF (strongly or weakly, depending on the match of the sites to the model and the specificity of the model) if the DNA sequences were open for binding, but are biologically non-functional in vivo . Genomic sequences that play an active role in transcriptional regulation, such as the promoters, may often be outside of the nucleosome structure, or at least available a part of the time due to chromatin remodeling ( 71 ). So the rate of non-functional site predictions in these sequences is likely to be lower than other parts of the genome. However, without the information on DNA availability, methylation status or other binding partners, the binding site predictions with individual models are unlikely to achieve the same level of specificity that are achieved in vivo by the TFs.
Despite the fact that for individual PWM models determining thresholds for distinguishing real versus non-functional sites is difficult, computational approaches exist that address this issue. One approach is based on the IC of the PWMs. A sample-size adjusted IC may be defined as the true IC ( Equation 1 ) minus the IC expected from an arbitrary alignment of an equal number of random sites. The PATSER program ( 59 ) ( ftp://ftp.genetics.wustl.edu/pub/stormo/Consensus/ ) uses a default cutoff score for which the log e (probability) (i.e. the probability of observing a score greater or equal to the cutoff) equals the sample-size adjusted IC. Another approach is based on prediction rates on sequences that contain known sites and on sequences that are likely not to contain sites ( 66 , 72 ). False negative rates are computed from predictions made on the experimentally characterized TF-binding sites, and false positive rates may be computed, for example, from predictions in exons, where the number of regulatory elements is expected to be low ( 66 ). Given these prediction rates, the selection of the appropriate cut-off score depends largely on the user's objectives. A third approach has been to use a distribution of scores for a PWM and use one or two standard deviations below the mean score as a threshold for filtering out low-scoring sites ( 73 ), but retaining most of the true positive ones. Recently, Djordjevic et al . ( 10 ) have described a support vector machine (SVM) method for representation of DNA motifs based on thermodynamic principles of protein–DNA binding. The biophysical treatment and optimization of the SVM automatically provides a threshold to distinguish the binding energies of the real sites versus those that are likely to be false. This natural and objective thresholding is one of the strengths of their approach and it provides a way to avoid the use of the more subjective or user-defined thresholds that are frequently employed.
Since many of the TFs bind DNA in the context of other TFs, where information is available about the organization of multiple DNA-binding sites or the binding partners for a given TF, it can be utilized to reduce the rate of false positive predictions. Even when individual PWMs tend to be fairly non-specific, searching for co-occurrence of binding sites that form regulatory modules has been shown to be an effective approach to increasing prediction specificity without losing sensitivity ( 74 ).
Finally, we discuss one limitation of the additive PWM models in representing DNA sites, and recent developments which address this issue. As seen from Equations 1 and 2 , the simple PWM approach assumes independent contributions from each position within a DNA site towards the binding free energy of the TF (mononucleotide model). This has been shown not to hold true in several situations ( 37 , 75 , 76 ) where sufficient binding site data are available for detailed analysis. In such cases, higher-order models, i.e. those that consider interdependence between positions within a site, are better representations and give more accurate predictions when searching for candidate sites ( 10 , 76 – 79 ). Most of the higher-order models ( 10 , 75 – 81 ) use di-nucleotide interactions, which is a reasonable compromise between the mononucleotide PWMs and the more complete models with interactions between multiple (more than two) positions. This is because (i) the large number of characterized binding sites that is needed to build complex models (with increased number of parameters) is rarely available for any particular TF ( 12 ) and (ii) in a few cases where sufficient experimental data are available ( 37 , 75 ), interdependencies between adjacent nucleotide positions appear to be the most significant of the within-site interactions.
From the DNA-binding site data available for a small number of TFs, it has been estimated that ∼25% of sites may show significant within-site positional correlations ( 76 ). Even in cases where intra-site interactions exist, the simpler additive model has been suggested to be a good approximation ( 80 ). As more experimental data become available with the application of high-throughput methods for characterizing TF-binding sites, it remains to be seen if significant interdependence between positions of DNA-binding sites is prevalent in TF–DNA interactions, and whether the more complex sequence models provide a considerable enhancement in modeling accuracy over the additive PWMs in many cases.
Since the structural contexts of both DNA and protein can contribute towards specific DNA binding by TFs ( 82 ), some studies have investigated the use of structural information of protein–DNA recognition in building predictive models for DNA-binding sites ( 82 – 88 ). The structure-based approaches are promising as they can predict binding sites for TFs where no previous sites have been characterized before ( 85 ), and can provide improved predictive power over the simple sequence profile models ( 83 ). However, they have been limited in their general use so far, since derivation of the quantitative structural parameters is dependent on the small number of solved protein–DNA complex structures. A more thorough discussion of the structural aspects of protein–DNA recognition is beyond the scope of this review and the readers are referred to the articles above (and references therein) for further information on this topic.
With larger datasets, the issues with modeling complex DNA–protein interactions can be investigated more thoroughly than what has been done so far, and the building of models (with or without structural parameters) with higher accuracy is likely to become more feasible.
DISCOVERY OF DNA MOTIFS
Discovery of motifs in sequence data was an early problem to be addressed in computational biology. The DNA motif discovery algorithms that have been developed can be divided into two categories, namely pattern driven methods (those that identify DNA patterns) and sequence driven or alignment driven methods (those that identify profile models) ( 16 , 20 ). The concepts behind these algorithms, their advantages and limitations, and a few example methods are discussed below.
In the pattern driven methods, given a set of DNA sequences and a length L of pattern that we wish to find, the challenge is to identify the most significant patterns of that length. The solution to the problem can be obtained by generating all possible patterns of length L , searching for the number of occurrences of each pattern, and then reporting the ones with highest frequency as being the most significant in the given data ( 52 – 58 , 89 – 91 ). This enumerative approach is exact and guaranteed to find optimal solutions in the restricted search space. Approximate sequence patterns, or patterns that contain degeneracy at one or more positions, can also be identified from the sequence. The similarity between any two patterns may be given by the Hamming distance (the number of positions in which they differ) or the Levinstein distance (the number of substitutions, insertions or deletions needed to transform one string into another) ( 20 ). When multiple patterns are close in terms of their distance, they can be merged into one approximate pattern ( 89 , 92 ). Although the exact enumeration is an advantage of these methods, one limitation is that searching for long patterns is computationally expensive, and an exhaustive search through the sequence space of 4 L words often becomes impractical for L > 10 ( 93 ). Two general strategies have been taken to address this limitation: (i) the use of efficient methods [e.g. pattern graphs ( 93 ) or projections ( 94 )] for pre-processing the data so that the search space is reduced and actual pattern search becomes less expensive, and (ii) combining the shorter overlapping patterns found from the data to yield longer or more complex patterns ( 89 , 92 , 95 , 96 ). In addition to the exact enumerative methods, efficient data structures like the suffix trees ( 97 ) have also been applied to the DNA pattern discovery problem ( 98 – 100 ). Although not exact algorithms, the advantage of the suffix trees is that they allow one to search for patterns of longer lengths since the search time is not exponential in the length of the patterns, but exponential in the number of mutations to be tolerated in the sites ( 98 , 100 ).
In the sequence driven methods the challenge is to find the location of the sites and the representative PWM using only the sequence data, without any assumptions on the statistical distributions of patterns in the sequences. If the locations of sites are known, building a DNA profile for them is trivial. However in the ab initio motif discovery problems, this information is not known (‘missing information’) and has to be learned from the input data. Such problems with missing information can be solved by employing machine learning algorithms. Several machine-learning approaches have been applied to the problem of motif finding ( 59 , 62 – 64 , 76 , 101 – 107 ). Unlike some pattern driven methods where the most significant motifs can be identified by exact enumeration, obtaining the globally optimal results cannot be guaranteed in these methods. But motifs of arbitrary lengths can be searched, since the search time does not depend significantly on the length of sites.
The first amongst the sequence driven methods was the greedy algorithm ( 59 , 61 , 62 ). Given a set of n sequences, and a motif length L to be searched, this algorithm progressively builds matrices by including the sites which maximize the IC ( Equation 1 ). The algorithm first builds a set of significant matrices by comparing all pairs of sequences. In subsequent iterations, sites that increase the IC of the alignment are added to the matrix. Another method that has been commonly used is the expectation–maximization (EM) algorithm ( 63 , 102 , 103 ). The EM algorithm simplifies the analysis of problems with missing information by iteratively substituting the locations of sites by expected locations. The algorithm starts with a guess PWM, which can be random or based on some prior knowledge about the binding sites [see e.g. ( 52 , 94 )]. Using the PWM, the probability of each subsequence being a binding site is estimated, and the PWM is updated based on those probabilities. This cyclic process is iterated until a convergence criterion is reached. A stochastic variant of the EM algorithm that is now very widely used in sequence motif recognition is the Gibbs sampling method ( 64 , 106 ). Gibbs sampling, which is a type of Markov chain Monte Carlo (MCMC) algorithm, tends to provide a more robust optimization of the PWMs as the probabilistic sampling of sites helps the avoidance of local minima. Variations of the Gibbs sampling technique have been implemented in many motif finding software that are used in the bioinformatics community ( 76 , 104 , 105 , 108 , 109 ). For stochastic algorithms like the Gibbs sampling, multiple searches have to be performed with the input dataset in order to confirm that the same motifs are discovered starting from different random points in the sequence space.
Controlling for background sequence is an important issue in DNA motif discovery that has been effectively addressed in recent years. Variation in local base composition can adversely affect sequence alignment and discovery of relevant motifs. Because such variations can be complex, and since binding motifs are often AT- or GC-rich, these adverse effects can be difficult to control using existing masking algorithms. In addition, some low complexity sequence motifs [like ploy(A) or poly(GC)] are widespread in genomes of certain organisms. As a consequence, often the strongest motifs that are discovered from any input data are these common motifs that are prevalent in the genome but do not represent the specific regulatory elements that are being sought. It is therefore important to detect those motifs that are significantly more frequent in the positive sequence set (the input set in which we want to discover motifs) relative to the background ( 91 , 92 ). The available programs which identify such motifs (often called discriminative motifs) do so by modeling the background with a Markov chain ( 105 , 109 , 110 ), or by accounting for the motifs that are frequent in a given background set ( 26 , 92 , 108 ) through sampling or enumeration.
Recently, Tompa et al . ( 111 ) reported the most comprehensive comparative study yet performed for different ab initio motif-finding algorithms. Assessment was done for 13 commonly used DNA motif discovery tools that do not use any auxiliary information, such as comparative sequence analysis, mRNA expression levels or chromatin immunoprecipitation (ChIP–chip) data. Predictions were compared with known binding sites, using various statistics to assess their accuracy. One important, but not unexpected, observation from this work was that a few different tools tend to complement each other's performance. For example, MotifSampler's ( 109 ) predictions complement well the predictions of MEME ( 101 , 102 ), oligo/dyad-analysis ( 56 , 90 ), ANN-Spec ( 108 ) and YMF ( 110 ). The authors rightly suggest that biologists would be well advised to use a few complementary tools in combination rather than relying on a single one, and to pursue the top few predicted motifs of each rather than the single most significant motif ( 111 ).
There are currently some examples where ab initio motif finding algorithms have been employed to identify novel regulatory elements that have been validated by follow-up experimental studies. A few of these are described below with the objective of illustrating the types of input data that can be used and the approaches that were taken in the studies. Using the program AlignACE on the upstream promoter region of 248 distinct groups of genes in yeast S.cerevisiae , Hughes et al . ( 104 ), generated a list of 3311 putative regulatory motifs. By applying stringent thresholds and selecting motifs which had highest specificity for certain groups of genes, this large set was reduced to a small set of 54 motifs that were then grouped to 25 motif clusters based on the similarity of multiple motifs. Of the 25 motif clusters, 16 were previously known motifs representing the binding sites for the yeast TFs. One of the previously unidentified motifs, representing the binding sites for the TF Rpn4, was verified using mRNA expression analysis of the Rpn4 protein knockout and overexpressing strains. GuhaThakurta et al . ( 112 ) have described the identification of one novel regulatory element in the nematode C.elegans heat-shock response starting from a set of microarray experiments in which genes were robustly upregulated on heat-shock treatments at both early and late time points. Using a set of 28 heat-shock upregulated genes they used the ab initio motif discovery algorithms, Ann-Spec ( 108 ) and Consensus ( 59 , 108 ), to identify two strong motifs that are over-represented in the promoters of these genes. One of those motifs was the previously characterized heat-shock element that is broadly conserved in eukaryotes and known to bind to the TF called heat-shock factor. The second motif, which was novel, was shown to be biologically functional in vivo through transgenic and mutational studies. In fact, the two elements were shown to function in a cooperative manner; the contribution to heat-shock mediated expression by either one was weak, but when placed together in close proximity they strongly regulated expression. Studies similar to the one described above was done to identify several new muscle regulatory elements in C.elegans, which were shown to contribute to muscle expression in a cooperative fashion ( 113 ). The identified elements were also highly predictive of additional muscle-specific genes in that organism ( 114 ). These studies show that ab initio prediction methods can be valuable in elucidating unknown regulatory elements not only in unicellular organisms, but also the more complex metazoan genomes.
Although much progress has been made in the methods for ab initio detection of DNA regulatory elements, the problem still remains a difficult one, especially where the input sequences are long and motifs are weak. Therefore, the incorporation of auxiliary information into such methods can be of significant benefit. Since the availability of many complete genomes, the application of comparative genomics in the identification of DNA regulatory elements has become an important area of study and it is covered in detail in the next section. Here, as an example, we discuss a different approach which leverages mRNA expression data in DNA motif discovery. Bussemaker et al . ( 24 ) described recently a method to discover regulatory elements that uses correlation of DNA patterns with the expression levels of genes in a single profiling experiment. They use a simple linear regression model to fit the logarithm of the expression of each gene to the sum of contributions from a set of DNA patterns in the upstream promoter region. All the genes are simultaneously fit, and statistically significant patterns that best fit the expression data are selected. Using yeast expression datasets, they reconfirm most of the motifs originally found by clustering of expression data and then running motif finding algorithms on clustered gene sets ( 115 , 116 ). Since there are frequently cooperative (non-additive) interactions between TFs regulating a gene, the modeling is simplistic. However, the advantage of this method is the fact that limited expression data are required for analysis and discovery of the DNA motifs. A similar approach has also been used recently by Conlon et al . ( 117 ). In addition to the above methods which integrate motif discovery with expression data, several tools and studies have been described that identify DNA-binding site motifs for TFs from a set of sequences defined by mRNA profiling ( 112 , 118 , 119 ) or ChIP–chip data ( 120 , 121 ).
COMPARATIVE GENOMICS IN THE SEARCH FOR REGULATORY ELEMENTS
In multicellular organisms the sequence space in which regulatory elements can be present in the genome is vast. In addition to the core promoter region, auxiliary transcription regulatory elements like enhancers, silencers and insulators can be present in the distant 5′ upstream region, 3′ downstream region and the introns ( 122 ). In Drosophila such DNA elements can be spread over a region of 10 kb around the genes whereas the average transcribed DNA is 2–3 kb, and in mammals these elements can be scattered over distances of hundreds of kb ( 123 ). Approaches that help to limit this search space are hence of significant value to the analysis of regulatory elements. Also, it is intriguing to study those regulatory mechanisms and elements that are likely to be of fundamental importance in maintaining certain cellular functions and therefore conserved in evolution. Consequently, phylogenetic footprinting ( 124 , 125 ), which is based on the premise that functional elements are likely to be under selection pressure and thereby evolve at a rate that is slower than the surrounding non-functional sequence, has been widely applied in recent years to the identification of regulatory sequences ( 21 , 123 , 126 – 129 ). Purely for the purpose of illustration, a simple example of the application of phylogenetic footprinting for the identification of putatively conserved TF-binding sites is given in Figure 3 .
Enrichment of regulatory elements has been clearly demonstrated in non-coding DNA in the human–mouse conserved regions ( 21 , 127 , 130 ). For example, Wasserman et al . ( 21 ) found that 98% of the known muscle regulatory elements are located within 19% of the sequence that is most conserved between human and mouse. Non-coding sequences that can be aligned across two or more organisms have been shown to have functional regulatory roles ( 123 , 129 ). Analysis of motif frequencies and correspondence in conserved regions across multiple species has been used to identify known and novel regulatory elements in organisms ranging from yeast ( 25 , 131 – 133 ) to the mammals ( 134 , 135 ). Based on human–rodent sequence alignments, and known sets of regulatory sequences and ancient repeats in those genomes, methods have been developed to distinguish conserved regulatory regions from neutral sequences ( 136 , 137 ). In addition to its application in eukaryotic genomes, comparative genomic approaches have helped in elucidation of regulatory elements in prokaryotes and archaea ( 22 , 138 – 141 ). Therefore, significant developments have been made over the past several years in utilizing the sequences from multiple species to identify functional regulatory elements in all phyla.
There are now numerous methods that are available for alignment of genomic sequences from two or more species ( 142 – 151 ). Some of these have been integrated with motif finding and visualization programs to provide practical tools for analysis of regulatory elements within cross-species conserved regions ( 152 – 154 ). Multiple sequence alignment methods can be advantageous in comparative genomics since they utilize more information relative to pairwise sequence alignments; they can also be more powered to identify regulatory motifs ( 135 , 140 , 155 ). Prakash and Tompa ( 135 ) recently compared the performance of six global and local multiple alignment tools with respect to their potential of identifying highly conserved short (10mers) DNA patterns from the immediate upstream promoter sequences of orthologous genes from multiple vertebrate species (human, chimp, mouse, rat, chicken). Two of the methods tested, namely MLAGAN ( 147 ) and TBA ( 156 ), appeared to perform better than several others for this purpose ( 135 ). More advanced methods have now been developed that directly take into consideration the phylogenetic distances between the organisms that are being aligned in order to identify conserved DNA motifs ( 107 , 126 , 157 – 159 ).
In addition to phylogenetic footprinting, comparative genomics is now being utilized in other sophisticated and intriguing ways to identify regulatory elements of potentially conserved function. Some methods have not only utilized sequence conservation but also gene network conservation (based on the hypothesis that multiple sets orthogonal genes may be regulated by common TFs) to identify sets of regulatory motifs ( 133 , 160 ). Although developed in yeast, one of these studies show that such methods can be sufficiently powered to detect regulatory elements in much longer sequences in the multicellular organisms, including mammalian genomes ( 133 ).
There are now many examples where comparative genome sequence analysis has been used successfully to elucidate novel regulatory elements which would have been difficult to identify without that information; three are illustrated below. McCue et al . ( 140 ) used an extended Gibbs sampling algorithm to identify probable transcription regulatory sites upstream of Escherichia coli genes by cross-species comparison. A set of 184 genes with orthologs from two or more other gamma proteobacterial organisms were analyzed. Of their predictions 81% corresponded with the documented sites known to regulate these genes, whereas 67% corresponded when data from only one other species were available, suggesting that addition of one more species aids in sensitivity of the binding site detection. One of the novel predictions, a DNA site bound by the TF YijC, was verified by experiments. In an elegant study, Loots et al . ( 123 ) demonstrated the utility of phylogenetic footprinting by identifying a regulatory region conserved between human and mouse for IL-5 that is located ∼120 kb away from the gene itself. In another study comparing genomic sequences from multiple primate species, Boffelli et al . ( 129 ), identified regulatory elements for APOA, a recently evolved primate gene. A region of high conservation adjacent to the transcription start site was shown to interact with one or more DNA-binding proteins using electrophoretic mobility shift assays with nuclear extracts from liver cells. Identification of such primate-specific functional elements would be unattainable through the comparison of species that are evolutionarily more distant.
As evident from the discussions above, approaches that utilize conservation across species are very useful in elucidating functional DNA regulatory elements. Significant advances have been made in this area over the past 6 years. Despite its utility and widespread use, there are limitations in phylogenetic footprinting approaches with respect to their use in the identification of regulatory elements. The short regulatory elements may be missed if the genomic sequences that are being aligned come from distant organisms ( 112 , 161 , 162 ). If the organisms are too close, however, the alignments are extensive and therefore unable to distinguish the conserved functional elements from the non-functional ones ( 161 , 162 ). It is unclear at this time whether cross-species alignments would be equally useful in finding regulatory elements in all phylogenetic clades ( 163 , 164 ). In a recent comparison of two Drosophila species, the known regulatory elements were found to be only modestly enriched in the conserved regions ( 164 ), although the amount of conservation in the non-coding regions of these Drosophila species was roughly the same as in human–mouse. In another study, individual binding sites appear to be conserved across two nematode species, but they were not located in sequences that were aligned by the software for phylogenetic footprinting ( 112 ). Therefore, whether cross-species sequence alignment is likely to be an effective approach in all organisms in elucidating most of their functional regulatory elements, as well as the issues such as optimal phylogenetic distances, and the number of organisms needed to detect these elements, are still matters of debate and investigation ( 163 ).
COMPOSITE MOTIFS AND CIS -REGULATORY MODULES
In eukaryotes, TFs rarely act alone in regulating the expression of a given gene. In most cases multiple factors bind DNA, often in close proximity with each other, forming regulatory modules ( 13 , 33 , 165 , 166 ). By utilizing combinatorial interactions between multiple factors these cis -regulatory modules (CRMs) confer specific spatial and temporal patterns of transcription. Therefore identification of composite modules and higher order regulatory structures is currently an active area of research in computational analysis of regulatory sequences. The problem is difficult however since the combinatorial interactions between the regulating factors can be very complex (e.g. see the array of regulatory interactions in the immediate upstream region of Endo 16 in sea urchin ( 13 )]. There have been some developments in the past 6 years in addressing this problem in DNA sequence analysis. The current approaches can be classified as follows:
Softwares that identify CRMs given a set of known DNA motifs fall into two categories, those that use hidden Markov models to represent the CRMs ( 27 , 169 ), and those that rely on observation of frequent joint occurrence of sites within a certain window, modeling the frequency of multiple sites with an appropriate statistical (e.g. Poisson) distribution ( 30 , 31 , 74 , 167 , 169 , 170 , 175 , 176 ). Because the DNA-binding site models for the majority of TFs are currently unknown and the information on TFs that bind DNA together in CRMs is very limited ( 166 ), several ab initio methods have been developed to identify composite DNA elements given just a set of input sequences ( 26 , 28 , 98 , 105 , 172 – 174 ). A few of these methods use efficient suffix-tree structures to identify multiple or dyad patterns ( 28 , 98 ), and others employ Gibbs sampling or Monte Carlo strategy to identify multiple PWMs that may represent binding sites in regulatory modules ( 26 , 105 , 172 – 174 ). By combining information from multiple sites, these methods have the potential to identify motifs that are too weak (i.e. information poor) to be identified individually ( 26 ).
Two examples of the application of computational methods for identification of novel CRMs are discussed. The first describes the utility of an ab initio motif finder in identifying a novel composite motif regulating cell-cycle genes in yeast, and the second describes how the information on TFs that are known to bind DNA jointly can be leveraged to make genome-wide predictions of regulatory modules involving sites for those factors. With the application of an ab initio composite motif discovery program, CoBind ( 26 ), Pramila et al . ( 177 ) identified a CRM involving the binding sites for Yox1 and Mcm1 from the promoters of a set of 28 genes upregulated in the yeast Yox1 knockout strain. Yox1, which represses the expression of genes in the M/G 1 interval of yeast cell-cycle, binds DNA in conjunction with the generic MADS family TF, Mcm1. The binding sites for both TFs were jointly identified using the software and subsequently validated through mutational and gel mobility shift studies. The second example is of the transcriptional program in Drosophila embryo. By using known DNA binding specificity data for five TFs, Berman et al . ( 170 ) identified genomic regions containing unusually high concentrations of predicted binding sites for these factors. A significant fraction of these binding site clusters overlap known CRMs. In addition, many of the remaining clusters were adjacent to genes that were expressed in a pattern characteristic of those regulated by these factors. The authors tested one of the newly identified clusters, mapping upstream of the gap gene giant ( gt ), and showed that it acted as an enhancer that recapitulates the posterior expression pattern of gt.
Although the above approaches show potential, the developments in computational identification of CRMs are recent and there is significant room for further investigations and improvement, since the arrangements of sites in the modules are complex. The number of datasets containing collections of known composite regulatory elements that exist today is very limited [a few examples are ( 74 , 164 , 167 )], which poses an obstacle in the training and testing of general computational methods in this realm.
Understanding the mechanisms of transcriptional regulation has been an object of extended and difficult quest in biological disciplines. Clearly, significant advances have been made over the past two and half decades not only in the representation and modeling of the DNA regulatory elements, but also methods for their identification in genomic DNA. However, our knowledge of the transcriptional regulatory elements in the genome and their contribution to gene expression in different spatial and temporal contexts is still limited. Given the complex pattern of regulatory interactions, the challenges involved in the complete elucidation of these elements in the genome are substantial.
One of the major challenges is to associate the computationally identified regulatory elements with their cognate TFs. Genome-wide analyses often identify a host of putative regulatory elements ( 25 , 132 – 135 ). In order to get an understanding of the regulatory processes it is essential to associate the regulatory proteins with these elements. There have been some investigations into this problem ( 119 , 178 , 179 ) in prokaryotes and yeast, but further developments are required.
An important utility of characterizing the cis -regulatory elements is their use in the computational reconstruction of gene regulatory networks. Several studies have applied the information on cis -regulatory elements, either in isolation or in combination with other orthogonal sources of information (e.g. microarray expression data), to construct regulatory networks ( 180 – 183 ). These studies demonstrate how the information on transcriptional regulatory elements can be integrated to create gene networks. However, there are significant opportunities for investigations in this area.
The ab initio motif discovery tools and comparative genomics approaches have made it possible to detect regulatory elements in many genomes. In addition, information that can guide the search for regulatory elements to the most relevant regions of the genome is becoming available, e.g. the accurate location of transcriptional start sites ( 184 ), DNA-ase hypersensitive sequences within nuclear chromatin that represent regulatory regions (including promoters, enhancers, silencers, locus-control regions) ( 43 ), and TF binding locations from the ChIP–chip experiments ( 38 , 40 , 41 ). Individually, there are methodological or practical limitations in the computational and experimental approaches in elucidating the location and function of the full set of regulatory elements. For example, the computational DNA motif finding algorithms have limitations in terms of the sensitivity and specificity of signals they can detect, whereas the high-throughput TF-binding site location technologies (e.g. ChIP–chip) are currently limited in the extent of intergenic sequences they can explore ( 185 , 186 ). Therefore it appears that the most efficient path to elucidating the novel regulatory elements and mechanisms will lie in the judicious integration of the various methods and data ( 182 ).
Despite inherent challenges in the field, rapid progress has been made over the past few years in the computational identification of regulatory elements. Successes of the computational methods have been demonstrated through experimental validations, and efficient methods for comparative genomics and analysis of CRMs are being fruitfully utilized to elucidate complex regulatory elements in organisms ranging from the unicellular bacteria to the mammals.
We are indebted to Gary Stormo for many discussions over the years. We wish to thank Yanqing Chen, Amit Kulkarni, Haoyuan Zhu, and especially David Haynor, Tao Xie and Jeffrey Sachs for comments on the manuscript. One anonymous reviewer is thanked for helpful comments and suggestions. Funding to pay the Open Access publication charges for this article was provided by Merck & Co., Inc.
Conflict of interest statement. None declared.