Genome Surveyor 2.0 is a web-based tool for discovery and analysis of cis- regulatory elements in Drosophila , built on top of the GBrowse genome browser for convenient visualization. Genome Surveyor was developed as a tool for predicting transcription factor (TF) binding targets and cis- regulatory modules (CRMs/enhancers), based on motifs representing experimentally determined DNA binding specificities. Since its first publication, we have added substantial new functionality (e.g. phylogenetic averaging of motif scores from multiple species, and a novel CRM discovery technique), increased the number of supported motifs about 4-fold (from ∼100 to ∼400), added provisions for evolutionary comparison across many more Drosophila species (from 2 to 12), and improved the user-interface. The server is free and open to all users, and there is no login requirement. Address: http://veda.cs.uiuc.edu/gs .
Cis- regulatory analysis is a key step in understanding and decoding transcriptional regulatory networks. The researcher is interested in determining which transcription factors (TFs) regulate a gene (or genes) of interest, the locations of binding sites for those TFs, and, if the analysis has an evolutionary component, how those binding sites and regulatory influences evolve across species. For Drosophila researchers, these tasks have been greatly facilitated by the availability of 12 Drosophila genomes ( 1 , 2 ) and vast amounts of other genetic and genomic data ( 3–5 ). In addition, a variety of computational tools can nicely complement high throughput experimental approaches to the above tasks, and aid the biologist to efficiently design and conduct hypothesis-driven experiments. For instance, available computational methods can summarize known binding specificities of TFs as ‘motifs’ and search the genome (or genomic regions near a specific gene) for matches to these motifs, thus identifying putative TF binding sites. Other more sophisticated methods can produce estimates of TF binding strength in a DNA segment, by integrating all putative binding sites, both weak and strong, present in that segment. Application of these methods to multiple Drosophila genomes, coupled with whole-genome alignments, can help describe the evolution of TF binding events. Cross species comparison can also improve the accuracy of predicting TF binding targets ( 6–8 ).
Computational methods have also been used to search for clusters of binding sites of multiple TFs, with the goal of identifying cis- regulatory modules (CRMs, also called enhancers). CRMs are ∼500–1000-bp long regulatory elements that harbor multiple binding sites that together mediate a specific expression pattern of a neighboring gene ( 9 ). The identification of CRMs can provide a meaningful context in which the role of individual TF binding sites can be interpreted; they may also help reduce false positives in predicting individual binding sites. More recently, statistical methods have been demonstrated to recover functional CRMs without the prior knowledge of relevant TFs and/or their motifs. Such motif-blind approaches adopt the alternative paradigm of ‘supervised CRM discovery’, where a set of known CRMs with similar functionality (expression patterns) are used as ‘training data’ to locate other similar CRMs in the genome ( 10 , 11 ).
Genome Surveyor 2.0 presents an easy-to-use, web-based graphical interface to many of the cis- regulatory analysis tools mentioned above. It allows the user to perform TF target prediction and CRM discovery using any motif(s) from the FlyFactorSurvey database ( 12 ), the most comprehensive resource for Drosophila motifs today. It displays genome browser ‘tracks’ that profile matches to individual motifs or user-selected combinations of motifs, based on sequence information from a single genome or a combination of genomes. It also provides tracks for ‘supervised CRM prediction’ ( 10 ), driven by a user-selected subset of known CRMs from the REDfly database ( 13 ). Additional tracks are available to visualize related information such as chromatin immunoprecipitation (ChIP)-based profiles of TF occupancy, and previously characterized CRMs from the literature. In addition to providing locus-centric visualization of cis- regulatory elements, Genome Surveyor 2.0 provides an interface to search for motif/ChIP-based binding site clusters genome-wide.
Genome surveyor 2.0 provides users with the following components to perform cis- regulatory analysis in Drosophila melanogaster ( Figure 1 A).
Single/multi-species motif profiles. A motif profile displays the estimated binding site presence for a user-selected TF motif as a function of genomic coordinates. We obtain the single species profiles by running the program Stubb ( 14 ) and multi-species profiles by averaging the profiles of orthologous regions from selected species.
Supervised CRM discovery profiles. This component allows the user to specify a set of known CRMs and search for novel CRMs that have a similar k-mer composition to the specified set. Supervised CRM discovery methods do not require pre-selection of motifs, and provide a viable alternative to predicting functional CRMs, as explained in ( 10 ).
Profiles of other cis-regulatory information. ChIP-based-binding profiles (from BDTNP) and experimentally validated CRMs (from REDfly) can be displayed along with other profiles. In addition to Stubb-based motif profiles, the user may visualize binding site predictions by a more traditional method (individual matches above a threshold).
Search for Motif/ChIP clusters of binding sites. This component provides the user with an ability to search the entire genome (or list of loci) for the most significant clusters of motif matches and/or ChIP sites.
The first three components are implemented as plugins for GBrowse ( 15 ), and their outputs are ‘tracks’ that may be added to the current view of GBrowse. Note that all of these tracks/profiles can be displayed simultaneously, as illustrated in Figure 1 B.
Single/multi-species motif profiles
We have pre-computed the motif profiles of a large collection of experimentally validated TFs for D. melanogaster ( 12 ) using the Hidden Markov Model-based program Stubb ( 14 ). (Stubb examines each 500-bp window and computes a score for the presence of one or more strong or weak binding sites in that window, without imposing arbitrary thresholds on what constitute a motif match.) We have also generated motif profiles for 11 other Drosophila species and mapped them to the D.mel coordinates. All profiles are normalized using their genome-wide mean and standard deviation. Users may select from the following options related to motif profiles:
Individual species, individual motif: This option displays the profiles of the selected motif(s) in the selected species. Given this option, users might easily check, for example, whether a specific potential binding event is conserved between D.mel and D.pse by turning on the tracks of the corresponding motif for both species. Also, they may easily assess the similarity between the targets of two or more TFs. All tracks are directly linked to (and just a click away from) the FlyFactorSurvey database ( 12 ) that provides detailed information about the binding site's specificity and the method used to characterize it.
Individual species, multi-motif: This option averages the profiles of selected motifs for each selected species. This provides a convenient way to look for clusters of binding sites of several TFs, as a means to discover novel CRMs. For example, a user searching for enhancers regulating dorsal/ventral (D/V) patterning may choose to select the motifs involved in this process (e.g. those for the TFs Dl, Twi, Sna ) and examine their average profile. The user may repeat this process for other species as well, to examine if the predicted CRM in D. melanogaster is independently supported by predictions at orthologous locations in those species.
Multi-species, individual motif: This option combines the profiles of a selected motif from different species, using simple averaging or a phylogenetic tree-based averaging ( 7 ). The peaks in this profile represent the TF targets that are conserved across species.
Multi-species, multi-motif: This option averages all the profiles from selected motifs and species to create a single track. The peaks in this profile represent the strong clusters of binding sites that are conserved across species, and may thus correspond to functional CRMs.
User-defined motif: This option allows users to input their own Position Weight Matrices (PWMs), rather than selecting from a pre-defined list of motifs. Although there has been an intense effort to characterize the binding specificities of all TFs in D. melanogaster ( 16 ), there remain many TFs with unknown binding specificity. The user-defined motif option allows motifs that are not part of the publically available database to be used.
Supervised CRM discovery profiles
The REDfly database catalogs over 800 experimentally characterized CRMs in D. melanogaster , along with their spatial/temporal expression patterns ( 13 ). This extensive resource can be used as ‘training data’ to computationally predict novel CRMs genome-wide, through ‘supervised CRM discovery’ methods. These methods score a genomic segment for sequence similarity to any given set of known CRMs. The similarity score is based on frequencies of short words in the sequences, and can detect the presence of shared binding sites without relying on prior knowledge of motifs. As such, this is a pragmatic approach to CRM discovery when the likely transcriptional regulators of a gene are not known in advance, or their binding specificities have not been characterized. Genome Surveyor 2.0 allows the user to profile any genomic region with two different scores [HexMCD and IMM (M. Kazemian, Q. Zhu, M. S. Halfon, S. Sinha, manuscript under preparation)] ( 10 ). The training set of CRMs may be selected as one of over 30 different subsets of REDfly CRMs, defined by the tissue/stage of development that they help regulate ( 11 ). The user may also upload a Fasta file of CRM sequences.
Binding sites, ChIP and REDfly profiles
Users may select from the following three tracks for additional information to aid their analysis:
Binding sites above a threshold. This functionality is taken from ( http://gmod.org/wiki/MotifFinder.pm ). It displays individual binding sites predicted based on how well they match the selected/provided motifs.
ChIP profiles. This track displays ChIP-based measurements of TF occupancy ( 17 ). At this time, these profiles are available for a limited number of TFs.
REDfly CRMs. This track shows experimentally verified D. melanogaster CRMs from REDfly ( 13 ). It helps user to check the availability of any known enhancer in their region of interest. Each CRM is linked back to the REDfly database for detailed information (e.g. CRM expression pattern, the evidence for the element, the source, binding sites).
Search interface for Motif/ChIP clusters of binding sites
CRMs are known to harbor binding sites for several TFs, which act together to achieve specific regulatory functions. As such, computational tools for genome-wide CRM discovery typically search for clusters of binding sites with suitably chosen collections of TF motifs. Genome Surveyor 2.0 provides an interface for users to search for the most significant clusters of binding sites in the D. melanogaster genome for any user-specified combination of TFs ( Figure 2 A).
The search interface may be accessed from the main page of Genome Surveyor 2.0. Users first select the type of binding site profiles that will be used for search (Single/multi species motif profiles or ChIP profiles). Next, they may choose to scan the entire genome, or provide a list of genomic loci where the search will be performed. Advanced options (e.g. the number of top hits or the minimum number of different TFs in a predicted cluster) are available, but default settings are provided and help pages provide guidance for changing them. Finally, the user selects the motif or ChIP profiles of interest and begins the search. The output of the search tool is a table of predicted regulatory sequences (500-bp segments with clusters of binding sites) in the D. melanogaster genome, with links to appropriate GBrowse views ( Figure 2 B). The results are sorted based on the average value of the selected profiles in the segments. Single as well as multi-species scores are reported for each segment. Moreover, a score representing each motif's presence in the segment is shown separately, to help the user determine which motifs contribute significantly to the cluster. The output also includes information about the nearest neighboring genes and their distances from the binding site cluster.
Stubb is a popular CRM discovery tool that has been tested by multiple groups in different species ( 18–20 ). We have shown previously that regions with high Stubb scores are highly enriched for experimentally observed TF binding (ChIP), and that the enrichment improves significantly upon incorporating multi-species information ( 7 ). Stubb score profiles can be utilized to investigate the binding site composition of any genomic region. Figure 3 shows an example of motif regulatory analysis for two known CRMs. The strategy of combining the Stubb profiles of multiple TFs and identify the segments with highest average scores ( Figure 3 ) has been demonstrated to recover known CRMs ( 16 ). Genome-wide predictions of the ‘supervised CRM prediction’ methods included in Genome Surveyor 2.0 have been assessed statistically and validated experimentally ( 10 ).
Funding for open access charge: This work was supported in part by grants by the National Institute of Health (grant R01HG004744-01 to M.H.B., grant R01GM085233-01 to S.S.). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Conflict of interest statement . None declared.