The identification and study of the cis -regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis -regulation in metazoan species. Drosophila melanogaster , with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis -regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis -regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis -regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu .
Regulated spatial and temporal control of gene transcription is a fundamental process for all metazoans. Critical to this process is the interaction of transcription factors (TFs) with specific cis -regulatory DNA sequences. These regulatory sequences—for instance, enhancers and promoters—are organized in a modular fashion, with each module containing one or more binding sites for a specific combination of TFs ( 1 ). We use the term ‘ cis -regulatory module’ (CRM) as a generic term to refer to all enhancers and similar regulatory elements that are located outside of the core promoter region and which function to regulate transcription in a spatio–temporal-specific manner. We use the more general term ‘ cis -regulatory element’ to refer to either a CRM or a TF binding site (TFBS).
Despite the clear importance of cis -regulatory elements for many areas of biology—for instance, CRMs and TFBSs act as major control nodes in embryonic development, and variation in cis -regulatory elements plays an important role in both evolutionary change and normal phenotypic variation ( 2 , 3 )—our knowledge of these sequences is surprisingly limited. The vast majority of CRMs are not known and, of those that are, relatively few have been characterized in detail. Even Drosophila melanogaster , which is a well-studied organism with a richly annotated genome, only has identified CRMs associated with fewer than 2% of its ∼14 000 genes ( 4 ). Likewise, fewer than 1% of Drosophila genes currently have annotated TFBS data ( 5 ).
A well-annotated collection of known CRMs and their constituent TFBSs would be of significant use in many important areas of biological research, including studies of transcriptional regulation, genome structure and organization and the evolution of gene regulation. Such a resource would have considerable value in aiding subsequent CRM and TFBS discovery, for example, by providing training data for supervised learning or other bioinformatics approaches. Currently, two databases play an important role in the study of cis -regulation in the model organism D. melanogaster: the FlyReg DNase I footprint database ( 5 ), a database of empirically defined TFBSs; and the REDfly (Regulatory Element Database for Fly) database ( 4 ), a highly annotated source of information on experimentally proven CRMs. These resources have served as the basis of a number of large-scale studies of cis -regulation and have allowed statistical, computational and comparative genomics methods to be brought to bear on its study ( 6–20 ). In this report, we describe the merger of these two databases into REDfly v2.0. With this release, REDfly has become a unified source of Drosophila cis -regulatory element annotation with one-stop access to both CRM and TFBS data for one of the best-studied model organisms, and the most comprehensive open-access resource for curated regulatory data in any metazoan species.
REDfly v2.0 (August 2007) contains records for 665 CRMs (up from 544 in the initial release) and 1341 TFBSs (down from 1367 in the initial FlyReg release due to removal of TFBS not attributed to target genes). Only sequences with empirical support are included in the database. The goal of REDfly is to include all experimentally verified fly CRMs and TFBSs along with their DNA sequence, their associated genes, and the expression patterns they direct. At present, curation has focused on literature reports of sequences that have been unambiguously demonstrated to be sufficient to regulate gene expression, primarily through reporter gene assays in transgenic animals and on TFBSs discovered by DNase I footprinting assays. For the most part, CRMs are included directly as reported in the literature. Where multiple nested sequences with identical activity were reported, the shortest such sequence was selected. Sequences with identical activity that are distinct but minimally overlapping are mostly reported separately, although in some instances of more substantial overlap, one or more sequences were omitted.
TFBS records include primarily DNase I (but not hydroxy-radical or copper nuclease) footprinting experiments that used protein obtained from nuclear extract (either crude or purified) or recombinant expression (either partial or full-length). When a binding factor purified from nuclear extract has been shown to be the derivative of a specific gene, footprints were attributed to the gene encoding that factor; otherwise the binding factor for nuclear extract footprints has been left as ‘unspecified’. Where possible, we followed the rule of precedence in attributing footprint data to a particular reference, unless members of the same research group reported refined coordinates in a subsequent publication. When two or more overlapping motifs for the same TF were reported for a single footprinted region, they were merged and annotated as one footprint.
All REDfly sequence features are mapped to the most current release (release 5; http://www.fruitfly.org/sequence/release5genomic.shtml ) of the D. melanogaster genome sequence. Coordinates are also provided for the two previous sequence releases for maximum convenience and back-compatibility with other sequence resources. We store the actual DNA sequences as well as the coordinates so that sequences can be downloaded without ambiguity. Because TFBS sequences are often short and therefore cannot be uniquely mapped to the genome, we also include a ‘TFBS with flank’ option that provides ∼25 bp of additional sequence both 5′ and 3′ to the TFBS. All records contain hyperlinks to the FlyBase ( 21 ) and FlyMine ( 22 ) entries for the target gene whose expression is regulated by the CRM or TFBS, and all features can be displayed on Gbrowse or UCSC genome browsers ( 23 , 24 ). For TFBS records, hyperlinks to FlyBase, FlyMine and FlyTF ( 16 ) are also available for the TF that binds the site, when known. For CRM records, controlled vocabulary descriptions of the expression pattern mediated by the CRM are provided using the Drosophila anatomy ontology ( 25 ). This is a key feature of REDfly and allows users to search for expression patterns using a tree-based browser interface ( Figure 1 ). Selecting a term from the tree will query REDfly for any CRMs annotated with that term or any of its descendant terms. Alternatively, users can search for only a single term. Because expression patterns are described using the anatomy ontology, users can link from a CRM record to any other REDfly CRMs that are annotated as mediating the same gene expression pattern, or to records in FlyBase or the Berkeley Drosophila Genome Project's in situ expression pattern database ( 26 , 27 ) for genes expressed in that pattern. These features promise to be highly useful for investigating properties of tissue-specific CRMs. For example, we recently made use of the expression pattern annotations to demonstrate that a certain class of CRMs—those that drive gene expression in the Drosophila early embryonic blastoderm—have characteristics that distinguish them from other CRMs ( 6 ). Detailed instructions on using the ontology to facilitate searching for CRMs that regulate specific expression patterns are provided in REDfly's online help.
A major advantage of integrating the REDfly and FlyReg databases is the unprecedented level of detailed information that can now be obtained by mapping TFBSs directly to the CRMs of which they are a part. Upon entry of a new CRM or TFBS, the sequence coordinates of the new element are checked against the coordinates of all of the stored TFBSs or CRMs, respectively. If a TFBS falls within a known CRM, the name of the CRM and a link to its REDfly record is provided. Similarly, all CRM records are linked to the REDfly annotations of any TFBSs that fall within them ( Figure 2 ). Searches of REDfly can be restricted to just those TFBSs that map to known CRMs, and vice-versa. Currently, 70% of TFBSs in REDfly map to a known CRM, while 26% of CRMs contain annotated TFBSs. Using these new REDfly features, it is now possible, for example, to investigate the association of TFBS sequences with expression patterns via their corresponding CRMs. REDfly is the only resource for regulatory bioinformatics that provides such a highly integrated annotation of CRMs and their constituent TFBSs.
The database schema has been designed to be both fast and extensible so that additional species can be added to the existing database structure at a later date and utilize the same search capabilities that have been developed for REDfly's Drosophila data. The tables in the database are grouped into four categories as diagrammed at the REDfly site at http://redfly.ccr.buffalo.edu/?content=/database.php :
Species-specific fixed terms (outlined in red)
The CRM definition (yellow)
The binding site definition (blue)
External reference information (green).
The fixed-term tables—equivalent to the dimension tables of a star database schema—contain information that change infrequently, including anatomy terms (using the controlled vocabulary), evidence terms (also using a controlled vocabulary), chromosome numbers, sequences and gene names and IDs. Fixed-term information can be associated with a particular species, or can be common across all species. By utilizing tables of fixed terms we can load information for multiple species into the database and then reference this information from CRMs or binding sites without duplicating the information. Fixed terms also allow us to reduce query times and prevent the introduction of typographical errors when entering data.
The CRM definition tables consist of the basic information describing each CRM, such as the CRM name, species to which the CRM belongs, free text notes, references to information stored in the fixed-term tables, citation data and references to external websites. The binding site tables describe the TFBSs and provide information similar to that found in the CRM definition tables. A mapping also exists from CRMs to binding sites that are associated with that CRM, and vice versa.
Any reference that a CRM or TFBS record makes to an external site—such as FlyBase ( 21 )—is considered an external reference. The external reference tables contain information on how to construct references to external sites such as a template for the URL, required parameters, etc. Citations are also included in external references.
Snapshots of the MySQL schema (i.e. a database dump) are recorded daily and are available for download. This provides extra backup and versioning protection, as well as an alternative method of access to the data for interested users.
RECENT IMPROVEMENTS TO REDFLY
In addition to the inclusion of TFBSs and their association with corresponding CRMs, we have implemented a number of other key improvements to REDfly. In particular, we have developed extensible markup language (XML) representations for both CRM and TFBS records and have enabled XML formatted data as one of the download options. The XML format is the most comprehensive available for REDfly and allows for a complete dump of the database contents. Development of the XML format has also helped us to automate our input procedures and should help to increase the pace of updates and additions to the database.
Data sharing with ORegAnno
We have also established data-sharing standards with the ORegAnno community regulatory annotation database ( 28 ) and implemented a two-way exchange of data. All REDfly data are automatically shared with ORegAnno, where they represent 33 and 27% of the total number of curated CRM and TFBS records in ORegAnno, respectively. Although ORegAnno lacks Drosophila -specific functionality and several of the detailed annotation fields contained in REDfly, the inclusion of our data within ORegAnno allows for alternative access through the ORegAnno web-services and the newly implemented ORegAnno tracks in the UCSC browser ( 23 ). REDfly data that do not correspond to core ORegAnno fields are stored as ‘metadata’ within the ORegAnno record. Importantly, ORegAnno is an open community-based annotation platform. Therefore, community users can annotate fly CRMs and TFBSs via the easy-to-use ORegAnno interface that includes automatic mappings to the current genome build. REDfly can then automatically retrieve these data from ORegAnno and map them to the appropriate REDfly fields using the XML representations. During the synchronization process, REDfly also performs real-time queries to NCBI and UCSC to augment the ORegAnno data with related information such as literature citations and mapping of feature locations to multiple genome sequence releases. The records are then passed to the REDfly curators for validation, for the addition of any further annotation not provided in the ORegAnno metadata, and for connection to external Drosophila -specific resources not supported by ORegAnno. Over time, we anticipate that this will be the primary route of entry for REDfly data, either from the community or by our own curators. In this way, we are able to take advantage of ORegAnno's general database cross-referencing functions, community-based annotation model and UCSC Browser tracks while still maintaining REDfly's ability to provide Drosophila -specific information such as expression pattern data, links to external Drosophila resources and cross-references between CRMs and TFBSs.
In addition to continued curation of REDfly, we have targeted several areas for development within the near future. Important among these is to expand our curation to include TFBS data from sources other than footprint experiments, e.g from electrophoretic mobility shift assay (EMSA) and chromatin immunoprecipitation experiments. We also plan to annotate a broader range of CRMs, including negative regulatory elements (silencers), and to increase the amount of annotation for each CRM to include features such as the FlyBase transgenic transposon ID for the reporter gene construct used in the assay that defined the CRM, and the position of the CRM with respect to the organization of the gene it regulates (e.g. transcription start site, exon boundaries). These additions will increase the comprehensiveness of REDfly as a source of Drosophila cis -regulatory data and facilitate the mining of these data in more diverse and sophisticated ways. A further planned development is the addition of images of reporter gene expression driven by each CRM in order to associate cis -regulatory sequences directly with embryonic expression patterns; this work is being conducted in collaboration with the FlyExpress project ( 29 ).
Over the longer term, we plan to incorporate a more extensive use of formal ontologies to describe not only expression pattern data but also experimental evidence, assay types and sequence features in order to maximize opportunities for data mining and for interoperability with other databases. Toward this goal, we have been working with the Sequence Ontology (SO) developers ( 30 ) to expand and refine SO's treatment of cis -regulatory sequences. We note that REDfly is easily adaptable to curation of cis -regulatory elements from species other than D. melanogaster with only minor modifications to the current schema that raises the possibility of incorporating multi-species regulatory data—either by direct curation or via our links with ORegAnno—into the database. This, along with our use of ontologies to allow interspecies mapping of genes and tissues, has the potential to make REDfly an unparalleled platform for comparing regulatory strategies and studying the organization of regulatory elements throughout evolution.
REDfly is freely available to all users without restriction at http://redfly.ccr.buffalo.edu . A snapshot of the current MySQL schema is posted daily on the REDfly server. Source code and other detailed information is available upon request.
We thank Guruprasad Sarvothaman for programming assistance and Stephen Montogomery, Obi Griffith, Bryan Chu, Erin Pleasance for assistance with the REDfly-ORegAnno data exchange. Grant support is from National Institutes of Health (HG002489 to M.S.H.). Funding to pay the Open Access publication charges for this article was provided by the University at Buffalo.
Conflict of interest statement . None declared.