Summary: Despite increasing numbers of computational tools developed to predict cis-regulatory sequences, the availability of high-quality datasets of transcription factor binding sites limits advances in the bioinformatics of gene regulation. Here we present such a dataset based on a systematic literature curation and genome annotation of DNase I footprints for the fruitfly, Drosophila melanogaster. Using the experimental results of 201 primary references, we annotated 1367 binding sites from 87 transcription factors and 101 target genes in the D.melanogaster genome sequence. These data will provide a rich resource for future bioinformatics analyses of transcriptional regulation in Drosophila such as constructing motif models, training cis-regulatory module detectors, benchmarking alignment tools and continued text mining of the extensive literature on transcriptional regulation in this important model organism.
The fruitfly Drosophila melanogaster has one of the most highly annotated metazoan genome sequences with respect to gene and transposable element content (Misra et al. 2002; Kaminker et al. 2002). In contrast, the cis-regulatory sequences that control transcription are only just beginning to be incorporated explicitly into the genome annotation, despite the vast literature of functionally characterized cis-regulatory elements that exists for this species (http://www.flybase.org/). This lack of a systematic, publicly available compilation of cis-regulatory sequences for D.melanogaster, such as the SCPD in yeast (Zhu and Zhang, 1999), limits progress in the computational analysis of gene regulation for this important model species. The need for such a resource is clear from the fact that cis-regulatory curation efforts of limited scope for genes involved in early development (Ludwig et al., 2000; Spirov et al., 2000; Berman et al., 2002; Papatsenko et al., 2002; Rajewsky et al., 2002; Emberly et al., 2003; Lifanov et al., 2003) have rapidly proven useful for subsequent bioinformatic and comparative studies of gene regulation (Costas et al., 2003; Grad et al., 2004).
To contribute to a comprehensive annotation of cis-regulatory sequences in D.melanogaster, we report here a database of DNase I footprint sequences derived from a systematic literature curation and annotation effort. We have chosen to focus on DNase I footprints data since they are an abundant and high-quality source of data on transcription factor specificity (Galas and Schmitz, 1978). In contrast to previous binding site compilations in Drosophila, these data are derived from the same experimental data type, cover all available aspects of development and are explicitly linked to the finished Release 3 genome sequence coordinates (Celniker et al., 2002). The purpose of this note is to present a basic characterization of these data and to make them available in a single database as a resource for computational analyses of transcriptional regulation in one of the most important model organisms.
Our literature curation yielded a total of 201 references with non-redundant experimental data from DNase I footprinting experiments (see Supplemental Files 1 and 2). Our set of references is a superset of all those meeting the same criteria in previous compilations of binding site data for the Drosophila early embryo (Ludwig et al., 2000; Spirov et al., 2000; Berman et al., 2002; Papatsenko et al., 2002; Rajewsky et al., 2002; Emberly et al., 2003; Lifanov et al., 2003) and Transfac 5.0 (Wingender et al., 2001). The overlap between the present and previous compilations is detailed in Supplemental File 1. Our current work includes information from 113 primary references not present in any previous compilation, doubling the number of references with curated Drosophila DNase I footprint data consolidated in a single public database.
Of the 1367 footprints annotated, 1341 footprints (98%) can be attributed to 101 target genes, with 26 footprints (2%) obtained from chromatin immunoprecipitation experiments having ‘unknown’ targets (Supplemental File 3). The mean (median) number of footprints annotated per target gene is 13.3 (5), with a skewed distribution (Fig. 1a): the top ten genes (Ubx, Antp, h, ftz, eve, dpp, kni, en, Ddc, Sgs4) contribute nearly half (49%) of the footprints mapping to known targets. Likewise, 1164 (85%) of the 1367 footprints annotated can be attributed to 87 purified or recombinant transcription factors, plus an additional 203 footprints (15%) from ‘unspecified’ factors with unknown identity derived from crude or purified nuclear extract (Supplemental File 3). The distribution of number of footprints per factor is also skewed with a mean (median) number of footprints annotated per factor of 13.4 (6) (Fig. 1). As with the distribution by target, the top ten genes (hb, Trl, ftz, Ubx, en, bcd, Kr, abd-A, z, dl) also contribute nearly half (49%) of the footprints derived from known factors. Although these data represent the most comprehensive collection of binding site data in Drosophila to date, it is clear that binding site information is lacking for the majority of factors and genes, a limitation that can hopefully be overcome in the future by high-throughput experimental techniques (e.g. Bulyk et al., 2001).
Individually the 1363 footprints that map to euchromatic arms (four footprints map to heterochromatic scaffolds) comprise a total of 26,983 bp of DNA sequence, but since nearly half (45%, n = 613) of the footprints annotated overlap at least one other footprint, these data span only 21 372 bp of genomic DNA, or approximately 0.0183% of the Release 3 euchromatic genome sequence. The footprinted sequences annotated range in length from 5 to 140 bp, and surprisingly have a mean (median) length of 19.8 bp (17 bp) (Fig. 1). In fact, the vast majority (81%, n = 1101) of the footprinted sequences annotated are longer than both the 10.5 bp length needed for one turn of the β-form DNA helix (Wolffe, 1998) as well as the core recognition motif length (5–10 bp) typically reported for most transcription factors. The prevalence of long footprinted sequences may simply result from steric hindrance of the transcription factor preventing access to DNase cleavage, but may also suggest an under-appreciated role for non-core motif nucleotides in transcription-factor DNA interactions and/or a high frequency of homo-cooperative binding interactions. Certainly, the magnitude of overlap among footprinted sequences suggests the possibility extensive hetero-cooperative interactions in these data. With the resource presented here, these and other hypotheses can now be tested using the wide array of experimental and computational methods available for the functional analysis of transcription factor binding sites.
Supplementary data for this paper are available on Bioinformatics online.
We thank Nicholas Blanchard for assistance with literature curation; FlyBase Cambridge for access to the Drosophila offprint collection; Michael Ashburner, Douda Bensasson, Thomas Down and Rachel Drysdale for suggestions on data format and representation; and three anonymous reviewers and Nikolaus Rajewsky for helpful comments on the manuscript. This work was supported in part by NIH grants HG00750 and GH002673 to G.Rubin and SEC, respectively. CMB is supported by NIH training grant T32 HL07279 to E.Rubin and by a Royal Society USA Research Fellowship.