SwissRegulon () is a database containing genome-wide annotations of regulatory sites in the intergenic regions of genomes. The regulatory site annotations are produced using a number of recently developed algorithms that operate on multiple alignments of orthologous intergenic regions from related genomes in combination with, whenever available, known sites from the literature, and ChIP-on-chip binding data. Currently SwissRegulon contains annotations for yeast and 17 prokaryotic genomes. The database provides information about the sequence, location, orientation, posterior probability and, whenever available, binding factor of each annotated site. To enable easy viewing of the regulatory site annotations in the context of other features annotated on the genomes, the sites are displayed using the GBrowse genome browser interface and can be queried based on any annotated genomic feature. The database can also be queried for regulons, i.e. sites bound by a common factor.
Regulation of the rate of transcription initiation is one of the main mechanisms through which cells regulate the expression of proteins encoded in their genomes. Transcription regulation is generally implemented through the sequence-specific binding of transcription factors (TFs) to target sites in the DNA, which are most often in the intergenic region upstream of the regulated gene.
The sequence segments recognized by TFs are generally short, i.e. typically ∼20 bp for prokaryotic TFs and ∼10 bp for eukaryotic TFs, and are normally degenerate. In spite of decades of extensive experimental work the number of experimentally known binding sites accounts only for a small fraction of the total number of functional sites that likely exist. For example, probably the most extensive data are available for Escherichia coli with ∼1000 sites that have experimental support (1), and for Sacchromyces cerevisiae with a few hundred sites that have direct experimental support (2). However, even for E.coli this constitutes probably less than one-fifth of all binding sites that exist genome wide, and only about a third of the ∼300 TFs in E.coli are represented with at least one binding site.
Computational approaches for inferring transcription factor binding sites stretch back almost two decades (3–5). However, only with the recent advent of large numbers of fully sequenced genomes, and the availability of genome-wide gene expression and chromatin immuno precipitation data has it become computationally feasible to comprehensively annotate regulatory sites genome-wide. For example, several approaches have been developed recently that identify regulatory sites by searching for significantly conserved sequence segments within multiple alignments of orthologous intergenic regions of related genomes (6–11). In this context several yeast species were sequenced recently (12,13) with the aim of identifying regulatory sites genome-wide. In addition to comparative genomic approaches large-scale ChIP-on-chip experiments have been undertaken recently in yeast to determine the intergenic regions bound by over 100 TFs (14,15). Computational approaches that combine comparative genomic analysis of orthologous intergenic regions with the analysis of these large-scale ChIP-on-chip data have led to the first comprehensive genome-wide annotations of binding sites in yeast (11,15–17).
REGULATORY SITE ANNOTATION METHODS
The methods that we use to produce the regulatory-site annotation for a given genome depend on the amounts and kinds of data that are available for that organism. For most organisms currently in SwissRegulon the only available data consist of the sequence of the genome and the genome sequences of related organisms. For these organisms our annotations are based on a careful comparison of orthologous intergenic regions from sets of related organisms as described below. For some genomes there are collections of known binding sites and we use these to build position-specific weight matrices (WMs) that represent the sequence-specificities of the TFs for which sites are available. For yeast there are also comprehensive ChIP-on-chip binding data available and we use these in combination with known sites to build a large set of WM models of yeast TFs. We use these sets of WMs to scan multiple alignments of orthologous intergenic regions genome-wide using the algorithm MotEvo (16).
IRUS: Intergenic Regions Under Selection
At the time of writing there are 354 complete microbial genomes that are available from the NCBI database (18). For all but a handful of these genomes there are no known regulatory sites, nor any ChIP-on-chip data available. However, for almost any genome in this collection one can find a number of related genomes that are close enough such that recognizable sequence homology in intergenic regions remains, even though a substantial fraction of nucleotides has been substituted since the common ancestor of the species. We have developed an automated pipeline that, starting from such a set of related genomes, predicts segments in intergenic regions that are under selection genome-wide. The details of this procedures, called IRUS, will be presented elsewhere. Here we briefly list the main steps:
We extract the genome sequences from GenBank (18) and identify orthologous genes between all pairs of species.
We reconstruct the phylogenetic tree relating the species. We first estimate the tree topology from multiple alignments of orthologous genes. Then we determine all pairwise distances from aligned third positions in 4-fold degenerate codons. Finally we fit the pairwise distances to the tree topology to obtain the branch lengths in the tree.
We construct multiple alignments of orthologous intergenic regions using T-Coffee (19).
We scan all alignments for putative regulatory sites using a probabilistic algorithm that explicitly models the evolution of regulatory sites along the phylogenetic tree. The algorithm returns posterior probabilities for each segment to contain a regulatory site and we select a set of segments with high posterior probability.
Note that the IRUS pipeline can be applied to any set of related species for which genome sequences are available.
Reconstructing WMs from known sites and ChIP-on-chip data
For E.coli and S.cerevisiae we reconstructed a set of WMs from the known binding sites in regulonDB (1) and SCPD (2) () by an automated curation procedure using the PROCSE algorithm (20). PROCSE is a probabilistic clustering algorithm that assumes the input sequences derive from an unknown number of unknown WMs and simultaneously partitions the sites into subsets that derive from a common WM, and aligns the sequences within the subsets. For each TF we also determined the site-length that maximized the overall probability of the data. For E.coli this curation lead to 97 WMs for 58 different TFs and for S.cerevisiae to 67 WMs for 62 different TFs and complexes of multiple TFs. Second, for S.cerevisiae we used the extensive binding data from (15) to infer WMs using the PhyloGibbs algorithm on alignments of orthologous intergenic regions of the Saccharomyces sensu stricto species as described in Ref. (11). Finally we combined and hand-curated the WMs resulting from the curation of the known sites and the WMs obtained with PhyloGibbs. This led to a total of 72 high confidence WMs, most of which correspond to the binding motif of a given yeast TF, whereas a small number correspond to the binding motif of a complex of yeast TFs.
MotEvo is a newly developed algorithm which identifies binding sites for a set of predefined WMs by scanning multiple alignments of intergenic regions (16). MotEvo exhaustively reports putative locations of binding sites and assigns a posterior probability to each reported site. For E.coli we ran MotEvo with the 97 curated WMs on multiple alignments of orthologous intergenic regions from E.coli, Salmonella typhi, Yersinia pestis KIM, Photorhabdus luminescens, and Photobacterium profundum SS9. For this dataset MotEvo reported 6237 putative sites in the E.coli genome, 1162 of which have a posterior >0.5. For S.cerevisiae we ran MotEvo with the 72 curated WMs on the multiple alignments of orthologous intergenic regions of the Saccharomyces sensu stricto species. For this dataset MotEvo reported over 85 000 putative sites, of which ∼57 000 have a posterior probability >0.1 and ∼17 000 sites having a posterior probability >0.5. For each gene MotEvo was run on the multiple alignment of intergenic regions from all species for which orthologs were available. For genes for which none of the other species have an ortholog MotEvo runs on the intergenic region of the reference species only. For these cases MotEvo effectively reduces to a WM matching algorithm.
Currently the SwissRegulon database contains regulatory site annotations for the following 18 organisms: S.cerevisiae, Agrobacterium tumefaciens, Bacillus subtilis, Brucella suis, Burkholderia, Chlamydophila caviae, Corynebacterium glutamicum, Ehrlichia canis, E.coli K12, Mycobacterium tuberculosis, Neisseria meningitidis, Prochlorococcus marinus, Pseudomonas syringae, Ralstonia eutropha, Rickettsia typhi wilmington, Staphylococcus aureus, Streptococcus pneumoniae and Vibrio cholerae. Our regulatory site annotations are shown in the context of the general genome annotations provided for each of these organisms. For all organisms except for yeast the genome annotations were obtained from GenBank (18). For S.cerevisiae the genome annotation, which is significantly more extensive, was obtained from the Saccharomyces genome database (SGD) (;21).
For all organisms except E.coli and S.cerevisiae the annotated regulatory sites are based on IRUS predictions only. For each site the genomic location, strand, sequence and the posterior probability as given by IRUS is recorded in the database. The number of regulatory sites predicted by IRUS varies from ∼750 sites for E.canis to ∼14 000 sites for Burkholderia. For E.coli and S.cerevisiae regulatory site annotations of MotEvo are given in addition to the IRUS predictions. For these regulatory sites the binding TF is also identified for each site. In addition, the database contains WM logos and regulons, i.e. lists of all annotated sites sorted by posterior probability for each TF. Finally, for S.cerevisiae the database also displays the experimentally determined binding sites from SCPD (2) and regulatory site annotations (15,17) that were downloaded from SGD. All genome-wide binding site annotations are available as flat files in gff format from the download section. For E.coli and S.cerevisiae we also provide flat files of the WMs that are used in the annotations.
The SwissRegulon database can be accessed at the address: . The database uses the Generic genome browser (GBrowse) (22) as an engine and is fully compatible with the original GBrowse. For a detailed description of GBrowse usage and features please see the original manual at the developers page (). Briefly, the genome browser graphically displays a section of the genome and all features annotated on it. The user can zoom in and out and scroll through the genome and click on features to obtain more detailed information.
Users can specify a genome segment for displaying, e.g. chrII:600..1000, or query the database by entering a keyword including wild card characters, e.g. SKO*. This query will return a list of matches to the search term. For example, to find all annotated binding sites for the transcription factor RAP1 one would query the database for RAP1* (note that, beyond the binding sites this query would also return the RAP1 gene). By clicking on one of the sites in the list the user will see the section of the genome where the site occurs.
Annotated regulatory sites are displayed as rectangular boxes with an arrow inside showing the strand of the site. The posterior probability assigned to the site is represented by the intensity of the box's color. That is, the higher the posterior probability, the more intense is the color of the box. Every box is labeled by an identifier which is either the name of the TF that binds the site or a unique identifier if the site has not been assigned to any known TF.
An example screen shot is shown in Figure 1. Placing the cursor on the box brings up a pop-up legend with the sequence of the site and its posterior probability. Clicking on a binding site box links to a page with detailed information about the site. For binding sites assigned to a TF this information includes the ‘regulon list’ of all sites for the same TF, and a logo of its WM. The regulon list shows for each site in the regulon the name(s) of the upstream gene(s) it regulates, the genomic coordinates of the intergenic region in which it occurs, the genomic coordinates of the site, and the posterior probability of the site. For convenient browsing the user can filter out sites according to their posterior probability. Filters are accessible under the ‘Results and Analysis’ pop-up menu.
COMPARISON WITH EXISTING RESOURCES
There are a number of databases that collect known TF binding sites from the literature. Most of these focus on regulatory sites from a single organism, e.g. RegulonDB for E.coli (23), SCPD for S.cerevisiae (2) and the more recent regulatory site annotations based on ChIP-on-chip data (15,17), DBTBS for B.subtilis (24), AGRIS for Arabidopsis thaliana (25), and the DNase I footprint database for Drosophila melanogaster (26). Similarly, there are a number of databases that contain known binding sites in vertebrate genomes (27–29). The well-known commercial TRANSFAC database (30) probably contains the largest collection of experimentally determined binding sites from multiple organisms, mainly from eukaryotic organisms. Finally, the PRODORIC database (31) focuses on prokaryotic genomes and contains collections of known binding sites from a number of bacteria, with E.coli, B.subtilis, and Pseudomonas aeruginosa represented by a substantial number of sites. Most of these databases also contain WMs for the TFs for which known sites are available (32). Some of the databases also offer the possibility of scanning intergenic regions with these WMs, and in some cases to filter the resulting sites for conservation in related species. Additionally, databases and web servers have been made available that show the results of ‘phylogenetic footprinting’ methods (33–35), i.e. that display conservation profiles for particular sets of related genomes.
Over the last years we have developed a number of probabilistic methods (11,16,20) for rigorously combining information from known binding sites and ChIP-on-chip data with motif finding methods, and phylogenetic footprinting. By applying these methods we obtain genome-wide regulatory site annotations across different genomes using a unified methodology, which rigorously assigns quality estimates, i.e. posterior probabilities, to all predicted sites. The main aim of SwissRegulon is to make these regulatory site annotations available across as many genomes as possible, both prokaryotic and eukaryotic. In addition we make all the annotations available using a common GBrowse genome browser interface that shows the binding sites in the context of other features annotated on the genome. Through this user-friendly graphical interface the SwissRegulon resource will be useful for people researching regulatory mechanisms both experimentally and computationally.
In the near future SwissRegulon will significantly expand the number of organisms represented, especially bacterial ones. For the bacteria for which significant collections of known sites exist, e.g. B.subtilis and P.aeruginosa, we will include these into the predictions as currently done for E.coli. Eventually we also intend to include comprehensive regulatory site annotations for higher eukaryotes, i.e. vertebrate genomes, flies and worms.
Second, we are intending to incorporate ChIP-on-chip data in the SwissRegulon database in the near future. The combination between the binding site annotations and condition-specific ChIP-on-chip data will give insight into the conditions under which different sites are bound by their TFs.
Third, currently binding sites are shown on a per genome basis even though site conservation across related organisms is used in the predictions. In the future we intend to provide explicit information about conservation for each binding site and to link each binding site to the orthologous binding sites in the related genomes.
Finally, we have recently implemented a web server () for running the PhyloGibbs motif and regulatory site finding algorithm (11). In the future we intend to integrate these two resources. This will allow users to run PhyloGibbs on input data that was selected in the genome browser, and to see the results in the context of the existing regulatory site annotations.
The research in this study was supported by SNF grant 3152A0-105972. Funding to pay the Open Access publication charges for this article was provided by the Biozentrum, University of Basel.
Conflict of interest statement. None declared.