DNA-binding domain (DBD) is a database of predicted sequence-specific DNA-binding transcription factors (TFs) for all publicly available proteomes. The proteomes have increased from 150 in the initial version of DBD to over 700 in the current version. All predicted TFs must contain a significant match to a hidden Markov model representing a sequence-specific DNA-binding domain family. Access to TF predictions is provided through http://transcriptionfactor.org , where new search options are now provided such as searching by gene names in model organisms, searching for all proteins in a particular DBD family and specific organism. We illustrate the application of this type of search facility by contrasting trends of DBD family occurrence throughout the tree of life, highlighting the clear partition between eukaryotic and prokaryotic DBD expansions. The website content has been expanded to include dedicated pages for each TF containing domain assignment details, gene names, links to external databases and links to TFs with similar domain arrangements. We compare the increase in number of predicted TFs with proteome size in eukaryotes and prokaryotes. Eukaryotes follow a slower rate of increase in TFs than prokaryotes, which could be due to the presence of splice variants or an increase in combinatorial control.
Sequence-specific DNA-binding transcription factors (TFs) each recognize a family of cis -regulatory DNA sequences described by a consensus motif ( 1 ) or position-specific weight matrix ( 2 ). They regulate spatial and temporal gene expression by binding to DNA and either activating or repressing action of an RNA polymerase. Like other proteins, TFs are composed of evolutionary units called domains, which belong to families that can occur in many different proteins and various domain combinations. In the DBD database, we define TFs as proteins containing a sequence-specific DNA-binding domain (DBD). Other databases, such as TrSDB ( 3 ), or data sets, such as Messina et al. ( 4 ), include both specific and general TFs. The precise description of TFs as sequence-specific DNA-binding we use is useful in a wide variety of studies. Examples include: improving genome annotation; high-throughput experiments such as ChIP–chip, protein chip or yeast one-hybrid ( 5 ); and studies of the evolution of gene regulation comparing multiple genomes ( 6 ), or gene regulation networks ( 7 ). The DBD database has been used as an annotation tool in the context of the InterPro ( 8 ) and FlyTF ( http://FlyTF.org ) ( 9 ) databases.
Access to the DBD database is via http://transcriptionfactor.org , where all data is available for viewing and immediate download. The community can browse predictions for over 700 species (from Arabidopsis thaliana to Zymomonas mobilis ) or DBD family (including helix–turn–helix, zinc-fingers, homeobox and many others); search predictions by sequence identifier or domain family; receive classifications for submitted protein sequences, and download our domain assignments, as well as our manually curated list of DBDs.
The prediction method in the DBD database ( 10 ) uses hidden Markov models (HMMs) to identify domains in proteins from two databases: SUPERFAMILY ( 11 ) and Pfam ( 12 ). From DBD release 2.0 onwards, updated annotation resulted in 303 HMMs from SUPERFAMILY and 145 from Pfam compared to a total of 251 HMMs in the first version of DBD. The HMMs from SUPERFAMILY represent 37 superfamilies and 87 families according to the definitions in the SCOP database ( 13 ). This includes 98 new models representing 37 sequence-specific DBD families. This resulted in an increase in additional TF predictions of 4.7%, for the 150 organisms in the original version of DBD.
The pipeline used to predict TFs begins with a domain annotation of all proteins from completely sequenced genomes with all HMMs from the SUPERFAMILY and Pfam databases (Supplementary Figure 1). A protein is classified as a TF if it has a significant match to a model we annotated as being a DBD, with the significance thresholds for HMM matches taken from the Pfam and SUPERFAMILY databases. This results in an estimated 1–5% of false-positive annotations. The TF predictions are limited to the families in our annotated collection, which means that the coverage is about two-thirds of known TFs. At the same time, up to an additional 50% of proteins are predicted as TFs that have annotations such as ‘hypothetical protein’, particularly in metazoan genomes. For details of benchmarking, please refer to ( 10 ). The prediction method is general and applicable to any proteome or sequence set. In fact, the database has grown to encompass TF repertoires of over 700 publicly available genomes. Predictions for newly sequenced genomes are continuously added to the database.
The current DBD database contains information on over 200 000 predicted TFs. These TFs are distributed across the tree of life. It is not surprising that, we find a greater number of TFs in larger genomes. To investigate the relationship between TF abundance and proteome size in different lineages we graph these variables on a log–log plot as in Kummerfeld and Teichmann ( 10 ) (Supplementary Figure 2 in this paper). To illustrate the difference between the eukaryotic and prokaryotic superkingdoms we separately perform a model fitting for these lineages. From the linear relationship on the log–log scale a power law can be inferred. This power law could be due to the underlying distribution of DBDs. A small number of DBDs (such as helix–turn–helix and zinc-finger families) occur in the majority of TFs. Whereas most DBDs occur in only a small number of TFs. In agreement with van Nimwegen ( 14 ) and Ranea et al. ( 15 ), we find a higher proportion of TFs are required to regulate larger proteomes. We also find the TF abundance in archaea and bacteria expands more rapidly than in eukaryotes. Thus, in general, the same number of TFs regulate fewer prokaryotic genes than eukaryotic genes. The higher degree of combinatorial control, where gene expression is regulated by not just one but by a group of TFs, may also contribute to the lower eukaryotic TF requirements. Different combinations of TFs mean the number of gene regulation modes can increase with a reduced increase in TFs. Bacteria and archaea obey the same power law in terms of number of TFs and number of proteins. This is in accordance with their shared repertoire of DBD families, which we will return to below.
Apicomplexa appear not to follow either the prokaryote or typical eukaryote trends, perhaps because they are obligate parasites, and only survive in the nutrient-rich environment of their hosts. Thus, a different mode of gene regulation may be used by this lineage, or it is possible that their TFs are not well characterized by the current model libraries. Below, we will illustrate in more detail how the DBD database provides a consistent framework for comparison of the distribution of DBDs across the tree of life.
Researchers can use the DBD database in several ways. For instance, all TF predictions are available to download. However, most users are only interested in a small number of TFs, so we have expanded the website search options to allow retrieval of individual TFs and subsets of TFs. New search capabilities include: searching for gene names, for example lacI or P53; listing all TFs that contain either a specified DBD or non-DBD family, for instance all TFs containing the bZIP (leucine zipper) family; retrieving all TFs containing a specified DBD family, which occur in a particular organism, e.g. all homoeodomain-containing TFs in human ( Figure 1 a and b).
We illustrate the TFs containing a specified DBD family in a particular organism in Figure 1 , where a hypothetical researcher is interested in the Homeobox TFs. These TFs are known to regulate vertebrate limb formation amongst other processes ( 16 ). Figure 1 a depicts the search for TFs in Homo sapiens containing the homoeobox domain. A subset of the results of this search are shown in Figure 1 b. By selecting the HOXA9 TF from this result set, the researcher can examine one of the new pages containing detailed information on each TF ( Figure 1 c). The detailed pages include the sequence of the TF, links to external databases containing further information on the protein, domain assignment regions and an indication of the quality of the domain assignment in the form of an Evalue. Links to predicted TFs with similar domain combinations are also provided on these pages. An example of predicted TFs with similar Pfam architectures to the HOXA9 TF (i.e. an N-terminal Hox9 activation region and a C-terminal Homoeobox domain) is shown in Figure 1 d.
Using the data on DBD families in different organisms, we compare the occurrence of DBDs (from the Pfam project) across the tree of life. The heatmap in Figure 2 demonstrates the lineage-specific DBD expansions and contractions. The list of species and DBD lists are included in Supplementary Tables 1 and 2. We found the number of occurrences of each DBD in each organism, and then normalized this number by the proteome size of that organism. In order to represent both contractions and expansions, we calculated a Z -score for each of the normalized DBD occurrence values. The Z -score is calculated from the distribution of normalized DBD occurrence across genomes for a particular DBD family, and has a mean of zero and a standard deviation of one. It is negative when the normalized DBD occurrence is below the mean, and positive when above the mean. In Figure 2 , DBD expansions (positive Z -scores) are represented using red, and contractions (negative Z -scores) using green.
Different sets of DBDs expand in different lineages. There is a clear separation between the DBD occurrence pattern in eukaryotes (in the top section of the heatmap) and prokaryotes. The DBD occurrence in prokaryotes is relatively diverse. For instance, there is a significant overlap between the DBD repertories of the actinobacteria, proteobacteria and firmicutes. This is almost certainly due to the ubiquitous horizontal gene transfer between prokaryotes. The DBD expansion pattern in archaea is similar to that in bacteria, despite sharing conserved basal transcriptional machinery with eukaryotes rather than with bacteria. The majority of these prokaryotic DBDs have the helix–turn–helix as part of their structure ( 17 ).
The eukaryote-specific DBD expansions have considerably greater variety than the prokaryotic expansions. An increased DBD kingdom-specificity is found in the eukaryotes. The metazoan, fungal and plant kingdoms are clearly distinguishable ( Figure 2 a). The fungal and metazoan kingdoms share more DBDs than the plant and metazoan kingdoms, which reflects their closer phylogenetic relationship ( 18 ). The metazoa, in the top right section of Figure 2 a, have the largest kingdom-specific DBD repertoire. This is most likely due to the regulatory overhead of metazoan complexity in terms of cell types.
The significant plant-specific DBD expansion is possibly due to the regulation of a large defence system—which plants have due to their inability to escape toxic environmental conditions. Figure 2 b clarifies the nature of the DBD expansions in the viridiplantae lineage. The AP2 family is expanded throughout this lineage, but is believed to also occur in the apicomplexa ( 19 ). Figure 2 c shows the AP2 domain in complex with DNA. This family is known to bind to the GCC-box pathogenesis-related promoter element ( 20 ) and activate defence genes. Several families are specifically expanded in the plant genomes of A. thaliana , Medicago truncatula and Oryza sativa (as opposed to the other viridiplantae, which are algae) including the family of ethylene insensitive 3 (EIN3) DBDs. This family regulates transcription in response to the chemically simplest plant hormone, ethylene ( 21 ).
Above we described novel developments in the display facilities and search tools, as well as the content of the DBD database, with a few examples of the type of insight this provides. In the future, we will continue to update the HMM libraries, which will result in improvements to the TF prediction coverage. When updating the Pfam HMMs we will make use of, and incorporate, the Pfam clan information ( 12 ). We will also continue to add and update predictions for new proteomes. Exciting new eukaryotic proteomes we hope to add soon include higher eukaryotes such as orangutan, marmoset and wallaby, disease vector insects, additional nematodes and several plants.
We have eliminated several eukaryotic genomes ( Xenopus tropicalis , Apis melifera and Populus trichcarpa ) from our analysis of DBD occurrence due to the presence of uncharacteristically high numbers of bacterial DBDs. This was a known problem in the X. tropicalis (frog) genome ( 22 ). The use of lineage-specific information on the occurrence of DBDs is a promising method for reducing false-positive TF classifications in the eukaryotes.
We also plan to refine the TF prediction procedure by taking into account that DBDs have typical patterns of domain repetition or combination with other DBDs or non-DBDs. It may be possible to make use of over-represented domain combinations to further improve our predictions, for instance by including marginal DBD matches if they occur in common TF domain arrangements as indicated by the statistical methods used in ( 23 ) and ( 24 ).
Supplementary Data are available at NAR Online.
We gratefully acknowledge comments on the manuscript from Subhajyoti De and Siarhei Maslau. This work was funded by Medical Research Council; Royal Thai Government Scholarship to VC. Funding to pay the Open Access publication charges for this article was provided by Medical Research Council.
Conflict of interest statement . None declared.