ProDom is a comprehensive database of protein domain families generated from the global comparison of all available protein sequences. Recent improvements include the use of three-dimensional (3D) information from the SCOP database; a completely redesigned web interface ( http://www.toulouse.inra.fr/prodom.html ); visualization of ProDom domains on 3D structures; coupling of ProDom analysis with the Geno3D homology modelling server; Bayesian inference of evolutionary scenarios for ProDom families. In addition, we have developed ProDom-SG, a ProDom-based server dedicated to the selection of candidate proteins for structural genomics.
Received September 15, 2004; Revised and Accepted September 23, 2004
The ProDom protein domain family database originates from the early recognition that automated methods are needed to reach comprehensiveness of protein domain analysis ( 1 – 3 ). This comprehensiveness makes ProDom a unique resource usefully complementing expert derived databases, such as PROSITE ( 4 ), PFAM ( 5 ) or SMART ( 6 ), thereby helping to sustain the rapid growth of InterPro ( 7 , 8 ). In recent years, we have developed relationships between ProDom and three-dimensional (3D) structural information, both for ProDom construction and in the ProDom user interface.
METHODS USED TO CONSTRUCT ProDom
ProDom2004.1 was built anew from the SWISS-PROT41.23 and TrEMBL24.11 databases, essentially as described previously ( 2 , 7 ). In the first stage, well-characterized domain families were used to recruit homologous domains using PSI-BLAST. Among these families, 21 families were derived from in-house expertise and 1352 structural domain families were selected from SCOP release 1.63 ( 9 ) using the ASTRAL compendium ( 10 ) on the basis of the following criteria: (i) length homogeneity (the shortest sequence should be at most 25% shorter than the longest); (ii) sequence homogeneity (family diameter below 450 PAM); (iii) the family should contain at least two different domains; (iv) domains should not contain internal repeats; and (v) they should not be longer than 500 amino acids. In the second stage, domain families were automatically clustered using the MKDOM2 program ( 2 ). The resulting protein domain families were aligned using an improved parallelized program called ProDomAlign, developed in C++ using OpenMP. ProDomAlign is based on MultAlin ( 11 ), a program well suited to align very large sequence families (thousands of sequences). Multiple alignments were assessed using the norMD objective function proposed by Thompson et al . ( 12 ). Other consistency indicators include family diameter and radius of gyration as proposed previously ( 3 ). Each family is identified by a unique accession number, which is followed across successive ProDom releases using the MatchDom program ( 7 ). Links among ProDom, InterPro ( 8 ) and Pfam-A ( 5 ) were calculated using MatchDom. Links with PROSITE were calculated using pftools ( 4 ). Matches with the Protein Data Bank (PDB) ( 13 ) were more difficult to maintain because of sequence numbering inconsistencies. We, therefore, searched the PDB for sequence identity with ProDom domains using the PDB ATOM record rather than SEQRES, in order to superimpose sequence and structural information. The current ProDom2004.1 release covers 726 272 protein sequences and contains 186 303 domain families with two or more domains.
THE GRAPHICAL USER INTERFACE ON THE ProDom WEBSITE
The ProDom website was completely redesigned in order to get a more ergonomic user interface. The main ProDom form consists of two parts. The first part (ProDom Browsing) allows querying of ProDom in a variety of ways: (i) by accession number (Display a ProDom entry); (ii) by the display of all proteins belonging to one or several ProDom families with logical AND/OR operators (All proteins in ProDom families); (iii) by related databases (InterPro, PROSITE, PFAM or PDB); (iv) by SWISS-PROT/TrEMBL identifier or accession number; and (v) by keyword search with AND/OR operators. The output is either information on a given domain family ( Figure 1 ) or cartoons displaying the domain arrangements of all proteins matching the query ( Figure 2 ). The number of different cartoons available for domain display was increased from 14 160 to 237 888 with the use of 64 colours, providing for more legible outputs while preserving consistency across different displays.
The ProDom graphical interface also provides for the display of ProDom domains on 3D structures ( Figure 1 ). It is possible to display one or all ProDom domains, either on one polypeptide chain or on all chains, with different colour codes for different domain families. These domain-enhanced structures can be displayed with Rasmol ( 14 ), MDL Chime or in VRML (Virtual Reality Modeling Language), provided the corresponding helper applications have been installed. Alternatively, they can be rendered as static images from three different angles generated with the help of DSSP ( 15 ) and MOLSCRIPT ( 16 ). Users may also choose to define particular viewing angles and opt for stereo display.
The second part of the main ProDom form allows for BLAST searches in ProDom (Compare your sequence with ProDom), suggesting a possible domain arrangement for any query protein. When 3D structures are available for target domains, the output is directly linked to both SWISS-MODEL ( 17 ) and Geno3D ( 18 ) servers for homology-based domain modelling.
ProDom-CG: RESTRICTION OF ProDom TO COMPLETED GENOMES
ProDom-CG is a subset of ProDom, restricted to sequences derived from completely sequenced genomes. Bacterial protein sets ( 19 ) were retrieved from the ExPASy server ( ftp://www.expasy.org/databases/hamap/complete_proteomes ), while eukaryotic protein sets were retrieved from the EBI server ( http://www.ebi.ac.uk/integr8 ). All relevant multiple alignments and characteristics were recalculated on the resulting families. In order to provide insight into the evolution of domain families, we used a Bayesian network methodology to infer the most probable evolutionary scenario for each family. Such a scenario may be complex, including domain loss or horizontal transfer events. The taxonomy tree encompassing completely sequenced genomes was colour-coded so as to indicate ancestral nodes predicted to contain domains in a given ProDom-CG family. These colour-coded trees are available for each ProDom-CG entry on the ProDom website.
ProDom-SG FOR STRUCTURAL GENOMICS
In the framework of structural genomics projects, it is extremely useful to identify potential targets unlikely to share homology to already known structures. We, therefore, developed the ProDom-SG (Structural Genomics) server, designed to assist in the selection of protein domain families corresponding to potentially new folds on the basis of lack of detected homology. The server also allows for the identification of favourable protein candidates for crystallization studies. ProDom-SG was built in three steps. In the first step, only ProDom families with norMD values above 0.5 were considered. In the second step, potential homology relationships between ProDom families were identified using PSI-BLAST with family specific, position-specific scoring matrices. When applicable, the existence of such related families is indicated using a specific logo appearing at the top of the family information sheet (in field A, Figure 1 ). In the third step, both direct and indirect links to the PDB were recorded for each family. A direct link implies that an experimental structure is available for at least one domain in the ProDom family, whereas an indirect link reflects the existence of structural information for at least one domain in a related family. These relationships are stored in a PostgreSQL database that can be accessed on the ProDom-SG website and can be queried by keyword or species of interest. The user can retrieve ProDom families either linked or not linked to the PDB, directly or indirectly. Thus ProDom-SG provides for inspection of domain families on the basis of structure availability, which allows to characterize protein families containing a known fold. Conversely, ProDom-SG provides a quick handle on candidate domain families for which no structural information is available nor can be readily inferred. Such families are indicated by a ProDom-SG logo appearing at the top of the family information sheet (in field A, Figure 1 ). ProDom families retrieved can be further filtered on the basis of sequence homogeneity (family diameter) and the number of domains in the family. It is also possible to restrict the search to ProDom families containing at least one mono-domain protein, thus obviating the need for engineering individual domains separately before crystallization attempts.
AVAILABILITY AND LICENSING
The ProDom database is copyrighted by INRA and CNRS. ProDom is freely accessible at http://www.toulouse.inra.fr/prodom.html but commercial users need to sign a license agreement for download and local usage.
The ProDom project was supported by the ‘Programme de Bio-Informatique Inter-Organismes’, the ‘Réseau des Génopoles’ and the European Union.