Three-dimensional structures are now known within most protein families and it is likely, when searching a sequence database, that one will identify a homolog of known structure. The goal of Entrez's 3D-structure database is to make structure information and the functional annotation it can provide easily accessible to molecular biologists. To this end, Entrez's search engine provides several powerful features: (i) links between databases, for example between a protein's sequence and structure; (ii) pre-computed sequence and structure neighbors; and (iii) structure and sequence/structure alignment visualization. Here, we focus on a new feature of Entrez's Molecular Modeling Database (MMDB): Graphical summaries of the biological annotation available for each 3D structure, based on the results of automated comparative analysis. MMDB is available at: http://www.ncbi.nlm.nih.gov/Entrez/structure.html .
Received September 30, 2002; Revised and Accepted October 9, 2002
Molecular Modeling Database (MMDB) is Entrez's ‘Structure’ database ( 1 ). Querying by terms, for example, one may identify structures of interest based on a protein name. Links between databases provide other search mechanisms. A query of Entrez's MEDLINE ® database, for example, can identify articles referring to a particular protein name. Links from this set of articles to ‘Structure’ may identify structures not found by direct query, since MEDLINE abstracts contains additional descriptive terms. At the time of writing, MMDB serves about 50 000 queries per day.
Experimental 3D structure data are retrieved from the Protein Data Bank ( 2 ). Agreement of atomic coordinate and sequence data for each structure is checked and sequences are automatically modified, if necessary, to achieve exact agreement with coordinates. Data are mapped into a computer-friendly format encoded in ASN.1. This validation and encoding supports interoperable sequence, structure and alignment displays. MMDB currently contains about 20 000 structure entries, corresponding to about 40 000 chains and 70 000 3D domains.
Links, neighbours, and visualization
Sequences derived from MMDB are entered into Entrez's protein or nucleic acid sequence database, preserving a link to the corresponding structure. Links to MEDLINE are generated by citation matching ( 1 ). Links to Entrez's organism taxonomy database are validated manually ( 3 ). Sequence neighbours are identified by BLAST ( 4 ), and links to the Conserved Domain Database (CDD) by the reverse PSI-BLAST algorithm ( 5 ). Structure neighbours are identified by VAST ( 6 ). Entrez's integrated viewer, Cn3D ( 7 ), provides molecular-graphics visualization.
Entrez's ‘Structure summary’ provides a concise description of the contents of an MMDB entry and available annotation. Figure 1 presents an example, Hck Kinase, 1QCF ( 8 ). Links to MEDLINE and Taxon are provided together with descriptive text and a ‘View’ control to launch molecular-graphics visualization. The remainder of the display presents a graphical summary of macromolecular components. Each polypeptide (or polynucleotide) is described by a ‘sequence ruler’ that indicates chain lengths and the locations of protein domains. This graphical display links to annotation pertaining to individual chains and protein domains.
MMDB employs two distinct but related definitions of protein domain. ‘3D domains’ are identified automatically as compact units within a polypeptide chain. As shown in Figure 1 , colouring of 3D domains in the molecular graphics display matches that of the ‘boxes’ indicating their locations on the sequence ruler. 3D domains are the units for which automated structure neighbour calculations are performed, and the ‘box’ for each 3D domain (and complete chain) links to a display of its structure neighbours. A link to Entrez's text-listing of 3D domains is useful for advanced queries combining structural similarity with other attributes ( 3 ).
Entrez's CDD defines protein domains as recurrent evolutionary modules. In Figure 1 , for example, a CDD ‘oval’ indicates that the region corresponding to the second 3D domain contains a member of the SH2 family. The SH2 ‘oval’ links to a detailed sequence/structure alignment, as predefined in CDD ( 5 ). Correspondence between 3D domains and conserved domains is not exact. The tyrosine kinase domain (‘TyrKc’) defined in CDD, for example, corresponds to two 3D domains, each representing a compact lobe in the structure.
Structure neighbours are a rich source of biological annotation. Figure 2 shows an example, the structure neighbours of the SH2 domain of 1QCF. The structure of loop regions contributing to the intra-molecular phosphotyrosine binding site is preserved in 1JYR, a complex of Grb2 SH2 domain with a phosphotyrosine-containing peptide ( 9 ), and in 1FBV, a complex of c-Cbl with a phosphotyrosine-containing peptide ( 10 ). One may infer that proteins preserving this site are likely to bind phosphotyrosine. Consistent with this inference, structure 1G99, an Archaeal acetate kinase ( 11 ), does not preserve this site. If this protein shares a common ancestor with SH2 domains, it presumably belongs to a lineage that diverged prior to evolution of phophotyrosine binding. While superpositions based on 3D domains are normally adequate for structure-function analyses of this kind, Cn3D's alignment editing tools may be used to modify alignments and superpositions when necessary.
On average, there are over 600 structure neighbours for each 3D domain in MMDB. To help identify neighbours that provide useful annotation, Entrez's ‘VAST Summary’ provides a series of controls for selecting and sorting structure neighbours. As illustrated in Figure 2 , the ‘alignment footprint’ of each neighbour indicates the region on the 3D domain serving as query that can be well superposed onto that neighbour. This display identifies structure neighbours similar to one another, where visualization of multiple-structure superpositions is informative. Other controls sort structure neighbours by measures of similarity and select subsets that include only one representative of sequence-similar subgroups. VAST-Search, which identifies neighbours of user-submitted structures, provides the same analysis tools.
Links to protein classifications like CDD are a valuable source of annotation, since descriptions and functional-site definitions are the result of expert curation. CDD alignments also identify the conserved core and in future we plan to use this information in sorting structure neighbours ( 12 ). Automated identification of sequence and structure neighbours provides the raw material for curated resources, however, and allows Entrez users to discover new relationships not yet described there. We plan to further improve tools for identification of informative sequence and structure neighbours.
We thank the NIH Intramural Research Program for support. Questions should be addressed to: email@example.com .