Three-dimensional structures are now known within many protein families and it is quite likely, in searching a sequence database, that one will encounter a homolog with known structure. The goal of Entrez’s 3D-structure database is to make this information, and the functional annotation it can provide, easily accessible to molecular biologists. To this end Entrez’s search engine provides three powerful features. (i) Sequence and structure neighbors; one may select all sequences similar to one of interest, for example, and link to any known 3D structures. (ii) Links between databases; one may search by term matching in MEDLINE, for example, and link to 3D structures reported in these articles. (iii) Sequence and structure visualization; identifying a homolog with known structure, one may view molecular-graphic and alignment displays, to infer approximate 3D structure. In this article we focus on two features of Entrez’s Molecular Modeling Database (MMDB) not described previously: links from individual biopolymer chains within 3D structures to a systematic taxonomy of organisms represented in molecular databases, and links from individual chains (and compact 3D domains within them) to structure neighbors, other chains (and 3D domains) with similar 3D structure. MMDB may be accessed at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure.
Received September 20, 2001; Accepted September 24, 2001.
Experimental 3D structure data for Entrez (1) are retrieved from the RCSB Protein Data Bank (PDB) (2). Theoretical models from PDB are omitted. Agreement of atomic coordinate and chemical-sequence data is checked and sequence data are automatically modified, if necessary, to achieve exact agreement with coordinates. Data are mapped into an easily-parsed form encoded in the ASN.1 language (3). This validation and encoding allows Entrez’s molecular-graphics viewer, Cn3D (4), to efficiently support integrated sequence, structure and alignment displays. Author-annotated features provided by PDB are fully recorded in MMDB (5). Uniformly defined secondary-structure and 3D-domain features are added, to support structure neighbor calculations. Coordinate subsets representing backbone-only and single-conformer models are also added, to support Cn3D visualization and structure neighbor calculations. MMDB currently contains ∼15 000 structure entries, corresponding to ∼35 000 chains and ∼50 000 3D domains.
Links, neighbors and visualization
Sequences derived from MMDB entries are entered into Entrez’s protein and nucleic acid sequence databases, preserving a link to the corresponding 3D structure. Links to the MEDLINE scientific literature database are generated by processing citation data within MMDB. These links allow Entrez to provide access to publications describing the original structure determination. Sequence neighbors of MMDB-derived sequences are identified automatically using the BLAST algorithm (6). Sequence-neighbor relationships are reciprocal, and MMDB-derived sequences also appear as neighbors of other sequences in Entrez. Structure neighbors are identified using the VAST algorithm, a structure–structure alignment method (7). While VAST uses a conservative significance threshold, the structural similarities it detects often represent remote relationships not detectable by sequence comparison. Some structural similarities may represent evolutionary convergence, however, and the Cn3D viewer provides 3D superpositions, so that users may examine and interpret structural similarities for themselves. Cn3D is available at http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml.
Links to NCBI’s taxonomy database (1) are generated by semi-automatic processing of ‘source’ and other descriptive text provided by PDB. Since PDB staff refer to the taxonomy database when creating ‘source’ descriptions (2), links normally follow the genus and species information provided. In some cases ‘source’ descriptions may omit genus and species, and refer only to the manner in which a sample was obtained or prepared. In these cases other descriptive information is examined manually, and sequence-similarity searches are sometimes conducted in an effort to determine an appropriate taxonomy link. The ‘source’ string for PDB entry 1FU2, for example, is ‘synthetic construct’. The primary citation provided by PDB indicates that the sample is human insulin, however, and a link to taxon Homo sapiens was therefore assigned within MMDB. We note that taxonomy is assigned at the level of individual chains and also recorded in MMDB-derived sequence records. Taxonomy assignments have been made for all MMDB entries, and new organisms represented only in MMDB have been added to the taxonomy database, in consultation with NCBI taxonomists.
We emphasize that taxonomy links in MMDB provide more than a means to search for structures from a particular genus or species. In each case a complete lineage, or location in the ‘tree of life’, has been recorded via the link to the NCBI taxonomy database. This means that one can search in Entrez for all 3D structures from mammals, for example, or from other taxonomic groups above the level of genus and species. This type of search is not possible using PDB files, which do not contain lineage information. To illustrate this capability, we survey in Figure 1 how some major taxonomic groups are populated by the 3D-structure database. The figure shows, for selected taxa, the numbers of species for which one or more structures are known and the total number of structures by taxon. We also list in Table 1 the 10 species for which the most structures have been determined. Further information on MMDB taxonomy assignments is available at http://www.ncbi.nlm.nih.gov/Structure/PDBEAST/pdbeast.shtml. A browser for the NCBI taxonomy database, useful for identifying the scientific names of different taxa, is accessible at http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi.
Related 3D domains
Calculations for MMDB’s structure neighbor database and visualization of VAST superpositions have always employed comparisons at the level of individual chains and compact 3D domains. In earlier versions of Entrez’s search engine, however, these were recorded only as ‘related structures’, a list containing the structure neighbors for all chains and 3D domains of a given structure. The current Entrez version links each structure to its ‘3D domains’, a list of all polypeptide chains and any compact domains within them. Each ‘3D domain’ is in turn linked to ‘related 3D domains’, that is, the structure neighbors of that particular chain or domain. In earlier versions, for example, structure neighbors of an antibody–lysozyme complex included both antibody and lysozyme structures. In the current version the structure neighbors of the ‘3D domain’ representing lysozyme list other lysozyme chains, while those of the antibody chains (and their compact domains) list other immunoglobulin family structures.
3D domains within individual polypeptide chains in MMDB are identified automatically, using an algorithm that searches for one or more breakpoints, falling between major secondary structure elements, such that the ratio of intra- to inter-domain contacts falls above a set threshold (8). This method is very similar to others proposed for identification of autonomously folding domains from 3D structure data, such as that of Holm and Sander (9). We emphasize that 3D domains identified in this way provide means to increase the sensitivity of structure neighbor calculations (7), and to present 3D superpositions based on compact domains as well as complete polypeptide chains. They are not intended to represent domains identified by comparative sequence and structure analysis, as modules that recur in related proteins, though there is often good agreement between domain boundaries identified by these methods (10). NCBI’s Conserved Domain Database (CDD) provides information on domains identified by comparative analysis (11) and is available at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. We note that structure neighbors for domains with boundaries chosen by the user are available through VAST-Search at http://www.ncbi.nlm.nih.gov/Structure/VAST/vastsearch.html.
A simple query
MMDB is an integrated part of Entrez and can be accessed by querying Entrez’s ‘3D structure’ database for particular terms or keywords. This allows one to identify structures based on protein names, author names, publication dates, species names or other terms. A query such as this will produce a list of MMDB entries, and one may browse this list, following links to other databases, for example those to MEDLINE abstracts. At the time of writing, MMDB’s servers receive approximately 25 000 3D structure queries per day.
As an example, we consider a search with the terms ‘aminocyclopropane synthase’. This identifies several 3D structures available for this enzyme, including the structure with PDB identifier ‘1B8G’ (12), the protein from Malus x domestica, the apple tree. Following the link to ‘3D domains’, one sees that structure neighbors are available for eight different substructures, the complete chains A and B plus three compact domains identified within each. Following the link to ‘related 3D domains’ for domain ‘1B8G A 3’ (the third domain in chain A, as numbered from the N-terminus of the chain), one sees that 3D superpositions are available for over 1000 structure neighbors of this domain.
A more advanced query
Entrez provides a query refinement feature that allows one to combine the results of simple queries involving term-match hits, links or neighbors. To continue with the example above, suppose one wishes to identify some of the most evolutionarily distant structure neighbors of domain ‘1BG8 A 3’, as a means to identify conserved residues that may be associated with its binding and/or catalytic function. One option is to examine the tabular listing of VAST superposition statistics, available by following the link from the domain identifier ‘1BG8 A 3’, to choose structure neighbors with a low percentage of identical residues in the structural alignment. Another powerful method, however, is to choose structure neighbors from phylogenetically distant organisms. For this search it is necessary to combine results of an MMDB search by taxonomy with structure neighboring results.
As may be seen by following the taxonomy links from domain ‘1BG8 A 3’, this protein is derived from an organism (apple tree) in the superkingdom Eukaryota. The most distantly related organisms will be those from the two other superkingdom taxa, Eubacteria and Archaea. Searching Entrez’s ‘3D Domain’ database for ‘Archaea’ (with ‘limits’ set to ‘organism’), one finds that there are approximately 1000 3D domain structures known for this taxon. To select those that are also structure neighbors of 3D domain ‘1BG8 A 3’, one uses Entrez’s ‘history’ window to request the Boolean ‘AND’ of the 3D domains identified by each simple query: <1> AND <2>, where <1> and <2> represent query numbers as recorded in Entrez’s history list. Performing this search, one finds approximately 20 structures which are both structure neighbors of ‘1BG8 A 3’ and derived from Archaea, among them domain ‘1DJU A 3’, a domain from an aromatic aminotransferase from Pyrococcus horikoshii (13). Proceeding similarly for ‘Eubacteria’, one finds that several hundred structure neighbors of ‘1BG8 A 3’ derive from this taxon, including ‘1AMQ 2’, an aspartate aminotransferase from Escherichia coli (14).
Visualization of structure neighbors is available from the ‘View’ link provided with tabular listings of VAST superposition statistics. Choosing the structure neighbors ‘1DJU A 3’ and ‘1AMQ 2’ from among the other neighbors of ‘1BG8 A 3’, and pressing the ‘View’ button, one may launch a Cn3D display as shown in Figure 2. Setting Cn3D to color aligned residues by variability, one can immediately see that conserved residues are concentrated in a single region of these domains. Furthermore, since each structure contains a bound pyridoxal phosphate cofactor (or related compound), one can verify that these conserved residues line the binding pocket, and are presumably necessary for cofactor binding and aminotransferase activity. We note that tabular listings of VAST superposition statistics provide several controls for sorting and subset selection, as an aid to browsing. To reproduce the superposition in Figure 2 it is helpful to select subset ‘all of MMDB’ and sort by ‘aligned residues’. This allows one to identify structure neighbors having extensive similarity (many aligned residues) and (in this example) with bound cofactors.
We thank the NIH Intramural Research Program for support. We thank Scott Federhen, Detlef Leipe and other members of the NCBI taxonomy team for assistance with taxonomy assignments. Comments, suggestions and questions are welcome and should be addressed to email@example.com.
To whom correspondence should be addressed. Tel: +1 301 435 7792; Fax: +1 301 480 9241; Email: firstname.lastname@example.org
aGroups of sequence-similar chains (with BLAST e-value < 10–7) are counted only once, so as to illustrate the number of unrelated protein structures known from each species. The total number of chains with known 3D structure is larger, since structures of many sequence-similar chains have been determined more than once, with and without bound ligands, for example.