PDBsum (http://www.ebi.ac.uk/pdbsum) provides summary information about each experimentally determined structural model in the Protein Data Bank (PDB). Here we describe some of its most recent features, including figures from the structure's key reference, citation data, Pfam domain diagrams, topology diagrams and protein–protein interactions. Furthermore, it now accepts users’ own PDB format files and generates a private set of analyses for each uploaded structure.
Since its inception in 1971 (1) the Protein Data Bank (PDB) has released over 55 000 experimentally determined structural models of proteins and nucleic acids. The archiving, management and quality control of these models is nowadays performed by the worldwide PDB (wwPDB), a consortium whose partners comprise: the Research Collaboratory for Structural Bioinformatics (RCSB), the Macromolecular Structure Database (MSD) at the European Bioinformatics Institute (EBI), the Protein Data Bank Japan (PDBj) at Osaka University and, more recently, the BioMagResBank at the University of Wisconsin-Madison (2). The wwPDB makes the structural models available via ftp together with, in many cases, the original experimental data.
As the formats in which the structural models are supplied can be rather hard to interpret on their own, a number of web sites have been set up to provide ‘atlas’ pages that describe each model in a more readily digestible form. These are usually aimed at a wide variety of users with an interest in protein structure, including structural biologists and bioinformaticians. The pages contain general information, obtained from the relevant PDB file, supplemented by links to other databases and various analyses and schematic diagrams that are often unique to the specific atlas. The best known atlases are: the RCSB PDB (3), the MSD (4), OCA, JenaLib (5), PDBj, MMDB (6) and PDBsum (7). A recent review compared these atlases, or ‘comprehensive information resources’ as it called them, and identified their similarities and differences (8).
Here we provide an update of the last-mentioned atlas, PDBsum (http://www.ebi.ac.uk/pdbsum). We focus on changes made to it since it was last described (9), and specifically concentrate on features that are unique to it.
FIGURES FROM THE LITERATURE
The first major change described here involves the inclusion of figures, plus corresponding figure legends, from the structures’ principal literature references. The aim here is to encapsulate some of the rich information that the given structural model provided when it was solved; and where better to get that information than from the authors’ original publication. A well-chosen figure can often provide much information, and may be especially useful for users who do not have free access to the journal in question. Furthermore, many of the figures are quite beautiful, so their inclusion is likely to draw the user to the original reference.
The figures come either from Open Access publications or from journals whose editors and/or publishers have granted us permission for use of their copyrighted material. In fact, most journals and publishers were very generous and cooperative in this respect, although there were one or two disappointing exceptions. In general, we were given permission to use up to two figures from each relevant paper. The statistics showing the numbers of key references from which figures have been taken are given in PDBsum's ‘Figure Stats’ page.
The actual capture of the figures and their captions from the online copies of the articles has proved something of a challenge. For one thing, the journals keep changing their formats. Furthermore, the very old papers are only available as PDF versions scanned in from the original hard copies. In these cases, each journal page is, in effect, just an image. To process these images requires use of optical character recognition (OCR) utilities to extract the text (a somewhat error-prone procedure) and then each separate block of text has to be categorized as: part of the main body of the paper, a figure legend, other text such as headings or tables, or text within the figures themselves (i.e. labels or parts of the picture misinterpreted as text by the OCR utility). From the placement of the text blocks the likely coordinates of each figure on the page are calculated and the figure spliced out. This procedure has been described in greater detail elsewhere (10). For the more recent papers, the figures can generally be downloaded as image files directly from the online versions of the papers, although this, too, is not 100% reliable.
Selection of which figure, or pair of figures, to use from each paper is made by an algorithm called a support vector machine trained to distinguish interesting figures from dull ones on the basis of the words and word-pairs that appear in the figure legend (10). However, rather than wholly rely on this automated method, the choice of figures is, where possible, e-mailed to the lead author of the paper with a request to review, and possibly change, which figures are to be used and an invitation to add the author's own comment(s). About one in six authors respond to these e-mails and over 200 in all have taken the trouble to annotate their entries with additional information. A useful spin-off from this system has been an increase in the level of feedback from the authors about PDBsum and has led to several improvements, most notably the Pfam domain diagrams described below.
As well as a PDB entry's key reference, it is also useful to know of more recent literature relating to the structure. To this end, we have started adding citation data to PDBsum; that is, references that cite each structure's key paper. Currently, the data comes from CiteXplore (http://www.ebi.ac.uk/citexplore), and is supplemented by additional citations automatically harvested from the web. Many references are still not captured by either method, but our coverage of the literature should increase with time. As of September 2008, there were over 44 000 citing references for the 25 000 key references in PDBsum.
Pfam DOMAIN DIAGRAMS
One crucial aspect of PDB structural models that many non-expert users fail to appreciate is that very often the model corresponds to only ‘part’ of the full protein sequence; sometimes it is only a single domain, and occasionally merely a fragment. To show just how the given structure relates to its parent sequence, PDBsum has a little schematic diagram near the top of the structure's page to provide the information at a glance. The diagram represents the full length of the sequence, including any constituent Pfam (11) domains, where known. Below the sequence is shown the full length of the structure in terms of its secondary structure elements. From the diagram one can easily see the correspondence between the structure and the full-length sequence. Figure 1a shows an example, in which the structure is of only the last domain. Also marked on this diagram, where known, are any CATH structural domains and any residue positions where the sequence in the PDB file disagrees with the sequence in UniProt (12). Clicking on a blank part of the image brings up an alignment between the two versions of the sequence, which can be used to identify these mismatches.
Given that a structural model may be incomplete in this way, PDBsum can illustrate ‘all’ structural models in the PDB for a given UniProt sequence. Figure 1b gives an example. Here one can see that there are in fact three models of the full-length sequence, any of which may be more informative than the single domain model given in Figure 1a.
IMPROVED WIRING DIAGRAMS
For each unique protein chain in a given structural model, PDBsum provides a ‘protein page’ that includes a schematic diagram of the protein's secondary structure. These ‘wiring diagrams’ have been improved to provide a prettier picture than previously. And now the diagrams can be enlarged to five times their size for publication purposes (Figure 2a).
Each protein page also includes a topology diagram showing the arrangement and connectivity of the protein's helices and strands (Figure 2b). Where the protein chain consists of more than one domain, as defined by CATH (13), a separate diagram is generated for each and is colour-coded according to the domain colouring on the wiring diagram. The topology diagrams are generated from the hydrogen bonding plots of Gail Hutchinson's HERA program (14).
Another new feature is the addition of schematic diagrams illustrating the interactions across protein–protein interfaces. Where a structural model contains more than one protein chain (e.g. Figure 3a), the interfaces between the chains are depicted by three types of plot: the first shows an overview of which chains interact with which (Figure 3b), the second summarizes the interactions across any selected interface (Figure 3c), and the third shows in detail which residues actually interact across that interface (Figure 3d).
PDBSUM PAGES FOR USER-SUBMITTED STRUCTURES
Finally, an upload option has been added to allow users to submit their own PDB-format files to PDBsum and have a set of PDBsum analyses and pages generated for it. The generated pages are password-protected for privacy, and are deleted after about 6 months. Currently the server is receiving an average of around 40 uploads per week.
Funding for open access charge: European Molecular Biology Laboratory (EMBL).
Conflict of interest statement. None declared.
We would like to thank Gail Hutchinson for use of her HERA program to help generate the topology diagrams. Also thanks to the many authors who contributed their time to reviewing the figures included in PDBsum and, in particular, to those who added comments to their pages or provided valuable feedback.