Motivation: The Protein Data Bank (PDB) contains over 43 800 experimentally determined 3D models of macromolecular structures and their complexes. Each 3D model reveals something interesting and important about the given molecule's function and biological significance. Usually the best source of this information is the original article describing it, and it is often possible to discern the key aspects of the structure from just one or two of the figures in that article.
Results: Here we describe how, with the permission of the journals and their publishers, we have endeavoured to make these key figures publicly available to enhance the functional information relating to each PDB entry in our PDBsum database.
PDBsum (Laskowski et al., 2006), which was set up in 1995, is one of a number of web-based databases that provide information on all experimentally determined structural models released by the Protein Data Bank, PDB (Berman et al., 2000). Other databases include the MSD (Tagari et al., 2006), the Jena Library of Biological Macromolecules (Reichert and Sühnel, 2002), and of course the PDB's own recently revamped site at http://www.rcsb.org/pdb. Despite some inevitable overlap the databases do complement one other by each of them providing unique information not found in the others.
A primary aim of PDBsum has always been to represent the structural information for each 3D model in as pictorial a manner as possible, providing schematic diagrams both of the molecules making up each PDB entry—i.e. protein/DNA/RNA chains, ligands and metals—and of the interactions between them. Over the years many new and unique features have been added. Yet, despite the wealth of structural information that each entry now contains, it is not always clear, particularly to the non-expert, what a given protein, or protein complex, actually does in real life; what is its biochemical function and overall biological role? The various links to other databases, together with the literature references, enable one to laboriously compile such information for oneself, but it would be preferable to have it presented there and then.
PDBsum does contain some functional annotation. Data from the Gene Ontology (The Gene Ontology Consortium, 2000) annotations for the corresponding UniProt sequence are provided where available as is functional annotation from the UniProt Knowledgebase (The UniProt Consortium, 2007). And for enzymes, a reaction diagram is given with any products or reactants highlighted if they are similar to any ligand bound in the given structure. This information is useful, but may not always tell one what the wider significance of the structure is.
A far richer source of information is of course the scientific literature and, specifically, the original article written by the structure's authors. Here the authors will have described how they determined the 3D structure, their analysis of the structure itself and, in most cases, an explanation of how the 3D structure relates to, or even explains, the biological function of the molecule(s) in question. A simple link to such an article can lead one to this information, but, crucially, not everyone has free access to the articles themselves.
In an attempt to make at least some of this valuable information more widely available, we have been approaching the main journals in the field and requesting permission to use selected figures from the relevant articles, together with their captions, on the PDBsum pages. The motivation behind this is that the key aspect of a structure can often be readily discerned from just one or two carefully selected figures from the relevant paper. And, as PDBsum is primarily a pictorial database, this fits in well with the database's aims. To date we have received an encouraging response from many of the top journals and have started to add the figures to the PDBsum pages in accordance with the permission granted by each journal or publisher. However, it has not been a simple or straightforward procedure and, in this article, we describe the technical difficulties that we had to overcome to, firstly, identify and locate the articles associated with each PDB entry, then extract their figures, and finally to select which would be the best to include on the PDBsum pages.
2.1 Identifying the references
The first difficulty we faced was to reliably identify the references relating to each entry. Most PDB files include a list of references in their header records. The ‘key’ reference is given in the JRNL records and usually corresponds to the description of the structure in question. Additional references are listed in REMARK records and tend to correspond to older papers describing earlier work and so are not usually directly relevant, yet may be of interest. The biggest problem is that in many cases the key reference is annotated merely as ‘TO BE PUBLISHED’, i.e. the full citation details were not available at the time of the structure's deposition. The problem is compounded by the fact that frequently the PDB entry is never updated with the full citation information, even when the paper appears in print. So, in such cases all we have is the paper's title and the names of its author(s), and it is not unusual for either one or both of these to change by the time the paper is published. As of February 2007, there were 11 700 references marked as ‘TO BE PUBLISHED’ in the 65 500 references cited by the 43 800 PDB entries.
Another problem is that, even where the full reference details are given, there are many cases where one or more of these are missing or incorrect. This, too, complicates the matching of the citation details to the actual papers.
Some of these problems have already been resolved by the wwPDB and, even though the details given in the PDB files may be inaccurate or incomplete, the XML versions of these files include a mapping to the PubMed identifiers of the articles in question. The Japanese arm of the wwPDB is currently in the process of ‘cleaning up’ all the literature references so one day these data will be more complete.
In the meantime, we needed to adopt a strategy that would be able to cope both with the remaining ‘TO BE PUBLISHED’ cases and the erroneous or incomplete citations. Our method makes use of the EBIMed database, http://www.ebi.ac.uk/Rebholz-srv/ebimed (Rebholz-Schuhmann et al., 2006). Two types of search are performed: the first on author names, and the second on words taken from the paper's title. The author search uses the first two, or where present three, non-identical author surnames. If this fails to produce a positive match then three to eight words, of at least five letters in length, taken from the title are used to scan the database.
Any hits returned are compared against whatever details are given in the PDB file. If there is an exact match to either the citation details (i.e. journal, year, volume and start page), or to the title and author list, the corresponding PubMed identifier is assigned to this reference. Otherwise, a comparison is made between the title and author details given in the PDB file and those returned by each hit from the database search. To allow for differences due to changes or errors in the title and/or author names, the comparison is performed using a simple dynamic programming algorithm as used for protein sequence alignment (Needleman and Wunsch, 1970). Here, rather than align letter by letter, each word/name is treated as a separate unit and each ‘sequence’ consists of the string of words in the title followed by the author surnames. Pairs of equivalent words in the two sequences are scored as follows: identical words score 10 while similar words score 5; similar words are where the number of letters from the shorter word that are also present in the longer word account for at least 60% of the length of the longer word. A gap penalty of 5 is used. The final alignment is scored by counting the numbers of identical and similar words in equivalent positions as a percentage of the length of the longer sequence. Matches scoring over 70% are stored as potential matches and written out for manual verification. As of February 2007, 4200 of the 11 700 ‘TO BE PUBLISHED’ references could be assigned a PubMed identifier in this manner.
2.2 Extracting figures from the references
Given a reference's PubMed identifier the next step is to determine the URL of the online version of the paper so its figures can be downloaded. We use an internal database (Peter Stoehr, personal communication) to obtain the URLs. In principle, the papers and their figures can then be downloaded using a simple script. In practice, there are various technical problems to overcome. The first is the wide variety and complexity of the HTML code used by the online journals; this makes it far from straightforward to universally locate and identify the figure images and their captions. The second is that the URLs given rarely relate to the HTML versions of the papers. In general they point to some ‘front page’ from which it is necessary to identify the links to the HTML and/or PDF versions of the paper and, often, further links need to be followed to reach the full-size versions of the figures.
To solve the problem of interpreting the HTML we use the Lynx browser to download the HTML pages and perform the interpretation for us. Lynx is a text-based browser generally used for viewing web pages on standard xterms, but it also provides a command-line dump option for writing out an HTML page as a simple text file with all the mark-up removed and the links given in a standard bracketed format. While this simplifies the interpretation, the problem of the wide variety of journal formats remains. To deal with this we have compiled a set of templates specifying how to identify the required information on each journal's pages. The information to be identified includes the start and end of each figure caption, the URL of each full-sized image, and the e-mail address of the corresponding author. For most journals, several templates are required to describe the different links to be followed to the final full-size images. Furthermore, for some journals several alternative templates are required to cope with format changes made over the years.
2.3 Figures from PDF files
Many of the older references are not available in HTML format but only as PDF files. For the very old references, prior to around 1996 (or even later for some journals), these PDF files were created by scanning in the articles from a paper copy of the journal. These present a particularly challenging problem as far as extracting figures and captions is concerned; each PDF file is essentially a set of TIFF images, one per page of the article (or, in some cases, sets of 20 or more images per page), with no simple means of separating text from images.
To extract the figures from PDF papers we use a combination of freely available utilities and a custom written program. First, we use the pstotext script which calls the ghostscript program to extract the text from the PDF file. For scanned-in PDF files, this involves optical character recognition (OCR) to identify the text and is highly error-prone. There are better, commercially available, alternatives to pstotext, but these are expensive. Secondly, we use the linux utility pdfimages to extract the images from the PDF file. This works well for most standard PDF files, but not all. For the scanned-in PDFs the utility is only able to pull out the image(s) of each page, rather than of the figures on it. In these cases the figures have to be excised from the images using a utility such as ImageMagick's convert program. But to do this requires accurately knowing the location of each figure on the page, and this is the tricky part.
First, the coordinates and sizes of each text fragment (i.e. word, part of a word, or single character), as returned by pstotext, are used to mask an array representing the layout of the given page. From the masked regions it is possible to identify the separate blocks of text on the page using connected component analysis (Figure 1). The block(s) of text corresponding to any figure legend(s) on that page are found. These, depending on the journal, will be blocks that start with the words ‘Figure n.’ or ‘Fig. n.’, where n is the figure number. (This assumes that the OCR algorithm has correctly identified the words as ‘Fig. n.’, rather than as, say, ‘Pig. n.’). Next, any small blocks that do not correspond to either a figure legend or to the main text of the paper are eliminated. These text fragments correspond to page headers and footers, tables, text captions in figures and, occasionally, parts of figures mistakenly identified as text by the OCR program.
After removing the fragments, one is left with the locations of the figure captions, blocks of text from the main body of the paper, and blank regions where the figures (and tables) must lie. The coordinates of the most appropriate blank region, adjacent to a figure caption, are then extracted and can be used to cut out the figure from that page's TIFF image (Figure 2). As the OCR text recognition is so poor, we include the figure caption together with the figure when cutting out the image so as not to lose information due to the OCR's tendency to garble things. A problem which we have not yet solved is where a table lies next to a figure. It is very difficult to correctly identify where the table ends and the figure begins, so such tables tend to be included as part of the figure when it is cut out.
While the above strategy works for the majority of papers and figures, there are inevitably cases where it fails. As a last resort, it is possible to cut out any missed figures manually from the relevant PDF file using any standard screen-capture utility. As of February 2007, over 1,800 scanned-in PDFs have had their figures captured using the above automated procedure. For some very old PDF files the pstotext program is unable to identify any text at all, and indeed even the Acrobat reader fails. So these cannot be automatically processed in this way.
2.4 Selecting the best figures
Once all figures and their captions have been extracted from a paper, the next stage is to select which one(s) to show on the relevant PDBsum pages. The number of figures we are allowed to use depends on the permission that the journal or publishers have granted us. In most cases it is two figures per paper. The simplest strategy is to rank all figures in each paper according to how interesting/informative they are and then use the most highly ranked ones on the PDBsum page. The ranking is achieved using a support vector machine (SVM) trained on a set of examples based on the words in the figure captions. We used a training set of around 1000 papers from J Mol Biol, J Biol Chem and Biochemistry. Volunteers from the Thornton group at the EBI were repeatedly served a paper at random from the training set and asked to class each figure in the paper as being of strong, medium or weak interest. Once a large enough dataset of responses had been compiled the medium-interest figures were discarded and the strong and weak figures were used to train an SVM. We used the SVMTorch program (Collobert and Bengio, 2001). For each figure caption any words of length 4 or greater, consisting solely of letters, and pairs of such adjacent words were extracted from their captions and their occurrence in the strong and weak groups were counted. Words or word pairs occurring only once in all captions were discarded, as were those occurring in more than 10% of the figures. The net result was 6610 words and word pairs from 500 figure captions.
Once trained, the SVM could be used to score, and then rank, the figures in any paper. However, rather than leave the figure selection solely to this automated process we decided to allow the authors of the papers to have a say in the matter. Where possible, an automatically generated e-mail is sent to the paper's corresponding author with a request to review the selected figures and to change the selection if necessary via a secure web page. The e-mail also serves to inform the authors, as a matter of courtesy, of the use of their figures in PDBsum. To date, around a fifth of the papers with valid e-mail addresses have been checked by their authors in this way, with some authors volunteering additional information for inclusion in the PDBsum pages. About 16% of the e-mail addresses were either invalid or out of date, and approximately 8% of the papers either had no e-mail address or the programs were unable to retrieved it from the paper. These latter percentages are likely to increase as many of the older references involving scanned-in PDF files have yet to be processed.
2.5 Added references
In addition to the references listed in the PDB files, another source of relevant publications are the tables of contents of some journals. For example, Acta Crystallographica and Nature Structural Biology annotate their tables of contents with the PDB identifiers of the 3D structures to which the articles relate. These references are regularly harvested and, where not already included in the PDB files, are shown as ‘added references’ on the relevant PDBsum pages.
As of February 2007, PDBsum includes figures from over 12 000 papers from around 30 different journals. These break down as 19 350 key references, 6500 secondary references and 100 added references (many of the papers are cited by several PDB entries). The 19 350 key references mean that 44% of the current 43 800 PDB entries have at least one figure from the scientific literature on their PDBsum page. These numbers will grow as we include more papers and seek permission from more journals. The current statistics can be seen in PDBsum at http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/figstats.
One can also see some examples of the figures themselves by going to the PDBsum Gallery at http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/Gallery. This gives a randomly generated selection of papers and their chosen figures each time the Renew button is pressed.
We would like to thank Harald Kirsch for his scripts for searching the EBIMed database, and Peter Stoehr for access to his database of literature URLs. We would also like to thank Janet Thornton, Deitrich Rebholz-Schuhmann and Neville Kallenbach for useful discussions, and Gerard Kleywegt and Jette Kastrup for comments on the early version of the system. Finally, we are grateful to the journals and publishers that have allowed us to make use of their copyright material in this way.
Conflict of Interest: none declared.