PoSSuM v.2.0: data update and a new function for investigating ligand analogs and target proteins of small-molecule drugs

PoSSuM (http://possum.cbrc.jp/PoSSuM/) is a database for detecting similar small-molecule binding sites on proteins. Since its initial release in 2011, PoSSuM has grown to provide information related to 49 million pairs of similar binding sites discovered among 5.5 million known and putative binding sites. This enlargement of the database is expected to enhance opportunities for biological and pharmaceutical applications, such as predictions of new functions and drug discovery. In this release, we have provided a new service named PoSSuM drug search (PoSSuMds) at http://possum.cbrc.jp/PoSSuM/drug_search/, in which we selected 194 approved drug compounds retrieved from ChEMBL, and detected their known binding pockets and pockets that are similar to them. Users can access and download all of the search results via a new web interface, which is useful for finding ligand analogs as well as potential target proteins. Furthermore, PoSSuMds enables users to explore the binding pocket universe within PoSSuM. Additionally, we have improved the web interface with new functions, including sortable tables and a viewer for visualizing and downloading superimposed pockets.


INTRODUCTION
The number of released protein entries in the Protein Data Bank (PDB) (1) has reached 100 000. To elucidate protein functions from the abundant structural data, an efficient approach must be used to examine ligand-binding sites specifically because proteins exhibit their functions through interaction with other molecules. Of particular interest are smallmolecule binding pockets, which are crucial in structurebased drug discovery. To retrieve similar binding pockets of biological relevance, many methods have been proposed in the last few decades (2)(3)(4)(5)(6)(7)(8)(9). However, those methods are applicable only to a limited subset of protein binding pockets mainly because of the time complexity. We developed an extremely fast and efficient method for finding similarities between vast numbers of ligand-binding pockets in our previous study (10,11) and then applied it to all-againstall similarity searches against 1.8 million known and putative small molecule-binding pockets throughout the PDB. Eventually, we discovered 14 million pairs of similar binding sites. All enumerated similar pairs, along with biological annotations, were compiled into the PoSSuM database (http://possum.cbrc.jp/PoSSuM/) (12).
Since its initial release (2011), PoSSuM has grown to include more than 49 million pairs of similar binding sites identified among about 5.5 million binding sites. This database expansion is expected to enhance opportunities for biological and pharmaceutical applications such as the prediction of new protein functions. The binding pockets were assigned with UniProt (13) identifiers, Enzyme Commission (EC) numbers (14), Gene Ontology (15) terms, and domain annotations from CATH (16), SCOP (17) and SCOPe (18), and compiled into a relational database.
A PoSSuM similarity search starts by choosing a known ligand-binding pocket (SearchK) or a whole protein structure (SearchP) as a query. However, it is also useful to search for similar binding pockets that bind a specific type of lig-and, such as a small-molecule drug. The orally available small-molecule drugs are regarded as pharmaceutically important chemicals because of their safety and simple applicability. Therefore, finding an unexpected similarity between a known drug target and an unknown one might provide a hint for consideration of an 'off-target effect' of the corresponding drug. Another important consideration is that these small-molecule drugs have a wealth of drug discovery data: bioactivity data, medicinal chemistry data and target information. Therefore, it is possible to verify the similarity between proteins that have similar binding pockets even in the absence of protein-ligand complex data. In this study, we performed a survey, in which we focused on 194 bioactive small-molecule drugs with ligand-protein complex structures available in the PDB, and compared their binding pockets against the entire PDB by using a PoSSuM similarity search. The results were compiled into a new web interface, PoSSuMds, which is expected to be useful for investigating analogs of drugs and also for fishing for potential targets.
In addition to this update and new developments described above, we have introduced new features including redundancy removal on the result page, sortable tables and the improved molecular viewer, as well as a downloading function of the coordinates of superimposed pockets.

DATA GROWTH TO OVER 4.7 MILLION BINDING SITES
As of September 2014, the PoSSuM data source contains 49 million pairs of similar binding sites detected between 4.7 million binding pockets (Table 1). Since its initial release in November 2011, the number of binding pockets has increased by approximately three times. This data expansion is attributable not only to our annual updates on the release of new PDB entries, but also to the results from the removal of the use of a representative subset. Previously, putative binding pockets were detected from non-redundant protein structures (defined with a cutoff of 95% sequence identity) by using Ghecom, a novel pocket prediction method (19). In the new release, all PDB protein structures, except for structures with a resolution of less than 4.0Å or containing more than 3000 residues, are used to generate putative pockets. This extension not only increases the opportunity to obtain hits, but also provides a more convenient service in 'SearchP', in which a PDB ID is specified as a query.
Given a query of a binding pocket, PoSSuM enumerates all similar pockets irrespective of the sequence redundancy. However, users can filter out the redundancy by setting a UniProt/UniRef50 filter. If multiple similar pockets with the same UniProt ID and HET code are identified, then the pairs with the longest aligned length can be displayed in the case of a UniProt filter. In the case of a UniRef50 filter, the UniProt ID was substituted with the UniRef50 ID. In this event, the longest matched pairs were displayed.

POSSUM DRUG SEARCH (POSSUMDS)
For any given drug compound, a challenging task in structural bioinformatics is to find related (and biologically relevant) compounds and target proteins. Our approach is to collect a set of binding sites similar to the pockets to which the query ligand is known to bind. This type of approach is useful for retrieving various ligand analogs, such as natural ligands, inhibitors and metabolites, and also for obtaining potential target proteins ( Figure 1).

Small-molecule drug data set and results of a similarity search
Small-molecule drug compounds and their chemical properties were retrieved from the ChEMBL database (20). From ChEMBL release 19, we selected all of the drug compounds that fulfilled the following criteria: -Approved drug (molecule dictionary.max phase = 4) -Used as a therapeutic (molecule dictionary. therapeutic flag > 0) -Oral drug (molecule dictionary. oral > 0) -Non-prodrug (molecule dictionary.prodrug = 0) Of these, 211 unique drug compounds were identified as having at least one protein-ligand complex structure in the PDB by matching the standard InChIKey against the Ligand Expo database (21). The PoSSuM source data included 194 such drug ligands (HET codes), which have at least one binding pocket showing similarities to different pockets. In all, 2595 binding pockets for the 194 ligands were used as queries in this study (Table 2). Query pockets were then searched against all of the non-self pockets by using PoSSuM. Subsequently, 530 898 similarities were detected between 26 509 unique pockets, which were occupied by 5312 unique ligands (HET codes) and derived from 12 220 unique protein structures (PDB IDs). Finally, all of the results were compiled into individual web pages based on the name of the query drug (HET codes .

Diversity of identified ligands and receptors
To describe the diversity of the retrieved ligands, we compared the ligand structures by using two types of molecular fingerprints: MACCS (166 bits) and Babel FP2 (1024 bits), in which ligand compounds were encoded into fingerprints by OpenBabel (http://openbabel.sourceforge.net/). Similarities between them were then measured by using the Jaccard Index (JI) metric. To represent relationships between ligands graphically, two ligands were connected with an edge D394 Nucleic Acids Research, 2015, Vol. 43, Database issue  Figure S1). To investigate the diversity of retrieved receptors, we compared functional and structural classifications, which included EC numbers, CATH homologous super families and SCOPe super families, between query proteins and target proteins identified by pocket similarity searches.
To provide users with better and dynamic graphics, all figures on the result page were generated with D3.js (https: //github.com/mbostock/d3), a Java Script library. The PoSSuMds interface starts with a table containing  the 194  Each result page is subdivided into four sections. The first section summarizes drug descriptions based on the WHO ATC classification and chemical properties of the query drug, and is followed by a table of statistics related to the results of the pocket similarity searches, including binding pockets, binding ligands and receptor proteins.

WEB INTERFACE OF POSSUMDS
The variety of the retrieved ligands is shown in the second section. At the top of this section, the distribution of the retrieved ligands is displayed in a bar plot ( Figure 3A). Chemical similarities among the ligands can be viewed as heatmaps ( Figure 3B), where darker colors represent higher JI values. The relationships can also be visualized as a network ( Figure 3C), where a ligand is denoted by a node and the chemical similarity is represented by an edge. This type of visualization is expected to be useful for understanding the distribution of ligand analogs that bind to structurally similar pockets. In the case of imatinib (HET: STI), an example of a typical kinase inhibitor, the chemical similarities to natural ligands, metabolites (e.g. ATP, ADP and AMP) and to other inhibitors such as dasatinib (HET: 1N1) and sunitinib (HET: B49), are apparent in the network view. Furthermore, up to 50 of the top ligands, in descending order of the number of binding pockets, are shown in the table at the end of this section. All the other ligands can also be downloaded at the end of this table.
The third section presents a description of the diversity of receptor proteins. The distributions of the target binding pockets in PDB entries and sequence groups (UniProt IDs and UniRef50 IDs) are summarized in bar plots (Supplementary Figure S2A), which are followed by pie charts representing the distributions of functional and structural groups (Supplementary Figure S2B). Unique groups such as EC codes observed only on the target side are shown in highlighted colors.
In the fourth and last section, one can retrieve all of the binding pockets on both the query and target sides (Figure 4A), as well as all of the similarity details between them ( Figure 4B and C). For several query ligands, such as kinase inhibitors, the number of similar pocket pairs exceeds 10 000, which is difficult to display in a web page. Therefore, we generated a subset of pocket pairs in the following manner. Presuming that a query pocket is associated with UniProt ID P1 and HET code H1, and that one similar pocket was identified as associated with UniProt ID P2 and HET code H2, if multiple similar pocket pairs were identi-   fied between (P1, H1) and (P2, H2), then only the pair with the longest aligned length is selected. For putative pockets, the HET code H2 was ignored. If the interaction between a query ligand and an identified target receptor has been tested by any binding assay, then we assigned a flag 'Yes' to the similarity pair in the last column ( Figure 4B and C). ChEMBL assay information was retrieved via the Target-Mine data warehouse (22). Users can download not only the subset displayed in the tables, but also all pairs at the end of the tables ( Figure 4B and C).
Because PoSSuMds is fully integrated into PoSSuM, users can identify a target pocket similar to one of the binding pockets for the 194 compounds and proceed to additional searches and further investigation ( Figure 4A).

SUPERIMPOSITION VIEWER
In addition to browsing the statistical results, users can visualize individual superimposed pocket pairs (Figures 4B, C and 5). We have improved the superposition page, where the user can display/undisplay the query, the target, and can check corresponding amino acids based on the structural alignment, and can also download the 3D coordinates. As a demonstration, Figure 5 shows superimposed binding pockets for the example described in the section above. For the 3D molecular viewer, we employed JSmol (25), which was developed based on the HTML5 technology and which requires no enabling of Java in the user's web browser.

DISCUSSION AND FUTURE DIRECTIONS
In this update, we specifically examined approved smallmolecule drugs. We plan to expand the list of the query ligands to oral drugs, drug candidates and metabolites in the future. We currently use TM-align (26), which can align similar binding sites only in a sequence-order-dependent manner. However, fast and efficient pocket comparison methods have been proposed recently (27,28). Such methods should be used to compare poorly aligned pocket pairs by TMalign. Adopting such methods is expected to enhance our database. Another crucial factor is flexibility. Some drug compounds can bind to their receptors in various conformations, which in turn changes the shape and size of the binding pockets. To capture conformational changes, it is necessary to employ a method that incorporates binding pocket flexibility.