sc-PDB: a 3D-database of ligandable binding sites—10 years on

The sc-PDB database (available at http://bioinfo-pharma.u-strasbg.fr/scPDB/) is a comprehensive and up-to-date selection of ligandable binding sites of the Protein Data Bank. Sites are defined from complexes between a protein and a pharmacological ligand. The database provides the all-atom description of the protein, its ligand, their binding site and their binding mode. Currently, the sc-PDB archive registers 9283 binding sites from 3678 unique proteins and 5608 unique ligands. The sc-PDB database was publicly launched in 2004 with the aim of providing structure files suitable for computational approaches to drug design, such as docking. During the last 10 years we have improved and standardized the processes for (i) identifying binding sites, (ii) correcting structures, (iii) annotating protein function and ligand properties and (iv) characterizing their binding mode. This paper presents the latest enhancements in the database, specifically pertaining to the representation of molecular interaction and to the similarity between ligand/protein binding patterns. The new website puts emphasis in pictorial analysis of data.


INTRODUCTION
The 3D structures of macromolecules, as collected by the Worldwide Protein Data Bank (PDB) organization (http:// wwpdb.org, (1)), offer wealth of information for computeraided approaches to drug design. During the last 30 years, the steady increase of the PDB archive (2) has prompted the development of 3D methods for hit identification by virtual screening of chemical libraries, de novo ligand design and hit to lead. Many success stories have been reported in the literature (3). Besides, some proteins have never been efficiently modulated by chemical compounds despite intense efforts in medicinal chemistry. The concept of ligandability has thus been suggested to qualify the ability of a protein to bind with high affinity a small molecular weight com-pound (4,5). Recent studies demonstrated that simple geometric and physico-chemical descriptors of protein cavities (principally size, shape and polarity) are sufficient to predict structural ligandability (6)(7)(8)(9).
The sc-PDB is a specialized structure database focused on ligand binding site in ligandable proteins (10). We have selected in the PDB all proteins in complex with a small synthetic or natural ligand (140 Da < MW < 800 Da), provided this ligand was well buried and biologically relevant and since 2013 provided the binding site was predicted ligandable according to a machine leaning-based model. The different stages of database design process are detailed in the online documentation and summarized in Figure 1.
The first publicly available version of sc-PDB has been released in 2004. The database has been annually updated with regular additions of new features (See Supplementary  Table S1 for a summary of changes since the database creation). Not only the quality and the precision of data improved over the 10 years, but new tools have allowed global analysis of data. A major example is the clustering of sites for proteins present in multiple copies in the database (11). The new functionalities in sc-PDB, introduced after 2011, are discussed in detail in this paper.

sc-PDB CONTENT
The sc-PDB data are directly compatible with computational methods, such as docking, molecular mechanics and electrostatic calculations. Unlike the PDB, which generally does not represent hydrogen atoms nor defines ionization state of titratable groups, the sc-PDB provides an allatom model of molecules: (i) hydrogen atoms are added to amino acids considering that arginine and lysine are positively charged and aspartic and glutamic acids are negatively charged, (ii) hydrogen atoms are added to other residues according to ionized templates built from HET group dictionary (12), (iii) the intermolecular H-bonding network is optimized using the BioSolveIT Hydescorer program (13). The overall processing of an original PDB entry yields atomic data for a single ligand, the protein chain(s) surrounding this ligand and its binding site (i.e. all protein residues with at least one heavy atom closer than 6.5Å to any ligand heavy atom). Of note, protein and binding site contain standard amino acids, and may include cofactor(s), metallic ion(s) and covalently bound residue(s), such as carbohydrate. Last, each sc-PDB entry is characterized with functional and chemical annotations.
The current sc-PDB release contains 9283 entries, representing 3678 different UniProt (14) proteins and 5608 different HET ligands. The data set is non-redundant: although about 10% of ligands and almost half of proteins are present more than once in the database, each sc-PDB ligand/protein complex is unique. Less than 5% of proteins are encountered more than 10 times in the database, yet some of them have a very high copy number. The three most frequent proteins are HIV protease (248 entries), cyclindependent kinase 2 (180 entries) and beta-secretase 1 (155 entries). More statistics are given in Supplementary Figure  S1, and at http://cheminfo.u-strasbg.fr/scPDB/ABOUT.
The total size of compressed database is 1.5 GB. Its downloadable content is summarized in Table 1.

Depiction of protein-ligand complexes
The latest sc-PDB release enables the user to depict proteinligand complexes according to different needs and complexity levels. For example, a medicinal chemist may be primarily interested in the PoseView 2D sketch highlighting the ligand structure, binding site boundaries and main inter-actions (Supplementary Figure S2A). A cheminformatician may focus on the nearby tabulated list of protein-ligand interactions including involved atoms and a full topological description (distance, angle) of each interaction (Supplementary Figure S2B). Last, a structural biologist can access a 3D picture of the complex embedded in the OpenAstex viewer (Supplementary Figure S2B) (15). The interaction table is graphically linked to the 3D picture: scrolling the mouse over any interaction line in the table interactively displays the corresponding interaction in the neighboring 3D picture.

Water molecules
Water is by essence the biological fluid. The role of water in molecular recognition events is not yet fully understood although it has been extensively studied experimentally and theoretically (see (16) for a comprehensive review). Observations made for water molecules at binding interface between a drug and its protein target demonstrated that ordered solvent molecule(s) can either reinforce or by contrast weaken the stability of the complex depending on the studied system (17). In drug design, interfacial water molecules have a profound impact on calculations, both the inexpensive computational protocols, such as hit finding by highthroughput docking (18), and the more sophisticated algorithms, such as lead optimization using free-energy perturbation calculations (19).
Since 2012, a sc-PDB protein contains all water molecules that establish two or more hydrogen bonds with the binding   site (i.e. donor-acceptor distance < 3.5Å and 120 • < donor-H-acceptor angle < 240 • ). These water molecules are expected to be hardly displaceable by a ligand because of tight binding to the protein (20). Water molecules are present in about two-thirds of sc-PDB complexes (Figure 2). The number of water molecules per site ranges from 1 to 10, but the distribution is largely biased toward smaller values. Although only few of the selected water molecules are in direct interaction with the ligand, using this information is key to structure-based design and drastically influences virtual screening for example.

Query for similar binding sites
The molecular basis of the ligand/protein recognition gives insights into the specificity of a drug for its target protein.
For example, structural variations in binding site may explain the permissive binding of different ligands to a single protein. As mentioned in the Introduction, we have previously addressed this issue by analyzing the multiple binding sites in a given protein (11). The sc-PDB clusters of binding sites can reveal differences in location, size, composition or 3D structure. For example, clustering the sc-PDB sites of adenylosuccinate synthetase yields three clusters; two of them that have similar structures and compositions except guanosine diphosphate (GDP) and Mg 2+ cofactors; the third one is localized in a different region in the protein (Supplementary Figure S3). Other high quality databases derived from the PDB also facilitate the comparison of the binding sites across a protein family (21)(22)(23). The sc-PDB database is, however, the only meta-database enabling to search the PDB using user-defined queries mixing protein, ligand, binding site and binding mode properties. For ex-ample, a single query in the sc-PDB enables the selection of all protein-ligand complexes for which (i) the target is a protein kinase, (ii) the ligand is a fragment with a molecular weight between 150 and 300, (iii) the binding site comprises at least one bound water molecule, (iv) the ligand is neutral and contacts its target by one aromatic face-to-face interaction. Local structural similarity between non-homologous proteins can account for the promiscuity of a ligand, and thus can help explaining the side effects of a drug or suggest its repositioning toward a novel target and therapeutical indication (24,25). The sc-PDB database now enables the identification of similar sites in distinct proteins using a pre-computed all-against-all comparison with the in-house developed Shaper algorithm (8). The sc-PDB website allows Nucleic Acids Research, 2015, Vol. 43, Database issue D403 to query the matrix of scores for any given sc-PDB site. It displays the distribution of scores and lists the entries whose similarity score is higher than a given threshold (default value is 0.44). For example, the binding site for phosphomethylphosphonic acid-guanylate ester in Escherichia coli adenylosuccinate synthetase (PDB ID: 1HOP, HET: CGP) shares significant 3D similarity with a single site in sc-PDB, that of GTP in a murine homologous protein (Supplementary Figure S4).

Query for similar binding patterns
The non-bonded interactions between a ligand and its protein define a 3D pattern that characterizes the binding mode. We have recently developed a new geometrical method to encode and compare protein-ligand interaction patterns (26). Briefly, each interaction is represented by three points: the interacting ligand atom, the interacting protein atom and a pseudo-atom at the geometric center of the above-cited two atoms. Each interaction is assigned a molecular type according to the type of non-bonded interaction (hydrophobic, aromatic, hydrogen bond, ionic bond, metal-ion bond). The 3D pattern is defined by all the triplets of interaction pseudo-atoms and graph theory is applied to find the maximal common subgraph (clique) between two 3D patterns. The similarity score evaluates the quality of overlap after 3D alignment of the two patterns. Using this approach we recently demonstrated that the protein-ligand binding mode is generally conserved within a family of homologous protein even though bound ligands are dissimilar.
The sc-PDB database now enables the identification of similar 3D pattern in distinct complexes; the all-against-all comparison of sc-PDB complexes was computed using the program Grim (26). The sc-PDB website allows to query the matrix of scores for any given sc-PDB ligand/protein binding mode. It displays the distribution of scores and lists the entries whose similarity score is higher than the threshold selected on the distribution (default value is 0.65). For example, the binding mode of phosphomethylphosphonic acid-guanylate ester to E. coli adenylosuccinate synthetase (PDB ID: 1HOP, HET: CGP) shares significant similarity with 25 complexes in sc-PDB, representing 19 different proteins bound to GDP, GTP or close analogs. The two top scorers are respectively a homologous protein in wheat (PDB ID: 1DJ3) and the functionally unrelated signal recognition particle protein (PDB ID: 1RJ9, Figure 3).

A new interface
The main architecture of database has not been changed, but the sc-PDB website has been completely re-designed to enhance interactivity. For every entry, the user can navigate in the main menu and directly switch views in the same window focusing on either a simple description of the entry, or a full characterization of the ligand or its binding site. Only searches for similar binding sites or binding modes open a new window with the rank-ordered list of sc-PDB hits corresponding to the query. At almost all sections of the web interface, molecules (protein, ligand, binding site), interaction pattern, tabulated results (hit lists, protein-ligand interactions) and charts (ligand and binding site properties, distribution of similar binding sites or binding modes) can be downloaded in the relevant file format (mol2, xlsx, csv, tsv, png, jpg, svg, pdf). In case of binding site/binding mode similarity searches, aligned molecules are also downloadable (Table 1).