Abstract

Motivation

Three-dimensional protein structures are important starting points for elucidating protein function and applications like drug design. Computational methods in this area rely on high quality validation datasets which are usually manually assembled. Due to the increase in published structures as well as the increasing demand for specially tailored validation datasets, automatic procedures should be adopted.

Results

StructureProfiler is a new tool for automatic, objective and customizable profiling of X-ray protein structures based on the most frequently applied selection criteria currently in use to assemble benchmark datasets. As examples, four dataset configurations (Astex, Iridium, Platinum, combined), all results of the combined tests and the list of all PDB Ids passing the combined criteria set are attached in the Supplementary Material.

Availability and implementation

StructureProfiler is available as part of the ProteinsPlus web service http://proteins.plus and as standalone tool in the NAOMI ChemBio Suite. Dataset updates together with the tool can be found on http://www.zbh.uni-hamburg.de/structureprofiler.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Three-dimensional structure models are the foundation of structural bioinformatics. The information content from a protein structure is tightly coupled to the richness of supporting experimental data. Depending on the application scenario, different sets of quality criteria should be applied to select structure collections. Use cases are for example molecular dynamics simulations as well as docking and scoring of protein-ligand complexes. Developing methods in this regime demand large datasets to allow statistically sound validation. For this purpose the Astex Diverse Set (85 protein-ligand complexes, Hartshorn et al., 2007), Iridium HT (207 protein-ligand complexes, Warren et al., 2012) and the Platinum (4548 ligands for bioactive conformation prediction, Friedrich et al., 2017) dataset were created spanning a decade of available structures in the Protein Data Bank (PDB, Gutmanas et al., 2014). For all three sets, the authors published extensive information about their selection criteria and multi-tool chains. The selection criteria catalog for Astex controls some model parameters such as the resolution, as well as ligand characteristics [Lipinski’s Rule of 5 by Lipinski et al. (1997)]. The Iridium dataset is based on criteria supposed to be applied on top of the Astex criteria and emphasize the necessity of available high quality experimental data. The Platinum dataset has added control against a high diffraction precision index (DPI, Goto et al., 2004), Rfree and bond angle and length deviations for the ligand. Astex and Iridium needed manual curation to control against the structure’s electron density support, Platinum uses the newly developed electron density score for individual atoms and molecular fragments (EDIAm,Meyder et al., 2017) to objectively automate the estimation of electron density support on the atomic level. While Platinum receives frequent updates, both Astex and Iridium remain static due to the high manual workload. Additionally, none of the tool chains of the three datasets are readily accessible and modifiable. Aiming to satisfy the aforementioned demands we developed StructureProfiler as part of the NAOMI ChemBio Suite. We provide configurations which are highly similar to the selection criteria of Astex, Iridium and Platinum. StructureProfiler is also integrated in our free web service ProteinsPlus (Fährrolfes et al., 2017; Fig. 1).

PDB ID 1nax evaulated by the StructureProfiler with the combined criteria set
Fig. 1.

PDB ID 1nax evaulated by the StructureProfiler with the combined criteria set

2 Materials and methods

All selection criteria available in StructureProfiler are listed in the Supplementary Table S1. The real-space correlation coefficient is implemented as published by Jones et al. (1991). Due to the varying implementations of the RSCC [see (Tickle, 2012) for a discussion on this topic] we use EDIAm or a tailored variant in the case of Iridium for the validation as a clearly defined, reproducible way to estimate electron density support. The Iridium dataset allows up to two heavy atoms not to be supported by electron density. Thus, we adjusted the EDIAm to leave out the two worst scored atoms calling the variation EDIAi further on. We also added selection criteria like B Factor distribution to extend the criteria catalog beyond those of the three datasets.

3 Results

StructureProfiler was validated against the three aforementioned datasets. In the following, the most important discrepancies per dataset are briefly discussed and full results can be found in the Supplementary Section S2. Electron density maps were downloaded from PDBe (Gutmanas et al., 2014). As a final application, we are profiling the PDB (downloaded on 2018-02-21, maximum resolution of 3.5 Å resolved with X-ray) with the combined criteria set. The PDB Ids annotated with ligand identifiers currently passing the combined filter criteria are given in the Supplementary Material. We plan to regularly update this list and provide all test results on http://www.zbh.uni-hamburg.de/structureprofiler.

3.1 Astex diverse set

All ligands in the Astex set need to fulfill the Lipinski Rule of five. We detected in G17905 (905, 1ygc) 8 Lipinski donors and 11 acceptors (Supplementary Fig. S2). Furthermore, a (CH2)4 linker is prohibited but present in DFPP-G (HA1, 1v48, Supplementary Fig. S1). As EDIA is more sensitive to atoms inconsistently supported by electron density, four ligands with low EDIAm values were detected (Meyder et al., 2017). Additional information can be found in Supplementary Section S2.1.

3.2 Iridium HT

Twelve ligands with more than two atoms inconsistently supported by electron density in regards to EDIAi were found (Supplementary Fig. S4). Also, four cases with crystal symmetry contacts closer than 6 Å were detected (Supplementary Fig. S3). In the case of Alpha(2, 3)-Sialyllactose in chain C, asparagine E 7 is only 2.48 Å away. Additionally, three active sites have atoms with an occupancy below 1 (Supplementary Section S2.2). One of these residues is in close proximity to the ligand and does not meet the requirements of the Iridium HT set.

3.3 Platinum

196 ligands without full occupancies were detected. We also found 240 EDIAm violations. Besides an EDIAm software update, we switched from the now defunct electron density server EDS to retrieve the maps from the PDBe. This e.g. resulted in an EDIAm score drop from 0.84 (good) to 0.54 (medium) in the case of AO1 (1r5g) as an extreme case. Discussion of the bond length and angle violations can be found in Supplementary Section S2.3. We also controlled the Platinum set with the combined criteria set detecting 19 intermolecular clashes between active site and ligand among others. This shows that datasets applicable for one use case may not be fitting for other ones.

3.4 Usage

StructureProfiler is available as part of our ProteinsPlus web service. Enter the PDB ID of your interest into the text field and then select the tool StructureProfiler on the right side of the web page. One of the four configurations (astex-like, iridium-like, platinum-like and combined) can be selected. Failed tests and substructures with at least one failed test are marked in red. All results and configuration files can be downloaded as INI/CSV files. The usage description of the customizable command line tool can be found in the Supplementary Section S1.1.

4 Conclusion

StructureProfiler assembles the currently relevant structure quality criteria catalog in a configurable standalone, easy-to-use tool, which is also available on the web. It allows rapid screening of inhouse data as well as easy repeated screenings of public databases. Due to the use of EDIA, it reduces human curation to a minimum in terms of electron density support control thus solving the up to now existing bottleneck in dataset curation. StructureProfiler serves as the next step towards the creation of large high quality datasets for docking, 3D-QSAR and the many new machine-learning-based applications appearing right now.

Conflict of Interest: none declared.

References

Fährrolfes
 
R.
 et al. (
2017
)
ProteinsPlus: a web portal for structure analysis of macromolecules
.
Nucleic Acids Res
.,
25
,
1
7
.

Friedrich
 
N.-O.
 et al. (
2017
)
High-quality dataset of protein-bound ligand conformations and its application to benchmarking conformer ensemble generators
.
J. Chem. Inform. Model
.,
57
,
529
539
.

Goto
 
J.
 et al. (
2004
)
Ph4Dock: pharmacophore-based protein-ligand docking
.
J. Med. Chem
.,
47
,
6804
6811
.

Gutmanas
 
A.
 et al. (
2014
)
PDBe: protein Data Bank in Europe
.
Nucleic Acids Res
.,
42
,
D285
D291
.

Hartshorn
 
M.J.
 et al. (
2007
)
Diverse, high-quality test set for the validation of protein-ligand docking performance
.
J. Med. Chem
.,
50
,
726
741
.

Jones
 
T.A.
 et al. (
1991
)
Improved methods for building protein models in electron density maps and the location of errors in these models
.
Acta Crystallographica Section A
,
47
,
110
119
.

Lipinski
 
C.A.
 et al. (
1997
)
Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings
.
Adv. Drug Delivery Rev
.,
23
,
3
25
.

Meyder
 
A.
 et al. (
2017
)
Estimating electron density support for individual atoms and molecular fragments in X-ray structures
.
J. Chem. Inform. Model
.,
57
,
2437
2447
.

Tickle
 
I.J.
(
2012
)
Statistical quality indicators for electron-density maps
.
Acta Crystallographica Section D
,
68
,
454
467
.

Warren
 
G.L.
 et al. (
2012
)
Essential considerations for using protein-ligand structures in drug discovery
.
Drug Discov. Today
,
17
,
1270
1281
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Alfonso Valencia
Alfonso Valencia
Associate Editor
Search for other works by this author on:

Supplementary data