Ligandbook: an online repository for small and drug-like molecule force field parameters

Abstract Summary Ligandbook is a public database and archive for force field parameters of small and drug-like molecules. It is a repository for parameter sets that are part of published work but are not easily available to the community otherwise. Parameter sets can be downloaded and immediately used in molecular dynamics simulations. The sets of parameters are versioned with full histories and carry unique identifiers to facilitate reproducible research. Text-based search on rich metadata and chemical substructure search allow precise identification of desired compounds or functional groups. Ligandbook enables the rapid set up of reproducible molecular dynamics simulations of ligands and protein-ligand complexes. Availability and Implementation Ligandbook is available online at https://ligandbook.org and supports all modern browsers. Parameters can be searched and downloaded without registration, including access through a programmatic RESTful API. Deposition of files requires free user registration. Ligandbook is implemented in the PHP Symfony2 framework with TCL scripts using the CACTVS toolkit. Supplementary information Supplementary data are available at Bioinformatics online.


IMPLEMENTATION DETAILS
Ligandbook is written in PHP with the Symfony2 framework. It stores object annotations in a MYSQL relational database while deposited files use filesystem-backed storage (see Supplementary  Information Fig. 1). The PHP code interfaces with TCL scripts that use the CACTVS toolkit (Ihlenfeldt et al., 1994) for the cheminformatics functionality. Chemical structures for structurebased queries are drawn with the CACTVS Sketcher (Ihlenfeldt et al., 2009). The text search uses the Elasticsearch engine (https: //github.com/elastic/elasticsearch). Elastica indexes are created for a range of information, including but not exclusive to: canonical IUPAC name, common names, formula, molecular weight, charge, number of atoms, SMILES, PubChem CID, PDB ligand ID, CAS RN. Indexing these fields permits rapid sifting and searching of the repository.
The security of the server and of data contained is ensured by secured connections (SSL encryption), use of CAPTCHA for user identification and email address validation when the user account is created. User passwords are salted and hashed with a functionality provided by Symfony2. The access to database objects Fig. 1: Software architecture and information flow. Users interact through the Ligandbook web frontend to deposit parameter files or query the database to retrieve parameter files (structures and topologies). Internally, Ligandbook uses the CACTVS toolkit to process chemical structures.
is restricted via the use of Access Control Lists (ACLs) that define the user "OWNER" of a particular resource. Other users commonly have only "VIEW" permissions. Ligandbook maintainers have "ADMIN" permissions, with access to all data structures.

Software architecture
The software architecture of the Ligandbook repository is represented in Figure 1. The user-facing web frontend is built with the Symfony2 PHP framework (https://symfony.com/). Symfony2 provides high-quality, maintainable, and upgradable components for web applications following the model-viewcontroller (MVC) pattern. By using a reliable foundation, we are able to maintain the site for the long term at low costs and with low developer overhead. For instance, security updates can be easily incorporated. Symfony also comes with numerous components such as authentication, emailer (for registration and password recovery emails), a rich text editor widget, and an admin backend that are all used inside Ligandbook. Symfony2 also abstracts the access to the underlying MYSQL database. The CACTVS cheminformatics toolkit (Ihlenfeldt et al., 1994) (http://www.xemistry.com/) provides the functionality to identify compounds from 3D structures, perform substructure search, generate 2D structure images, and fetch metadata from PubChem. The PHP code in the web frontend communicates to CACTVS through scripts written in TCL.
The text search is performed with the Elasticsearch engine, which is built with the Apache Lucene Core search engine library (McCandless et al., 2010). Therefore, our text search supports the advanced Apache Lucene syntax with Boolean operators, grouping, wildcards, fuzzy search, and search by specific fields.
Users interact with the underlying database through the web frontend. A guiding principle of Ligandbook is that chemical structures are accurately represented and the whole site is "chemically aware", i.e. compounds can be searched by structure and are identified based on their chemistry. On deposition, the 3D coordinate file submitted by the user is parsed and represented as a 2D chemical structure. This step enables the user to spot and correct any mistakes that can occur when deriving bond order information from 3D structures. The chemical structure is then used to derive most of the meta data and to link to PubChem and to other data bases. The most accurate way to query the database is the chemical search whereby the user draws the 2D chemical structure (or provides a SMILES string to have the structure drawn).
The chemical structure is also required to automatically generate the chemical structure depictions (in SVG format). Two depictions are made available as zoomable images. A simplified representations only shows the heteroatoms explicitly. A detailed representation includes all atoms and labels each atom explicitly, which aids in identifying atoms within structure and topology files.

Data structures and versioning
Compounds Ligandbook associates an individual molecule, here named a compound, with meta data about the molecule itself and user-provided data. Compounds are described by their chemical 2D structure. Compound details are generated by CACTVS or fetched from external sources. Each compound is uniquely identified with a hash key (HASHISY) as generated by CACTVS. If available, the compound name is taken, in order of availability, from the first CommonName, the IUPAC name or the PDB.
In order to facilitate validation of of parameter sets, an arbitrary number of reference values can be associated with a compound. A reference value is an observable such as an experimental hydration free energy or a high-level quantum mechanical calculation of a completely unvalidated, the parameters run in a simulation, but no properties of the system have been validated 2 minimally validated, structural properties are acceptable but no explicit comparison to experiment was done 3 reasonable parameters that agree with 1-2 experimental measurements, major problems are apparent 4 strong agreement between a number of reference and computed properties, minor problems 5 perfect agreement between multiple experimental and computed values torsional energy barrier. Each reference value consists of a text description, a numerical value, the units, and a literature citation.
Packages and versions A package contains meta data and one or more versions. A version is a container for structure and topology files. If there are reference values asscociated with the compound then there should also be corresponding computed values stored with a version. These computed values are the values of the reference observable as produced by the particular parameter set contained in the version. Users may compare the computed values to the reference values in order to judge the quality of the parameters. Additionally, the depositor may assign a subjective reliability rating to a version (typically in conjunction with reference and computed values). For simplicity, a simple integer scale from 1 to 5 was adopted, together with a free form text field suitable for a short justification of the rating; the intended meaning of the scale is listed in Table 1. There can be multiple versions in a package, each with its own unique identifier. The versions make up the history of a parameter set. The files in versions are stored together with SHA1 checksums so that users can immediately identify files that changed between versions. Individual versions of parameters can be accessed through the web interface or through programmatic access (see Section 1.4).

Site structure
At the top of the Ligandbook home page (Fig. 2a) and all other top level pages contain a menu with the major functionalities:

Home
link to the home page ( Fig. 2) Search text search or search by 2D structure with the CACTVS Sketcher (see Section 3) Reorder search with a 3D structure and return a structure whose atom are reordered so that it can be used with matching topology files found in the repository (see  The top bar contains a text search box and links for registration and log in and underneath the main menu. The graphic slider shows various statistics about the molecules in the repository and is updated nightly. The "lucky choices" are randomly generated links to packages in the repository and invite the casual user to browse and explore the site. The footer menus provide links to specific documentation pages, the wiki, and the issue tracker. All pages display the same top bar and main menu as well as the footer menu and thus make it easy for the user to navigate the site.
By default, users can access the site anonymously. In order to deposit a package or in order to curate an existing package, users have to register with a valid email address and log in with their user name and password. The Register and Login menu items in the top right corner provide this functionality. If users forget their password, they can reset it by themselves and obtain a temporary password through their registered email address.
Further functionality and technical details of the repository are described in more detail in Documentation and Support sections of the web page that can be directly accessed from the footer of every page (Fig. 2b). The Getting started page contains step-by-step tutorials for commonly used operations. A dedicated Wiki and an Issue tracker are present to facilitate discussion with the community. The Policy governing the repository is also clearly spelt out.

Programmatic access
In order to increase interoperability with other tools and scripts, Ligandbook defines a RESTful API for searching the repository using URL-based queries. Results can be retrieved in YAML, XML and JSON formats for further automatic downstream processing. For instance, a text search for "benzene" is accessed with the URL In Python, a query can be easily parsed into a dictionary, using the yaml package: import urllib2 import yaml url = "https://ligandbook.org/search/benzene/results.C yml" response = urllib2.urlopen(url) results = yaml.load(response) The variable results will hold a data structure that contains the complete query. For instance, the total number of packages found is Any GET URL that is produced for a search in Ligandbook can thus be used to access the returned matches in a programmatic fashion. From the data one can then construct the URL to download the package. Using the benzene example from above, one can use the following Python code to select the first package and then download the contents of the latest version inside the package as a zip file (here just named "download.zip"): Programmatic access should enable other tools to directly use parameters from Ligandbook. For instance, in the future the Gromacs  topology generation tool pdb2gmx could query Ligandbook for parameters of small molecules that are not part of the included standard force field files and automatically create the simulation-ready topology for a protein with its ligand.

LICENSES
To credit the original parameters developers, upon creating a package users must accept to license the parameters under the CC BY-SA (Creative Commons Attribution-ShareAlike, https:// creativecommons.org/licenses/by-sa/2.0/), which ensures that the parameters will always be available as Open Data (https://okfn.org/opendata/). Meta data is published under the CC0 1.0 Universal (CC0) Public Domain Dedication (https://creativecommons.org/publicdomain/zero/ 1.0/) and effectively puts these data into the public domain, which would be necessary for future enhancements to Ligandbook such as assigning DOIs for individual parameter sets.

SEARCHING
One of the most important functionalities in Ligandbook is the search. Standard text search can be used for broad queries that potentially yield hundreds of results. Text search with fields and boolean operators can narrow down results substantially. However, the chemical exact and substructure search sets Ligandbook apart from all other sites that provide parameter sets. It allows the user to precisely search by the chemistry of the compounds, as encoded by a chemical 2D structure. Search results are presented as a list of packages with the package name, a zoomable image of the compound structure and links to corresponding entries in the Protein Data Bank (PDB) (Berman et al., 2000), PubChem (Bolton et al., 2008) and Common Chemistry (http://www. commonchemistry.org/).

Text search
Simple text search All meta data fields are indexed and can be searched by entering search terms into the search field that is present on each page and also on the Search page. For instance, we can search for isobutylphenylpropanoic acid, a nonsteroidal anti-inflammatory drug better known as ibuprofen.
1.Enter ibuprofen in the search box and select Search. 2.A list with results is shown (Fig. 3). We find two hits: the (S) and (R) enantiomers of isobutylphenylpropanoic acid. The

Searching by chemical structure
A key feature of Ligandbook is the capability to search by chemical structure. This enables a user to accurately describe the chemistry of compounds instead of relying on text matches, which are rarely sufficiently precise. Let's assume we suspect that the active core of ibuprofen is represented by ethylphenylpropanoic acid. We can search for ibuprofen and related compounds by a chemical substructure search with the active core.
1.Under the Search menu, sketch the 2D structure in the Sketcher window (or enter the SMILES C1=CC=CC=C1C(C(=O)O)C), as shown in Fig. 5a. The chemical structures of common compounds can also be fetched from PubChem (Bolton et al., 2008) by entering the compound name as a quoted string (e.g., "ibuprofen"). 2.Browse the list of results that are shown under the Sketcher window (Fig. 5b). As for the text search, we find the two enantiomers. Select the compound and parameter set of interest.
The most accurate way to search Ligandbook is to provide the query chemical structure in the Sketcher and to perform a search for an exact match. The Sketcher recognizes many compounds by name when entered into the SMILES text entry field as a quoted string. For example, ibuprofen can be entered as "ibuprofen" (a) Chemical structure and package details.
(b) Compound details and License. and the Sketcher will immediately display the chemical structure. In conjunction with the Exact search, Ligandbook can be chemically accurately queried.

SETTING UP A MOLECULAR DYNAMICS SIMULATION
Ligandbook makes it easy to set up molecular dynamics simulations to probe protein-ligand interactions. As an example we show how to study the interaction of the anti-inflammatory drug aspirin with a potential target enzyme, phospholipase A2 (PLA2) (Burke and Dennis, 2009), using the Gromacs software package Abraham et al., 2015). A crystal structure of aspirin in  complex with PLA2 (PDB ID 1OXR) was solved to 1.9Å resolution (Singh et al., 2005). Ligandbook contains two parameter sets for aspirin in the OPLS-AA force-field under packageIDs 785 (neutral form) and 2926 (anionic form). However the atom order of the parameter files is different from what is available in the PDB structures of the aspirinprotein complex. The user has to re-order the atoms in the structure to match the order in the topology file. This technical requirement is error-prone and tedious and creates an additional barrier to using parameters from a repository. Ligandbook overcomes this problem with its Reorder function that can re-order PDB structure files to match parameter files.
In order to completely define the structure, the user should add hydrogens to the ligand and so completely determine the protonation state. For example, using Schrödinger's Maestro PrepWizard one could protonate the AIN ligand to obtain the anionic acetylsalicylate base (charge -1) and save it as the file 1oxr resname-AIN Hs.pdb. (Alternatively, one could also use the open source RDKit (http://www.rdkit.org/) to protonate the ligand.) It is also possible to skip the protonation step and provide a structure without hydrogens. In this case, Ligandbook will add hydrogens and return all possible matches for different Example of a simulation with Ligandbook parameters for aspirin (2-acetyloxybenzoate, LBID 2926). A Phospholipase A2 (PLA2) with aspirin bound via calcium ion (yellow) and a hydrogen bond to W19. B Root mean square distance (RMSD) timeseries showing the apo enzyme (gray) and the PLA2-aspirin complex (black) as well as the RMSD for aspirin in the reference frame of the RMS-fitted protein (red). The complex shows smaller fluctuations than the free protein. C Probability distribution of the minimum distances between oxygen atoms and the calcium ion. D Probability distribution of the length of the hydrogen bond to W19. The arrow highlights the population of strong hydrogen bonds that is established towards the last ∼10 ns of the 100 ns MD simulation.
forms of the query molecule but this approach can be error prone and for best results it is recommended to always manually protonate the query structure.
Under the Reorder menu, upload the ligand-only PDB (1oxr resname-AIN.pdb or 1oxr resname-AIN Hs.pdb) to get the re-orderd parameter files.
If the ligand was not protonated manually, Ligandbook will generate re-ordered parameters for all protonation states of the molecule in the database and show all matching packages. For a protonated ligand, only exactly matching structures will be shown. The user downloads the desired re-ordered and protonated structure (PDB format, e.g., 57b9d65c162e4.pdb; note that file names are randomly generated for each uploaded compound) and corresponding topology file (Gromacs ITP format, e.g., 57b9d65c14721.itp).
The ligand structure needs to be combined with a properly protonated protein structure (e.g., produced with the Gromacs 4.6.1 pdb2gmx tool ): pdb2gmx -f 1oxr_xtalwater.pdb -o 1oxr_xtalwater_H.pdb -C ignh -ff oplsaa -water tip4 -p water.top pdb2gmx -f 1oxr_protein.pdb -o 1oxr_protein_H.pdb -ignh C -ff oplsaa -water tip4 -p system.top (We also keep the crystal water molecules but because only oxygen atoms are stored for crystal waters, we must also add hydrogens.) Then merge the three components (here using MDAnalysis) from MDAnalysis import Universe, Merge protein = Universe("1oxr_protein_H.pdb").atoms ligand = Universe("57b9d65c162e4.pdb").atoms water = Universe("1oxr_xtalwater_H.pdb").atoms u = Merge(protein, ligand, water) u.atoms.write ("1oxr_protein_AIN_xtalwater.pdb") Note that the position in space of the re-ordered ligand structure is the same as the input and hence it is possible to simply combine protein and ligand. The fully protonated system is 1oxr protein AIN xtalwater.pdb; manually edit the system.top file by including the ligand ITP file with the line #include "57b9d65c14721.itp" and adding the ligand and the number of water molecules to the [ molecules ] section. Further processing (solvating with water and ions, energy minimization, equilibration with position restraints) follows the standard Gromacs protocol, with the details summarized in the next section.
MD simulations Apo and drug-bound simulations of PLA2 were performed with Gromacs 4.6.1  using the OPLS-AA force field for proteins and ions (Kaminski et al., 2001;Jensen and Jorgensen, 2006), and the TIP4P water model (Jorgensen et al., 1983). The classical equations of motions were integrated with the leap-frog integrator with a time step of 2 fs. All bonds involving hydrogen atoms were constrained with P-LINCS (Hess, 2008) (order 4, 2 iterations). The neighbour list was updated every 5 steps. Simulations were performed at constant temperature T = 300 K using the Bussi velocity rescaling thermostat (Bussi et al., 2007) (coupling time constant 0.1 ps) and constant pressure P = 1 bar with the isotropic Parrinello-Rahman barostat (Parrinello and Rahman, 1981) (coupling time constant 1 ps, compressibility 4.5 × 10 −5 bar −1 ). Long range electrostatic forces were computed with the SPME method (Essman et al., 1995) (real space cutoff about 1 nm which was optimized by Gromacs on the fly to 1.502 nm, FFT grid spacing (started with 0.12 nm but was optimized to about 0.19 nm), 4th order spline) and Lennard-Jones interactions were cut off at 1 nm.
The protein-aspirin complex (PDB id 1OXR) was downloaded from the Protein Data Bank as described. Missing protein sidechains were added using the WHAT IF server (Vriend, 1990). The Ca 2+ ion and all crystal water molecules were retained during system preparation. For the apo simulation, the bound aspirin molecule was removed; for the protein-drug complex simulation it was retained but the atoms were reordered to match the topology file from Ligandbook. The protein was solvated in a dodecahedral simulation cell with a minimum protein-box edge distance of 1.5 nm. The simulation systems contained NaCl for a free ion concentration of about 150 mM, comprising about 35,000 atoms in total. Systems were energy minimized by alternating conjugate gradient and steepest descent algorithms until the maximum force in the system dropped below 1000 kJ · mol −1 · nm −1 . Protein (and drug) were position restrained (harmonic force constant 1000 kJ · mol −1 · nm −2 ) for 1 ns so that the solvent could relax without disrupting the protein structure or protein-drug interactions seen in the crystal structures. Unrestrained equilibrium simulations started from the final frame of the restrained simulation and were performed for 100 ns each.

Results
The interaction of phospholipase A2 (PLA2) with aspirin has been described by Singh et al. (2005) on the basis of a 1.9-A resolution crystal structure as a model of for interactions of proteins with non-steroidal anti-inflammatory drugs. We performed molecular dynamics (MD) simulations to better understand the difference between drug-protein interactions in a crystal structure and under physiological conditions.
Results are shown in Fig. 6. PLA2 in the apo state opens up somewhat compared to the crystal structure as indicated by the higher apo RMSD in Fig. 6B. The bound Ca 2+ ion is only held in place by a tight interaction with the carboxylate group of D49 (Fig. 6C). In the drug-protein complex simulation, aspirin remained tightly bound to PLA2 over 100-ns simulation as shown by the low RMSD of the drug (Fig. 6B). Primarily, aspirin is held in place by the electrostatic interaction of its carboxylate group with the Ca 2+ ion (Fig. 6C) that itself is tightly bound to PLA2 via D49 and the backbone carbonyl oxygens of Y28 (average distance 3.1 ± 0.9Å) and K31 (3.8 ± 2.0Å); G30:O, which contributes to the binding site in the crystal structure, flips out in the simulations (Fig. 6A). Aspirin effectively blocks access to the catalytic residues of PLA2, H48, Y52, and D94 (Burke and Dennis, 2009).
From the crystal structure an additional interaction with Y64 was described (distance 3.3Å in 1OXR) (Singh et al., 2005) but our simulations show this residue to be very mobile and the interaction not to be present under the conditions simulated here. However, a previously undetected interaction is established towards the end of the simulations by a hydrogen bond between the acetyl oxygen of aspirin and the nitrogen of W19 (Fig. 6A, D). The initial W19:Naspirin:O4 distance is about 4Å, corresponding to a weak hydrogen bond. However, towards the end of the simulation, the distance shrinks to < 3Å, which is indicative of the formation of a strong hydrogen bond.
These preliminary simulations indicate that the interactions between aspirin and PLA2 are qualitatively different in the crystal conditions and the physiological conditions present in the MD simulations. Future work would need to further validate the parameters for aspirin and investigate the robustness of the observed interactions under repeated and longer simulations.
This example shows how Ligandbook enables the study of drugprotein interactions by removing a substantial barrier to setting up the simulation system.