LigDig: a web server for querying ligand–protein interactions

Summary: LigDig is a web server designed to answer questions that previously required several independent queries to diverse data sources. It also performs basic manipulations and analyses of the structures of protein–ligand complexes. The LigDig webserver is modular in design and consists of seven tools, which can be used separately, or via linking the output from one tool to the next, in order to answer more complex questions. Currently, the tools allow a user to: (i) perform a free-text compound search, (ii) search for suitable ligands, particularly inhibitors, of a protein and query their interaction network, (iii) search for the likely function of a ligand, (iv) perform a batch search for compound identifiers, (v) find structures of protein–ligand complexes, (vi) compare three-dimensional structures of ligand binding sites and (vii) prepare coordinate files of protein–ligand complexes for further calculations. Availability and implementation: LigDig makes use of freely available databases, including ChEMBL, PubChem and SABIO-RK, and software programs, including cytoscape.js, PDB2PQR, ProBiS and Fconv. LigDig can be used by non-experts in bio- and chemoinformatics. LigDig is available at: http://mcm.h-its.org/ligdig. Contact: jonathan.fuller@h-its.org, rebecca.wade@h-its.org Supplementary information: Supplementary data are available at Bioinformatics online.


Fig. 1.
A workflow showing the modularity of LigDig component tools, what input data they require (green: mandatory, orange: optional), which databases they access (grey), and what output data they produce (cyan). Possibilities to link output data from one tool to another are shown with green arrows from the tool icons.
Design: The PubChem queries are performed using python to query the PUG interface. Results are passed from the LigDig server to the client browser in JavaScript Object Notation (JSON) format. To avoid cross-site-scripting limitations when accessing UniChem to provide database cross-references, the LigDig web server queries UniChem using the Play framework web services extensions, and the results are passed to a client side JavaScript that updates the web page HTML asynchronously.

Find an inhibitor
Motivation: Recently there has been much discussion of the impact of polypharmacology on drug discovery efforts. Similar principles are also important for systems biology approaches where a chemical inhibitor can be applied to inhibit a specific component of a network or pathway. Many compounds exhibit polypharmacology meaning that they act on multiple proteins. It would be desirable for a bench scientist to be able to easily discover these proteins. If they find a compound that inhibits multiple proteins within their pathway or network they may then select another more specific inhibitor, or choose to explicitly model this effect.

Functionality:
The "Find an Inhibitor" tool finds inhibitors of a given protein and then, for each inhibitor, finds all other proteins against which it has activity. To do this, a UniProt identifier is required. The search box on this page has an autocomplete functionality to assist the user, who can provide a fulltext string, for which UniProt identifiers are suggested. ChEMBL is then queried once using the supplied UniProt identifier. The resulting compounds that have activity values (IC 50 , K d , K i , Potency, EC 50 , XC 50 , AC 50 ) are returned. When more than 100 compounds meet this criterion, only those that exceed a user defined ligand activity threshold value (default 1000 nM) are retained, whereas for fewer than 100 compounds, every compound is returned. The retained compounds are used as queries to find all proteins with activities against them. Results are displayed using Cytoscape.js and additionally in a table. Compounds with activities exceeding the ligand activity threshold value are labelled as potential inhibitors, whereas compounds with activities exceeding a user-defined off-target activity threshold value (default: 100,000 nM) are marked as off-target compounds.
Design: The SQLAlchemy python package is used to interface with the ChEMBL MySQL database files downloaded after each new ChEMBL release (as of June 17 th 2014, release 18), which are hosted on a MariaDB SQL server. The query result is rendered into a JSON format suitable for Cytoscape.js. UniChem is used to perform database cross-references for compounds. UniProt identifiers are provided by using the typeahead functionality from Twitter bootstrap 2.3.2 and by querying the local ChEMBL database based on protein name.

Ligand functional annotation
The ligand functional annotation tool has two possible usage modes: the first uses PDB files as input, the second uses an InChI and an E.C. number. The results for each mode are displayed slightly differently. Both tools are described below.

PDB-based search
Motivation: Ligands that bind to an enzyme can have several functions, e.g., acting as a substrate, product or as different kinds of modifiers, depending on their enzymatic context. If the function of a ligand in a particular receptor-ligand complex structure is not reported as metadata in the PDB database (RCSB), this information must be obtained from other databases or the literature.

Functionality:
Starting with a PDB code, the corresponding enzyme names, UniProt identifiers and E.C. numbers, as well as ligand names and SMILES strings of the molecules in the structure, are retrieved from the PDB database. Depending on the user-supplied parameters, the enzyme names, UniProt identifiers and/or E.C. numbers are used as search terms to query the reaction database SABIO-RK. Subsequently, each compound name is used to query the PubChem database, and the structural similarity for all PDB ligand-SABIO-RK compound pairs is analysed by calculating the Tanimoto coefficient. The user is presented with a table containing all ligand:SABIO-RK compound pairs above a user-specified similarity threshold. In addition, compound function, given by an SBO term in SABIO-RK, is supplied.
Design: Python is used to perform RESTful queries of the RCSB PDB. SABIO-RK is also queried using RESTful web services, and results are returned as SBML. SBML is parsed using the python wrappers for libSBML (Bornstein et al., 2008). Subsequently, the SBML files are parsed, and each compound name is used to query the PubChem database for the isomeric SMILES string. For the sake of standardisation, SMILES strings are also requested from PubChem using the SMILES strings of each ligand provided by the PDB database. Using these SMILES strings, the structural similarity for all PDB ligand-SABIO-RK compound pairs is analysed by calculating the Tanimoto coefficient using the FP2 molecular fingerprint implemented in Open Babel, which encodes molecular structures in a series of bits that represent the presence or absence of particular substructures in the molecule.

E.C. and InChI-based search
Motivation: Many enzymes have multiple functionalities, even if the E.C. number does not reveal them all. Knowledge of this functionality can help the user to find out whether a given compound might be involved in a reaction involving a specified E.C. number.

Functionality:
The user chooses an E.C. number for the enzyme that they are interested in. The user then supplies an InChI, which can be pasted from an external data source, or prepopulated by using the "Find Compound by Name" tool. SABIO-RK is queried to return an SBML file containing all reactions involving the given E.C. number. A list of all products and reactants for which a ChEBI or KEGG identifier exists in SABIO-RK is extracted. To assist the user, a Tanimoto similarity score is calculated for each compound with respect to the query compound. The data are displayed as a table that can be sorted or filtered as required by the user.
Design: The SBML file returned from the SABIO-RK RESTful interface (Wittig et al., 2012) is parsed using the JSBML package for Java (Dräger et al., 2011). A list of all products and reactants for which a ChEBI or KEGG identifier exists in SABIO-RK is extracted and an InChI is returned for each using the UniChem web service (Chambers et al., 2013). The CDK toolkit is used to generate 2D images for each compound, and a

Batch search for ligand identifier
Motivation: Many compound names used in the literature are stored as a list of synonyms in databases like PubChem or ChEBI. Due to ambiguous naming, database searches using compound names can result in multiple identifiers; therefore, a selection of the correct identifiers must be carried out by hand. This is time consuming and can be error-prone.
Functionality: LigDig enables automatic searching of the PubChem and ChEBI databases, via upload of an Excel file in .xlsx format containing the names of compounds of interest. Furthermore, if the user provides the molecular formula and/or the formal net charge for each compound, the retrieved database compounds are validated against these values by comparing the number of atoms and whether the given formal charge is possible. The highest and lowest formal charges are calculated using either the default pH values 1 and 14, respectively, or a user-specified pH range. In the case of a mismatch with the input molecular formula or a formal charge beyond the calculated range, the selected database compound identifier is not written to the output Excel file.
Design: Validation of a compound against the user-supplied molecular formula and/or net-charge is done by calculating the structural topology, including hydrogens, of the retrieved SMILES string using Open Babel (O'Boyle et al., 2011) in combination with the MMFF94 force field (Halgren, 1996). Excel files are written using the openpyxl python package.

Find protein structures
Motivation: Many of the tools in LigDig require a PDB code as an input, e.g. "PDB-based ligand functional annotation", "Binding site comparison" or "Structure preparation". Therefore we provide a simple way to find protein structures in the PDB . Typically, the user would reach this page by first using the "Find compound by name" tool as described previously.
Functionality: The input list of PDB codes can be entered manually from the search page of "Find Protein Structure" or be sent directly from the results page of "Find Compound by Name". The PDB files themselves are not downloaded at this stage. E.C. number, UniProt (Consortium, 2013), SCOP (Murzin et al., 1995) and PFAM (Mistry et al., 2013) identifiers and links are presented for each PDB code in a sortable table.
Design: A list of PDB codes is passed to the RCSB PDB RESTful web services using python.

Binding site comparison
The binding site comparison tool can be used to align two protein structures by superposing either ligands in the binding site, or amino acid residues that define the binding site.

Ligand-based superposition
Motivation: Ligand-based superposition of binding sites can be useful when two protein structures contain the same ligand or ligands that have similar chemical structures. Comparison of the binding sites relative to the superposed ligand orientation can give important clues about conserved binding residues.
Functionality: The user supplies a list of PDB files containing the ligands in which they are interested. This list can be autopopulated using the previously described "Find protein structures" tool. The ligand binding sites, defined by the set of residues that contains at least one atom within a defined distance of any ligand atom (typically 5 Å), are extracted. Ligands are then superposed using the fconv tool (Neudert and Klebe, 2011). The user is then supplied with a JSmol visualization and several UI components to help visualize the binding site.
Design: PDB files are downloaded from the RCSB PDB using the RESTful interface, and components used in the structure preparation tool are reused to extract the ligand, and the binding site from the PDB file. For each binding site, the rotation and translation used to superpose the query ligand onto the reference ligand are computed and applied to the query binding site using a simple python script, in order to visualize the binding sites correctly in relation to the reference ligand. JSmol is used for PDB file visualization.

Residue-based superposition
Motivation: Residue-based binding site superposition not only allows the comparison of binding sites at which similar ligands are known to bind, but it also allows the comparison of arbitrarily defined binding sites for which no ligand is known.
Residue-based superposition is particularly useful when similar binding sites bind similar ligands, even though the respective protein folds are quite different.
Functionality: This tool allows the user to find similar binding sites even if they are known to bind to very different ligands. The residue-based binding site superposition mode also allows new binding sites to be found as a ligand is not necessarily required. It makes use of the ProBiS tool (Konc and Janezic, 2010). Structure files do not need to be prepared for using the ProBiS web server, however, protein chains and ligands need to be selected by the user for each PDB file. A table displaying the chain and ligand information to the user is then displayed where the user can select which binding sites to superpose, or which PDB chains to search for similar binding sites. The ProBiS web server returns the superposed structures and further information about the alignment. This enables the user to visualize only the residues considered similar in the local structural alignment.
Design: The underlying ProBiS (Konc and Janezic, 2010) algorithm finds structurally similar protein binding sites and performs pairwise local structural alignments. The RESTful interface of a locally installed ProBiS tool is used to compute the local similarity of the residues around the specified ligands or in the complete protein chains. To assist the user to determine which binding sites to superpose using ProBiS The RESTful web service for the internally available OpenSiteFinder web server returns a JSON file containing the list of proteins and ligands which determine binding sites that could possibly be superposed.

Structure preparation
Motivation: Most crystal structures of proteins lack hydrogen atoms. The aim of this tool is to add hydrogen atoms to structures of protein-ligand complexes so that the coordinates can be used in further calculations and analyses. The tool cannot currently handle missing non-hydrogen atoms.
Functionality: Starting either with a list of PDB codes or a single PDB file uploaded by the user, hydrogen atoms can be added to receptor and ligand crystal structures using PDB2PQR and fconv, respectively. Due to the limitations of PDB2PQR, metal ions, non-standard amino acids, and covalently bound ligands given in the PDB files used cannot be handled, and are pasted unprocessed to the final downloadable output PDB file. In addition, intermediate files for the receptors and extracted ligands are provided in PDB and mol2 file formats, respectively.
Design: PDB files for the input PDB codes are downloaded automatically from the RCSB PDB or, for obtaining biological units, from the wwPDB. Alternatively, a user-specified PDB file can be uploaded manually for further processing. Metal ions and ligands are written to mol2 files and the latter are protonated using fconv (Neudert and Klebe, 2011). After changing all atom types in the individual mol2 files to the format used in PROPKA (Dolinsky et al., 2007), all ligand files are merged into a single mol2 file. Finally, PDB2PQR (Dolinsky et al., 2004(Dolinsky et al., , 2007 is used to add hydrogen atoms to the receptor structure and water molecules. In the case where a bound ligand is contained in the PDB file, the previously described mol2 file suitable for use in PDB2PQR, is also provided. Version 1.8 of PDB2PQR cannot handle metal ions or modified amino acids by default; therefore, missing atoms that are not processed by PDB2PQR are concatenated to the final pdb file.