-
PDF
- Split View
-
Views
-
Cite
Cite
Chun Meng Song, Shen Jean Lim, Joo Chuan Tong, Recent advances in computer-aided drug design, Briefings in Bioinformatics, Volume 10, Issue 5, September 2009, Pages 579–591, https://doi.org/10.1093/bib/bbp023
Close - Share Icon Share
Abstract
Modern drug discovery is characterized by the production of vast quantities of compounds and the need to examine these huge libraries in short periods of time. The need to store, manage and analyze these rapidly increasing resources has given rise to the field known as computer-aided drug design (CADD). CADD represents computational methods and resources that are used to facilitate the design and discovery of new therapeutic solutions. Digital repositories, containing detailed information on drugs and other useful compounds, are goldmines for the study of chemical reactions capabilities. Design libraries, with the potential to generate molecular variants in their entirety, allow the selection and sampling of chemical compounds with diverse characteristics. Fold recognition, for studying sequence-structure homology between protein sequences and structures, are helpful for inferring binding sites and molecular functions. Virtual screening, the in silico analog of high-throughput screening, offers great promise for systematic evaluation of huge chemical libraries to identify potential lead candidates that can be synthesized and tested. In this article, we present an overview of the most important data sources and computational methods for the discovery of new molecular entities. The workflow of the entire virtual screening campaign is discussed, from data collection through to post-screening analysis.
INTRODUCTION
Introduction of new therapeutic solutions is an expensive and time-consuming process. It is estimated that a typical drug discovery cycle, from lead identification through to clinical trials, can take 14 years [1] with cost of 800 million US dollars [2]. In the early 1990s, rapid developments in the fields of combinatorial chemistry and high-throughput screening technologies have created an environment for expediting the discovery process by enabling huge libraries of compounds to be synthesized and screened in short periods of time. However, these concerted efforts not only failed to increase the number of successfully launched new molecular entities, but seemingly aggravated the situation [3, 4]. Hit rates are often low and many of these identified hits fail to be further optimized into actual leads and preclinical [5–7]. Among the late-stage failures, 40–60% was reportedly due to absorption, distribution, metabolism, excretion and toxicity (ADME/Tox) deficiencies [8–10]. Collectively, these issues underscore the need to develop alternative strategies that can help remove unsuitable compounds before the exhaustion of significant amount of resources [7].
In time, a new paradigm in drug discovery came underway, calling for early assessment of potency (activity) and selectivity of lead candidates, as well as their potential ADME/Tox liabilities. This helps reduce costly late-stage failures and accelerates successful development of new molecular entities. At the core of this, paradigm shift is the application of computational techniques to facilitate the discovery of new molecular entities. Computer-aided drug design (CADD) is a widely-used term that represents computational tools and resources for the storage, management, analysis and modeling of compounds. It includes development of digital repositories for the study of chemical interaction relationships, computer programs for designing compounds with interesting physicochemical characteristics, as well as tools for systematic assessment of potential lead candidates before they are synthesized and tested. The more recent foundations of CADD were established in the early 1970s with the use of structural biology to modify the biological activity of insulin [11] and to guide the synthesis of human haemoglobin ligands [12]. At that time, X-ray crystallography was expensive and time-consuming, rendering it infeasible for large-scale screening in industrial laboratories [13]. Over the years, new technologies such as comparative modeling based on natural structural homologues have emerged and began to be exploited in lead design [14]. These, together with advances in combinatorial chemistry, high-throughput screening technologies and computational infrastructures, have rapidly bridged the gap between theoretical modeling and medicinal chemistry. Numerous successes of designed drugs were reported, including Dorzolamide for the treatment of cystoid macular edema [15], Zanamivir for therapeutic or prophylactic treatment of influenza infection [16], Sildenafil for the treatment of male erectile dysfunction [17], and Amprenavir for the treatment of HIV infection [18].
CADD now plays a critical role in the search for new molecular entities [7, 13, 19]. Current focus includes improved design and management of data sources, creation of computer programs to generate huge libraries of pharmacologically interesting compounds, development of new algorithms to assess the potency and selectivity of lead candidates, and design of predictive tools to identify potential ADME/Tox liabilities. Here, we review major tools and resources that have been developed for expediting the search for novel drug candidates. The pipeline anatomy of a typical virtual screening campaign from data preparation to post-screening analysis is discussed.
DATA SOURCES
Data accessibility is critical for the success of a drug discovery and development campaign. Huge amounts of organic molecules, biological sequences and related information have been accumulated in scientific literature and case reports. These data are collected and stored in a structured way in a number of databases. Every year, hundreds of biological databases are described in [20]. At the same time, computational algorithms are actively developed to facilitate the design of combinatorial libraries. The most important data sources are reviewed in this section.
Small molecule databases
Small molecule databases represent a major resource for the study of biochemical interactions and play an increasing role in modern discovery with the accumulation of data. A variety of repositories of biologically interesting small molecules and their physicochemical properties have been compiled [21]. These include databases of known chemical compounds, drugs, carbohydrates, enzymes, reactants, natural products and natural-product-derived compounds (Table 1) [22, 23]. PubChem (http://pubchem.ncbi.nlm.nih.gov/), under the umbrella of National Institute of Health (NIH) Molecular Library Roadmap Initiative (http://nihroadmap.nih.gov/), provides information on the biological activities of more than 40 million small molecules and 19 million unique structures. The Available Chemicals Directory (ACD) from the Molecular Design Limited (http://www.mdli.com) serves as a central resource for docking studies. As of January 2009, the database details information of >571 000 purchasable compounds, while its screening compound counterpart Screening Compounds Directory stores over 4.5 million unique structures. ZINC [24], a free database of purchasable compounds, contains 20 089 615 3D structures of molecules annotated with biologically relevant properties (molecular weight, calculated Log P and number of rotatable bonds). LIGAND [22] provides records on 15 395 chemical compounds, 8031 drugs, 10 966 carbohydrates, 5043 enzymes, 7826 chemical reactions and 11 113 reactants (February 2009). DrugBank [25] stores detailed information on nearly 4800 drugs, including >1350 FDA-approved small molecule drugs, 123 FDA-approved biotech drugs, 71 nutraceuticals and >3243 experimental drugs. ChemDB [21] includes nearly 5 million commercially available compounds. Other small molecule databases exist and have been reviewed elsewhere [26].
Some small molecule databases reviewed in this article
| Name . | URL . |
|---|---|
| PubChem | http://pubchem.ncbi.nlm.nih.gov/ |
| ACD | http://www.mdli.com |
| ZINC | http://zinc.docking.org/ |
| LIGAND | http://www.genome.jp/ligand/ |
| DrugBank | http://www.drugbank.ca/ |
| ChemDB | http://cdb.ics.uci.edu/ |
| Name . | URL . |
|---|---|
| PubChem | http://pubchem.ncbi.nlm.nih.gov/ |
| ACD | http://www.mdli.com |
| ZINC | http://zinc.docking.org/ |
| LIGAND | http://www.genome.jp/ligand/ |
| DrugBank | http://www.drugbank.ca/ |
| ChemDB | http://cdb.ics.uci.edu/ |
Some small molecule databases reviewed in this article
| Name . | URL . |
|---|---|
| PubChem | http://pubchem.ncbi.nlm.nih.gov/ |
| ACD | http://www.mdli.com |
| ZINC | http://zinc.docking.org/ |
| LIGAND | http://www.genome.jp/ligand/ |
| DrugBank | http://www.drugbank.ca/ |
| ChemDB | http://cdb.ics.uci.edu/ |
| Name . | URL . |
|---|---|
| PubChem | http://pubchem.ncbi.nlm.nih.gov/ |
| ACD | http://www.mdli.com |
| ZINC | http://zinc.docking.org/ |
| LIGAND | http://www.genome.jp/ligand/ |
| DrugBank | http://www.drugbank.ca/ |
| ChemDB | http://cdb.ics.uci.edu/ |
Biological databases
Sequencing of the human and other model organism genomes have produced increasingly huge amounts of data relevant to the study of human disease. Some of these data sources are described in Table 2. The international collaborative GenBank [27], DNA Data Bank of Japan (DDBJ) [28] and European Molecular Biology Laboratory (EMBL) [29] serve as worldwide repositories for nucleotide sequences of diverse origins. The three databases synchronize their records on a daily basis. Swiss-Prot [30] and Protein Information Resource (PIR) [31] provide comprehensive and expertly annotated protein sequence and functional information. A total of 410 518 protein sequences are currently (February 2009) indexed by Swiss-Prot. Translated EMBL (TrEMBL), a computer-annotated protein sequence database supplement of Swiss-Prot, includes all translation of EMBL nucleotide sequences that are not available in the database [30]. Protein Data Bank (PDB) [32] is the single worldwide archive of structural data of biological macromolecules. As of February 2009, a total of 56 066 biological macromolecular structures have been deposited in PDB.
Some biological databases reviewed in this article
Some biological databases reviewed in this article
Apart from the wealth of information from general-purpose biological databases, a variety of specialist databases have also been developed. Collectively, these sources represent current accumulated knowledge on human biology and disease. Gene expression profiles provide hints of potential targets that may be signatures of diseases. For this, databases such as ArrayExpress [33], Gene Expression Omnibus (GEO) [34] and CIBEX [35] are popular repositories. In the field of proteomics, data from 2D gel electrophoresis have been deposited into resources such as SWISS-2DPAGE [36] and GELBANK [37], while mass spectrometry data is available in databases such as Open Proteomics Database (OPD) [38] and Global Proteome Machine Database (GPMDB) [39]. Metabolomic databases, which detail information of biological pathways and their workings, are available through resources such as the Human Metabolite Database (http://www.metabolomics.ca; HMDB), MDL Metabolite database (http://www.mdl.com/products/predictive/metabolite/index.jsp) and METLIN [40]. The Biomolecular Interaction Network Database (BIND) [41], Human Protein Reference Database (HPRD) [42] and IntAct [43] provide data on protein–protein interactions while transcriptional relationships are available from resources like TRANSFAC [44] and TRED [45]. Post-translational modifications to proteins are also characterized and gathered in dbPTM [46], RESID [47], among others. Collectively, some of these pair-wise relationships have been abstracted into biologically related pathways and networks and made available through resources such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [48] and BioCarta pathways (http://www.biocarta.com/). These resources allow in-depth analysis of selected biomolecules and their roles in molecular pathways of disease. More detailed reviews are available elsewhere [20].
Virtual combinatorial libraries
Combinatorial chemistry is now a critical component of modern drug discovery. Often, such libraries are far too large to be synthesized or screened in their entirety. It is common that these resources may contain a large number of highly similar compounds in terms of their physicochemical characteristics. The potential for improved design that allows optimizing a library's diversity or similarity to a target can help minimize redundancy or maximize the number of discovered true leads. Concepts such as diversity, coverage and representativeness are commonly adopted to ensure a good sampling of a library using the minimum number of molecules [49]. Virtual library design usually begins with the explicit enumeration of all molecular variants within appropriate chemical spaces, followed by subsetting to allow good sampling of all products in the library [50]. Two approaches are normally used for the enumeration of molecular variants: (1) Markush techniques which attach a list of alternative functional groups to variable sites on a common scaffold [51], and (2) chemical transforms which specifies part of the reacting molecules that undergo chemical transformations and the nature of these transformations [52, 53]. These libraries may be optimized for molecular diversity or similarity using descriptors such as chemical composition, chemical topology, 3D structures and functionality [54], or drug-likeness using heuristic rules to detect ADME/Tox deficiencies [49].
COMPUTATIONAL MODELING, ANALYSIS, OPTIMIZATION AND PIPELINING
Fold recognition and comparative modeling
Fold recognition plays an integral role in modern drug discovery process, fueled by the rapid production of data from initiatives such as the Human Genome Project [55]. A potential drug target that is structurally similar to that of a well characterized protein with known biochemical function may help identify binding sites and molecular function [13]. Existing methods include sequence comparison and protein threading [56]. Sequence comparison typically involves searching a query sequence against a database of known protein sequences with experimentally defined 3D structures and evaluating the alignment using substitution matrices, gap penalties, or propensity scales [57–59]. In contrast, protein threading or side-chain conformation search involves substituting the backbone coordinates of a source structure with the probe sequence and assessing the plausibility of the model by means of a set of empirical potentials [60–62]. Such methods offer the potential to identify structurally conserved proteins with no evolutionary links, and are useful for modeling highly conserved receptor-ligand complexes. The search for T-cell epitopes and the subsequent design of peptide-based modalities exemplifies the applicability of this approach [63].
Once the structure of a homologous protein has been identified, a 3D model of the target structure may be constructed by comparative modeling, which provides a foundation for drug design by structures [64]. Such methods are based on the fact that the structures of evolutionary related proteins are more conserved than their primary sequences [65]. Hence, models of a protein with unknown structure may be constructed based on proteins with similar sequences [66]. Successful model construction requires at least one experimentally solved 3D structure with significant amino acid sequence similarity to the target sequence [67]. It has been estimated that templates with 50% sequence identity can reliably generate 3D models with approximately 1 Å agreements between matched Cα atoms, while templates with 25% identity produces models with Cα root-mean-square-deviation (r.m.s.d.) of more than 2 Å [68]. A variety of techniques have been developed for model construction, such as fragment assembly, as implemented in SWISS-MODEL [67], COMPOSER [69] and 3D-JIGSAW [70], segment matching, as implemented in SegMod/ENCAD [71], spatials constraints, as implemented in Modeller [64], or ab initio prediction, as implemented in Rosetta [72]. Several inhibitors including human lipoxygenase inhibitors [73], kinase inhibitors [74] and cannabinoid CB2 receptor agonists [75] have been discovered using virtual screening with homology models.
Ligand selectivity
The discovery of new molecular entities for drug intervention is a highly combinatorial science due to the diversity of protein targets, as well as huge variability of possible lead candidates. The theoretical number of natural proteins is approximately 250 000 [76], while the number of real organic compounds with molecular weight <2000 Da is more than 1060 [77, 78]. Due to the astronomically huge chemical space, the cost required for systematic studies can be extraordinarily high. Computational tools are increasingly used as a cost-effective way for the selection, modeling, analysis and optimization of potential lead candidates. This section surveys the computational methods that have been developed for the prediction of ligand selectivity.
Receptor-based techniques
The availability of a protein target structure is usually helpful in identifying potential ligand interactors. Such approaches usually involve explicit molecular docking of ligands into the receptor binding site, producing a predicted binding mode for each candidate compound [79]. Predicting the preferred binding poses of ligands within a protein active site is difficult. First, the location and geometry of the binding site must be known, which may not always be addressed by X-ray crystallography or NMR studies [80]. Second, the method must find the correct positioning of a compound in the active site of the protein [81]. Incremental construction algorithms are potentially useful in guiding the search for optimal binding poses. Fragments are placed in the binding site of proteins and then ‘grown’ to fill the space available. An example of such approach has been reported by Rarey and colleagues [82], in which the conformational space of the ligand is sampled on the basis of a discrete model and a tree-search technique is used for extending the ligand within the active site. Boehm and coworkers [83] applied the use of needle screening to identify compounds that bind to the bacterial enzyme DNA gyrase ATP binding site. There are also an increasing number of reports on the use of Monte Carlo procedures for protein modeling and design. An early use of such procedure was described by Abagyan and Totrov [84], which randomly selects a conformational subspace and makes a step to a new position independent of the previous position, but according to the predefined continuous probability distribution. The use of conformational ensembles [85] and genetic algorithms [86] to predict the bound conformations of flexible ligands to macromolecular targets was also explored. Other docking algorithms exist and have been described elsewhere [87]. A comparative evaluation of eight docking programs (DOCK, FlexX, FRED, GLIDE, GOLD, SLIDE, SURFLEX and QXP) for their capacity to recover the X-ray pose of 100 small-molecular-weight ligands was reported [88]. It was found that at a 1 Å r.m.s.d. threshold, docking was successful for up to 63% of cases, while at an r.m.s.d. threshold of 2 Å, the maximum success rate was 90%.
Third, the system must evaluate the relative goodness-of-fit or how well the compound can bind to the receptor in comparison with other compounds [13]. An early venture was described by Platzer and colleagues [89], on calculating the relative standard free energy of binding of substrates to α-chymotrypsin. At that time, computational limitations did not allow the inclusion of solvation or entropic effects in the simulations. Since then, new methods have been devised which allow the basic handling of such configurations [90]. Physical-based potentials uses atomic force fields to model free energies of binding, and may be coupled with methods such as free energy perturbation (FEP) [91] and thermodynamic integration (TI) [92] for higher accuracy. Tools that implement physical-based scoring methods include Assisted Model Building and Energy Refinement (AMBER) [93], Chemistry at HARvard Molecular Mechanics (CHARMM) [94] and DOCK [95]. Empirical-based potentials are fast and hence widely used in most docking algorithms. Such an approach requires the availability of receptor–ligand complexes with known binding affinity, and uses additive approximations of several energy terms such as van der Waals potential, electrostatic potential, hydrophobicity potential, among others, for binding free energy estimations [80]. Examples of tools that deploy empirical-based scoring functions include FlexX [82], SCORE [96], Internal Coordinate Mechanics (ICM) [84] and VALIDATE [97]. Knowledge-based methods, which are implemented in Potentials of Mean Force (PMF) [98] and DrugScore [99], compute binding free energies based on the frequencies of inter-atomic contacts. Such methods are also fast to compute and do not require availability of binding affinity data [79]. The Poisson–Boltzmann equation [100], which describes inter-molecular electrostatic interactions, has also been reportedly used for assessing the quality of a virtual screen.
Ligand-based techniques
Central to screening procedures based on ligands is the Similarity Property Principle [101], which asserts that molecules with similar structures are likely to share similar properties. This forms the basis for many ligand-based screening efforts where molecular structure and property descriptors of interacting molecules are extrapolated to search for other molecules with similar characteristics [54, 102]. For this, various machine learning techniques have been described, including the use of decision trees [103], recursive partitioning [104], artificial neural networks (ANN) [105] and support vector machines (SVM) [106, 107]. More recently, several groups have also demonstrated the use of mapping methods that transforms molecular features into various representations. For instance, Godden and coworkers [108] introduced the concept of Dynamic Mapping of Consensus positions (DMC) for mapping consensus positions of specific compound sets to binary-transformed chemical descriptor spaces, as well as the idea of Distance in Activity-Centered Chemical Space (DACCS) for accurately detecting molecular similarity relationships in ‘raw’ chemical spaces of high dimensionality [109]. Eckert et al. [110] introduced an extension of DMC, DynaMAD, which maps compounds to ‘activity-class-dependent’ descriptor values using unmodified descriptor value distributions. Molecular fingerprints based on 2D or 3D descriptors are also applied for virtual screening applications, as seen in methods such as MOLPRINT 2D [111], Property Descriptor value Range-derived FingerPrint (PDR-FP) [112], Rapid Overlay of Chemical Structures (ROCS) [113], shape fingerprints [114], and 3D pharmacophore fingerprints [115]. Dynamic activities over the past few years have also seen the development of hybrid techniques that integrate the strength of both structure-based and ligand-based techniques. For example, Cherkasov and coworkers [116] reported a combined approach integrating docking and structure-activity modeling using ANN to predict nonsteroidal compounds that bind to human sex hormone binding globulin. Although useful in practice, ligand-based procedures are usually non-generalizable and highly dependent on the quantity and quality of available experimental data. Where there is limited data or biasness in the training dataset, these models suffer from poor accuracy. Reported successes of ligand-based virtual screening include the discovery of novel cyclooxygenase 2 (COX-2) inhibitors [106] and anti-malaria compounds [117].
Assessment of ADME/Tox deficiencies
The disposition of a pharmaceutical compound may be described by its pharmacokinetic or ADME properties [118]. In order to exert a pharmacological effect in tissues, a compound has to penetrate various physiological barriers, such as the gastrointestinal barrier, the blood–brain barrier and the microcirculatory barrier, to reach the blood circulation. It is subsequently transported to its effector site for distribution into tissues and organs, degraded by specialized enzymes, and finally removed from the body via excretion. In addition, genetic variation in drug metabolizing enzymes implies that some compounds may undergo metabolic activation and cause adverse reactions or Tox in humans [119]. Accordingly, the ADME/Tox properties of a compound directly impact its usefulness and safety.
The membrane permeability of a compound is determined by a combination of factors including compound size, aqueous solubility, ionizability (pKa) and lipophilicity (log P). It has been reported that the polar surface area (PSA) inversely correlates with the lipid penetration ability [120]. Compounds that are completely absorbed by humans tend to have PSA values of ≤60 Å2, while compounds with PSA >140 Å2 are less than 10% absorbed. Lipinski [121] carefully studied the physico-chemical properties of 2245 drugs from the World Drug Index (WDI) and found that poor absorption and permeation are more likely to occur when molecular weight <500 g/mol, Clog P < 5, hydrogen bond donors <5 and hydrogen bond acceptors <10. A ‘rule of five’ was subsequently proposed with respect to drug-likeness. These rules were extended by other researchers, including Ghose et al. [122] and Oprea [123]. A more stringent ‘rule of five’ was proposed by Wenlock et al. [124] after analyzing 594 compounds from the Physicians Desk Reference 1999, wherein molecular weight <473 g/mol, Clog P < 5, hydrogen bond donors <4 and hydrogen bond acceptors <7. Congreve and coworkers [125] performed an analysis on a range of targets derived by NMR and X-ray crystallography and found that the fragments obeyed, on average, a ‘rule of three’ for lead-likeness, in which molecular weight is <300 g/mol, hydrogen bond donors ≤3 and Clog P ≤ 3. However, these rules could only serve as the minimal criteria for evaluating drug-likeness. It has been estimated that 68.7% of compounds in the Available Chemical Directory (ACD) Screening Database (2.4 million compounds) and 55% of compounds in ACD (240 000 compounds) do not violate the ‘rule of five’ [126]. A collection of 1203 compounds which represents 2973 pharmacokinetic measurements is now available in the PK/DB database [127]. The general rules for assessing ADME/Tox properties have been extended to more complex computational and mathematical models. Procedures based on genetic algorithms (GAs) [128, 129], ANNs [129, 130], SVMs [131] and statistical models [132, 133] have been widely used for predicting aqueous drug solubility and human intestinal absorption. Likewise, machine-learning algorithms and mathematical models for predicting Caco-2 permeability [133, 134] and blood–brain barrier penetration [135, 136] have been described. A comprehensive description of these methods can be found in a recent review [126]. Collectively, these systems allow detailed modeling of pharmacokinetics and drug delivery. The result will be realistic models that should match the complexities of external drug administration and greatly assist our understanding of the fate of compounds ingested or otherwise delivered externally to a human. An example is the successful screening for novel inhibitors of human carbonic anhydrase II using a series of hierarchical filters to reduce the initial data sample based on functional group requirements and pharmacophore matching [137].
Stereochemical quality assessment
Errors in protein structures may be identified by evaluating the stereochemical quality of generated models. A commonly used indicator of protein quality is the Ramachandran plot, which displays the φ and ψ backbone conformational angles for each residue in a protein [138]. Such an approach evaluates the correctness of structural coordinates based on standard deviations in φ and ψ angle pairs for residues in a protein. More complex forms of such metric exist, which incorporates addition parameters such as bond lengths for protein structure verification [139]. An alternative method for assessing protein stereochemical quality is to compare the model to its own amino-acid sequence using a 3D profile, computed from the atomic coordinates of the structure 3D profiles of correct protein structures, with its amino acid sequences [140]. A wrongly folded segment in a structure may be identified by examining the profile score in a moving-window scan. Appropriate use of these resources will enhance the quality of developed models and prediction accuracies.
Data pipelining
Data pipelining is increasingly applied to streamline and automate the process of virtual screening campaigns. In such systems, data automatically flow from one task to another allowing complete data analysis to be performed. This is achieved by constructing and executing workflows using components that perform specific data integration, calculation or analysis tasks. An example of workflow technology is Pipeline Pilot [141] developed at SciTegic, Inc. The system allows for the analysis of discovery data such as chemical series and high throughput screening results using an array of cheminformatics tools that include molecular fingerprints, similarity calculations, clustering, maximal common subgraph search and Bayesian model learning.
A ROADMAP FOR A STRUCTURE-BASED VIRTUAL SCREENING CAMPAIGN
In order to make sense of the wide assortment of tools available for virtual screening, we present a simplified solution as a roadmap, with a small set of selected options for each step (Figure 1). Specific tools and resources have been selected on the basis of their availability and performance. These tools have been shown to perform consistently either autonomously or as part of different analysis pipelines. As such, we wish to recommend them as a general procedure for small or large scale virtual screening initiatives.
A roadmap for structure-based screening campaign, comprising of (i) target selection (ii) library preparation and (iii) stereochemical quality assessment, ADME/Tox assessment and computational optimization.
A campaign usually begins with the selection of biological targets whose role in the disease pathway is established. The BLAST suite of programs [58] is usually helpful in inferring functional and evolutionary relationships between sequences, and helps in identifying members of gene families. Where available, the structures of target proteins may be downloaded from PDB [32] and the corresponding amino acid sequences from Swiss-Prot [30] or PIR [31]. PSI-BLAST [58], 3D-PSSM [142] and SAM-T2K [143] are useful for inferring the binding sites of target proteins. Next, it is necessary to prepare the library of compounds to be screened. Potential sources of publicly available chemical libraries include ZINC [24] and DrugBank [25]. Tools like Open Babel [144] and JOELib (http://sourceforge.net/projects/joelib/) are useful for inter-conversion of different chemical file formats such as PDB [32], Chemical Markup Language (CML) [145], MDL Molfile [146], simplified molecular input line entry specification (SMILES) [147] and SYBYL Line Notation (SLN) [148]. In addition, computational software exists for combinatorial library enumeration and 3D structure generation. SmiLib [149] allows the rapid combinatorial generation of chemical compounds by attaching different functional groups on a common chemical scaffold. The commercial software CORINA [150] or Converter (http://www.accelrys.com) are useful for generating the 3D structure of a small molecule. Tools such as CLEVER [151] support chemical library creation and manipulation, combinatorial chemical library enumeration using user-specified chemical components, chemical format conversion and visualization, as well as chemical compounds analysis and filtration with respect to drug-likeness, lead-likeness and fragment-likeness based on the physicochemical properties computed from the derived molecules.
For a target protein of unknown structure, a 3D model may be constructed with the help of SWISS-MODEL [67] or Modeller [64]. The de novo structure prediction algorithm, Rosetta, may be applied to predict the conformations of structurally divergent regions in comparative models [72]. Tools such as Relibase+ [152] are helpful for selecting conserved water molecules to be included into docking screens. WHAT_CHECK [139], PROCHECK [153] or Ramachandran Plot 2.0 [154] may be applied, before and after the screening process, to check stereochemical quality and identify errors in protein structures. Protonation of the target protein, energy minimization, and molecular docking may be performed using DOCK [85], AutoDock [86] or the commercial software ICM [84]. The commercial software ACD/LogD Suite (http://www.acdlabs.com/products/phys_chem_lab/logd/suite.html) by Advanced Chemistry Development can be used to predict ADME-related properties including hydrophobicity, lipophilicity and pKa, while Pharma Algorithms provide a suite of products via ADME Boxes (http://www.ap-algorithms.com/adme_boxes.htm) that addresses issues such as solubility, oral bioavailability, absorption and distribution. To remove toxic hits from the chemical libraries, counter pharmacophore screening may also be performed, using compounds whose inhibition leads to toxic effects [155]. This entire process may be iterated, with the inclusion of modified analogues or additional compound sets, for further optimization of potency and drug-like properties.
CONCLUSION
Since the first reported success of discovery by design over three decades ago, there has been an explosion in the number, variety and sophistication of resources and analysis tools. CADD is now widely recognized as a viable alternative and complement to high-throughput screening. The search for new molecular entities has led to the construction of high quality datasets and design libraries that may be optimized for molecular diversity or similarity. On the other hand, advances in molecular docking algorithms, combined with improvements in computational infrastructure, are enabling rapid improvement in screening throughput. Propelled by increasingly powerful technology, distributed computing is gaining popularity for large-scale screening initiatives. Recent examples include the European Union funded WISDOM (World-wide In Silico Docking on Malaria) project which analyzed over 41 million malaria-relevant compounds in ∼1 month using 1700 computers from 15 countries [155], and the Chinese funded Drug Discovery Grid (DDGrid) for anti-SARS and anti-diabetes research with a calculation capacity of >1 Tflops per second [156]. Combined with concerted efforts towards the design of more detailed physical models such as solubility and protein solvation, these advancements will, for the first time, allow the realization of the full potential of lead discovery by design.
Numerous bioinformatics tools and resources have been developed to expedite drug discovery process. This article provides an overview of the most important data sources and computational methods for the discovery of new molecular entities. We have also provided guidelines on the workflow of the entire virtual screening campaign, from data collection through to post-screening analysis.
Data accessibility is critical for the success of a drug discovery and development campaign. Small molecule databases represent a major resource for the study of biochemical interactions. Biological databases represent current accumulated knowledge on human biology and disease. Combinatorial libraries allow for optimization of a library's diversity or similarity to a target and can help minimize redundancy or maximize the number of discovered true leads.
A campaign usually begins with the selection of biological targets whose role in the disease pathway is established. Next, it is necessary to prepare the library of compounds to be screened. For a target protein of unknown structure, a 3D model may be constructed using homology modeling techniques. This is usually followed by protonation of the target protein, energy minimization, molecular docking and stereochemical quality assessments.
