-
PDF
- Split View
-
Views
-
Cite
Cite
Naruki Yoshikawa, Kei Terayama, Masato Sumita, Teruki Homma, Kenta Oono, Koji Tsuda, Population-based De Novo Molecule Generation, Using Grammatical Evolution, Chemistry Letters, Volume 47, Issue 11, November 2018, Pages 1431–1434, https://doi.org/10.1246/cl.180665
- Share Icon Share
Abstract
Automatic molecule design with machine learning and simulations has shown a remarkable ability to generate new and promising drug candidates. We propose a new population-based approach using a grammatical evolution named ChemGE, that can update a large population of molecules concurrently and evaluate with multiple simulators in parallel. In computational experiments, ChemGE succeeded in finding hundreds of candidate molecules whose affinity for thymidine kinase is better than that of known binding molecules in a database (DUD-E).
Designing new molecules with desirable properties is an important task for pharmaceutical science and materials science, with the improvement of binding affinity for a protein and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) profile, a priority for hit-to-lead and subsequent optimization stages of drug discovery. Computer-based de novo molecule design1 is an active research fields in drug discovery. Fragment-based methods like RECAP,2 which generate molecules by linking known fragments, are known as a popular approach. However, these methods have some problems with structural diversity of generated molecules and patentability. On the other hand, several de novo molecule generation methods recently suggested do not require fragments. They formulate the molecule generation problem as a black-box optimization of SMILES (Simplified Molecular Input Line Entry System) strings3 via deep neural networks and molecular simulations.4–12
Although these methods have achieved great results in finding drug candidates, most still have problems in simulation concurrency and diversity. Because few, often no more than one, molecules are generated at a time, it is difficult to parallelize their subsequent evaluation through molecular simulation. The lack of diversity in generated molecules is also a problem,13 mainly due to the way most methods are designed to find an optimal molecule based on an a priori defined score: drug discovery is a stage-wise process and it is impossible to design a comprehensive score where all aspects are reflected. However, improving molecular diversity is necessary to increase the chance of survival, down the drug discovery pipeline.
In this paper, we propose a new efficient molecule generator (ChemGE) for generating multiple novel molecules taking advantage of parallel computation. ChemGE employs a population-based approach grammatical evolution14 to optimize a population of molecules. Grammatical evolution is a population-based optimization technique to optimize a population of strings that follow a context-free grammar. Such population-based evolutionary methods have been regaining popularity for solving black-box optimization problems, such as hyperparameter optimization and neural network design.15 Their main advantage is inherent concurrency. Hence, ChemGE can perform many simulations simultaneously and can easily be adapted for parallel computation.
Compared to other evolutionary methods using SMILES,16 the mutation operation in ChemGE causes a big change of molecules. This nature is beneficial to finding novel molecules. The SMILES format is just a linear representation of a molecule graph, so SMILES itself is a good prior knowledge of molecule graph distribution. Since generation of SMILES string is computationally cheap, we can create more diverse molecules under an identical computation environment.
In addition, mutation operations in evolution ensure large diversity throughout the optimization process. In benchmarking experiments using a druglikeness score, we confirmed that ChemGE is able to find many more molecules than deep learning-based methods using identical computational resources. ChemGE effectively finds novel molecules that are affinitive to thymidine kinase using rDock.17 Employing 32 cores in parallel, ChemGE generated 189 molecules whose docking scores are better than the best molecule in a database (DUD-E18) in 26 hours. The generated molecules were highly diverse, and dissimilar to known active ones in DUD-E.
Before introducing ChemGE, we review SMILES and its context-free grammar. SMILES3 is a notation of chemical structure using ASCII strings. A context-free grammar G is defined as a 4-tuple: |$\text{G} = (\text{V},\Sigma ,\text{R},\text{S})$|, where V is a finite set of non-terminal symbols, Σ is a finite set of terminal symbols, R is a finite set of production rules, and S is a special non-terminal symbol called the start symbol. A production rule defines a transformation from a non-terminal symbol V to |$(\text{V} \cup \Sigma )^{*}$|, where the asterisk denotes a Kleene star. To generate a string, a sequence of production rules is applied to non-terminal symbols until no non-terminal symbol remains. OpenSMILES19 specifies the context-free grammar of SMILES. In this work, we used a subset of the grammar. Note that following the grammar alone is not sufficient for generating valid molecules. The validity of SMILES strings is evaluated by a package like RDKit.20
In ChemGE, a chromosome is defined as a sequence of integers that is translatable to a SMILES string through a mapping process. Figure 1 illustrates how this translation is conducted. Non-terminal symbols are successively converted by production rules defined in grammar until a string is generated. Integers in a chromosome specify which rule is to be applied. At the k-th step of the mapping process, the k-th integer in the chromosome c = C[k] is looked up. The number of rules applicable to the leftmost non-terminal symbol is denoted by r. Among them, the ((c mod r) + 1)-th rule is chosen and applied. The process iterates until the string has no non-terminal symbol or the final integer of the chromosome is used.

Mapping process to translate a chromosome into SMILES with an example of CO2. A chromosome is a sequence of integers. At the k-th step, the k-th integer is looked up and it specifies the production rule applied to the leftmost non-terminal symbol.
ChemGE evolves a population of molecules in accordance with the (μ + λ) strategy21 as the following; (1) create λ new chromosomes by repeatedly drawing a random chromosome from the population, and changing one integer at a random position to a random value (i.e., mutation). (2) Each of the λ chromosomes is translated to a SMILES string and then to a molecule and its fitness is evaluated (e.g. by docking simulation). If translation fails, the fitness is set to −∞. (3) A new generation is made by selecting the μ top-fit molecules from the merged pool of existing and new molecules. We did not use a crossover operation here, as it did not improve the performance (data not shown).
To validate the usefulness of our method, we conducted two types of computational experiments. One is for searching drug-like molecules. The other is for high scoring ligands for a protein.
Concerning drug-like molecules, we optimized an indicator of druglikeness, JlogP score,4 defined as,
where logP(m) is the octanol-water partition coefficient (logP) of a molecule, m. SA(m) is synthetic accessibility.22 “ring-penalty(m)” is for imposing the number of carbon rings less than seven. SA(m) and ring-penalty(m) are normalized so that their mean and standard deviation values are set to zero and one, respectively. Roughly speaking, molecules with larger JlogP scores have more suitable structural profiles as pharmaceutical drugs. Since very large logP values (JlogP > 5) cause low metabolic stability and other pharmacokinetic defects, JlogP values of launched drugs and clinical tested compounds collected from Clarivate Analytics Cortellis database range from −10 to 5.
We compared ChemGE with CVAE,4 GVAE,5 and ChemTS10 using this score. CVAE uses variational autoencoder (VAE)23 to obtain continuous representation of SMILES strings, conduct Bayesian optimization over the continuous space, and obtain strings from the optimized representation. GVAE is an updated version of CVAE, where grammar information on SMILES is considered. ChemTS uses Monte Carlo tree search and recurrent neural networks to design SMILES strings.
In this experiment, ChemGE is applied with different initial population sizes: (μ, λ) = (10, 20), (100, 200), (1000, 2000), (10000, 20000), randomly chosen from the ZINC database, which contains 35 million commercially available compounds.24 JlogP(m) was used as fitness score, but set to −∞, if the molecular weight was bigger than 500 or an identical molecule had already been generated before. In order to compare with other methods, no parallel computation was conducted here. We used an Intel Xeon CPU E5-2630 v3 computing core.
We tabulated the maximum score of each method after running for 2, 4, 6, and 8 hours in Table 1. The importance of choosing a suitable population size is well-documented in evolutionary algorithm literature,14 as an overly small population cannot have sufficient diversity, while an overly large population cannot optimize each molecule sufficiently due to the small number of generations. We found that, at (μ, λ) = (1000, 2000), the final ChemGE score was the largest and outperformed other methods in score and speed of generating molecules. The efficiency of ChemGE enables evaluation of a much larger number of molecules than computationally-demanding deep learning models, such as RNN or VAE, or than Bayesian optimization, which is used in CVAE and GVAE, and gets slower as search progresses.
Maximum score J at time points 2, 4, 6, and 8 hours achieved by different molecular generation methods. The average values and standard deviations over 10 trials are shown. The two numbers after ChemGE specifies (μ, λ). The second column from right shows the number of generated molecules per minute (duplication is eliminated). The rightmost column shows the number of generations in 8 hours.
Method . | 2 h . | 4 h . | 6 h . | 8 h . | Mol/min . | Generation . |
---|---|---|---|---|---|---|
ChemGE (10, 20) | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 14.5 ± 4.0 | 1030000 ± 24000 |
ChemGE (100, 200) | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 135 ± 22 | 8360 ± 1340 |
ChemGE (1000, 2000) | 4.45 ± 0.24 | 5.32 ± 0.43 | 5.73 ± 0.33 | 5.88 ± 0.34 | 527 ± 62 | 704 ± 59 |
ChemGE (10000, 20000) | 4.20 ± 0.33 | 4.28 ± 0.28 | 4.40 ± 0.27 | 4.53 ± 0.26 | 555 ± 68 | 72.5 ± 9.5 |
CVAE4 | −30.18 ± 26.91 | −1.39 ± 2.24 | −0.61 ± 1.08 | −0.006 ± 0.92 | 0.14 ± 0.08 | — |
GVAE5 | −4.34 ± 3.14 | −1.29 ± 1.67 | −0.17 ± 0.96 | 0.25 ± 1.31 | 1.38 ± 0.91 | — |
ChemTS10 | 4.91 ± 0.38 | 5.41 ± 0.51 | 5.49 ± 0.44 | 5.58 ± 0.50 | 40.89 ± 1.57 | — |
Method . | 2 h . | 4 h . | 6 h . | 8 h . | Mol/min . | Generation . |
---|---|---|---|---|---|---|
ChemGE (10, 20) | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 14.5 ± 4.0 | 1030000 ± 24000 |
ChemGE (100, 200) | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 135 ± 22 | 8360 ± 1340 |
ChemGE (1000, 2000) | 4.45 ± 0.24 | 5.32 ± 0.43 | 5.73 ± 0.33 | 5.88 ± 0.34 | 527 ± 62 | 704 ± 59 |
ChemGE (10000, 20000) | 4.20 ± 0.33 | 4.28 ± 0.28 | 4.40 ± 0.27 | 4.53 ± 0.26 | 555 ± 68 | 72.5 ± 9.5 |
CVAE4 | −30.18 ± 26.91 | −1.39 ± 2.24 | −0.61 ± 1.08 | −0.006 ± 0.92 | 0.14 ± 0.08 | — |
GVAE5 | −4.34 ± 3.14 | −1.29 ± 1.67 | −0.17 ± 0.96 | 0.25 ± 1.31 | 1.38 ± 0.91 | — |
ChemTS10 | 4.91 ± 0.38 | 5.41 ± 0.51 | 5.49 ± 0.44 | 5.58 ± 0.50 | 40.89 ± 1.57 | — |
Maximum score J at time points 2, 4, 6, and 8 hours achieved by different molecular generation methods. The average values and standard deviations over 10 trials are shown. The two numbers after ChemGE specifies (μ, λ). The second column from right shows the number of generated molecules per minute (duplication is eliminated). The rightmost column shows the number of generations in 8 hours.
Method . | 2 h . | 4 h . | 6 h . | 8 h . | Mol/min . | Generation . |
---|---|---|---|---|---|---|
ChemGE (10, 20) | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 14.5 ± 4.0 | 1030000 ± 24000 |
ChemGE (100, 200) | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 135 ± 22 | 8360 ± 1340 |
ChemGE (1000, 2000) | 4.45 ± 0.24 | 5.32 ± 0.43 | 5.73 ± 0.33 | 5.88 ± 0.34 | 527 ± 62 | 704 ± 59 |
ChemGE (10000, 20000) | 4.20 ± 0.33 | 4.28 ± 0.28 | 4.40 ± 0.27 | 4.53 ± 0.26 | 555 ± 68 | 72.5 ± 9.5 |
CVAE4 | −30.18 ± 26.91 | −1.39 ± 2.24 | −0.61 ± 1.08 | −0.006 ± 0.92 | 0.14 ± 0.08 | — |
GVAE5 | −4.34 ± 3.14 | −1.29 ± 1.67 | −0.17 ± 0.96 | 0.25 ± 1.31 | 1.38 ± 0.91 | — |
ChemTS10 | 4.91 ± 0.38 | 5.41 ± 0.51 | 5.49 ± 0.44 | 5.58 ± 0.50 | 40.89 ± 1.57 | — |
Method . | 2 h . | 4 h . | 6 h . | 8 h . | Mol/min . | Generation . |
---|---|---|---|---|---|---|
ChemGE (10, 20) | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 4.46 ± 0.34 | 14.5 ± 4.0 | 1030000 ± 24000 |
ChemGE (100, 200) | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 5.17 ± 0.26 | 135 ± 22 | 8360 ± 1340 |
ChemGE (1000, 2000) | 4.45 ± 0.24 | 5.32 ± 0.43 | 5.73 ± 0.33 | 5.88 ± 0.34 | 527 ± 62 | 704 ± 59 |
ChemGE (10000, 20000) | 4.20 ± 0.33 | 4.28 ± 0.28 | 4.40 ± 0.27 | 4.53 ± 0.26 | 555 ± 68 | 72.5 ± 9.5 |
CVAE4 | −30.18 ± 26.91 | −1.39 ± 2.24 | −0.61 ± 1.08 | −0.006 ± 0.92 | 0.14 ± 0.08 | — |
GVAE5 | −4.34 ± 3.14 | −1.29 ± 1.67 | −0.17 ± 0.96 | 0.25 ± 1.31 | 1.38 ± 0.91 | — |
ChemTS10 | 4.91 ± 0.38 | 5.41 ± 0.51 | 5.49 ± 0.44 | 5.58 ± 0.50 | 40.89 ± 1.57 | — |
We also employed ChemGE to design molecules that have strong binding affinity for a specific target protein with rDock,17 which is one of the fastest and most accurate docking simulators, commonly used in high throughput virtual screening. We adapted thymidine kinase (KITH), a well-known target of antiviral drugs, as the target protein for this study. After taking the structure data from a KITH-ligand complex (PDB ID: 2B8T), we generated a cavity using the default reference ligand method (radius: 6.0 Å). To calculate ChemGE's fitness, we took the best intermolecular score (Sinter) among three rDock runs from different initial conformations under default parameter settings. The fitness was defined as,
because a small intermolecular score implies high affinity and SA score helps the generated molecule being synthetically available. If the molecular weight was bigger than 500 or the molecule was already generated, the fitness was set to −∞. In this computational experiment, sulfur, boron, and phosphorus are excluded from grammar.
We reserved 32 cores for calculation and set (μ, λ) = (32, 64). Evaluation of molecules in a population was conducted in parallel. The computation for 1000 generations was finished in about 26 hours, resulting in a large library of 9466 generated molecules. We compared the score of generated molecules with that of 57 known inhibitors derived from DUD-E.18 We found 189 molecules whose intermolecular score was better than the best score of 57 known inhibitors. Although synthetic routes and ADMET properties are unknown for these molecules, we believe that this large-scale library can offer great opportunities and inspiration for medical chemists.
We investigated the diversity of generated molecules found by ChemGE. Larger diversity generally increases the chance of survival at multiple ADMET endpoints.
Figure 2 shows a visualized distribution in the chemical space explored by ChemGE. A mapping from fingerprints to a two dimensional space is constructed by applying the ISOMAP algorithm25 to ZINC. Then, the following molecule sets are mapped to the two dimensional space: Figure 2(a) known inhibitors, Figure 2(b) initial population of ChemGE, Figure 2(c) 100-th generation, Figure 2(d) 1000-th generation. The initial population is distributed in various places and a large fraction lies close to known inhibitors. As evolution goes on, the distribution moves away from the initial position and the location of final generation was different from known inhibitors. This observation suggests that ChemGE detected a new class of binding molecules, although they need to be confirmed by biological assays, and synthesizability and ADMET issues would have to be solved. (e) shows an example of generated molecules whose scores are better than known binding molecules.

ISOMAP visualization of the chemical space and the best molecule. Blue dots represent molecules in ZINC database. Red cross in (a) represent known inhibitors of thymidine kinase. Red crosses in (b), (c), (d) represent ChemGE populations in different generations. The molecule that has the best score is shown. (e) shows an example of generated molecules whose score are better than known inhibitors.
We evaluated internal diversity13 of generated molecules as an indicator of diversity. Using Morgan fingerprints,26 the internal diversity of a set of molecules A is defined as |$\text{I(A)} = \sum\nolimits_{(\text{x},\text{y}) \in \text{A} \times \text{A}}\text{T}_{\text{d}}(\text{x},\text{y})/|A|^{2} $|, where Td is the Tanimoto distance |$\text{T}_{\text{d}}(\text{x},\text{y}) = 1 - |\text{x} \cap \text{y}|/|\text{x} \cup \text{y}|$|. We evaluated a “ChemGE-active” molecule set consisting of 189 molecules whose scores were better than the best known inhibitor. The internal diversity was 0.52, larger than that of known inhibitors, 0.46. The difference is substantial, considering that the diversity of the whole ZINC database is 0.65. Deep reinforcement learning methods often generate very similar molecules13 and special countermeasures are necessary to maintain diversity. In grammatical evolution, diversity is inherently built-in, because a large fraction of molecules is mutated in each generation.
Figure 3 shows the effect of concurrent simulation. We used Amazon Web Services (AWS) to set up a parallel environment and changed the number of instances to run docking simulations. As the number of instances increased, optimization got faster. This demonstrates parallelization contributes to increased speed.

Elapsed time and docking score (Sinter) are shown. As the number of instances increased, score improvement got faster. This illustrates that concurrent simulation contributes to acceleration of molecule design.
In summary, we developed ChemGE as a new molecular generator to generate functional molecules using grammatical evolution. In benchmark experiments, ChemGE efficiently generated far more molecules than deep learning methods, at similar resource levels. In computational docking experiments with thymidine kinase, a large library of activity-biased molecules was successfully obtained. Among them, hundreds of molecules had better scores than known active molecules. Our library was shown to retain high diversity, which may contribute to increasing the survival rate in the following steps of the drug discovery process. Our method is suitable for parallel computation and our implementation can be applied to even larger computational environments. This work demonstrated that molecule generation is possible without costly deep learning and showed a new direction for research. Nevertheless, to increase performance further, exploitation of a probabilistic or neural model in the evolutionary process might be beneficial.27
References
R. Gómez-Bombarelli, D. Duvenaud, J. M. Hernández-Lobato, J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, A. Aspuru-Guzik, arXiv preprint 2016, arXiv:161002415v2.
M. J. Kusner, B. Paige, J. M. Hern’andez-Lobato, Proceedings of the 34th International Conference on Machine Learning, 2017, 70, 1945.
G. L. Guimaraes, B. Sanchez-Lengeling, P. L. C. Farias, A. Aspuru-Guzik, arXiv preprint 2017, arXiv:170510843.
I. Dempsey, M. O'Neill, A. Brabazon, Foundation in Grammatical Evolution for Dynamic Environments, Springer, Germany, 2009. doi:10.1007/978-3-642-00314-1.
M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, arXiv preprint 2017, arXiv:171109846.
C. A. James, OpenSMILES specification, http://opensmiles.org.
RDKit: Open-source cheminformatics, http://www.rdkit.org.
D. P. Kingma, M. Welling, Proceedings of the International Conference on Learning Representations, 2014.

K. Tsuda

N. Yoshikawa

K. Terayama

M. Sumita

K. Oono