Molecular generation strategy and optimization based on A2C reinforcement learning in de novo drug design

Abstract Motivation In the field of pharmacochemistry, it is a time-consuming and expensive process for the new drug development. The existing drug design methods face a significant challenge in terms of generation efficiency and quality. Results In this paper, we proposed a novel molecular generation strategy and optimization based on A2C reinforcement learning. In molecular generation strategy, we adopted transformer-DNN to retain the scaffolds advantages, while accounting for the generated molecules’ similarity and internal diversity by dynamic parameter adjustment, further improving the overall quality of molecule generation. In molecular optimization, we introduced heterogeneous parallel supercomputing for large-scale molecular docking based on message passing interface communication technology to rapidly obtain bioactive information, thereby enhancing the efficiency of drug design. Experiments show that our model can generate high-quality molecules with multi-objective properties at a high generation efficiency, with effectiveness and novelty close to 100%. Moreover, we used our method to assist shandong university school of pharmacy to find several candidate drugs molecules of anti-PEDV. Availability and implementation The datasets involved in this method and the source code are freely available to academic users at https://github.com/wq-sunshine/MomdTDSRL.git.


Introduction
It is important to design effective and high-quality compounds, but the process is time-consuming and expensive for the new drug development.Molecular design (Valverde 2001) is usually to generate compounds with specific chemical properties and pharmaceutical potential based on ligand or target structure.Effective molecular design approaches facilitate the rapid discovery of compounds with desired properties, such as high bioactivity, low toxicity and excellent synthetic properties.Such methods are widely employed in the new drug development, including lead compound discovery and molecular structure optimization.However, the existing drug molecular design methods face a significant challenge in terms of generation efficiency and quality.There is still a huge gap between chemically generated molecules and the actual requirements for drug development.The fundamental difficulty stems from the need to systematically establish a correlation mechanism between molecular structure and a variety of drug properties.
Molecular structure determines the pharmacological properties and functions of compounds, therefore, molecular characterization is a fundamental problem in drug computing.In terms of intuitiveness, the 3D structural representation of drug molecules can provide spatial configuration information, facilitating the understanding of the molecule's spatial properties and interactions (Ragoza et al. 2020, Luo et al. 2021).But the accuracy of its prediction results can be affected by factors such as molecular conformational changes and computational resources.Another representation is the molecular graph, which represents the molecule as a graph of nodes and edges, used to analyze the chemical structure and properties of drug molecules (Schu ¨tt et al. 2017, Li et al. 2018, Mercado et al. 2021).But cannot express the molecule's stereochemistry and spatial properties, which can affect applications such as virtual screening and activity prediction.However, the most commonly used representation is the SMILES representation, which has the advantages of simplicity, easy storage, and transmission, but may suffer from ambiguity and partial loss of chemical information (Blaschke et al. 2020).And current existing methods can use the molecule's atomic and bond information to avoid problems in SMILES representation (Scarselli et al. 2008).
Regarding the design methods, deep learning techniques have found extensive application in molecular design.A series of methods based on recurrent neural network (RNN), variational autoencoder (VAE), generation adversary network (GAN) and reinforcement learning (RL) have realized the generation of candidate drugs for specific targets.RNN was a widely used neural network structure, which showed a strong processing ability in the drug molecular design (Merk et al. 2018, Moret et al. 2020).The RNN-based generation model uses low-dimensional vector representation of molecular character sequences to calculate hidden states at each time step to output probability distributions (Olivecrona et al. 2017, Popova et al. 2018).But, this method only considers random combination of fragments or de novo design, ignoring the advantages of molecular scaffolds.The concept of variational automatic encoder was first proposed at the end of 2013 (Kingma and Welling 2019).The VAE-based molecular generation models map the input molecules to a Gaussian distribution in the latent space using an encoder and decoder, aiming to generate a latent variable distribution similar to the original molecular data by minimizing the reconstruction loss and KL divergence loss (Bjerrum and Sattarov 2018, Dai et al. 2018, Simonovsky and Komodakis 2018).GAN could be combined with multi-task learning to optimize different properties of compounds for joint training, so as to complete the multi-objective molecular generation task (Dong et al. 2020, Li et al. 2021).The GAN-based molecular generation model optimizes the generator and discriminator loss functions by mutual competition, achieving a mapping from random noise space to molecular space to generate new molecules with desired properties (Guimaraes et al. 2017, Yu et al. 2017, De Cao and Kipf 2018).Reinforcement learning could flexibly adjust different parameters to adapt to multiple tasks in drug computing (Cherti et al. 2017, Popova et al. 2018, Wang et al. 2021).The RL-based molecular generation model follows the Markov decision process, interacting with the chemical space, taking actions to generate new molecules, and evaluating performance by reward function.By constantly updating the policy function, new molecules with superior properties can be generated.Whether it is based on VAE, GAN, or RL methods, the evaluation mechanism for molecule generation is limited to the enumeration of simple properties which cannot be compatible to consider the key druggable properties such as physicochemical properties and bioactivity.Moreover, the scale of molecular docking calculation used for evaluating molecular bioactivity is also extremely large.
Aiming at the above question, we propose a novel molecular generation strategy and optimization based on A2C reinforcement learning.The main contributions of our approach are the following: 1) The generation strategy adopts transformer-DNN to retain the scaffolds advantages, while accounting for the generated molecules' similarity and internal diversity by dynamic parameter adjustment, further improving the overall quality of molecule generation.2) In molecular optimization, heterogeneous parallel supercomputing is introduced for large-scale molecular docking based on message passing interface (MPI) communication technology to quickly obtain bioactivity information, thereby improving the efficiency of drug design.3) Experiments show that our model can generate highquality molecules with multi-objective properties at a high generation efficiency, with effectiveness and novelty close to 100%.

Materials and methods
This paper presented a novel molecular generation strategy and optimization by combining transformer-DNN, heterogeneous parallel supercomputing and reinforcement learning.As shown in Fig. 1, we integrated molecular characterization, generation strategy and molecular optimization into the framework based on actor-critic reinforcement learning.
The specific submodules will be described in detail below.

Molecular characterization based on the balanced binary tree
Molecular characterization is a fundamental problem in drug design.It directly affects the construction of a clear correlation between molecular structure and drug properties.Therefore, we proposed the molecular characterization based on balanced binary tree.

Fragment combination library construction
We constructed a virtual fragment combinatorial library by fragmenting compounds.The molecules were partitioned into ring structures, side chains and linkers.To achieve this, single bonds extending from a ring atom were cleaved, and the corresponding attachment points were recorded for subsequent assembly steps.

Fragment encoding based on balanced binary tree
For the fragment combination library constructed previously, we calculated the similarity between fragments and then encoded fragments to achieve molecular fragment data preprocessing.

1) Fragment similarity calculations
In the method, through encoding all fragments as binary strings, similar fragments got similar encodings.Specifically, we employed the Tanimoto-MCS (TMCS)similarity for measuring the similarity, its formula was defined as follows: Here, TMCSðM 1 ; M 2 Þ was the number of atoms in the maximum common substructure of molecules M 1 and M 2 , aðM 1 Þ and aðM 2 Þ were the number of atoms, respectively.2) Fragments encoding When constructing the tree, we calculated similarities among all fragments using Equation (1).Subsequently, pairs of fragments were created in a stepwise, bottom-up fashion, starting with the two most similar fragments.This procedure was iterated, and the two pairs containing the most similar fragments were amalgamated into a new tree with four branches.The computed similarity between two subtrees was determined as the highest similarity between any two fragments within those subtrees.This joining process was reiterated until all fragments were consolidated into a unified tree.
All fragments were encoded into binary strings by constructing a balanced binary tree that was based on fragments similarity calculations.Afterwards, each fragment used this tree to generate binary strings.The fragments encodings were determined by the paths from the root to the leaves.Every branching to the left will add a one ("1") to the end of the encoding and every branching to the right will add a zero ("0").

Molecular generation strategy based on transformer-DNN
Molecular characterization is the foundation, so the molecular generation strategy is the core of drug design.Molecular generation is able to generate new compounds by performing fragment replacement on an initial set of lead compounds.
In our method, the generation strategy adopted the transformer-DNN to identify one or more fragments for replacement.Firstly, the molecular sequence was preprocessed through molecular characterization to obtain fragment level embedding with the position information and scaffolds level embedding, which served as the input to the transformer-DNN.Then, using the transformer, the fragments importance was obtained by calculating the attention coefficient of each fragment.This process preserved the advantages of the molecular scaffolds and output a vectorized feature representation of a molecule with different fragment associations.Finally, the concatenated output was classified by the DNN, which estimated the probability distribution of which fragments to replace and with what fragments, as illustrated in Fig. 2.
The molecule's multi-granularity information features were processed through a transformer layer.This transformer encoder comprised multi-head attention layers, position feedforward network (FFN), residual connection layers and normalization layers.The multi-head attention layer consisted of multiple attention heads, with each head performing linear transformations before calculating attention weights.Different trainable parameter sets were used for each attention head to capture distinct relationships between fragment sequences.This allowed learning of internal molecular structures through attention-weighted calculations.Specifically, the input of the multi-head attention layer consisted of three vector sequences: , where l 1 and l 2 were the lengths of the first and second fragment sequences, respectively.Next, for the hth head, three weight matrices were used to map these three inputs to a lower dimension d p .The attention function was then performed: where , the attention function produced an array for each vector in the query sequence Q.A h 2 R l 1 Âd f represented the output of the attention function.Then, the output of the multi-head attention layer was obtained through a concatenation of linear transformations from attention heads: Here, H represented the number of heads, and W O 2 W HdpÂd f was the trained weight matrix.Subsequently, the output of the multi-head attention layer was fed into a feedforward neural network.The feedforward neural network consisted of two linear layers, with the first layer utilizing ReLU as an activation function.The position FFN transformed each vector in the sequence in a uniform manner, as illustrated below: Then, the multi-head attention and the feedforward neural network were connected through residual connections and normalization.After the encoding transformation, we can obtain features for each molecular sequence: where FT f represented fragment-level features, and FT s represents scaffold-level features of the molecular sequence.Molecular generation strategy and optimization 3

Multi-objective optimization of molecular properties
Molecular generation strategy is the core, in that way molecular optimization is the key in the drug design.In this article, we added the evaluation of physical and chemical properties and molecular docking to the reward of reinforcement learning, the latter of which was used to add biological activity constraints.Considering the computational scale of molecular docking, we had adopted parallel supercomputing, greatly accelerating the speed of molecular generation.Preliminary generated molecules were screened for toxicity by checking for the presence of known toxic groups.In addition, molecular weight (MW), effectiveness (E), lipophilicity (LogP) and polar surface area (PSA) also affected the pharmacokinetic properties.Only those molecules satisfying the specified range of properties had excellent pharmacokinetic properties.The calculation formula was as follows: where k represented multiple properties, l and u were the set range of molecular properties.We used large-scale molecular docking to obtain docking scores, and then input them into a perceptron classifier for bioactivity prediction.Docking score in each epoch needed to perform docking calculations between the generated compounds in batches and specific target structures, which was obviously a time-consuming operation.To address this issue, we introduced MPI parallel computing based on the heterogeneous parallel supercomputing.In this process, the master process was responsible for reading protein target file data and analyzing target sequences.It intelligently distributed the data to computing nodes within the sub-communication domain based on the relationship between the number of protein targets and the number of worker processes.The worker processes handled the reading of ligand small molecule file data and broadcast it to all computing nodes, while also monitoring the overall program's execution.The computing nodes were tasked with the actual molecular docking process, and once suitable docking conformations were found, they wrote the docking results to a file and sent a signal to the master process.
Specifically, we first preprocessed ligand data and protein target files to create multiple ligand sets and specific target collections.a "target management node" was established to read all target files and docking pocket information in batches, which was then broadcasted to each computing node.Simultaneously, multiple "ligand management nodes" were assigned to read small molecule files and distributed them to computing nodes.Subsequently, molecular docking was performed between target files and small molecule files in the ligand sets, resulting in docking outcomes upon completion.Figure 3 shows the specific task scheduling and allocation process.
In the molecular optimization, the bioactivity prediction was realized by constructing classifiers.Initially, the ic50 value was used as the benchmark, with active molecules as positive samples and inactive molecules as negative samples.The perceptron generated the activity information by virtually docking between the data of positive and negative samples and disease-related targets.From the docking results, we extracted the docking score, number of nonhydrogen atoms, and ratio of the first two.These features were then input into the perceptron to predict activity.In addition, a large-scale molecular docking was employed to quickly and accurately calculate the affinity score between small molecules and specific targets, and to predict the real activity of generated molecules, so as to further optimize the biological activity.
In brief, the data obtained through molecular characterization was fed as the input to the model.The actor then adapted the generation strategy using transformer-DNN, with particular emphasis on molecular optimization being incorporated into the reward mechanism.Finally, critic calculated the value of the generation strategy adopted by the actor after receiving the reward.If yes, the actor's action would be strengthened.If no, be blocked.Then, the current state was replaced by the new state.The actor was rewarded for each valid molecule generated.The reward was higher if it managed to produce molecules that met the expectations of multi-objective properties.The valid molecules generated were taken as the input for the next generation task.Through continuous iteration, the model accuracy was optimized to the maximum, which was marked by the stable proportion of high-quality molecules generated.

Dataset and preprocessing
We utilized two common datasets to evaluate model performance.The first one was the MOSES dataset (Polykovskiy et al. 2020), composed of 1.9 million lead compounds filtering from the ZINC database, which had favorable drug-like properties.The second one was the GuacaMol dataset (Brown et al. 2019), which was a subset of ChEMBL 24, containing 1.6 million molecules with diverse distributions in terms of MW, LogP, TPSA and the number of rotatable bonds.We randomly selected 100 000 molecules for model training, and in the testing phase, we evaluated the model using 10 000 generated molecules.
Moreover, we constructed a dataset for case studies focusing on specific Mpro targets.The dataset was divided into two parts.One part was the initial compound group that contained 175 lead compounds, selected from 969 lead compounds by virtual screening.The other part was a virtual fragment library with 9823 Mpro inhibitory activity from the ChEMBL.And, in the reward mechanism of reinforcement learning, the biological activity prediction was realized by constructing perceptron classifiers.The perceptron model used the docking software Ledock to dock between the existing relevant PDB files of 380 different 3D conformations of Mpro targets and molecules generated in each epoch.In the pretraining process of the perceptron classifier, we used the dekois 2.0 benchmark dataset library provided by Tubingen University, which contained the SARS coronavirus 3CL protease dataset (Ibrahim et al. 2015).
In molecular optimization, we optimized several multiobjective properties, including LogP, PSA, MW, and toxicity.The target ranges for these properties were shown in Table 1.To obtain the reference range, we calculated properties values of 2086 patent drug published on the DrugBank (Wishart et al. 2018) using the ADMETLab2.0(Xiong et al. 2021).We then made probability statistics as shown in Supplementary Figs S1-S4.According to the Lipinski rule and the empirical parameters guided by the statistical results of the combined drugs, the compounds falling within these reference ranges were likely to have better pharmacokinetic properties and higher bioavailability during metabolic processes in the organism.

Evaluation metrics
We used a set of metrics from the MOSES and GuacaMol benchmarks to assess the quality and diversity of the generated chemical compounds.
Validity: The proportion of molecules that satisfy chemical validity criteria.Higher validity means that the model has a better grasp of fundamental chemical rules.
Novelty: The ratio of generated valid molecules that do not exist in the training dataset, using to assess the model's ability to generate novel molecules.
Uniqueness: The distinct components within the molecules generated effectively, aiding in assessing the model's grasp of the molecular distribution in the data.Higher uniqueness means that the model can generate more diverse molecules.
Internal diversity: The similarity among molecules in the generated molecular set.As shown in Equation ( 8), T represents the Tanimoto similarity between molecules m 1 and m 2 in the generated set G. A higher value of this metric indicates higher diversity within the generated set.
Scaffold similarity: Cosine similarity of Bemis-Murcko scaffold frequencies between two sets of compounds.Expressing c s ðAÞ as the frequency of scaffold s appearing in molecules from set A, and defining S as the set of fragments found in either G or R, the metric is defined as shown in Equation ( 9): Fre ´chet ChemNet distance (FCD): A distance measure that quantifies the chemical and biological similarity between two sets of compounds.As shown in Equation ( 10), l G , l R are mean vectors and R G , R R are full covariance matrices of activations for molecules from sets G and R respectively.

Method comparison
To assess our method, we generated 10 000 molecules by sampling and benchmarked the model using We conducted molecular generation on the NOSES dataset and the GuacaMol dataset, sampling 10 000 molecules and counting various metrics for the generated molecules.Table 2 shows the performance of the baseline and our model on the MOSES dataset, indicating that our model achieves optimal novelty and diversity.However, due to the uneven probability distribution of fragment selection and replacement, the validity of our model was slightly lower.In contrast, JTVAE achieved 100% validity due to its tree decomposition and predefined subgraphs, avoiding learning SMILES syntax, but its other three indicators were not prominent.MoFlow also attained 100% validity (thanks to validity checking components) and scaffold similarity, but the internal diversity was lower, which meant that the model cannot effectively modify the scaffold and was less suitable as a scaffold transformation.
Table 3 displays the performance of the baseline and our model on the GuacaMol dataset.SMILES LSTM, AAE, and VAE exhibited strong validity, uniqueness, and novelty scores (>80%).Conversely, Organ underperformed significantly, demonstrating lower validity (37.9%), novelty (68.7%), and an FCD value of 0. This indicated that Organ struggled to effectively learn the data distribution of the training set and failed to adequately consider the molecular scaffold structure.AAE overcaptured features from the training data, resulting in similar molecules in the embedding space being closer, leading to lower FCD values.In contrast, our model had achieved the best performance in all aspects, especially achieving 100% uniqueness and novelty This conclusively demonstrated that our model outperforms others in overall performance.

Target-specific molecules generation
For the specific Mpro target, we introduced an activity optimization strategy.Utilizing the dataset we had created, we successfully generated high-quality molecular compounds, offering promising candidates for anti-PEDV to the Shandong University School of Pharmacy.Furthermore, we also conducted a comprehensive evaluation, covering various critical indicators, including analyzing the property distribution, similarity, synthetic accessibility (SA), and drug-likeness, as well as the binding model of the generated molecule and its corresponding crystal structure, to ensure that these molecules had good potential and feasibility in anti-PEDV drug research.
Molecular properties: We focused on optimizing four key properties, namely effectiveness, logP, PSA and MW.The percentage of generated molecules that met the target ranges for these properties was shown in Fig. 4. Overall, with the increase of epoch, the proportion of molecules that satisfied the target ranges for each property increased and tended to be stable.In the end, among the molecules we had successfully generated, nearly 98% were effective compounds, with over 40% of them meeting all desired properties.
In addition, we counted the properties distribution of MW, logP and PSA.Compared with the distribution in the initial set, the distribution of target properties generated in the last 10 epochs was shown in Supplementary Figs.S5-S7.It could be found that property distribution of the generated molecular sets (red) closely matched that of the initial lead molecule set and were within the target properties range specified in Table 1.
Molecular similarity: The similarity between newly molecules and the initial molecules remained stable at 0.8 under different Time parameters.As shown in Fig. 5, with the increase of the parameter, that was, the increase of replacement fragments, the similarity values of the generated new molecules and the original molecules scattered in different intervals.This indicated the increasing number of replaced molecular fragments led to the increase in the novelty of the newly generated molecules.This could be attributed to the fact that the initial molecular structures underwent multiple modifications, resulting in a greater diversity and novelty in the generated molecular structures.
Synthetic accessibility: SA score was used to assess the difficulty of compound synthesis.We analyzed the SA score distribution of the generated molecules in Fig. 6.Except for parameter 3, which had a relatively high number of outliers, The bold values signify the highest value for that metric.The bold values signify the highest value for that metric.
the molecular SA property distributions generated by the other parameters were mostly within the 1.5 interquartile range (IQR).In addition, the median of the SA properties was consistently close to 3, with a range of 1 to 10, where values closer to 1 indicated molecules that were easier to synthesize.
Binding model: We selected molecules with highest binding energy and ligand efficiency from the generated molecules and analyzed the binding model of the generated molecule and its crystal structure with SARS-CoV-2 Mpro of COVID-19.Then, we compared the binding energy and ligand efficiency of the generated molecules with their corresponding initial molecules.As shown in Fig. 7, we observed that the optimal docking score obtained for the generated molecules was superior to that of the known compounds.Drug-likeness: Drug-likeness (QED) scores were a quantitative estimate to evaluate the similarity between the generated compound molecules and known drugs.We randomly selected 10 molecules with better binding energy and ligand efficiency from the generated molecules and analyzed the change of QED score between the generated and the initial molecules.As shown in Fig. 8, the QED scores of the generated molecules was greater than that of the initial molecules, and where the QED score was >0.7.

Ablation studies
To validate the impact of properties optimization on generating high-quality molecules, we conducted a series of ablation experiments and analyzed validity, novelty, internal diversity  as well as scaffold similarity of the generated molecules.As shown in Table 4, the molecules generated in the model with properties optimization exhibited a significantly improved validity, rising from 0.820 to 0.950.Furthermore, more satisfactory results were achieved in terms of novelty (0.999 versus 0.869) and scaffold similarity (0.930 versus 0.785).In contrast, the impact of whether properties optimization was employed within the model on internal diversity was not as evident.This might be attributed to the fact that the objectives of properties optimization tend to ensure that the generated molecules meet specific attributes or rules, resulting in relatively consistent internal structures and a reduction in internal diversity.

Conclusion
Improving the efficiency and quality of molecule design is crucial in the process of the development of innovative drugs.The key is to balance the bioactivity and multi-physicochemical properties based on molecular scaffolds, which is difficult to achieve through opportunistic compound screening and artificial structure optimization.In this study, we proposed a novel molecular generation strategy and optimization based on A2C reinforcement learning.In molecular generation strategy, we adopted transformer-DNN to retain the scaffolds advantages, while accounting for the generated molecules' similarity and internal diversity by dynamic parameter adjustment, further improving the overall quality of molecule generation.In molecular optimization, we introduced heterogeneous parallel supercomputing for large-scale molecular docking based on MPI communication technology to rapidly obtain bioactive information, thereby enhancing the efficiency of drug design.On this basis, future research can be extended to the characterization based on graph with 3D geometric information.In molecular generation strategies, it is possible to consider the introduction of diffusion models or their combination with larger models to optimize drug molecule design.For molecular optimization, the ADMET properties of molecules can also be considered in future research to facilitate the simplification of downstream molecular synthesis and drug measurement.Moreover, our method may provide important insights into the refinement of molecule generation and optimization for self-learning automatic molecular design.

Figure 1 .
Figure 1.Molecular design framework based on A2C reinforcement learning.

Figure 3 .
Figure 3.The specific task scheduling and allocation process.

Figure 4 .
Figure 4.The percentage of valid molecules and the percentage of molecules that fulfill the target ranges specified in Table1.

Figure 5 .
Figure 5.The distribution of similarity between the generated and the initial molecules.

Figure 6 .
Figure 6.The distribution of SA score of generated molecules.

Figure 7 .
Figure 7. Binding model of the generated molecule and its crystal structure.A, B, and C correspond to 6L70.C shows the protein-ligand interactions of the known compounds and the generated molecules.

Figure 8 .
Figure 8.The change of QED score between the generated and the initial molecules.

Table 1 .
Targeted molecular properties and their desired value ranges.

Table 2 .
Performances metrics of different models on the MOSES dataset with sampling 10 000 molecules generated.

Table 3 .
Performances metrics of different models on the GuacaMol dataset with sampling 10 000 molecules generated.

Table 4 .
The impact of properties optimization on the model.