Molecular property prediction based on graph structure learning

Abstract Motivation Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have achieved considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. Results For this sake, in this article we propose a graph structure learning (GSL) based MPP approach, called GSL-MPP. Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations. Then, with molecular fingerprints, we construct a molecule similarity graph (MSG). Following that, we conduct GSL on the MSG, i.e. molecule-level GSL, to get the final molecular embeddings, which are the results of fuzing both GNN encoded molecular representations and the relationships among molecules. That is, combining both intra-molecule and inter-molecule information. Finally, we use these molecular embeddings to perform MPP. Extensive experiments on 10 various benchmark datasets show that our method could achieve state-of-the-art performance in most cases, especially on classification tasks. Further visualization studies also demonstrate the good molecular representations of our method. Availability and implementation Source code is available at https://github.com/zby961104/GSL-MPP.


Introduction
The accurate prediction of molecular properties is a critical task in the field of drug discovery.By utilizing computational methods, this task can be accomplished with great efficiency, reducing both time and expense associated with identifying drug candidates.This is particularly important considering that the average cost of developing a new drug is currently estimated to be approximately $2.8 billion [4,30] and the development period lasts a dozen of years, let alone the high risk of clinical failure [20].Naturally, a molecule can be abstracted as a topological graph, where atoms are treated as nodes and bonds are viewed as edges.In the past few years, deep graph learning methods, especially various graph neural networks (GNNs) have been applied in this field, offering effective molecular graph representations for accurate molecular property prediction [3,22,26].In GNNs, nodes iteratively update their representations after aggregating information from their neighbours and a final graph-pooling layer will generate a graph representation for the molecule.Up to now, various message passing layers have been proposed and applied, including GAT [27], MPNN [5] and GIN [33].And later studies further considered to integrate edge features into the passing messages in order to improve the expressive power of their models, like DMPNN [34] and CMPNN [22].
Despite the considerable progress, most of recent studies focus only on the message passing within individual molecules.The relationships among molecules are often ignored, which could also play an important role in property prediction [29].One relatively easy and effective way is to construct a relationship graph among molecules using the structural similarity, because a critical assumption of medicinal chemistry is that structurally similar molecules tend to have similar biological activities [8].For example, fingerprint (carrying the structural information of the molecules) similarity search is often used in virtual screening [15].However, this assumption is not always true since a phenomenon called activity cliff (AC) exits.An AC is defined as a pair of structurally similar compounds with a large potency difference against a given target [12,23,25,24].Thus, the relationship graph constructed by structural similarity may be not "perfect" for the downstream tasks.We need to take certain measures to enhance this relationship graph if we want to make full and proper use of it.
To address these problems above, we propose a novel two-level graph representation learning method for molecular property prediction, called GSL-MPP.Our method operates in a two-level molecular graph representation framework: (i) Atom-level molecular graph representation where molecular graphs composed of atoms and bonds represent the intra-structures of molecules; and (ii) molecule-level graph representation where inter-molecule similarity graph (MSG in short) is constructed by fingerprint similarity to encode similarities between molecules that allows effective label propagation among connected similar molecules.Intra-molecular representation is done by GNNs, and inter-molecular representation is finished by graph structure learning (GSL).This twolevel graph representation enables us to comprehensively exploit both intra-molecule and intermolecule information to get better molecular representations and overcome (to some degree) the AC problem, consequently boosting MPP performance.Specifically, we applies metric-based iterative graph structure learning in our method.The MSG structure and molecular embeddings are updated for T times.During each iteration, GSL-MPP learns a better MSG structure based on better molecular embeddings, and in turn, learns better molecular embeddings with a better MSG structure.Besides, during the training process, we also add a GSL-specific loss to the common supervised loss for better MSG structure learning on both classification tasks and regression tasks.Our method is evaluated on 7 benchmark datasets including 4 classification tasks and 3 regression tasks.Experimental results show that our model can achieve state-of-the-art performance in most cases.Ablation studies show that the combination of fingerprint similarity and GSL is of particular effectiveness.

Molecular Property Prediction
Most methods for predicting molecular properties can be summarized using a general framework.In this framework, we first transform the input molecule m into a specific-length vector h using a representation function, h = g(m).Then another prediction function is used to predict a specific property y based on h, y = f (h).During this period, a good molecular representation is of vital importance to address molecular property prediction problems.
At early stages, traditional chemical fingerprints such as Extended Connectivity Fingerprints (ECFP) [14,6] are used to encode a molecule to a vector.These fingerprints could carry the structural information of the molecules [16].
In order to improve the expressive power, recent works started to use the graph neural networks (GNNs) to acquire graph-level representation as molecular embedding.Examples include graph convolutional network (GCN) [3], graph attention network (GAT) [27], message passing neural network (MPNN) [5] and graph isomorphism network (GIN) [33].Later works extend the MPNN framework to consider bond information during message passing procedure, like DMPNN [34] and CMPNN [22].Besides, CD-MVGNN [11] also considers both atom-level and bond-level message passing, and a cross-dependency mechanism is designed to ensure these two views rely on information from each other during feature updates, thereby enhancing expressive capabilities.
Recently, many efforts have also been made to integrate transformer to graph neural network.Molecule Attention Transformer(MAT) [13] attempts to incorporate node distance and graph structural information when calculating attention scores.Another work Grover [19] combines message-passing networks with the Transformer architecture to create a more expressive molecular encoder that captures information at two hierarchical levels.CoMPT [1] is also built upon the Transformer architecture.Unlike previous graph Transformer models that treated molecules as fully connected graphs, this approach employs a message diffusion mechanism inspired by heat diffusion phenomena to integrate information from the adjacency matrix, alleviating the over-smoothing issue.
However, these methods only focus on the structure of a single molecule, while ignoring the important role of inter-molecular relationships for property prediction.

Graph Structure Learning
The expressive power of GNNs often depends on the input graph structure.However, the initial graph structure is not always optimal for downstream tasks.On the one hand, the original graph is constructed from the original feature space, which may not reflect the "true" graph topology after feature extraction and transformation.On the other hand, errors can also occur when data is measured or collected, making the graph noisy or even incomplete.Graph structure learning (GSL) is one of the methods that can effectively solve this problem, through learning and optimizing the graph structure [35].Recently, [2] proposed the method of iterative deep graph learning (IDGL) for jointly and iteratively learning graph structure and node embeddings in the field of natural language processing (NLP).It was later used by [28] for few-shot molecular property prediction.Compared to [28], our method is not based on few-shot situation and the datasets and baselines we choose are not for few-shot either.Besides, The specific implementation of GSL is different.More importantly, we try to construct an initial graph between molecules before we apply GSL, which is confirmed to be necessary in ablation study.

Overview
The structure of our method GSL-MPP is illustrated in Fig. 1, which is operated on a two-level graph learning framework.Specifically, the two-level graph learning framework consists of (i) the lower level: atom-level molecular graphs encoded by GNN to extract the initial molecular representations, and (ii) the upper level: a molecule-level similarity graph, on which graph structure learning (GSL) is performed to iteratively learn the final molecular embeddings, where inter-molecular relationships are exploited.
The workflow of GSL-MPP is as follows: (1) molecule graphs are first encoded by a GNN to obtain initial molecular embeddings.Meanwhile, molecules are represented as feature vectors using fingerprints.(2) With the molecular feature vectors, the initial molecular similarity graph (MSG) is constructed, where each node is a molecule initially represented by the above GNN embeddings, and each edge attached with a weight -the similarity between the two corresponding molecules.(3) GSL is performed on the MSG, which iteratively updates the molecular embeddings and the graph structure.(4) The final molecular embeddings are used for property prediction.

Molecular Graph Embedding
Here, we describe how to represent a molecular graph as an initial vector by GNN.A molecule m can be abstracted as an attributed graph where G m = (V, E), in which |V| = n v refers to a set of n v atoms (nodes) and |E| = n e refers to a set of n e bonds (edges) in the molecule.x v are used to represent the initial feature of node v and N v denotes the set of neighbors of node v.

Node embedding
We use Graph Isomorphism Network (GIN) [33] as intra-molecule GNN to extract each node's embedding: where MLP means multi-layer perceptron, h v is the representation vector of node v at the k-th layer.We initialize h and ϵ is a learnable parameter.

Graph pooling
After gaining each node's embedding, a READOUT operation is applied to get the initial molecular embedding h g :

Constructing Molecular Similarity Graph (MSG)
Our inter-molecule graph reflects the relationships between molecules, where each node indicates a molecule, and each edge means the relationship between two molecules.As shown in Fig. 1, the initial feature vector of each node is the molecule's embedding obtained by GNN (their embedding matrix is denoted as X r ), and the initial adjacency matrix A (0) is calculated by the structural similarity between molecules.Here, we calculate the structural similarity between molecules based on their Extended Connectivity Fingerprints (ECFP) [14].
ECFPs are circular fingerprints, possessing several beneficial characteristics: 1) They can be calculated fast; 2) They are not predefined and can capture an almost limitless range of molecular characteristics including stereochemical information; 3) They indicate the presence of specific substructures, facilitating interpretation of computation results [18].Specifically, we get each molecule's ECFP and calculate the Tanimoto Coefficient as the similarity score.A hyperparameter ϵ tc acts as a threshold to obtain a sparse matrix.That is, we mask off those elements in the adjacency matrix that are smaller than ϵ tc .We apply molecular fingerprints to construct A (0) because it contains useful structural information [16] and could offer an informative initial inter-molecule graph.

Structure Learning on Molecular Similarity Graph
As we have discussed, this similarity graph constructed above may not be good enough for downstream tasks, therefore here graph structure learning is employed to enhance the graph by exploiting inter-molecule relationships.Specifically, initial matrix built with fingerprint similarity only measure structural similarity between molecules and may not "perfectly" reflect true molecular property similarity, so we use GSL to refine it.The core of GSL is the structure learner that could be grouped into three types: (1) Metric-based approaches use a metric function like cosine similarity on pairwise node embeddings to calculate edge weights; (2) Neural approaches employ neural networks to infer edge weights; and (3) Direct approaches treat all elements of the adjacency matrix as learnable parameters [35].
In this paper, following IDGL [2], we adopt the metric-based approach and employ m-perspective weighted cosine similarity as the metric function: where s p ij estimates the cosine similarity between nodes v i and v j , each perspective p considers one part of the semantics contained in the vectors and corresponds to a learnable weight vector w p .The obtained s ij is the entry in row i and column j of the newly learned adjacency matrix A. Also the ϵ-neighborhood sparsification technique is applied to obtaining a sparse and non-negative adjacency matrix.
The node embeddings H r and the adjacency matrix A will be alternately refined for T times.At the t-th iteration, A (t) is calculated from the previously updated node embeddings H (t−1) r by Eq. ( 3).Then we use the learned graph structure A (t) as supplementary to optimize the initial graph A (0) : where A (1) is the adjacency matrix learned from X r at the 1-st iteration in order to maintain the initial node information.λ and η are hyperparameters.
After learning the adjacency matrix, we employ an L-layer inter-molecule GNN to learn node embeddings, and in the l-th layer, H (t,l) r is updated by is the final node embeddings in this iteration and H (t,0) r = X r .

Loss Function
After T rounds of iteration, the node (molecule) embeddings H (T ) r represent the final molecular representations.Based on this, predictions can be made for specific property ŷ with a fully connected layer (FC) as follows: The whole loss function used in our method consists of two parts: the label prediction loss and the GSL-specific loss.The label prediction loss function L pred is obtained in a manner similar to existing methods: where ŷ represents the predicted value, y is the ground truth, and ℓ represents the loss function used.
In classification tasks, it is the Cross Entropy Loss, and in regression tasks, it is the Mean Squared Error Loss.
Since the quality of the learned inter-molecule graph structure is of great importance for our method, we further design a GSL-specific loss, hoping that the learned adjacency matrix does not contain wrong edges.We use S train to represent molecules in training set and A (T ) to represent the final adjacency matrix after being refined T times.In classification tasks, there exists a ground truth for the matrix, A * (A * ij = 1 if y i = y j else 0), i.e., molecules with the same label should be connected by edges.Thus, we define the GSL-specific loss as However, in regression tasks, the prediction of a molecule is a real value and no native ground truth exists.We have to define it by ourselves.For the convenience of calculation, we only consider those molecular pairs with large difference (beyond a certain threshold ϵ y ) in predicted values when calculating the GSL-specific loss: The whole loss function combines both the task prediction loss and the GSL-specific loss, that is, L = L pred + L GSL .

Algorithm
The algorithm of our method is presented in Algorithm 1.After obtaining the initial molecular embeddings and constructing the initial inter-molecule similarity graph MSG (corresponding to the adjacency matrix), T iterations of GSL are applied.During each iteration, the adjacency is refined based on the node embeddings gained in the last iteration, while the node embeddings are updated based on adjacency matrix obtained in the last iteartion.
Algorithm 1 The GSL-MPP algorithm 1: Obtain initial molecular embedding h g,i for each molecule m i by a graph-based molecular encoder (an intra-molecule GNN); 2: X r ← embedding matrix of all h g,i ; 3: H (0) r ← X r ; 4: Construct an initial molecule similarity matrix A (0) using molecular fingerprint similarity; 5: for t = 1 to T do 6: Use GSL to learn a refined adjacency matrix A (t) by H (t−1) r using Eq.(3); Combine initial and refined adjacency matrices A (0) and A (t) , A (1) to obtain A (t) by Eq. ( 4); Calculate L pred by Eq. ( 7) and L GSL by Eq. (8)  Datasets.We use 7 benchmark datasets from MoleculeNet [31] for experiments, among which 4 are classification tasks and 3 are regression tasks.Specifically, BACE is about the binding results of several inhibitors; BBBP is the blood-brain barrier penetration dataset; SIDER and Clintox are two multi-task datasets corresponding to side effects and toxicity respectively; ESOL, Lipophilicity and Freesolv are regression datasets about physical chemistry properties.Scaffold splitting of [34] is adopted to split the datasets into training, validation, and test, with a 0.8/0.1/0.1 ratio, which is more empirical and challenging than random splitting.Following previous works [11,19], we use three independent runs on three random-seeded scaffold splitting for each dataset.
Baselines.We compare our method against 12 baselines.TF_Robust [17] is a DNN-based multitask framework that takes molecular fingerprints as input.GCN (GraphConv) [3], Weave [9]  *Here, GROVER does not use pretrained model for a fair comparison, and standard deviation is not provided in its the original paper.
and SchNet [21] are three graph convolutional models.MPNN [5] and its variants MGCN [10], DMPNN [34] and CMPNN [22] are models considering the edge features during message passing.AttentiveFP [32] is an extension of the graph attention network.GROVER [19] and CoMPT [1] are two transformer-based models.Here, GROVER is compared without the pretrain process for a fair comparison.CoMPT is a transformer-based model utilizing both nodes and edges information in message passing process while CD-MVGNN [11] also constructs two views for atoms and bonds respectively.
Evaluation metrics.Following the evaluation criteria adopted by these baseline models, we use AUC-ROC to evaluate the performance of classification tasks, and RMSE to evaluate regression tasks.
Implementation details.Our model apply a polynomial decay scheduler to the learning rate with two linear increase warm-up epochs and polynomial decay afterward.The power of polynomial decay is set to 1, indicating a linear decay.The final learning rate is 1e-9 and the max_epoch is 300.
For the proposed model, on each dataset we try different hyper-parameter combinations, and take the hyper-parameter set with the best result.While building the initial inter-molecule graph, the radius of used ECFP is 2. The threshold of GSL-specific loss for regression tasks (ϵ y ) is 0.01.More details of the hyper-parameter setting in the implementation of our model are presented in Table 1.Especially, on the ClinTox dataset, our model gets up to 7.8% performance improvement.This also reflects the effectiveness of our molecule similarity graph construction and graph structure learning.

Ablation Study
To investigate the contribution of each component of our model, an ablation study is conducted.We consider four variant models for comparison as follows: • Not any: directly use H r to predict.It is almost a GIN network.• Only A (0) : apply GNN on the initial molecular similarity graph A (0) constructed by ECFP similarity without GSL.
• Only GSL: use de novo GSL without an initial graph reference A (0) .
• No GSL-Loss: use A (0) and GSL, but apply only the prediction loss.
Ablation results are given in Table 3.Here we mainly consider the contribution of the initial adjacency matrix A (0) constructed by ECFP fingerprints and GSL process in our method.The results of "Not any", "Only A (0) " and "No GSL-Loss" confirm that the use of A (0) could improve the performance of our model and the improvement will be much more significant when combined with GSL.Besides, it is interesting to notice that "Only GSL" often performs worse than "Not any", which probably means learning an inter-molecule graph from scratch might be difficult and it is necessary for us to utilize the chemical information of molecular fingerprints to build an initial graph.Finally, while comparing "No GSL-Loss" and the complete "Our Model", we can see that GSL-specific loss does make a difference for our method.
We also conduct experiments to show the results of using different values of some important hyperparameters on all the datasets.Table 4 reports the results of applying different values of λ, which is used to balance the learned graph structure and the initial graph structure.It can be seen that applying a large λ value (0.8 or 0.9) will generate a relatively good results on most datasets, which indicates the importance of the initial inter-molecule graph.Besides, Table 5 shows the impact of the number of iteration T on performance.We can see that as T increases from 1 to 5, performance on most datasets does not show continuous improvement, which means that the best T is data dependent.

Visualization of Molecular Representations
To check the molecular representation learning ability of our model, we apply t-distributed Stochastic Neighbor Embedding (t-SNE) with default hyper-parameters to visualize the final molecular represen-   We can see that molecules of different labels have a clear boundary for both two classification datasets, especially for BBBP.Molecules of the same label tend to be clustered together, while molecules of different labels are located apart.Also, there seems a certain distribution pattern existing among the molecules of different property values for the two regression datasets.For the FreeSolv dataset, molecules tend to move from the outer region to the inner region as the property value decreases.As for the ESOL dataset, molecules tend to move from upper left to lower right as the property value decreases.These results indicate that our model generates reasonable representations of molecules for downstream tasks.

Scaling Our Model to Larger Datasets
During GSL, the similarity metric function calculate similarity scores for all pairs of graph nodes, which requires O(n 2 ) complexity.So we need to address the scalability issue if the size of datasets becomes larger.Following IDGL [2], we apply an anchor-based method.During each iteration, We learn a node-anchor similarity matrix R ∈ R n×s instead of the original complete adjacency matrix A ∈ R n×n .s represents the number of anchor nodes, which is a hyperparameter that can be set according to different datasets.By using R instead of A, the time and space complexity can be reduced from O(n 2 ) to O(ns).Therefore, Eq. (3) in the paper can be rewritten as the following: where s ik is the similarity score between node v i and anchor u k .The procedure of message passing should also be changed accordingly.The node-anchor similarity matrix R allows only direct connections between nodes and anchors.We call a direct travel between a node and an anchor as one-step transition described by R. Based on theories of stationary Markov random walks, we can actually recover A from R by calculating the two-step transition probabilities.
Using the above anchor-based GSL, We firstly evaluate whether introducing anchor nodes will have a great impact on the original prediction performance of our model.Results are given in Table 6.
We can find that anchor-based GSL performs a little worse than the original GSL in these molecule datasets but the performance degradation is not significant.So we think it is appropriate for us to apply anchor-based GSL in larger-scale molecule datasets.
After completing the above evaluation, we test the anchor-based GSL method on the HIV dataset which includes over 40000 molecules and compare it with some existing models.Except for CD-MVGNN, the results of other models on the HIV data set are from PharmHGT [7].PharmHGT is a recently proposed model based on the Transformer structure, which treats molecules as heterogeneous graphs.The ROC-AUC of CD-MVGNN on the HIV data set is obtained experimentally by ourselves.
Results are given in Table 7.Our method is able to achieve the optimal ROC-AUC on the HIV dataset, showing that after introducing anchor nodes, our method can be well extended to larger-scale datasets and achieve satisfactory results.

Conclusion
In

Figure 1 :
Figure 1: The workflow of GSL-MPP.In the initial molecular similarity graph (MSG), each node is a molecule initially represented by the GNN and each edge is attached with the FP similarity between the two corresponding molecules.GSL is then performed on the MSG.

Figure 2 :
Figure 2: Visualization of molecular representations for 4 datsets: (a) BBBP, (b) BACE, (c) FreeSolv and (d) ESOL.For classification datasets BACE and BBBP, molecules of label 1 are colored in red and molecules of label 0 are colored in blue.For regression datasets FreeSolv and ESOL, the colors of the points change from red to blue as the property value increases.

Table 2 :
Performance comparison between our model and baselines.Mean and standard deviation of AUC or RMSE values are reported.

Table 3 :
Ablation study on four variants of our model.

Table 2
(1)sents the results of our model and the baselines on all datasets.boldfacedvaluesare the best results, and underlined values are the 2nd best results.From 2 we have the following observations:(1)Compared to the SOTA model CD-MVGNN, Our model performs better on 3/4 classification datasets with a 2.4% AUC lift on BBBP.Since our model is based on a simple GIN without complicated message passing procedures used in CD-MVGNN and CoMPT, this result indicates the effectiveness of the inter-molecule graph for prediction tasks.(2) Our model performs relatively poor on regression tasks compared to SOTAs, which may be explained by the lack of real ground truth of relationship graphs in regression tasks.However, our model still achieve 2nd best results on 2/3 datasets.(3) Though our model uses GIN for intra-molecule graphs, it outperforms GIN on 7 of the 8 datasets.

Table 4 :
Results for different λ values on different datasets.

Table 5 :
Results for different T values on different datasets.

Table 6 :
Performance comparison between original and anchor-based GSL.

Table 7 :
Performance comparison between our model (using anchor-based GSL) and baselines.
this paper, we propose a new model based on two-level molecular representation for molecular property prediction.Unlike previous attempts focusing exclusively on message passing between atoms or bonds within individual molecule graphs, we further take use of the inter-molecule graph.Concretely, we utilize the chemical information of molecular fingerprints to construct an initial molecular similarity graph, and employ graph structure learning to refine the graph.Molecular embeddings based on GSL on the inter-molecular similarity graph are used for MPP.Extensive experiments show that our model can achieve state-of-the-art performance in most cases, especially on the classification tasks.Ablation studies also validate the major designed components of the model.However, there is still room to improve our model in the following directions: (1) Using more sophisticated graph-based models to encode molecular graphs rather than GIN.(2) Designing new metrics other than weighted cosine similarity for graph structure learning.(3) Exploring new and more effective GSL methods.