Molecular property prediction by semantic-invariant contrastive learning

Abstract Motivation Contrastive learning has been widely used as pretext tasks for self-supervised pre-trained molecular representation learning models in AI-aided drug design and discovery. However, existing methods that generate molecular views by noise-adding operations for contrastive learning may face the semantic inconsistency problem, which leads to false positive pairs and consequently poor prediction performance. Results To address this problem, in this article, we first propose a semantic-invariant view generation method by properly breaking molecular graphs into fragment pairs. Then, we develop a Fragment-based Semantic-Invariant Contrastive Learning (FraSICL) model based on this view generation method for molecular property prediction. The FraSICL model consists of two branches to generate representations of views for contrastive learning, meanwhile a multi-view fusion and an auxiliary similarity loss are introduced to make better use of the information contained in different fragment-pair views. Extensive experiments on various benchmark datasets show that with the least number of pre-training samples, FraSICL can achieve state-of-the-art performance, compared with major existing counterpart models. Availability and implementation The code is publicly available at https://github.com/ZiqiaoZhang/FraSICL.


Introduction
Nowadays molecular property prediction (MPP) based on deep learning techniques has been a hot research topic of the AI-aided Drug Discovery (AIDD) community [1,2,3,4,5,6,7].As most of the molecular properties that drug discovery studies concern require in vivo or in vitro wet-lab experiments to measure, labeled data for MPP tasks are typically scarce, because it is expensive and time-consuming to acquire such data [8].On the contrary, there are large amounts of public available unlabeled data [9,10,11].Therefore, how to use these large-scale unlabeled molecular data to train deep neural networks to learn better molecular representations for MPP tasks, is of great interest to the AIDD community.
Recently, as self-supervised pre-trained models (e.g.BERT [12], MoCo [13] and SimCLR [14]) have shown significant superiority in the fields of Natural Language Processing (NLP) and Computer Vision (CV), self-supervised learning (SSL) has become a mainstream method of utilizing large-scale unlabeled molecular data in MPP study.These SSL methods typically use some inherent features within or between samples to construct pretext tasks, so that unlabeled data can be leveraged to train deep models in a self-supervised learning manner [15].Contrastive learning, masked language model and predictive learning are the currently three categories of methods to design pretext tasks in MPP studies [16,17,18,19,20,21,22,23]. Inspired by SimCLR, contrastive learning methods aim at learning representations through contrasting positive data pairs against negative ones [16].Original molecular structures are arXiv:2303.06902v1[q-bio.BM] 13 Mar 2023 FraSICL augmented into multiple views, and views generated from the same molecule are typically used as positive data pairs, while views of different molecules are taken as negative ones [16].
The way to generate molecular views is crucial to the design of contrastive learning pretext tasks for molecular representation learning.As a kind of special objects, molecules can be represented by different methods, including molecular fingerprints [24], SMILES [25], IUPAC [26], and molecular graph.These different molecular representation methods therefore can naturally be leveraged to generate views for contrastive learning.For instance, the DMP [17] and MEMO [18] models are designed in this way.Following the practice in CV, another widely used category of methods tries to add noise into molecular structures to generate transformations of the original molecules.These noise-adding operations include deleting atoms, replacing atoms, deleting bonds, deleting subgraph structures etc. MolCLR [16] and GraphLoG [21] are such representative models.
Although the noise-adding methods for view generation have been widely used in CV studies [14,27], when applying these methods into MPP tasks, a fact that has not been noticed by the researchers is that molecules are very sensitive to noise.Arbitrarily modifying the topological structure of a molecule with noise, the generated new structure may represent a totally different molecule.For instance, as shown in Fig. 1(a), adding noise into an dog image by randomly masking some area will not change the semantic of the generated view, which is still a yellow dog.However, in Fig. 1(b), for an acetophenone molecule, deleting a subgraph leads to a benzene molecule, indicating that acetophenone's chemical semantic is completely changed.And small modification to molecular structure can lead to dramatic changes in the properties of modified molecules, including both bio-activity and other physio-chemical properties.Concretely, from the PubChem database we can find that the LogP value of acetophenone in Fig. 1(b) is 1.58, while that of benzene is 2.13.The difference is almost 35%.Therefore, it is unreasonable to treat these two views (molecules) as a positive pair for contrastive learning.Aiming at solving this semantic inconsistency problem, this paper proposes a Fragment-based Semantic-Invariant Contrastive Learning molecular representation model, named FraSICL.A semantic-invariant molecular view generation method is developed, in which a molecular graph is properly broken into fragments by changing the message passing topology while preserving the topological information.A multi-view fusion mechanism is introduced to FraSICL to make better use of the information contained in views of different fragments and avoid the impact of randomness.In addition, an auxiliary similarity loss is designed to train the backbone Graph Neural Network (GNN) to generate better representation vectors.
Our contribution are summarized as follows: • We raise the semantic inconsistency problem in molecular view construction for molecular constrastive learning and develop an effective method to generate semantic-invariant graph views by changing message passing topology while preserving the topological information.• We propose a novel Fragment-based Semantic-Invariant Contrastive Learning molecular representation model for effective molecular property prediction, which is also equipped with a multi-view fusion mechanism and an auxiliary similarity loss to better leverage the information contained in unlabeled pre-training data.

FraSICL
• Extensive experiments show that compared with SOTA pre-trained molecular property prediction models, the proposed FraSICL can achieve better prediction accuracy on downstream target tasks with less amounts of unlabeled pre-training data.

Method
Here, we first formally define semantic-invariant molecular view in Sec.2.1, then propose a semantic-invariant molecular view generation method in Sec.2.2 and a multi-view fusion scheme in Sec.2.3.Finally, we present the structure of the Fragment-based Semantic-Invariant Contrastive Learning (FraSICL) molecular representation model in Sec.2.4 and its loss functions in Sec.2.5.

Semantic-invariant Molecular View
In Sec. 1, we give an example to illustrate how noise-adding operations may lead to semantic inconsistency and consequently false positive pairs.Here, we will formally define semantic-invariant molecular view.
Given a molecule m and its molecular graph G = {V, E, X atom , X bond } (hydrogen-depleted) where V denotes the set of nodes that represent the atoms, E denotes the set of edges between nodes, representing the bonds.
Both semantic-conflict views and semantic-ambiguity views will lead to false positive pairs for molecular representation contrastive learning.For example, assume that a Graph Neural Network g(•) serves as an encoder to embed the molecular graphs into latent graph embeddings h G = g(G).If we ignore the randomness in the encoder, it is obvious that, for molecule m, if it has a semantic-conflict view w.r.t.molecule m 2 , i.e., G =F (G)=G 2 , then the representation of G embedded by the graph neural network will be the same as that of G 2 .That is, h G =g(G )=h G2 =g(G 2 ).In this case, as h G and h G are considered as a positive pair in contrastive learning, h G and h G2 are consequently used as a positive pair.In another word, the contrastive loss will implicitly make the representations of molecule m and m 2 to be close.However, as claimed before, the molecular properties of different molecules may be greatly different, so that they cannot be used as a positive pair for contrastive learning.Therefore, semantic-conflict views will lead to false positive pairs and degrade learning performance.
On the other hand, if a semantic-ambiguity view is generated as defined in Def. 2, i.e., F (G 2 )=G =F (G), indicating that the contrastive loss will make h G and h G , h G2 and h G to be close in the embedding space, thus h G and h G2 to be close too.That is, the representations of molecules m and m 2 are consequently close by contrastive learning.So semantic-ambiguity views will also lead to false positive pairs.
To boost the performance of contrastive learning for MPP, we should avoid the generation of both semantic-conflict views and semantic-ambiguity views.That is, we generate only semantic-invariant views, which are defined as follows: Definition 3 (Semantic-invariant view) Given a view G of molecule m with graph G, if G is neither a semanticconflict view nor a semantic-ambiguity view w.r.t.any other molecules, then we say G is a semantic-invariant view of m.
In next section, we will give a method to generate semantic-invariant views.

Semantic-invariant View Generation
According to Def. 3 in Sec.2.1, semantic-invariant views should be neither semantic-conflict views nor semanticambiguity views.Besides, from the perspective of prediction, they should also be discriminative.That is, they can be encoded into different representations by neural network encoders.In this section, to achieve these goals, we propose a semantic-invariant view generation method.In our previous study [6], to better capture the hierarchical structural information of molecules, a chemical-interpretable molecule fragmentization method FraGAT is proposed.By considering acyclic single bonds as boundaries between functional groups, the FraGAT model proposes to randomly breaking one of the acyclic single bonds to generate two graph fragments corresponding to some chemical meaningful functional groups.The experimental results show that learning representations by chemical meaningful molecular graph fragments can achieve good predictive performance for MPP tasks.Inspired by these findings, our semantic-invariant view generation method is designed as follows: Given a molecule m, its molecular graph can be denoted as an annotated graph G = {V, E, X atom , X bond }.The atom feature matrix X atom and the bond feature matrix X bond are computed according to Tab. 1.Then, remove one of the acyclic single bonds e ij from E, we obtain G = {V, E , X atom , X bond } where E = E − {e ij }.We accept G as a view to be generated, i.e., a semantic-invariant view.As the graph G consists of two disconnected molecular graph fragments, it is also called fragment-pair view.
From the discrimination perspective, G is a different graph from the original molecular graph G, so that it will make GNN encoders to generate a different representation.Furthermore, as all the acyclic single bonds in a molecule are unique, breaking different acyclic single bonds will lead to different fragment-pair views, whose representations after a GNN encoder will also be different.That is, the generated views for a molecule are discriminative.
Then, is G a real semantic-invariant view of molecule m according to Def. 3? Let us check.
On the one hand, as the atom feature matrix X atom is not modified, the numbers of bonds of the two vertex i and j of the broken bond e ij encoded in the atom feature vectors of the generated G remain the same as that in the original molecular graph G.However, the modified E indicates that there is no edge between i and j, so that the numbers of bonds encoded in the atom feature vectors are not consistent with that in the topological structure.The degrees of node i and j in graph G are lower than the numbers of bonds of i and j encoded in X atom .In another word, G is not a valid molecular graph of any molecule.Furthermore, from the graph perspective, because G consists of two disconnected subgraphs, and no any valid molecule corresponds to a disconnected graph.Therefore, G cannot be a semantic-conflict view w.r.t.any molecule according to Def. 1.
On the other hand, since only one single bond is removed in view G , this discrepancy can only be discovered at nodes i and j, and the numbers of bonds of i and j encoded in X atom must be only 1 larger than the degrees of i and j, so the removed edge can only be between i and j, and the removed edge can only be a single bond.Thus, there is no other molecular graph G 2 that can generate the same G .Similarly, from graph perspective, the graph G of molecule m cannot be equal to the disconnected graph of any view generated from any other molecule.In summary, G cannot be a semantic-ambiguity view w.r.t.any molecule according to Def. 2.
Finally, from the perspective of graph rewiring [28,29,30], the topological information about the broken edge is encoded in the atom feature matrix X atom .So our method preserves the topological structural information of the original molecular graph, but propagates message between nodes through a different topology.Thus, it realizes local decoupling of the input graph topology and the message passing topology.Moreover, compared with randomly breaking FraSICL any edges in the molecular graph, our method can generate chemical meaningful graph fragments to benefit the prediction of molecular properties, which has been demonstrated in the experiments of previous work [6].
In conclusion, our method is expected to generate better positive pairs, which will help to train neural networks to generate better molecular representations by contrastive learning.

Multi-view Fusion
The number of acyclic single bonds of an organic molecule is often large, so there are various fragment-pair views can be generated from one molecule by our proposed view generation method.As demonstrated in the experiments of some existing work [6], different fragment pairs contain different information about functional groups that constitute a molecule, which shows different predictive performance.So, to ensure that the information of the functional groups that determine molecular properties and are contained in the fragment pairs can be obtained by the neural network, in FraSICL, we no longer randomly generate fragment-pair views as other contrastive learning models do.Instead, a multi-view fusion mechanism is introduced as follows: Given a molecule with N b breakable acyclic single bonds, all of the N b fragment-pair views are generated and the representations of these fragment-pair views are calculated by a GNN encoder.Then, a Transformer encoder is exploited to fuse these representations by the multi-head attention (MHA) mechanism to produce a representation vector that contains information of all of the fragment pairs, named fragment view.The details of fragment-pair view fusion are delayed to the next section.The fragment view and the molecule view ( i.e., the original molecular graph) are used as two views of a molecule for contrastive learning.

Model Structure
The structure of the FraSICL model is shown in Fig. 2. Given a molecule m with molecular graph G mol = {V, E, X atom , X bond }, the model computes the representations of two views via two branches: the left branch is the molecule view branch for generating molecular view, and the right one is the fragment view branch for generating the fragment view.
In the molecule view branch, a GNN g mol (•) is used as an encoder to capture the representation of the molecular graph h mol = g mol (G mol ).Attentive FP [5] is employed as the graph encoder in this work.Then, following the structure of MolCLR [16], h mol is fed to a projection head l mol (•) and a regularization function norm(•) to produce the projection of the molecule view p mol = norm(l mol (h mol )).The structure of a projection head is shown in Fig. 3.And the regularization function is norm(v) = v v , which can make the length of the projection vector be 1.And for the fragment view branch on the right, all of the N b breakable acyclic single bonds are enumerated and broken by the method proposed in Sec.2.2 to generate N b fragment-pair views Then, a GNN g f rag (•) is used as an encoder to compute the representation of each fragment-pair view h i f rag = g f rag (G i f rag ).Attentive FP is also used here.Note that since there are two disconnected components in each fragment-pair view G i f rag , g f rag (•) will read out these two subgraphs separately and produce two subgraph embeddings.The representation of a fragment-pair view is obtained by element-wisely adding its two corresponding subgraph embeddings for permutation-invariant property.
Then, as described in Sec.2.3, a multi-view fusion mechanism is introduced for leveraging all of the information related to functional groups contained in the N b fragment-pair views.Specifically, a Transformer encoder T (•) is employed, which uses the representations of fragment-pair views h i f rag as input tokens, and computes the interaction relationships between the fragment-pair views by the multi-head attention (MHA) mechanism.The resulting attention scores serve as weights to fuse the representations and obtain ĥi f rag .By summing up all of the representations of N b fragment-pair views, we can get the representation of fragment view Finally, the representation h f v of the fragment view goes through a projection head l f v (•) and a normalization layer to get Following the structure of MolCLR, the two projections p mol and p f v are used to calculate contrastive loss.And when finetuning on the downstream tasks, the model will output one of the representations h mol or h f v of a molecule to serve as learned molecular representation.A downstream prediction head f (•) will use this representation as input, and predict the molecular property by y = f (h mol ) or y = f (h f v ).

In addition, representations h i
f rag of N b fragment-pair views of a molecule goes through another projection head and a normalization layer to produce projection p i f rag = norm(l f rag (h i f rag )).Inner product of these projections are computed to generate a similarity matrix S = {s ij |s ij =< p i f rag , p j f rag >}, S ∈ R N b ×N b , where < •, • > denotes the inner product of two vectors.The usage of this similarity matrix S will be introduced in the next section.

Projection Head
Linear BatchNorm ReLU Linear DropOut Figure 3: The structure of a projection head.Following the design proposed in BYOL [27], a projection head consists of a stack of linear layer, BN layer, activation layer, linear layer and dropout layer.

Loss Functions
The training of FraSICL in the pre-training phase is illustrated in Fig. 4. Here, given a batch of N molecules, the model will calculate the projections p mol and p f v of each molecule.Then, contrastive learning is performed between all samples in a batch where inner product similarity is adopted for sim(p i mol , p i f v ), and τ is a temperature parameter.The sum of all contrastive losses of a batch of molecules is denoted as In addition, although the representations of different fragment-pair views have been fused, from the perspective of contrastive learning, the representations of different fragment-pair views of the same molecule should also be as close as possible.And as demonstrated in previous study [6], representations of some fragment pairs of a molecule are highly predictive on the downstream tasks, while some others are less effective.So, we hope that the representations of fragment-pair views can use information from each other to train the GNN encoder to extract better representations.To this end, an additional auxiliary loss L sim is introduced to improve the similarity between representations of fragment-pair views of a molecule, based on the similarity matrix S. Since the inner product of two normalized vectors is equivalent to cosine similarity, and the maximum value of cosine similarity is 1, assuming a molecule k have N k b FraSICL fragment-view pairs, the elements of the similarity matrix are s k ij =< p k,i f rag , p k,j f rag >, our auxiliary similarity loss of molecule k is: i.e., as shown in Fig. 4, the sum of L2 loss between each element of the similarity matrix S and that of an all-one matrix.Denote the similarity loss of a batch of molecules as L sim = N k=1 L k sim , then the loss for pre-training the FraSICL model is: where γ is a hyper-parameter to adjust the influence of the auxiliary similarity loss.

Baseline experiments
Experimental setting.To construct the pre-training dataset, 200K molecules are randomly sampled from the pretraining dataset of MolCLR, where 10 million molecules are gathered from the PubChem database [10].The amount of pre-training data is generally smaller than that of the other baseline models, as shown in Tab. 2. 5% of the pre-training data are randomly selected as a validation set for model selection.7 downstream tasks from MoleculeNet [31] are used as downstream target tasks for the baseline experiments.Scaffold splitting is used on each downstream task, with an 8 : 1 : 1 ratio for the training/validation/test sets.
When transferring a pre-trained FraSICL model to the target tasks, different strategies can be applied, including using which branch of the model for producing molecular representations, and whether to finetune the pre-trained model (PTM) on target tasks.In the baseline experiments, we adopt the more complex fragment view branch for molecular representations, and finetune the model together with prediction head on the target tasks.

Compared baseline models.
Seven state-of-the-art self-supervised pre-training models for molecular representation learning are used as baseline models for comparison, including MolCLR [16], DMP [17], MEMO [18], GROVER [19], GraphLoG [21], PretrainGNNs [22] and KPGT [23].The experimental results are shown in Tab. 3, where the data of baseline models are cited from the original papers of these models.The best score on each dataset is bold, and the second-best is underlined.Results and analysis.As shown in Tab. 3, FraSICL achieves the best predictive performance on 5 of the 7 downstream MPP tasks, and the second on another one.As the number of pre-training samples used by FraSICL is only 200K, which is the least among these compared baseline models, the experimental results show that FraSICL can make better use of the information contained in the graph fragments of molecules to produce molecular representations with better predictive performance.Compared with the MEMO model that uses the same amount of pre-training data, the predictive performance of FraSICL on the 7 downstream tasks is significantly improved, even exceeds 20% on the BBBP dataset.And compared with the models such as GROVER and DMP-TF, FraSICL can achieve comparable or even higher predictive performance with only about 1/50 training samples.These results show the superiority of FraSICL to the existing models on molecular property prediction tasks. FraSICL

Conclusion
This paper focuses on the semantic inconsistency problem that may occur when using noise-adding operations to generate new views for contrastive learning in self-supervised molecular property prediction studies.To solve this problem, this paper first defines semantic-invariant molecular view by introducing two types of semantic inconsistent views that may lead to false positive pairs and consequently poor performance.Then, a semantic-invariant view generation method is proposed.The views generated by this method will not cause semantic inconsistency, which realizes the decoupling of the input graph topology and the message passing topology of GNNs.Thus, this method is expected to promote the GNN encoders to extract better molecular representations.
Based on the semantic-invariant views, a Fragment-based Semantic-Invariant Contrastive Learning (FraSICL) molecular representation model is developed.FraSICL is an asymmetric model with two branches, the molecule view branch and the fragment view branch.A multi-view fusion mechanism is also introduced to make better use of the information contained in the views of different fragment pairs.Furthermore, an auxiliary similarity loss is designed to train the backbone GNN to produce better representations.
Baseline experiments are conducted on 7 target tasks, and experimental results show that FraSICL achieves state-ofthe-art predictive performance with the least number of pre-training data.Further experiments demonstrate that in our model finetuning is effective in boosting performance and the auxiliary similarity loss can improve the predictive accuracy if a proper hyperparameter γ is selected.These findings reveal that FraSICL can make better use of the information of pre-training samples and generate representations with superior predictive performance.

Figure 1 :
Figure 1: Illustration of the influence of noise-adding operation to the semantic of generated view.(a) After adding noise to the image by randomly masking some area, the semantic of dog image does not change, i.e., the image still represents a dog.(b) By adding noise into the molecular structure by masking some atoms and edges, the molecule acetophenone becomes a completely different molecule benzene with different molecular properties.That is, the chemical semantic of acetophenone is completely changed.

Figure 2 :
Figure 2: The structure of the FraSICL model.

Figure 4 :
Figure 4: The illustration of FraSICL training.FraSICL is trained by both NT-Xent contrastive loss and an auxiliary similarity loss.In the contrastive loss, two projections of a molecule are treated as a positive pair, which is highlighted by red lines in the figure.Projections of other molecules in a batch are considered as negative pairs, which is shown by blue lines.For each molecule, L2 loss is computed between the similarity S and an all-one matrix as auxiliary similarity loss.
atom and X bond are feature matrix of atoms and bonds respectively.A transformation function F (•) is used to generate a molecular graph view (or simply molecular view) G of G, i.e., G =F (G) and G ={V , E , X atom , X bond }.In what follows, we first define two types of semantic inconsistent views.
Definition 1 (Semantic-conflict view) If there is another molecule m 2 whose molecular graph is G 2 , and G 2 =G =F (G), i.e., the view G of m is the same as the molecular graph G 2 of m 2 , then we say G is a semantic-conflict view of m with regard to (w.r.t.) m 2 .Definition 2 (Semantic-ambiguity view) If there exists another molecule m 2 whose molecular graph is G 2 , and

Table 1 :
Properties of atoms and bonds in X atom and X bond .

Table 2 :
The training details of the PTMs