BatmanNet: bi-branch masked graph transformer autoencoder for molecular representation

Abstract Although substantial efforts have been made using graph neural networks (GNNs) for artificial intelligence (AI)-driven drug discovery, effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets , which are time-consuming, computationally expensive and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug–drug interaction and drug–target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.

1 Introduction AI-driven drug discovery (AIDD) has attracted increasing research attention.Many remarkable developments have been achieved for the various tasks related to the small molecules, e.g., molecular property prediction [1], drug-drug interaction [2], drug-target interaction prediction [3,4,5,6], molecule design [7,8].Effective molecular representation learning plays a crucial role in these downstream tasks.Recently, graph neural networks (GNNs) have exhibited promising potential in this emerging representation learning area, where the atoms and bonds of a molecule are treated as a graph's nodes and edges [9].There remain a few limitations, especially learning from the insufficient labeled molecules, hindering applications to real-world scenarios.Specifically, in the biochemistry domain, task-specific labeled data related to small molecules is very limited, as obtaining high-quality molecular property labels often requires time-consuming and resource-costly wet-lab experiments [10].Supervised training deep GNNs on these limited datasets easily leads to the overfitting problem [11].
To overcome these challenges, some recent studies suggested that a self-supervised pre-trained large neural network on the unlabeled datasets enables better transfer performance in downstream molecular property prediction tasks.For example, [12,13] used SMILES representation [14] of molecules to pre-train a sequence-based model with the masked language-modeling task.However, lacking explicit topology representation, such methods cannot explicitly learn the molecular 2D topology structural information, instead focusing their learning on the grammar of molecular strings.
Recently, more research works focused on pre-training models directly from the 2D topology graphs.[15] proposed a self-supervised task at the level of nodes and graphs to pre-train the GNNs to learn local and global representations.[16] also proposed a pre-training task at two different levels, i.e., contextual property prediction for node/edge level task and graph-level motif prediction, and designed a GNN-Transformer-style pre-training model-GROVER.Although the GROVER model achieved state-of-the-art performance on multiple downstream molecular property prediction tasks, we argue that the molecular representation learning in this way is suboptimal.First, in order to predict the contextual property from a masked local subgraph, a contextual property dictionary has to be pre-defined in advance, which requires additional pre-processing on the raw molecular graph and introduces associated hyper-parameters to control context radius manually.Thus, the pre-training task cannot be achieved efficiently in an end-to-end manner.Second and more importantly, the best performer GROVE model contains over 100 million parameters, and thoroughly training such a big model often requires a large-scale dataset (11 million unlabelled molecules were used for pre-training) and considerable computational resources, which is very time-consuming, computationally expensive, and environmentally unfriendly.
In this paper, we propose a novel molecular pre-training framework, BatmanNet, to significantly improve the effectiveness and efficiency of molecular representation learning.First, we propose a simple yet powerful self-supervised task, i.e., masking a high proportion (60%) of the nodes and edges in the molecular graph and reconstructing the missing part via a graph-based autoencoder framework, as illustrated in Figure 1.With this challenging self-supervised task, BatmanNet is encouraged to learn expressive structural and semantic information of molecules in an end-to-end fashion.Compared to GROVER [16], our method is much more elegant, scalable, and effective, directly operating on the finest granularity of atoms and bonds.Second, we propose a domain-tailored bi-branch asymmetric encoder-decoder architecture for molecular pre-training.Specifically, the encoder is a transformer-style architecture composed of multiple GNN-Attention blocks.The GNN is integrated into the attention layer to extract local and global information of molecular graphs, respectively.The encoder operates only on the subset of the molecular graph visible to the model (excluding the tokens that are masked).The decoder reconstructs the molecular graph from the learned representation together with masked tokens, and its architecture is similar to the encoder but much more lightweight.This asymmetric encoder-decoder architecture can significantly reduce the amount of computation, shorten the overall pre-training time, and reduce memory consumption.
In summary, our contributions are as follows: • We propose a novel self-supervised pre-training strategy for molecular representation learning, masking nodes and edges simultaneously with a high mask ratio (60%) and reconstructing them via an autoencoder architecture.
• We developed a bi-branch asymmetric graph-based autoencoder architecture, significantly enhancing the learning effectiveness and efficiency of the model and vastly reducing memory consumption.
• Extensive experiments show that our BatmanNet achieves state-of-the-art performance on multiple downstream tasks even though it is pre-trained with fewer model parameters (2.575M ) on relatively smaller-scale datasets (250K) compared to GROVER with 100M parameters pre-trained on 11M molecular data.

Related Work
The existing molecular representation learning methods can be roughly classified into two categories.The first one is the SMILES-based method.Inspired by the great success of BERT [17] in natural language processing (NLP), some works proposed to use the language models to learn from the molecular sequence representation-SMILES, e.g., pre-training BERT-style models with masked language modeling [18,19,12,13] or using an autoencoder framework to reconstruct the SMILES representation [20,21,22].However, the SMILES itself has several limitations in representing the small molecules.First, it is not designed to capture molecular similarity, e.g., two molecules with similar chemical structures might be translated into markedly different SMILES strings, which likely misleads the language modeling relying on the positional embedding [7].Second, some essential chemical properties of molecules, such as molecular validity, are not easily expressed by the SMILES representation, resulting in most text sequences over the appropriate character set not belonging to valid molecules.
The second one is the graph-based method, which can be further divided into two subgroups depending on the utilized molecular information level.One group of methods pre-trained 2D GNN models from the molecular 2D topology, which more focused on the adjacency of atoms [23,15,16,24,25].The other methods worked on the 3D geometry graphs with spatial positions of atoms by utilizing 3D GNN models [26].Although graph-based methods explicitly take into account the molecular structural information, they usually require a large volume of molecule data for pre-training due to their complicated architectures, which may limit their generalization abilities when the data is sparse.
Among the graph-based methods, GMAE [24], MGAE [25] and GraphMAE [27] are most relevant to our work.However, unlike GMAE masking nodes only and MGAE masking edges only, our BatmanNet constructs a bi-branch complementary autoencoder.The dual branches perform node masking and edge masking, respectively, to enhance the expressiveness of the model.Compared with GraphMAE, which uses a descriptor to replace the masked nodes, our method directly removes the masked part, which is more difficult.

Preliminaries
In this section, we formally introduce GNNs and the attention mechanism in Transformer [28].

Graph Neural Networks (GNNs)
GNNs are a class of neural networks designed for graph-structured data, and they have been successfully applied in a broad range of domains.One of the key components of most GNNs is the message passing (also called neighborhood aggregation) mechanism between nodes in the graph, where the hidden representation h v of node v is iteratively updated by aggregating the states of the neighboring nodes and edges.For a GNN with K layers, which means repeating the message passing by K times, the v's hidden representation will contain the structural information of k-hop on the graph topology.Formally, the k-th layer of a GNN can be formulated as, where m v is the representation of node v at the k-th layer, e uv is the representation of edge (u, v), σ(•) is the activation function, and N v is a set neighbors of v. AGG (k) (•) is the neighborhood aggregation process of the k-th layer.For convenience, we initialize h where V is the set of nodes (atoms).

Multi-head attention mechanism
The multi-head attention mechanism is the core building block of Transformer [28], which has several stacked scaled dot-product attention layers.The input of scaled dot-product attention layer consists of queries q and keys k with dimension d k and values v of dimension d v .In practice, the set of (q, k, v)s are packed together into matrices (Q, K, V) so that they can be computed simultaneously.The final output matrix is computed by, Multi-head attention allows the model to focus jointly on information from different representation subspaces.Suppose multi-head attention has h parallel attention layers, then the output is, where   This section introduces the pre-training strategy, the BatmanNet architecture, and fine-tuning steps.Before getting into details of BatmanNet, we introduce its overall structure.As shown in Figure 2, BatmanNet is a bi-branch model that includes a node-branch and an edge-branch.Each branch focuses on learning the embeddings of nodes or edges from the input graph that can be used for fine-tuning the downstream tasks.Similar to MAE [29], we propose a transformer-style asymmetric encoder-decoder architecture for each branch.By applying a bi-branch graph masking pre-training strategy, the encoder operates on partially observable signals of molecular graphs and embeds them to latent representations of nodes or edges.The lightweight decoder takes the latent representations of nodes and edges along with mask tokens to reconstruct the original molecule.
4.1 Pre-training strategy: Bi-branch graph masking [16] and [15] discussed that a good self-supervised pre-training strategy should include both the node level and the graph level tasks so that the model can learn local and global level information of the molecular graph.Here we propose a self-supervised pre-training strategy that can effectively learn local and global information with a single prediction task.Inspired by MAE [29], we propose a bi-branch graph masking and reconstruction task for molecular pre-training.Specifically, given a molecular graph, for the node-branch of BatmanNet, we randomly mask (i.e., remove) its nodes with a high ratio and let the encoder only operates on the remaining unmasked nodes.The same masking strategy is applied to the edge-branch.The difference is that the edge-branch takes the complementary edge-graph of the original molecular graph as the input.Given an input molecule, we denote its original molecular graph as the node-graph where atoms are represented as nodes and bonds as edges.The edge-graph is the node-graph's primary dual, whose nodes and edges correspond to the edges and nodes of the node-graph, respectively.Thus, like the node-graph, we can directly apply GNNs to the edge-graph to get its edge embeddings through message passing.More implementation details of the dual graph are deferred to Appendix A1.It is worth mentioning that, considering that the message passing process in GNNs is directed, we adopt the directed masking scheme [25] to the random masking of edges (i.e.,(u, v) and (v, u) are different).Removing (u, v) does not mean that (v, u) is also removed.To distinguish (u, v) and (v, u), we add the feature of the starting node (head node) to the initial feature of the edge.
We argue that with bi-branch graph masking and reconstruction, the pre-trained node and edge embeddings can learn the molecular information at both the node/edge and graph levels.First, after masking nodes and edges with a relatively high ratio, e.g., 60%, each node and edge would have a high likelihood of missing neighboring nodes and edges simultaneously.To reconstruct missing neighboring nodes and edges, each node and edge embedding needs to learn its contextual information locally.[16] proposed a node-level task that uses the node embedding to predict neighboring subgraphs within k-hops.In our high ratio randomly masking and reconstruction, the "subgraph"-missing neighboring nodes and edges can break the restriction of scale and shape of subgraphs used for prediction.Thus the node and edge embeddings learned from our task are encouraged to capture local contextual information beyond the k-hops range and limited shapes.Second, after removing nodes or edges with a high ratio.The remaining nodes and edges can be seen as a collection of "subgraphs" that are used to predict the full graph.This is a much more challenging task of graph-to-graph prediction compared with other self-supervised pre-training tasks in which they usually learn the global graph information with smaller graphs or motifs as the target [16].Our "harder" pre-training task of bi-branch graph masking and reconstruction gives us a larger capacity for learning high-quality node and edge embeddings that can capture the molecular information on both the local level and global level.

Asymmetric Autoencoder for Pre-training
Encoder.The encoder maps the initial features of visible, unmasked nodes and edges to embeddings on their latent feature spaces.As shown in Figure 2, the node and edge branches of the encoder are two symmetric multi-layer transformer-styled networks based on the implementation described in [16].The encoder comprises a stack of N identical layers of the GNN-Attention block, and each block adopts a double-layer information extraction framework.In a GNN-Attention block, a GNN is used as the first layer of the information extraction network.It performs the message passing operation on the input graph to extract local information, and the output is the learned embeddings.A multi-head attention layer is then adopted to capture the global information of the graph.Concretely, a GNN-Attention block comprises three GNNs, i.e., G Q (•), G K (•), and G V (•), that learn embedding of queries Q, keys K, and values V as follows: where H ∈ R n×d is the hidden representation matrix of n nodes, with the embedding size of d.Then we apply equations ( 3) and ( 4) to get the final output of the GNN-Attention block.At the beginning of the encoder, we use a linear projection with added positional embeddings to keep the positional information of unmasked nodes and edges.
Here we adopt the absolute sinusoidal positional encoding proposed by [28] and the positions of nodes and edges in the input graph are indexed by RDkit before masking.By feeding the original graph and its dual graph to both branches of the encoder, respectively, we get the aggregated node embedding m v and edge embedding m vw as follows: We add long-range residual connections from the initial features of nodes and edges to m v and m vw to overcome the vanishing gradient and alleviate the over-smoothing at the message passing stage.In the last step, we apply a Feed Forward and LayerNorm to obtain the node embedding and edge embedding as the final output of the encoder.
Decoder.The decoder takes the input of a full set of reordered molecular representations, including (i) the embeddings of unmasked nodes and edges from the encoder and (ii) mask tokens of removed nodes and edges.Specifically, we use a Feature Reordering layer (as shown in Figure 2) that concatenates (i) and (ii) and recovers their order in original input graphs by adding corresponding positional embeddings.The decoder uses the same transformer-styled architecture as the encoder but is more lightweight, i.e., by stacking M GNN-Attention blocks, where M N .The BatmanNet decoder is used in the pre-training stage to perform the molecular reconstruction task.Only the encoder is used to produce molecular representations for the downstream prediction tasks.Based on [29], a narrower or shallower decoder would not impact the overall performance of the MAE.In this asymmetric encoder-decoder design, the nodes and edges of the full graph are only processed at the lightweight decoder.This significantly reduces the model's computation and memory consumption pre-training.
Reconstruction Target.The node and edge branches of BatmanNet reconstruct molecules by predicting all features of masked nodes and edges, respectively.The features of node (atom) and edge (bond) we used in BatmanNet are deferred to Appendix A2.A linear layer is appended to each decoder's output, and its output dimension is set as the total amount of the feature size of either atom (for node branch) or bond (for edge branch).Both the reconstruction tasks of nodes and edges are high-dimension multi-label predictions, which can alleviate the ambiguity problem discussed by [16] where a limited number of the atom or edge types are used as the node/edge level pre-training targets.The pre-training loss is computed on the masked tokens similar to MAE [29], and the final pre-training loss L pre-train is defined as: where L node and L edge are the loss functions of the node branch and edge branch.V mask and E mask represent the set of masked nodes and edges, respectively.L ce is the cross entropy loss between predicted node features p v and corresponding ground-truth y v or predicted edge features p (u,v) and corresponding ground-truth y (u,v) .

Fine-tuning Model
We only use BatmanNet's encoder for downstream tasks.Unlike the pre-training, where the model input is an incomplete molecule, the inputs of downstream tasks are complete molecules without masking.After N GNN-Attention blocks, both branches of BatmanNet's encoder perform Node Aggregation, producing two node representations m node-branch v and m edge-branch v as follows: where h u and h uv are the hidden states of the GNN-Attention blocks of node-branch and edge-branch.Then we also apply a single long-range residual connection to concatenate m node-branch v and m edge-branch v with initial node features and edge features, respectively.Finally, we transform the two embeddings m node-branch v and m edge-branch v through Feed Forward layers and LayerNorm to generate the final two embeddings output for downstream tasks.
Through the above process, given a molecule G i and the corresponding label y i , BatmanNet's encoder can generate two node embeddings, H node-branch i and H edge-branch i , from the node branch and the edge branch, respectively.Following GROVER [16], we feed these two node embeddings into a shared self-attentive READOUT function to generate two graph-level embeddings, g node-branch and g edge-branch .They are both obtained by: where W 1 ∈ R dattn_hidden×dhidden_size and W 2 ∈ R datt_out×datt_hidden are two weight matrix.After that, we apply a Feed Forward layer for both branches to get predictions p node-branch i and p edge-branch i .The final loss consists of the supervised loss L sup and the disagreement loss [30] L diss , where the disagreement loss is to train the two predictions to be consistent.So, L fine-tune = L sup + L diss , where 5 Experiments

Datasets
Pre-training Datasets.We use the ZINC-250K molecule dataset from [31] to pre-train BatmanNet.The dataset is constructed by sampling 250K molecules from the ZINC database [32].Here we randomly split the dataset into training and validation sets with a ratio of 9 : 1.
Fine-tuning Datasets.To evaluate the performance of pre-trained BatmanNet on downstream tasks, Following the PretrainGNN [15], we select 8 classification benchmark datasets from MoleculeNet [33] containing different classification targets.We apply scaffold splitting [34,35] to split the dataset into training, validation, and test sets at a ratio of 8 : 1 : 1 in each downstream task.More details are deferred to Appendix B.

Experimental Configurations
We use the Adam [41] optimizer and the Noam learning rate scheduler [17] to optimize the model and adjust the learning rate for both pre-train and fine-tuning.The batch size is set as 32, and the BatmanNet is implemented by PyTorch.

Pre-training
The masking ratio for both branches of BatmanNet is set as 0.6, the layer numbers of the encoder and decoder are 6 and 2, and the hidden size is 100.For the GNN-attention block of each layer, the number of GNN layers and the number of self-attention heads are 3 and 2. The autoencoder structure contains around 2.6M parameters and is pre-trained on 1 Nvidia RTX3090 for two days.
Fine-tuning Following the same protocol of [16], we extract additional 200 molecule-level features using RDKit.We concatenate these features to the node and edges embedding learned from the encoder of BatmanNet, and use MLPs for prediction tasks of fine-tuning.The model is selected based on its performance on the validation set.We run our model on three different scaffold splittings in each downstream prediction task and report the mean and standard deviation of the AUC after 30 epochs.More pre-training and fine-tuning details are deferred to Appendix C.  The experimental results comparison between our proposed BatmanNet and all baselines on eight benchmark datasets.We report the mean (and standard deviation) AUC for each dataset of three random seeds with scaffold splitting.

Results on Downstream Tasks.
The overall performance results of all models on downstream tasks are presented in Table 1.We report the comparison of BatmanNet with GROVER base and GROVER large in terms of pre-training dataset size, the number of parameters and average AUC in It demonstrates that our BatmanNet with asymmetric encoder-decoder design can significantly reduce computational consumption while achieving better performance.

Ablation Study
To further investigate the factors that influence the performance of our proposed BatmanNet framework, we conducted the ablation study on eight benchmark datasets.
How Powerful is the Bi-branch Information Extraction Network?To evaluate the expressive power of the bi-branch information extraction network in BatmanNet, we replaced it with a single-branch network (the node branch or the edge branch).We pre-train them under the same training setting with nearly the same number of parameters (2.6M ).As shown in Table 3, we can observe that: (1) BatmanNet with bi-branch achieves 1.4% and 1.6% improvements on average AUC compared to using only node branch or edge branch, respectively, which proves the effectiveness of the bi-branch design of the BatmanNet.( 2) Even if we only use a single branch, BatmanNet can achieve or even exceed the performance of GROVER on average AUC, demonstrating that our Bi-branch graph masking is highly effective in capturing molecular information from the graph.
The Effect of Different Masking Ratio.To evaluate the influence of different masking ratios.We pre-trained BatmanNet with masking ratios ranging from 10% to 90% and reported average AUC in all downstream tasks (see Figure 3, more detailed results are deferred to Appendix D2).The results show that setting the masking ratio to 60% achieves the best performance.This suggests that masking nodes and edges with a relatively high ratio makes a "hard" pre-training task that gives a larger capacity for remaining nodes and edges to capture molecular information in their embeddings.However, when we set the masking ratio higher than 60%, there is not enough information in the remaining nodes and edges that can recover the complete graph, and it starts to harm the quality of learned embeddings.

Conclusion
We developed a novel pre-train model for molecular representation learning.Designing effective pre-training tasks is at the core of the pre-train models.As noted by [15,16], the pre-training tasks for molecule representation should consider the strategy to learn both the node-level and graph-level information.We investigated works in NLP [17] and computer vision (CV) [29] where the pre-trained models have been dominantly used, and we find that the key to most successful pre-training models is to design simple but effective tasks that can scale well.Autoencoding tasks naturally fit these qualities.In our work, we design a simple bi-branch graph masking task on an autoencoder that does not require any specific domain knowledge, e.g., pre-defined motifs or subgraphs, that could also fulfill the requirements of learning molecular representation above.We adopt an asymmetric transformer-styled autoencoder further increases the scalability of our approach.We improved results on eight downstream tasks over other extensively trained models with less pre-training data and computational resources.It would be worth scaling our approach on a much larger pre-training dataset and evaluating how much we can improve further from the current small model and pre-training dataset.Since the high-ratio masking is a very challenging task that gives the capacity to learn high-quality node/edge embeddings, we would expect our pre-trained model have a strong generalization ability that can be transferred to many different downstream tasks and possibly perform well on few-shot learning tasks.

Figure 1 :
Figure 1: Illustration of the designed self-supervised tasks of BatmanNet.A very high portion of nodes or edges is randomly masked, then the BatmanNet is pre-trained to reconstruct the original molecule from the latent representation and mask tokens.

Figure 2 :
Figure2: Overview of the BatmanNet architecture with node-branch (left) and edge-branch (right).The bottom sub-network colored with light orange is BatmanNet's encoder.After pre-training, it will be used as a feature extractor for the downstream molecular property prediction tasks by stacking two MLPs on top of two branches, and the final prediction is averaged over the two outputs.The upper sub-network colored by light blue is BatmanNet's decoder.

Figure 3 :
Figure 3: Comparison of different masking ratios.The y-axis is the average AUC on all downstream datasets in this paper.

Table 2 .
(1)From Table1we can see the fine-tuned BatmanNet models consistently outperform other approaches on all datasets except the PretrainGNN on the MUV dataset.They achieve an overall relative improvement of 0.5% in average AUC on all datasets compared to the previous SOTA model-GROVER large .This improvement

Table 2 :
Comparison of BatmanNet with GROVER base and GROVER large in terms of pre-training dataset size, the number of parameters, the number of GPUs, the pre-training time, and average AUC on eight benchmarks.

Table 3 :
Comparison between BatmanNet with bi-branch and with a single branch (only node or edge branch).justifies that our bi-branch graph masking and reconstruction task is more effective than multiple node-level and graph-level tasks used in GROVER large in learning molecular information at both the node/edge and graph level.(2)Asshown in Table2, the second best performing model GROVER large contains 100M parameters, and was pre-trained on 11M unlabeled molecules.The GROVER large used 250 Nvidia V100 GPUs and was trained in four days.In contrast, our BatmanNet has 2.575M parameters and only used 250K unlabeled molecules for pre-training.We pre-trained our model within two days using one Nvidia RTX3090 GPU.Substantially, with fewer data (2.273%), the smaller model (2.575%), and less GPUs (about 0.4%), BatmanNet outperforms GROVER large in terms of average AUC and training time.