Similarity measures-based graph co-contrastive learning for drug–disease association prediction

Abstract Motivation An imperative step in drug discovery is the prediction of drug–disease associations (DDAs), which tries to uncover potential therapeutic possibilities for already validated drugs. It is costly and time-consuming to predict DDAs using wet experiments. Graph Neural Networks as an emerging technique have shown superior capacity of dealing with DDA prediction. However, existing Graph Neural Networks-based DDA prediction methods suffer from sparse supervised signals. As graph contrastive learning has shined in mitigating sparse supervised signals, we seek to leverage graph contrastive learning to enhance the prediction of DDAs. Unfortunately, most conventional graph contrastive learning-based models corrupt the raw data graph to augment data, which are unsuitable for DDA prediction. Meanwhile, these methods could not model the interactions between nodes effectively, thereby reducing the accuracy of association predictions. Results A model is proposed to tap potential drug candidates for diseases, which is called Similarity Measures-based Graph Co-contrastive Learning (SMGCL). For learning embeddings from complicated network topologies, SMGCL includes three essential processes: (i) constructs three views based on similarities between drugs and diseases and DDA information; (ii) two graph encoders are performed over the three views, so as to model both local and global topologies simultaneously; and (iii) a graph co-contrastive learning method is introduced, which co-trains the representations of nodes to maximize the agreement between them, thus generating high-quality prediction results. Contrastive learning serves as an auxiliary task for improving DDA predictions. Evaluated by cross-validations, SMGCL achieves pleasing comprehensive performances. Further proof of the SMGCL’s practicality is provided by case study of Alzheimer’s disease. Availability and implementation https://github.com/Jcmorz/SMGCL.


Introduction
Rapid advances in drug research and development over the past few decades, as well as public health emergencies, such as the outbreak of COVID-19, have forced researchers to explore effective ways to counter these risks. Computer-aided prediction of drug-disease associations (DDAs, a.k.a. drug repositioning) is becoming more appealing as it involves derisked compounds, which could lead to lower total development expenses and shorter development schedules.
At present, the popular DDA prediction methods can be roughly divided into two categories: DDA prediction based on matrix decomposition and completion, and DDA prediction based on Graph Neural Networks (GNNs). For the methods based on matrix decomposition and completion, BNNR (Yang et al. 2019) integrates the drug-drug, drug-disease, and disease-disease networks and uses a bounded nuclear norm regularization method to complete the drug-disease matrix under the low-rank assumption; GRGMF (Zhang et al. 2020b) is an improved neural collaborative filtering framework, which learns the neighbor information for each node adaptively and draws support from existing external similarity information to enhance the prediction performance. For the methods based on GNNs, DRWBNCF (Meng et al. 2022) encodes known DDAs together with drug and disease neighborhood and neighbor interactions, allowing specific network features to be taken into account as well; MVGCN (Fu et al. 2022) constructs multiple views by combining different similarity networks with the biomedical bipartite network and uses a neighborhood information aggregation layer to aggregate the information of inter-and intra-domain neighbors in different views. Although the above methods have achieved promising performance, they all suffer sparsely labeled data problems due to the limited annotated data as wet experiments are expensive and time-wasting. These data are insufficient to induce accurate representations of drugs and diseases in most cases, leading to suboptimal performance.
A contrastive learning paradigm from the computer vision domain is one approach to addressing these difficulties (Wu et al. 2018, which aims to construct consistent and inconsistent view pairs via data augmentations, including cutout and color distortion (Howard 2014). Some researchers have made a preliminary attempt at graph data (Huang et al. 2021, Zhao et al. 2021. However, contrastive learning on drug repositioning has its unique challenges: (i) the graph of DDAs has fewer nodes and more sparse edges (a number of diseases might only be treated by one drug). Therefore, techniques with node/edge dropout are completely unavailable for DDA prediction. (ii) When creating selfsupervision signals, most existing methods generally consider neighbors in isolation. We instead argue that interactions between neighboring nodes may reveal potential relations between them and the target node, and modeling such interactions can improve the target node representation to imply richer semantics.
To get over the aforementioned limitations, we enrich the DDA graph contrastive learning (GCL) by incorporating the drug-drug similarity graph and disease-disease similarity graph, motivated by the fact that the indications for similar drugs are often the same. On top of that, we propose an endto-end Similarity Measures-based Graph Co-contrastive Learning (SMGCL) model for DDA prediction with three modules. The first module, "multi-source contrast views construction," builds the known DDA view, the drug-similarity, and disease-similarity views (applying the nearest neighbors) by using three sources of data. The second module, "contextaware neighborhood aggregation," uses a bilinear GNN to capture complicated local feature in the DDA view, and a global-aware attention mechanism to compensate for the receptive field issue in bilinear aggregation. The last module is "contrastive objective," where we introduce a sampling mechanism to radically mine supervised signals for efficient cocontrastive learning. Furthermore, the prediction task and the contrastive learning task are unified under a "primary&auxiliary" learning paradigm. Cross-validation and extensive experiments on three benchmark datasets provide statistical evidence for the superiority of SMGCL over the baseline approaches, and further case study demonstrates the practicability of SMGCL.

Materials and methods
We denote vectors by lowercase boldface, matrices by uppercase boldface, and sets by uppercase calligraphic font. Thus, let R ¼ fr 1 ; r 2 ; . . . ; r N g denotes the set of drugs, where N is the number of drugs; D ¼ fd 1 ; d 2 ; . . . ; d M g denotes the set of diseases, where M is the number of diseases. The objective of DDA prediction is to learn a mapping function f ððr; dÞjxÞ : E ! ½0; 1 from edges to scores, where x is a parameter, in order to determine the probability that a given drug would be effective in treating a given disease. Figure 1 displays the architecture of the proposed method. Note that, the description on the whole model from the drug part, since the drug and disease parts are dual.

DDA view
The DDA view can be regarded as an undirected graph G ¼ fV; Eg, where V represents the set of nodes that correspond to drugs and diseases, E V Â V denotes the set of edges and indicates the existence of interaction between two kinds of nodes in V. Furthermore, the graph G can be represented as an incidence matrix A 2 f0; 1g NÂM , where A ij ¼ 1 if drug r i can treat disease d j , otherwise A ij ¼ 0.

Similarity view
A tremendous deal of effort has gone into calculating the similarity of drugs or diseases. Taking the construction of drugsimilarity view as an example, with the similarity of drugs, for a certain drug node r i , we can select drugs with the top-K highest similarity as the neighbor nodes, which are the most similar to this drug in chemical structure, side effects, etc. In this way, the drug-similarity view is denoted as G R 2 fV R ; E R g with N drugs, and its adjacency matrix A R 2 f0; 1g NÂN , where A R ij ¼ 1 if drug r j is the top-K nearest neighbor of drug r i ; otherwise A R ij ¼ 0. In the same way, the disease-similarity view is denoted as G D 2 fV D ; E D g with M diseases, and its ad- For descriptive purposes, we define terms that are used interchangeably throughout the literature: view is a synonym for graph.

Context-aware neighborhood aggregation
After views construction, we develop a context-aware neighborhood aggregation including two encoders, to capture both heterogeneous (homogeneous) and local (global) information. Each encoder is in charge of extracting useful information from one heterogeneous (homogeneous) graph to improve DDA prediction.

Node feature extraction
Each column of the adjacency matrix of the similarity view can act as an initial feature vector for the corresponding node; however, these vectors may not capture the higher order connectivity information of the graph. For this reason, we run Random Walk with Restart (Tong et al. 2006) separately on drug-similarity matrix A R and disease-similarity matrix A D to enrich the initial embeddings for each node with local structure context. The process can be defined as the following recurrence equation: where a is the restart probability, P R is the probability transition matrix obtained from A R by column-wise normalization. x ðlÞ ri is a column vector of drug node r i , whose ith entry indicates the probability of reaching node i after l steps. x ð0Þ ri 2 R N is a one-hot vector with dimensions of N where ith entry is 1 and 0 otherwise, which denotes the initial vector representation of drug r i .
After approaching the steady-state, a single-layer perceptron is applied to obtain e ri ¼ MLPðx 1 r i Þ on A R for drugs, where e r i 2 R t denotes the updated drug node representation with t dimensions and MLP contains single hidden layer. In the same way, we can obtain the disease node representation e dj 2 R t .

DDA view encoder
GCN (Kipf and Welling 2017) assumes that neighboring nodes are independent of each other and utilizes the weighted sum to learn low-dimensional representations of nodes. We formulate a GA aggregator for target node v (drug r or disease d) as: where GAðÁÞ is the non-linear aggregator,N ðvÞ ¼ fvg [ fijA vi ¼ 1g denotes the extended neighbors of node v, which contains the node v itself. r is a non-linear activation function. a vi is the weight of neighbor i and is defined as In addition, the co-occurrence of two neighboring nodes can be regarded as an important feature of the target node. However, the common GCNs ignore the possible interactions between neighboring nodes. Even if it is a Graph Attention Networks that can adaptively aggregate the information of neighboring nodes of different importance, it cannot extract the possible interaction features between neighboring nodes. At the same time, multiplying two vectors can effectively model the interactions by emphasizing the consistent information and weakening the divergent information (Zhu et al. 2020). Thus, we define a BA aggregator for target node v as: where BAðÁÞ is the non-linear aggregator, b v ¼ 1 2d v ðd v À 1Þ denotes the number of interactions for the node v, eliminating the bias of node degree to some extent with the normalization process. is element-wise product and W b is the weight matrix to do feature transformation.
Then, the encoder which is built on the DDA view for message passing between drugs and diseases extracts indirect interactions in the local structure. Specifically, for target node v, the DDA view encoder is defined as: where b is a hyper-parameter to trade-off the strengths of the GA aggregator and BA aggregator.

Similarity view encoder
Previous drug repositioning research assumed that similar drugs would treat the same disease, but we argue that dissimilar drugs might also treat the same disease. To fully exploit this potential correlation, we design a global-aware strategy based on an attention architecture, which increases significant signals and weakens noisy signals when calculating the attention coefficient d vi , to obtain node representations considering various perspectives. Specifically, the following two aspects are taken into account by the attention mechanism. Firstly, we calculate the average representation of all nodes' embedding in the similarity view. In order to explore the potential of drug treatment for non-indications, the node representation and average information representation are used to calculate the following attention score: where att 1 is a single-layer feedforward neural network with the LeakyReLU as activation function, W 1 is a transformation matrix, e represents the average node information by average pooling. Apart from the above, we extend the message passing process by the attention mechanism. If the drug neighbor node is more correlated with the target drug node, its contribution in aggregation toward the target node will be more significant and vice versa.
where W 2 is a transformation matrix, k denotes the concatenation operation, e j is the neighbor node representation of the node v, and att 2 is a single-layer feedforward neural network applying the LeakyReLU nonlinearity. Then, both the global and local score of each node is added following the additive attention mechanism (Bahdanau et al. 2015). Besides, softmax function is utilized to normalize coefficients across all choices of j, so as to make coefficients are able to directly compared between all nodes. The attention coefficients d ij between node i and node j can be calculated as: Figure 1. The framework overview of the proposed SMGCL. Solid rounded rectangles in (a) indicate three kinds of views, which are constructed from three different kinds of data. The DDA view is constructed on the known associations in the training set. Next, the node representation generated by the random walk with restart is transformed and applied as input to the model. Then, filled rounded rectangles in (b) indicate neural network encoders. For each type of node, we can get two kinds of representations by the different neural network encoders. Finally, we co-train the node representations, the prediction task and the contrastive learning task are unified under a primary&auxiliary learning paradigm in (c). Best viewed in color.
where j determines the amount of information flow from j while f ij decides the information target node i may receive. In this way, we can get another representation of drugs and diseases obtained on "drug-similarity view" and "disease-similarity view," respectively, which are denoted as q v ðv 2 fr; dgÞ. The calculation is defined as: where W 3 is the weight matrix.
REMARK: We elaborately describe the drug representation learning process here. Because the disease representation learning is a dual process, we omit it for brevity.

Generating prediction and model optimization
To reconstruct the associations between drugs and diseases, our decoder f ðe ri ; e dj Þ is formulated as follows: whereŷ ri;dj is the predicted probability score. DDA graph possesses two characteristics: (i) sparse edges (there is only a small number of existing DDAs) and (ii) limited nodes (the number of drugs and diseases are far less than those of users and items in the recommender systems). In order to make full use of all these information, we thus take all unknown drug-disease pairs as negative instances in the training set of each fold. Since there is no negative sampling, the setting of negative samples in the training set and the test set are the same. Furthermore, some of the existing studies (Zhao et al. 2021) sample the same number of unknown DDAs as that of the known association in the training set in some studies. We argue that the sampling strategy tends to adopt random sampling, which is likely to introduce unnecessary noise. Given that there are far fewer known DDAs than there are unknown or unseen DDAs, and since known DDAs have undergone extensive laboratory and clinical validation, they are highly reliable and crucial for enhancing predictive performance. Hence, our proposed SMGCL learns parameters by minimizing the weighted binary cross-entropy loss as follows: where ði; jÞ indicates the pair of drug r i and disease d j , S þ rd denotes the set of all known DDAs, and S À rd represents the set of all unknown or unseen DDAs. The balance factor g ¼ jS À rd j jS þ rd j emphasizes the importance of observed associations to mitigate the damage of data imbalance, where jS À rd j and jS þ rd j are the number of pairs in S À rd and S þ rd . Moreover, instead of minimizing the weighted binary cross-entropy loss, we also consider the variant of our model, named SMGCL-NS, which minimizes the binary cross-entropy loss. It also means that the same number of unknown DDAs as known associations is sampled.

Mining self-supervision signals
Through the above section, we have constructed two view encoders over three views, each of which can deliver complimentary semantics to the other. As a result, it makes sense to improve each encoder by using the data from the other view.
In this section, we illustrate how SMGCL enhances DDA prediction by mining valuable self-supervision signals. This can be accomplished by following the co-training architecture. Given a drug r i and disease d j in the DDA view, we choose their positive and negative drug samples within the same minibatch using its representation learned over the similarity view: where score r 2 R M denotes the predicted probability of each disease being cured to the drug r in the similarity view.
A natural intuition is that we may select highly confident diseases via calculated probabilities, i.e. top-K ranking diseases, so as to supervise the drug embedding in the similarity view as augmented ground truths. The positive sample selection is defined as: where P K d denotes picking the corresponding diseases d, which are according to the top-K probability scores with the highest confidence.
When it comes to picking negative samples, a simple intuition is to choose the diseases with the lowest scores. Nevertheless, this approach contributes minimally to the representation update and cannot distinguish and tailor complex and difficult samples. Thus, K negative samples are randomly chosen from diseases ranked in top 50% in score ri excluding the positives to construct S d À r . We argue that these diseases should be considered as hard negatives, suggesting finer information with slight possibility of false negatives that may deceive learning. Finally, the information samples used for disease embeddings are selected in the same way to get S r þ di and S r À di . The positive and negative pseudo-labels for each drug and disease in the similarity view are repeatedly generated for every training batch. More hard negative samples are anticipated to be produced by repeating this procedure. Note that the encoders can evolve under the guidance of informative samples, recursively extracting more hard samples.

Co-contrastive learning
With the generated pseudo-labels, the graph co-contrastive learning task for evolving the encoder can be performed by contrastive objects. We utilize NT-Xent (You et al. 2020) as our objective function to maximize the mutual information between the two views. Formally, the training objective for drug h r i is as follows: where s denotes the temperature parameter and simðu; vÞ is the cosine similarity. In the same way, the training objective for disease h di is defined as: Finally, we unify the prediction task with the auxiliary SSL task. The total loss L is defined as: where k is hyper-parameter to control the scale of the graph co-training. The weights are initialized in accordance with Glorot and Bengio (2010), and the model is optimized using the Adam optimizer (Kingma and Ba 2015). We train the model in a denoising setup by randomly dropping out edges with a fixed probability, which enables us to effectively generalize to the unseen data and avoid the model from over-fitting. For the graph convolution layers, we also use regular dropout.

Datasets
We evaluate our model on three benchmark datasets: "Fdataset" (Gottlieb et al. 2011), "Cdataset" (Luo et al. 2016), and "LRSSL" (Liang et al. 2017), which are often used in DDA prediction. The basic statistics of the three datasets are shown in Table 1. Sparse ratio is defined as the ratio of the number of known associations to the number of all possible associations. Details of these benchmarks are in the Supplementary Material.

Evaluation metrics and parameters settings
To assess SMGCL's overall performance, we adopt the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall curve (AUPR) as primary metrics. It is meaningful to measure the characteristics of ROC and PR while treating the unknowns as true negatives since the actual associations are limited in comparison to the total number of unknowns. Details of each metric are in the Supplementary Material. Our proposed SMGCL model uses the Adam optimizer. The values of all hyper-parameters refer to the practices of previous researchers and are finally determined by grid search, where the learning rate is set as 0.001, batch size is set as 64, restart probability a ¼ .1, temperature s ¼ 0:1, and scale control hyper-parameter k ¼ 0:1. For trade-off hyperparameter b, SMGCL has different optimal parameters for different benchmark datasets. On Fdataset, b ¼ 0:6; on Cdataset and LRSSL, b ¼ 0:8. Besides, all methods have been compared under the same evaluation settings. For the baseline models available for code disclosure, we run the code with reference to the best parameters reported in the original paper, and our results are consistent with those in publications. For the baseline models with unavailable codes, we report the results directly since we use the same datasets.

Overall performance
Following Li et al. (2020) and Zhang et al. (2020a), we adopt 10-fold cross-validation (10-CV) to evaluate the performance of prediction methods. In particular, for each 10-CV repetition, we calculate all evaluation metrics, and the final evaluation results are obtained by calculating the average evaluation metrics over 10 repetitions. The prediction model is constructed on the known associations in the training set and is used to predict the associations in the remaining fold as the test set. Besides, we deploy a t-test under AUROC and AUPR metrics. Table 2 reports the performance comparison results and statistical significance, in which SMGCL-NS means that the same number of unknown DDAs as known associations is sampled. We have the following observations: 1) On three datasets, BNNR and DRIMC outperform expectations in terms of performance. Such performance might be attributed to a smaller number of nodes in DDA data compared to e-commerce and social recommendation data, which allows for the promising performance of BNNR and DRIMC on AUROC. In addition, as an improved neural collaborative filtering framework, GRGMF introduces two graph regularization terms to deal with nodes without any known link information,  This may greatly alleviate the influence of unbalanced data on the model and achieve suboptimal performance on AUPR. However, GRGMF does not explicitly model the connectivity in the embedding learning process, which easily leads to its poor performance on AUROC. 2) Compared with NIMCGCN and LAGCN, the performance of DRWBNCF verifies that modeling neighbor interactions can improve representation learning.
MVGCN is the only model that uses contrastive learning apart from the proposed SMGCL. The difference with SMGCL is that MVGCN uses contrastive learning to obtain the initial representation of nodes, while SMGCL optimizes the contrastive objective and prediction task jointly. MVGCN obtains optimal performance on AUPR, which validates that contrastive learning can mitigate the impact of data imbalance. Surprisingly, in some cases, the performance of NIMCGCN, LAGCN, and MVGCN is worse than that of BNNR and DRIMC. The reason might be that NIMCGCN ignores the interaction of nodes in heterogeneous networks, and LAGCN indiscriminately mixes the network topology information of different domains (i.e. drug and disease domains), and MVGCN does not select the nearest neighbor of each node to construct the similarity view, which introduces a lot of noise information.

3) The AUROC obtained by SMGCL on Fdataset and
Cdataset shows the best performance, on LRSSL shows great performance. Compared with GRGMF and MVGCN, the average AUROC of SMGCL increased by 15.54% and 9.55%, respectively. Moreover, in the context of imbalance classification, AUPR is also an indispensable evaluation metric. Compared with BNNR and DRWBNCF, the average AUPR of SMGCL increased by 27.96% and 12.98%, respectively. To clarify the advantages of SMGCL, more detailed comparison between SMGCL and MVGCN in Supplementary Section S2.4. Benchmarking comparison results show that SMGCL improves the comprehensive prediction performance thanks to combining the information of the known DDA is co-trained with the neighborhood and neighborhood interaction information of drugs and diseases under the framework of contrastive learning.

Model ablation
To evaluate the rationality of design sub-modules in our SMGCL framework, we consider three model variants as follows: 1) SMGCL without DDA view encoder (w/o-DE): We only use the similarity views to model drugs and diseases, removing the graph co-contrastive learning. 2) SMGCL without similarity view encoder (w/o-AE): We only use the DDA view to model drugs and diseases, removing the similarity views, interaction-aware similarity views, and the graph co-contrastive learning. 3) SMGCL without co-contrastive learning task (w/o-CL): We remove the graph co-contrastive learning task and only use simple summing of drug/disease embeddings on two views to get the final embedding.
As can be observed in Fig. 2, each component contributes to the final performance. The DDA view encoder contributes the most. When only using the DDA view encoder, the model achieves a suboptimal performance, which is much higher than the performance of the SMGCL without co-contrastive learning task on both the three datasets. This can demonstrate the effectiveness of modeling the interaction between neighbor nodes. By comparison, only using the similarity view encoder would lead to a huge performance degradation on three datasets. Surprisingly, removing the co-contrastive learning task and using the sum of drug/disease embeddings on two views to obtain the final embedding do not achieve suboptimal performance. This proves that contrastive learning can automatically mine labels, so as to maximize agreement between nodes in different view. According to this ablation study, we can conclude that a successful DDA prediction model should consider not only the interaction between drugs and diseases, but also the relationship between drugs and drugs, diseases and diseases.

Case study: approved drugs for Alzheimer's disease determined by calculation
We conduct a case study for the neurodegenerative disease Alzheimer's disease (AD), for which there are currently no effective treatments, in order to further evaluate the predictive capability of SMGCL. All of the known DDAs in the Fdataset are used as the training set and the unknown DDAs are used as the candidate set when trying to find possible AD drugs. Once the SMGCL predicts the probability of interaction of a given disease with all drug candidates, we rank the candidates according to that predicted probability, so that the topranked drug is the most likely to treat the disease.
We focus on the top 15 potential candidates for AD predicted by SMGCL in Table 3. For each drug, we show the DrugBank ID, canonical name and literature-reported evidence, which check the predicted DDAs. Then, we select three drugs in Table 3 to describe them in detail. Amantadine has antiviral, anti-Parkinson's, and anti-pain activities. By promoting dopamine release from striatal dopaminergic nerve terminals and preventing its pre-synaptic reuptake, it has anti-Parkinsonian actions. Furthermore, Erkulwater and Pillai (1989) have proved that the mental status of two AD patients has obviously improved after treatment with amantadine. Haloperidol is a highly effective first-generation antipsychotic drug and one of the most commonly used antipsychotics in clinical practice today. Devanand et al. (1998) have conducted an experiment on the efficacy and side effects of haloperidol and placebo in the treatment of psychosis and disruptive behavior in patients with AD. Ultimately, the results have shown that haloperidol at a dose of 2-3 mg/day had a good therapeutic effect. Carbidopa is the levorotatory isomer of a synthetic hydrazine derivative of the neurotransmitter dopamine. Meyer et al. (1977) have performed serial clinical assessments and neuropsychological measures of functioning in 10 patients with severe dementia consisting of AD or multiinfarct dementia (MID) or both, who have taken Carbidopa. The results have demonstrated that one patient with AD þ MID demonstrated clinical and psychological improvement.
Overall, a variety of evidence from clinical trials and other literature data have validated 14 of the top 15 predicted drugs (93% success rate), ordered by confidence scores.

Conclusion
In this study, we look into the potential of GCL to address the shortcomings of the traditional DDA prediction. In particular, an end-to-end SMGCL model is suggested to tap candidate drugs for diseases. To be specific, we learn the representation of drugs and diseases on three relevant views and then introduce a co-contrastive learning strategy that can sample positive samples and dig hard negative samples to generate accurate node representations. Finally, experiments on three benchmark datasets justify the advantages of our proposal regarding DDA prediction. The reliability of the newly discovered DDAs has been supported by case study.
Since the task of DDA prediction is closely related to biological safety and human health. It is crucial to design a reasonable negative sampling strategy for constructing a robust DDA prediction model. In future work, we will consider developing a proper negative sampling strategy for the DDA prediction task and analyze the performance improvement of the negative sampling strategy on different SOTA models.

Supplementary data
Supplementary data are available at Bioinformatics online. The underline indicates that we have made a detailed analysis and introduction of these drugs in the following.