Hi-GeoMVP: a hierarchical geometry-enhanced deep learning model for drug response prediction

Abstract Motivation Personalized cancer treatments require accurate drug response predictions. Existing deep learning methods show promise but higher accuracy is needed to serve the purpose of precision medicine. The prediction accuracy can be improved with not only topology but geometrical information of drugs. Results A novel deep learning methodology for drug response prediction is presented, named Hi-GeoMVP. It synthesizes hierarchical drug representation with multi-omics data, leveraging graph neural networks and variational autoencoders for detailed drug and cell line representations. Multi-task learning is employed to make better prediction, while both 2D and 3D molecular representations capture comprehensive drug information. Testing on the GDSC dataset confirms Hi-GeoMVP’s enhanced performance, surpassing prior state-of-the-art methods by improving the Pearson correlation coefficient from 0.934 to 0.941 and decreasing the root mean square error from 0.969 to 0.931. In the case of blind test, Hi-GeoMVP demonstrated robustness, outperforming the best previous models with a superior Pearson correlation coefficient in the drug-blind test. These results underscore Hi-GeoMVP’s capabilities in drug response prediction, implying its potential for precision medicine. Availability and implementation The source code is available at https://github.com/matcyr/Hi-GeoMVP


Deep learning backbone
We used graph isomorphism networks [1], graph attention networks [2], and the variational autoencoder [3] as the backbone to learn the drug and cell line representations respectively.

Graph Isomorphism Network
Graph isomorphism network (GIN) is a message-passing graph neural network [4].Each node feature is recursively from the messages of its neighbor nodes at each layer.The k-th layer GIN cooperating with bond information is: where M LP k is the Multilayer Perceptron at k-th layer that maps the aggregated messages for u to the drug latent space.Here, h u denotes the representation for atom u at the k-th layer, d bond−uv is the initial bond attribute vector for bond uv, N (u) is the set of neighboring atoms of atom u, and h (0) u = d atom−u .

Geometry-enhanced Graph Neural Network
We adopt the GeoGNN [5] to incorporate the drug's atom-bond-angle geometry relations.With the initialization of h uv at the k-th layer is updated by: u is reformulated by:

Graph Attention Network
Graph attention networks apply both the attention mechanism and graph neural networks to learn the graph representation.The updating process is as follows: • Compute attention coefficient: The attention coefficient is given by the equation: Then, the coefficients are normalized using the softmax function so they can be considered as probabilities, allowing the model to focus on important nodes: • Update the node features: The output features of the GAT are computed as a weighted sum of the features of every node, where the weights are the attention coefficients: Here, h i is the feature vector of the node i, W is a weight matrix, [Wh i ||Wh j ] represents the concatenation of the transformed feature vectors of nodes i and j, a is a weight vector, LeakyReLU is the activation function, N (i) is the set of neighbor nodes of node i, and α ij is the attention coefficient between nodes i and j.

Variational Autoencoder
We use a Variational Autoencoder (VAE) to learn the complete gene expression embedding but not only cancer-related gene expression features.VAE is a type of generative model that learns a distribution but not distinct points in the latent space.The VAE is composed of two main components: an encoder and a decoder.In the VAE model, the whole gene expression (WG) data for a given cell i, denoted as c wg i , is generated based on a latent variable z i , which follows a prior distribution p θ (z i ), through a conditional distribution (decoder) ), which maps c wg i to a latent space.We use a 4 − layer MLP for the encoder and the decoder.We denote the reconstructed gene expression profiles by c wg .The VAE updates the encoder and decoder jointly by maximizing the evidence lower bound (ELBO): where D KL is the Kullback-Leibler (KL) divergence between two probability distributions and the first term in Eq.7 is the reconstruction loss which is measured by the negative Mean Squared Error(MSE): −M SE(c wg , c wg ).

Model training details
We built the model using the deep learning library PyTorch, and PyTorch Geometric.We did an early stop on the validation set.The model was validated every five epochs, and training was halted if there was no improvement in validation error for eight consecutive iterations.We did the hyperparameters search on the validation set to select the best hyperparameter set.The hyperparameters are listed in Supplementary Table 1.The conventional machine learning methods are trained in Matlab (SRMF), R (DualNets), and Python (all other methods).

Assessment methods
We performed an assessment of various deep learning models using a shared dataset, GDSC V1, adhering to a baseline testing methodology.This drug response dataset was divided into two segments in a 2:8 ratio.The larger portion was utilized for both training and validating the neural network model, while the smaller portion was employed to measure the model's accuracy across three separate metrics.The process of training and validating a model necessitated a further division of the larger dataset into five equally sized subsets.In each iteration, we combined four of these subsets to form the training set, leaving the fifth for validation purposes.This cycle was repeated five times, each time swith a different subset serving as the validation set.
Each experiment concluded with the calculation of the accuracy of the testing set.Ultimately, the average accuracy obtained from the five trials was recorded and reported as the final result.
We evaluated the methods using four different metrics, including Root Mean Square Error (RMSE), Pearson Correlation Coefficient (PCC), and R 2 statistic.
The root mean square error (RMSE) and mean absolute error (MAE) captures the deviation between the true and predicated responses for a method and are defined as: The Pearson correlation coefficient (PCC) of y and y is defined as: where the µ() represents the mean of the components of a vector.The R squared statistic (R 2 ) is defined as: The smaller values for RMSE and MAE indicate better performance of a method.This means the predicted responses are closer to the true values.Conversely, a value close to one for the PCC and R 2 metrics indicates good performance.

Hard split for blind test
In the problem of drug response prediction, models are developed based on the observation that drugs with similar chemical characteristics typically exhibit similar responses [6].Similarly, the response of two cell lines to a particular drug tends to be alike if their genomic profiles are comparable.To rigorously assess model performance in blind tests, we implemented a more strict data splitting strategy.For the drug blind test, we divided the drugs by their scaffold structures [7], which provide a more realistic estimation of model performance in prospective evaluations compared to random splitting [8].For the cell line blind test, we employed maximum dissimilarity sampling, utilizing cosine similarity measures of gene expression to select training, validation, and test cell lines.

Datasets and data preprocessing 5.1 Cell lines profiles
We adopted the gene expression, CNV, and mutation profiles presented in the GDSC dataset as the cell line features for this study.We used the same preprocessed genomic profiles in [9] where genes with low expression in variation are removed and only cancer-related gene sets according to COSMIC are left for mutation and CNV data.We further filtered the gene expression by the COSMIC cancer-related gene sets to build the gene-gene interaction graph.For multi-omics data, we got c ge−whole ∈ R 734×8046 ,c ge ∈ R 734×414 ,c mut ∈ {0, 1} 734×636 and c cnv ∈ {0, 1} 734×696 .All the processed data can be downloaded from https://github.com/Jinyu2019/Suppl-data-BBpaper.
We constructed multiple cell line graphs for different genomic data.Compared to the neutral graph structure of molecular drugs, cell lines have no explicit relational structures among genes.We used the Gold Standard Positives(GSPs) from HumanNet [10] to derive the gene interactions G ge GSP ∈ R 414×414 in our study.GSPs are experimentally validated gene pairs, making the resulting interactions more robust and reliable than those derived from PPIs.Additionally, the smaller number of interactions guarantees that the cell line graph is sparse without the need of setting any threshold for the interaction score.
Furthermore, existing studies research on GNN demonstrates that beyond the known topology, the similarity between initial node attributes also has a significant impact [11], suggesting we could consider gene interactions in a higher-level latent space.We created G ge sim ∈ {0, 1} 414×414 by defining gene-gene interaction using a Pearson correlation coefficient (P CC) threshold of 0.5 and a corresponding P -value threshold of 0.05 from c ge .G mut ∈ {0, 1} 636×636 and G cnv ∈ {0, 1} 694×694 were created from c mut and c cnv by calculating the cosine similarity score, with a threshold set to the highest 1% similarity score.We only use a similarity-based graph for mutation and CNV since the matched genes in these two omics set with the GSPs are rather limited which may filter the important cancer-related genes.

Drug profiles
We built the 2-D chemical graph for a drug from the drug's SMILES representation using a Python package called RDKit [12].Suppose j-th drug contains n j atoms and m j bonds.We denote the atom feature matrix as d atom j with k-dimensional features where d atom j ∈ R nj ×k and the bond feature matrix with ldimensional features as d bond j ∈ R mj ×l .The way how the atoms are connected is given by the adjacency matrix G d−atom j ∈ {0, 1} nj ×nj .Beyond the 2-D atom-atom connection (a2a), we introduce the bond-bond (b2b) connection by considering the angle between two bonds [5].We used the Merck Molecular Force Field (MMFF) in RDKit to generate multiple 3D conformers for a given drug and optimize their geometries to get the atom coordinates.By estimating the coordinations of each atom in the spatial domain and building the bond-bond-connectionbased graph G

Table S4 :
Method PCC ↑ R 2 ↑ RMSE ↓ MAE ↓ Regression performances of machine learning models for mix tests.

Table S5 :
Regression performances of machine learning models for cell blind tests.