DeepAlloDriver: a deep learning-based strategy to predict cancer driver mutations

Abstract Driver mutations can contribute to the initial processes of cancer, and their identification is crucial for understanding tumorigenesis as well as for molecular drug discovery and development. Allostery regulates protein function away from the functional regions at an allosteric site. In addition to the known effects of mutations around functional sites, mutations at allosteric sites have been associated with protein structure, dynamics, and energy communication. As a result, identifying driver mutations at allosteric sites will be beneficial for deciphering the mechanisms of cancer and developing allosteric drugs. In this study, we provided a platform called DeepAlloDriver to predict driver mutations using a deep learning method that exhibited >93% accuracy and precision. Using this server, we found that a missense mutation in RRAS2 (Gln72 to Leu) might serve as an allosteric driver of tumorigenesis, revealing the mechanism of the mutation in knock-in mice and cancer patients. Overall, DeepAlloDriver would facilitate the elucidation of the mechanisms underlying cancer progression and help prioritize cancer therapeutic targets. The web server is freely available at: https://mdl.shsmu.edu.cn/DeepAlloDriver.


INTRODUCTION
Cancer is a genetic disease caused by mutations that dir ectly or indir ectly provide cancer cells with selecti v e advantages for proliferation and invasion ( 1 ). The vast majority of mutations occurring in cancer cells are passengers, irrespecti v e of the cancer phenotype and biological effect, while a small number of dri v er mutations confer cell growth and survival (2)(3)(4). A significant challenge in cancer therapy is distinguishing dri v er mutations from passenger mutations. With the de v elopment of whole genome and / or exome sequencing, the generation of r esour ceful databases, and machine / deep learning methods, se v eral approaches, tools, and platforms have been proposed to identify dri v er mutations (5)(6)(7)(8)(9)(10)(11)(12)(13)(14). Howe v er, a vital strategy for identifying dri v er mutations in the protein-coding regions is lacking.
Mutations in protein-coding regions are typically associated with altered biological functions. A small fraction of the muta tions tha t occur a t the ca talytic sites has been extensi v ely studied for their loss-of-function or gain-of-function ef fects, which are a ttributed to the disruption of the interaction between the natural ligand and protein (15)(16)(17). Howe v er, little attention has been paid to remote mutations that can also perturb protein function through allosteric regula tion ( 15 , 16 ). Allosteric regula tion plays a vital role in all cellular processes, including signal transduction, enzyma tic ca talysis, cellular metabolism, and gene regulation. From a structural perspecti v e, allosteric mutations can induce a population shift and, ther efor e, destabilize or stabilize acti v e or inacti v e states. Kurochkin et al. ( 17 ) demonstra ted tha t some muta tions in function-relevant distal sites resulted in an increase in the catalytic activity of the insulindegrading enzyme by stabilizing its acti v e state. Shen et al. ( 18 ) reported that deleterious mutations identified in cancer genomes were more significantly enriched at protein allosteric sites than tolerated mutations. Tee et al. ( 19 ) suggested that SNPs may function allosterically and that muta tions a t critical positions in the protein sequence can allosterically disrupt protein function. Additionally, targeting mutants at allosteric sites offers various advantages for drug discovery, including enhanced drug selectivity and efficacy compared to those at buried acti v e / orthosteric sites. As a result, identifying allosteric dri v er mutations is critical for tumorigenesis and tumor-specific target discovery ( 20 ).
Based on previous studies ( 12 , 18 ), we proposed a deep learning method called DeepAlloDri v er and offered a platform to identify allosteric dri v er mutations and annotate dri v er mutations , genes , and proteins. In the benchmarking test dataset, DeepAlloDri v er detected dri v er mutations with > 93% accuracy and pr ecision. Furthermor e, we em-ployed DeepAlloDri v er to identify an allosteric dri v er mutation in the Ras-related protein R-Ras2 (RRAS2) (Gln72 to Leu, RRAS2 Q72L ), which is supported by a knock-in model ( 21 ). Ov erall, DeepAlloDri v er offers a ne w perspecti v e for allosteric drug target design in addition to re v ealing the mechanisms of cancer.

Workflow of DeepAlloDriver
DeepAlloDri v er was de v eloped as a platform to identify dri v er mutations at allosteric and potential allosteric sites. The w e b server is free for all users without a login.
The workflow of the DeepAlloDri v er is shown in Figure 1 . First, users can submit cancer samples, including gene symbols and amino acid substitutions. Substitutions are then extracted from samples and mapped to allosteric or potential allosteric sites deri v ed from the RCSB Protein Data Bank (PDB) ( 22 ) and Allosteric Database (ASD) ( 23 ). Subsequently, the substitutions located at the mapped sites are evaluated for their potential driver effects using the server. In addition, the prediction score of the substitutions and location of the predicted substitutions on the protein structur e, the corr esponding gene or protein involved in the biological pa thway, and modula tors targeting proteins are displayed. Additional details of the workflow are provided in Supplementary Figure S1.

Input
To of fer grea ter flexibility, the w e b server allows users to upload three file formats. In the 'Input' text area, users can paste a valid amino acid substitutions list (each item is a substitution as follows: Simpleid;AR;G244D). Alternati v ely, users can upload cancer sample(s) in Mutation Annota tion Forma t, which is a tab-delimited file containing somatic and / or germline mutation annotations. Moreover, uploading a tab-delimited file generated by ANNOVAR is permitted ( 24 ).

Nucleic Acids Research, 2023, Vol. 51, Web Server issue W131
A 'Job Name' is needed, which helps users find their 'Job Serial' in the 'Job Queue.' DeepAlloDri v er currently comprises 1949 proteins and 10 081 allosteric sites from ASD ( http://mdl.shsmu.edu.cn/ASD ) for the prediction of allosteric dri v er muta tions. It is recommended tha t nonspecialist users submit example tasks to master the server before uploading their samples. Example 1, containing six substitutions items, was tested for running time, and it took 30 seconds; users can input samples with < 3600 mutation items according to their demands.

Output
The output page consists of three parts. First, the 'Job Queue' bar lists the basic information about a job. Second, the middle 'Job Progress' section records the progress of calculations. Third, the 'Target Result' outputs the prediction result table once a job is finished. Mutations in the submitted sample would be ranked according to the 'Pr edicted Scor e' in the 'Target Result' table. The first six columns account for the protein and mutation information, such as 'Sample ID', 'Gene / Pr otein', 'Unipr ot ID', 'Mutation', 'PDB ID' and 'Pr edicted Scor e'. Furthermor e, clicking the 'Show' button in each 'Entry' can direct the users to more detailed information about the mutation, and users can click the 'Show' button to navigate the details available.
The jump page pops up when users click the 'Show' button of the 'Target Result' table, displaying the annotation for the mutation and its protein; the detailed page contains 'Target informa tion', 'Pa thway in the Reactome' and 'Known drugs on the target'. At the top of the detailed page, an interacti v e 3D r epr esentation power ed by the JSMol plugin demonstrates the predicted substitution in the protein structure. The panel next to it exhibits some useful information like 'Gene Symbol,' 'NCBI Gene ID' with an internet link to the NCBI Gene database , 'Function,' 'PDB ID ,' 'Mutation', and 'Pr edicted Scor e'. The middle of the page illustra tes the 'Pa thway in Reactome' table summarizing the biological pathways in which the potential dri v er mutations were annotated from the Reactome pathway database ( 25 ). P articularly, clicking the 'r eactome' button at the end of each entry allows visualization of the 2D diagram of signaling pathway. At the bottom, users can examine known modulators of the potential dri v er protein in the 'Known drugs on the target' table, which describes the compound name, the ID in DrugBank ( 26 ) or CHEMBL ( 27 ), the 2D structure, molecular weight, and the indication of the compound. A detailed explanation of pages is provided on the 'Help' page for user convenience.

PERFORMANCE
We collected three datasets, including mutations from public databases (e.g. TCGA, ICGC and COSMIC), 17181 known dri v er mutations from various r esour ces (e.g. CGI and IntOGen), and allosteric and potential allosteric sites from the PDB and ASD databases (detailed in the 'Dataset Collections' section and Supplementary Tables S1 and S2 in Supplementary Information). Of the 17 181 mutations located in allosteric sites, 8565 were defined as positi v e allosteric dri v er mutations, and the same number of passenger muta tions a t allosteric sites were obtained from these r esour ces as negati v e allosteric passenger mutations. Additionally, 17 130 allosteric dri v er and passenger mutations were randomly split into the training set, validation set, and test set at a ratio of 8:1:1 (13 704 for the training set, 1713 for the validation set, and 1713 for the test set). In the training and test sets, 149 proteins and 1373 allosteric sites in these proteins were used according to the driver mutations ( Supplementary Tables S3 and S4).
The classification power of DeepAlloDri v er was benchmarked on a test set of 1713 mutants with optimized hyperparameters after 5-fold cross-validation (Supplementary  Tables S5 and S6). The DeepAlloDriver score distributions of the allosteric dri v er and passenger mutations were built using an equivariant multi-head attention weighted gr aph neur al network (EGNN) and a threshold of 0.5 can best separate the score distributions of the positi v es and negati v es (Supplementary Figure S2). We used 0.5 as the threshold to obtain the performance scores, and mutations with > 0.5 probability, were assigned as dri v ers. Our model detected allosteric dri v er mutations with 94.1% accuracy, 93.8% precision, 94.3% recall, 93.9% specificity and 94.1% F1 score (Supplementary Tables S5 and S6). Furthermore, we analyzed the classification power of our model via the recei v er operating characteristic (ROC) curves (Supplementary Figure S2). The area under the ROC curve (AUC) was computed as 0.975, indicating the significant predicti v e power of DeepAlloDri v er.
DeepAlloDri v er was also compared with the previous Al-loDri v er serv er using the aforementioned blind test set containing 1713 allosteric mutations, as mentioned previously. As shown in Supplementary Table S7, DeepAlloDriver e xhibited an av erage improv ement of a pproximatel y 47% in predicti v e accuracy compared with AlloDri v er, ranging from accuracy and precision to recall metrics. Additionally, it demonstrated a nearly 36% increase in the classification balance metrics, including specificity and the F1 score. Ov erall, DeepAlloDri v er achie v ed a much higher the area under the recei v er oper ating char acteristic curve (AU-ROC) value, indicating its superior performance in classifying allosteric mutations. In general, a comparison between DeepAlloDri v er and AlloDri v er re v ealed that the performance of DeepAlloDri v er was significantly better than that of AlloDri v er, providing a useful tool for understanding the allosteric mechanisms of cancer progression.

Case 1: allosteric driver mutation on NTRK1
The neur otr ophic receptor tyr osine kinase 1 (NTRK1, also known as MTC, TRK, TRK1, TRKA or Trk-A), is a membrane-bound receptor tha t phosphoryla tes itself and members of the mitogen-activated protein kinase (MAPK) pathway upon neur otr ophin binding. In addition to the most common oncogenic NTRK fusions, NTRK mutations hav e been e xplored as potential oncogenic e v ents ( 28 , 29 ). According to the server results, the mutation in Arg507 of the NTRK1 protein is located in an allosteric site composed of Arg507, Val511, Leu512, Lys513, Trp514, Lys523, Phe525 and Leu526, and the cysteine mutation on Arg507 (NTRK1 R507C ) could trigger allosteric communication with a score of 1.000 (Figure 2 A), leading to a negati v e effect on the catalytic function of NTRK1. Output analyses have indica ted tha t the r egion wher e Arg507 is located is also a hub of regulation across the protein kinase superfamily, and ther e ar e now 25 known type IV allosteric inhibitors or activators determined in the experimental structures bound to the region, destabilizing or strengthening the interactions with regulatory subunits ( 30 ). Overall, these findings indicate the possibility of using NTRK1 R507C as a cancer target and offer an acti v e r efer ence for the rational design of novel therapeutic agents.

Case 2: allosteric driver mutation on RRAS2
The Ras-related protein R-Ras2 (RRAS2, also known as teratocarcinoma oncogene 21 (TC21)) is a Ras-like GTPase that shar es downstr eam effectors with the Ras subfamily proteins. The abnormal function of RRAS2 is a triggering factor that perturbs downstream tumor signaling cascades (such as MEK-ERK signaling and the PI3K-mTOR pathway), promoting malignant transitions in various cancers ( 31 , 32 ). Using DeepAlloDri v er, we found that RRAS2 Q72L could act as a strong dri v er mutation (output score = 0.995) by inducing internal conformational regulation from an allosteric site (Gln72, Glu73, Glu74, Phe75, Gln81, Met83, Arg84, Gln110 and Arg112) (Figure 2 B), providing a mechanism for the mutation as a potent oncogenic dri v er in both RRAS2 Q72L knock-in mice and cancer patients ( 21 ).

DISCUSSION
Allostery is a regulatory approach that transmits information in biological systems and can be utilized to decipher molecular mechanisms in a wide spectrum of biological processes and discover cancer driver proteins. DeepAllo-Dri v er provides an efficient service to help clinicians and biologists with better decision-making in identifying allosteric dri v er mutations and car cinoma-r elevant targets ( 30 ). Based on the results, the key factors contributing to the performance of DeepAlloDri v er were inv estigated. First, DeepAlloDri v er employs EGNNs, which are highly expressi v e and well suited for handling biomolecular graph data and capturing intricate patterns. Second, a larger dataset trained in DeepAlloDri v er provided more di v erse types of allosteric dri v er mutations, allowing the model to learn more complex patterns and better generalize to unseen data. Thir d, DeepAlloDri v er uses raw data, specifically the 3D Cartesian coordinates of all C ␣ atoms in the protein, which enables the model to learn complex relationships and patterns directly from the data. To extend the scope of the server, 1949 proteins and 10 081 allosteric sites from ASD ( http://mdl.shsmu.edu.cn/ASD ) are provided for the prediction of allosteric dri v er mutations. Ov erall, DeepAlloDri v er highlights the strong correlation between protein structure and function as well as the superior ability of EGNNs to predict allosteric dri v er mutations, and the effecti v eness of the server was validated in the cancer drivers RRAS2 and NTRK1. This study has some limitations. It did not include dri v er muta tion identifica tion across more e xtensi v e disease types that are not currently associated with cancer. In addition, accumulati v e allosteric da ta incorpora ting allosteric modulators and proteins, together with the AlphaFold database ( 33 ), could allow for systematic profiling of allosteric sites in the human structural proteome. These improvements make DeepAlloDri v er a more useful r esour ce for the identification of allosteric dri v er mutations.