Abstract

Drug resistance is a major threat to the global health and a significant concern throughout the clinical treatment of diseases and drug development. The mutation in proteins that is related to drug binding is a common cause for adaptive drug resistance. Therefore, quantitative estimations of how mutations would affect the interaction between a drug and the target protein would be of vital significance for the drug development and the clinical practice. Computational methods that rely on molecular dynamics simulations, Rosetta protocols, as well as machine learning methods have been proven to be capable of predicting ligand affinity changes upon protein mutation. However, the severely limited sample size and heavy noise induced overfitting and generalization issues have impeded wide adoption of machine learning for studying drug resistance. In this paper, we propose a robust machine learning method, termed SPLDExtraTrees, which can accurately predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. Especially, the proposed method ranks training data following a specific scheme that starts with easy-to-learn samples and gradually incorporates harder and diverse samples into the training, and then iterates between sample weight recalculations and model updates. In addition, we calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for these data-limited predictive tasks. The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios and achieve predictive accuracy comparable with that of molecular dynamics and Rosetta methods with much less computational costs.

Introduction

In recent decades, drug resistance has been one of the major challenges in the development of anticancer and antimicrobial therapeutics. The resistance can arise via a variety of mechanisms, such as increased drug efflux, drug deactivation, alternative signaling pathways activation and protein mutations. Particularly, protein mutations that directly affect drug binding are found to be the most common mechanism [21, 32, 53]. On the one hand, the adaptive drug resistance is responsible for the failure of chemotherapy in many advanced cancers. The demands of finding alternative treatment and identifying new drugs that target mutated proteins have been steadily increasing. For example, it is beneficial for drug development by conducting parallel explorations of inhibitors with different resistance profiles [2]. On the other hand, the rapid development of gene sequencing technologies enables the precision medicine to provide personalized recommendations for selecting suitable treatment plans based on the patient’s genotype [14, 60]. All these emerging trends urge a better understanding and modeling of the drug resistances and protein mutation, which directly contributes to overcoming the resistances.

Protein kinases play a fundamental role in human cancer initiation and progression [8] and are among the most important drug targets [19]. Many small-molecule kinase inhibitors have been discovered for the treatment of diverse types of cancers by targeting different kinases. A total of more than 62 drugs inhibiting about 20 targets have been approved by the FDA for the treatment of malignancies [40]. Among them, most inhibitors target tyrosine kinases, which play key roles in the modulation of growth factor signaling and cell proliferation [4]. Tyrosine kinase inhibitors (TKI) are advantageous in the treatment of chronic myeloid leukemia (CML) and non-small cell lung cancer due to their high efficacy, high selectivity and low side effects. Some TKIs even have become the first-line drugs for treatment [4, 38]. Despite the successes of TKIs, the emergence of drug resistance remains a major challenge in cancer treatment and has forced the continuous development of new generations of inhibitors [25, 34, 54, 55]. Shockingly, over 25% of CML patients are reported to change TKIs at least once during the treatment due to TKI resistance, most of which are caused by kinases Abl mutations [4, 35]. In addition, most of the clinically observed kinase mutations display a long tail of rare and uncharacterized mutations [19], which make the sensitivity of known TKIs to these kinase mutants unknown [19]. The high diversity of kinase mutants makes experimentally examining the influences on TKI time-consuming, expensive and less feasible. Therefore, computational methods are expected to give a clue of how many mutations in target proteins would affect the sensitivity of TKIs.

Nowadays, three types of computational methods are commonly employed to estimate the affinity changes (⁠|$\Delta \Delta $|G) of proteins with point mutations [2]. The first is molecular dynamics (MD) based free energy calculations. These calculations use the first-principles statistical mechanics, employing unphysical (i.e. alchemical) intermediates to estimate the free energies changes of the protein thermostability upon a site-mutation [1, 15, 46, 52]. Despite the remarkable performance as a gold standard, it suffers from high computational overhead. The second option is Rosetta, a modeling program that applies mixed physics- and knowledge-based potentials [3, 5] for scoring binding strength. Compared with MD-based simulations, Rosetta with the REF15 scoring function achieves a relative balance between accuracy and computational cost. The third one is the machine learning method [2], which is data-driven and uses expert-engineered features for predicting |$\Delta \Delta $|G. For example, Aldeghi et al. employ extremely randomized regression trees to predict TKI affinity change values. Once the input features related to the change in protein mutation affinity are calculated, the change in binding affinity can be predicted within a few seconds. This enables high-throughput evaluations on a wide-range of protein mutations that would potentially induce drug resistance. Meanwhile, this would also be beneficial to the discovery of new drugs targeting mutant proteins.

Although data-driven machine learning methods have been developed to assess the impact of the mutations on protein stability [9, 11, 16, 18, 28, 36, 42] and protein–ligand binding affinity [22, 44, 49, 50, 61], very few machine learning methods have been developed to predict the impact of ligand binding affinity changes upon protein mutation and identify resistance-causing mutations [2, 47]. The main reason is that machine learning usually requires a large amount of labeled training data, whereas data for the resistance-causing mutations can be impractical or expensive to acquire. Limited training data tend to make predictions become more challenging. Besides, random noise or system and collection biases are inevitably present in the samples, which tends to induce overfitting issue and lead to poor generalization performance. Sample reweighting approach is a commonly used strategy against this robust learning issue [45, 58], whose main idea is to impose weights on samples based on their reliability for training. Typical methods include self-paced learning (SPL) [26] and variants [23, 24, 57, 58].

To alleviate the overfitting issue, we present a robust learning strategy, extremely randomized regression trees with diversity self-paced learning (SPLDExtraTrees), to predict ligand binding affinity changes upon protein mutation for the cancer target Abl kinase. It incorporates a sample reweighting strategy into the extremely randomized regression trees to adaptively suppress the negative influence of noisy data. Specifically, the proposed method selects not only easy samples (with the small loss values), but also focus on balancing sample diversity (from different protein families). Incorporating the diversity of protein families is expected to help quickly grasp easy and comprehensive information and to achieve improved generalization performance. In addition, we also incorporate additional physics-based structural features using Rosetta REF15 scoring function to capture a more sophisticated protein knowledge compared with the work in [2]. We systematically evaluate the capability of SPLDExtraTrees by conducting experiments on human kinase Abl dataset under three scenarios. The experiments demonstrate the potential of SPLDExtraTrees for predicting ligand binding affinity changes upon protein mutations for Abl kinase. It outperforms state-of-the-art machine learning methods by a large margin under all competitive scenarios and achieves predictive accuracy comparable with that of molecular dynamics and Rosetta methods with less computational time and resources. We expect that the proposed method would effectively provide routine prediction of resistance-causing mutations across other protein targets.

Materials and Methods

In this section, we first present the dataset used for training and testing machine learning methods. The calculation of input features associated with affinity changes upon protein mutation and feature selection will then be introduced. After that, we present the objective function of the proposed SPLDExtraTrees method as well as an efficient algorithm for realizing SPLDExtraTrees. The pipeline of the proposed method is summarized in Figure 1.

Schematic representation of the SPLDExtraTrees workflow. (A) The platinum dataset and TKI dataset are used in this work. Given wild-type, mutant proteins and ligand structure files. (B) Feature calculation and selection. A total of 146 features are calculated, which might contain useful information for predicting changes in affinity introduced by protein mutations. (C) The pipeline of the proposed SPLDExtraTrees method. SPLDExtraTrees gradually incorporates both easy (with the small loss values) and diversity samples (from different protein families) into learning and then iterates between weight recalculating and model updating. A dotted block denotes a group with given $\lambda $ and $\gamma $, colored box denotes a sample and the samples in each group are sorted according to the loss values (color from dark to light).
Figure 1

Schematic representation of the SPLDExtraTrees workflow. (A) The platinum dataset and TKI dataset are used in this work. Given wild-type, mutant proteins and ligand structure files. (B) Feature calculation and selection. A total of 146 features are calculated, which might contain useful information for predicting changes in affinity introduced by protein mutations. (C) The pipeline of the proposed SPLDExtraTrees method. SPLDExtraTrees gradually incorporates both easy (with the small loss values) and diversity samples (from different protein families) into learning and then iterates between weight recalculating and model updating. A dotted block denotes a group with given |$\lambda $| and |$\gamma $|⁠, colored box denotes a sample and the samples in each group are sorted according to the loss values (color from dark to light).

Datasets

The machine learning methods are trained on a subset of the Platinum dataset [37] and tested on the TKI dataset [19]. Platinum dataset is manually curated, literature-derived database, which associates the experimental information on changes in affinity with three-dimensional structures of protein–ligand complexes. Following the data preprocessing process outlined in [2], we exclude some samples from the whole training dataset if they: (1) do not have a point-mutated correspondent in the dataset; (2) contain ‘broken ligand structures’; (3) contain ligands that dissolve poorly. Finally, the training dataset contains 484 point-mutations with associated affinity change values (⁠|$\Delta \Delta $|G) across 84 proteins and 143 ligands. For the test dataset, there are 144 point-mutation samples with TKI affinity changes (⁠|$\Delta \Delta $|G) across eight TKIs caused by 31 Abl mutations. Table 1 summarizes the TKI dataset for 144 Abl kinase mutations and eight TKIs used in this work. Furthermore, Figure 2 shows the distribution of experimental |$\Delta \Delta $|G values for the Platinum and TKI datasets.

Table 1

Detailed information of the TKI dataset used in the experiments.

TKIN|$_{mut}^{1}$|Resistant|$^{2}$|Susceptible|$^{2}$|PDB
Axitinib260264WA9
Bosutinib214173UE4
Dasatinib215164XEY
Imatinib215161OPJ
Nilotinib214173CS9
Ponatinib210213OXZ
Erlotinib716Dock to 3ue4|$^{3}$|
Gefitinib606Dock to 3ue4|$^{3}$|
Total14419125
TKIN|$_{mut}^{1}$|Resistant|$^{2}$|Susceptible|$^{2}$|PDB
Axitinib260264WA9
Bosutinib214173UE4
Dasatinib215164XEY
Imatinib215161OPJ
Nilotinib214173CS9
Ponatinib210213OXZ
Erlotinib716Dock to 3ue4|$^{3}$|
Gefitinib606Dock to 3ue4|$^{3}$|
Total14419125

|$^{1}$| N|$_{mut}$| represents the total number of Abl kinase mutants.|$^{2}$| Number of resistant, susceptible mutants using 10-fold affinity change threshold, corresponding to a binding free energy change of 1.36 kcal/mol.|$^{3}$| PDB Source PDB ID, or Dock to 3ue4, which used 3ue4 as the receptor for Glide-SP docking inhibitors without co-crystal structure [19].

Table 1

Detailed information of the TKI dataset used in the experiments.

TKIN|$_{mut}^{1}$|Resistant|$^{2}$|Susceptible|$^{2}$|PDB
Axitinib260264WA9
Bosutinib214173UE4
Dasatinib215164XEY
Imatinib215161OPJ
Nilotinib214173CS9
Ponatinib210213OXZ
Erlotinib716Dock to 3ue4|$^{3}$|
Gefitinib606Dock to 3ue4|$^{3}$|
Total14419125
TKIN|$_{mut}^{1}$|Resistant|$^{2}$|Susceptible|$^{2}$|PDB
Axitinib260264WA9
Bosutinib214173UE4
Dasatinib215164XEY
Imatinib215161OPJ
Nilotinib214173CS9
Ponatinib210213OXZ
Erlotinib716Dock to 3ue4|$^{3}$|
Gefitinib606Dock to 3ue4|$^{3}$|
Total14419125

|$^{1}$| N|$_{mut}$| represents the total number of Abl kinase mutants.|$^{2}$| Number of resistant, susceptible mutants using 10-fold affinity change threshold, corresponding to a binding free energy change of 1.36 kcal/mol.|$^{3}$| PDB Source PDB ID, or Dock to 3ue4, which used 3ue4 as the receptor for Glide-SP docking inhibitors without co-crystal structure [19].

Distribution of the experimental $\Delta \Delta $G values of the Platinum and TKI datasets. The line at $\Delta \Delta G= 1.36$ kcal/mol separates mutations defined as resistant from susceptible.
Figure 2

Distribution of the experimental |$\Delta \Delta $|G values of the Platinum and TKI datasets. The line at |$\Delta \Delta G= 1.36$| kcal/mol separates mutations defined as resistant from susceptible.

Feature Calculation

We calculate a total of 146 features that bear potentially useful information for predicting affinity changes introduced by protein mutations.

Overall, 127 features are calculated, including crystallographic protein–ligand structures, physicochemical properties of ligands and residues and the fast empirical scoring function for ligand–target binding. These feature selections are consistent with the work in [2]. Specifically, 18 ligand properties (e.g. logP, molecular weight and apolar surface area) are calculated using RDkit version 2018.09.1.13 residue mutation properties are calculated using precomputed properties for each amino acid (e.g. change in hydropathy index, flexibility, volume and so on). Among these, there are also the changes in folding free energy upon mutations that predicted by FoldX v4 [43]. A total of 21 features describing the mutation environment (e.g. radial count of protein atoms near the mutation, number of polar/apolar/charged residues in the binding pocket) are calculated using Biopython version 1.73. Six protein–ligand interaction features are calculated using the Protein–Ligand Interaction Profiler (PLIP) [41]. A total of 59 Vina features are calculated with the modified AutoDock Vina program [48] provided with DeltaVINA [49]. We also use DeltaVINA to calculate 10 pharmacophore-based solvent accessible surface area features.

In addition, we also integrate energy features associated with empirical and statistical potentials, aiming to capture more sophisticated protein–ligand complex information than that encoded in the traditional features [3]. We incorporate 19 Rosetta energy features, which are computed using the Rosetta full-atom force field [3]. Specifically, the ligands are parametrized using Rosetta’s molfile_to_params.py script. The interface between the ligand and the protein is then minimized with RosettaScript. The XML description of the complete process is available from the RosettaScripts official website (https://new.rosettacommons.org/docs/latest). A total of 10 complexes are generated for each protein–ligand complex and the full-atom energies for the structure with the lowest energy score are extracted. The 19 full-atom energy features in REF15 are incorporated into the feature set.

Finally, the differences between the wild-type and mutant feature values are calculated and fed into machine-learning models for predicting the |$\Delta \Delta $|G value for the mutation-induced difference in the binding free energy. Figures S1–S6 in the supplementary material show the distribution of each selected features on the Platinum and TKI datasets.

Feature Selection

Feature selection aims to select a subset of the initial features by removing those that possess less predictive power. Generally, feature selection methods can be roughly divided into three categories: filter, wrapper and embedded methods [30]. Filter methods evaluate a feature based on discriminative power without considering its correlations with other features [12, 13, 27, 29]. Wrapper methods utilize a particular machine learning method as feature evaluation metric, select a subset of features according to the estimated prediction error and build the final learning machine [17, 33, 39]. However, the wrapper methods greatly require extensive computational time. For the embedded method, it performs the feature selection as part of the learning procedure, which is more computationally efficient than the wrapper method [30, 56]. See Table S4 in the supplementary material for detail.

In this work, feature selection is performed with the SelectFromModel function using the sklearn library, which is an embedded feature selection method. When the tree-based model is coupled with the SelectFromModel meta-transformer, feature importance can be calculated and used to discard irrelevant features. We allow the selection of any number of features, which minimizes the mean-squared error of |$k$|-fold (⁠|$k=10$|⁠) cross-validation on the Platinum dataset. The distribution of each selected features on the Platinum and TKI datasets is shown in Figure S7 in the supplementary material.

Proposed Method

We show that machine learning can achieve state-of-the-art results in predicting ligand binding affinity changes upon protein mutation if appropriate learning strategy is adopted. The proposed SPLDExtraTrees method is built on top of the basis of extremely randomized regression trees (ExtraTrees) [2]. Instead of learning from all samples simultaneously, we introduce SPL strategy that learns in a meaningful order, from easy samples to hard ones [6, 26] according to the definition of easiness provided in the subsequent discussion. In the sample selection process, we incorporate protein family information so that the model can obtain more comprehensive knowledge during the learning process and achieve better generalization performance.

Model Formulation

Theoretically, given the training dataset |$\mathcal{D}=\left \{\left (\boldsymbol{x}_{i}, y_{i}\right )\right \}_{i=1}^{n}$|⁠, where |$\boldsymbol{x}_{i} \in \mathbb{R}^{m}$| represents the |$i$|-th input sample and |$y_{i} \in \mathbb{R}$| represents the corresponding experimental |$\Delta \Delta $|G value. The model learns a mapping from input to output through the decision function |$f_{\boldsymbol{\beta }}:X \rightarrow Y$|⁠, where |$\boldsymbol{\beta }$| is the model parameter inside the decision function. Let |$L\left (y_{i}, f\left (\boldsymbol{x}_{i}, \boldsymbol{\beta }\right )\right )$| denote the loss function which calculates the cost between the experimental |$\Delta \Delta $|G value |$y_{i}$| and the estimated |$\Delta \Delta $|G value |$f\left (\boldsymbol{x}_{i}, \boldsymbol{\beta }\right )$|⁠. Here, we adopt the most commonly used Root Mean Square Error (RMSE) as the loss function |$L$| and apply ExtraTrees as the decision function |$f$|⁠. The objective in a conventional machine learning model can be expressed as:
(1)
To improve the generalization ability in predicting binding affinity changes for protein mutations, we incorporate self-paced regularization term into the learning objective to adaptively learn the model in a meaningful order. Specifically, we define a latent weight variable |$\boldsymbol{v}=\left [v_{1}, \cdots , v_{n}\right ]$|⁠. The purpose of the SPL model is to jointly learn the model parameter |$\boldsymbol{\beta }$| and the latent weight variable |$\boldsymbol{v}$| by minimizing:
(2)
where |$\lambda $| denotes the age parameter for adjusting the learning space. Eq. (2) is to minimize the weighted training loss together with the negative |$l_{1}$|-norm regularizer |$-\|\boldsymbol{v}\|_{1}=-\sum _{i=1}^{n} v_{i}$| (since |$\left .v_{i} \geq 0\right )$|⁠).

In the standard SPL model, samples are selected solely in terms of ‘easiness’. In practice, however, diversity is an important aspect of learning that should also be considered. For instance, there are large set of protein families with differing binding properties [10]. Typically, the correlation between data points is higher when the samples belong to the same protein family as opposed to different ones. Therefore, we encourage the consideration of protein diversity in sample selection during self-paced training, which implies that the selected samples should be less similar. To achieve this, we embed a diversity regularization term [24] into Eq. (2).

Formally, assume training samples |$\boldsymbol{X}=(\boldsymbol{x}_{1}, \dots , \boldsymbol{x}_{n}) \in \mathbb{R}^{m \times n}$| are partitioned into |$b$| protein groups: |$\boldsymbol{X}^{(1)}, \dots , \boldsymbol{X}^{(b)}$|⁠, where columns of |$\boldsymbol{X}^{(j)} \in \mathbb{R}^{m \times n_{j}}$| correspond to the samples in the |$j$|-th group, |$n_{j}$| is the number of samples under the |$j$|-th group and |$\sum _{j=1}^{b} n_{j}=n$|⁠. Accordingly, denote the weight vector as |$\boldsymbol{v} =\left [\boldsymbol{v}^{(1)}, \dots , \boldsymbol{v}^{(b)}\right ]$|⁠, where |$\boldsymbol{v}^{(j)} = \left (v_{1}^{(j)}, \dots , v_{n_{j}}^{(j)}\right )^{T} \in [0,1]^{n_{j}}$|⁠. On the one hand, we expect that the model needs to assign a non-zero weight to |$\boldsymbol{v}$|⁠, and on the other hand, it needs to distribute the non-zero weight to more protein groups |$\boldsymbol{v}^{(j)}$| to increase diversity. Both of these requirements can be simultaneously achieved with the following optimization model:
(3)
where |$\lambda $| and |$\gamma $| are the parameters imposed on the easiness term (the negative |$l_{1}$|-norm: |$-\|\boldsymbol{v}\|_{1}$|⁠) and the diversity term (the negative |$l_{2,1}$|-norm: |$-\|\boldsymbol{v}\|_{2,1}$|⁠), respectively. Furthermore, the diversity term can be expressed as
(4)
As shown in Eq. (3), this SP-regularizer consists of two components. One is the negative |$l_{1}$|-norm inherited from the conventional SPL, which favors selecting easy over complex samples. The other one is the negative |$l_{2,1}$|-norm, which tends to select diverse samples scattered in different protein groups. It is well known that the |$l_{2,1}$|-norm leads to the group sparsity [59], that is, the non-zero entries of |$\boldsymbol{v}$| are likely to be concentrated in a few groups. On the contrary, the negative |$l_{2,1}$|-norm should have a counter effect to group sparsity. Therefore, this self-paced learning with diversity (SPLD) regularization term selects not only easy samples but also diverse samples that are sufficiently dissimilar from what has already been learned.

In this paper, we incorporate extremely randomized regression trees (ExtraTrees) with SPLD. Each node of the regression trees is trained with easy and diverse samples selected in a self-paced way. Compared with the conventional ExtraTrees method, SPLDExtraTrees is expected to better avoid overfitting and manifest a superior generalization capability.

Model Algorithm

We adopt an alternating optimization strategy to solve Eq. (3), the details are listed in Algorithm 1. A challenge is that optimizing |$\boldsymbol{v}$| with a fixed |$\boldsymbol{\beta }$| becomes a non-convex problem. Jiang et al. [24] proposed a simple yet effective algorithm to obtain the global optimum solution of the model, as listed in Algorithm 2. According to Step 5 of Algorithm 2, a sample whose loss is smaller than the threshold |$\lambda +\gamma \frac{1}{\sqrt{i}+\sqrt{i-1}}$| will be selected in the learning process, where |$i$| represents the sample’s rank with respect to its loss value within its group. Since the threshold value gradually decreases with the increase of rank |$i$|⁠, Step 5 penalizes samples selected monotonically from the same group. Therefore, when Algorithm 2 obtains the optimal |$\boldsymbol{v}^{*}$|⁠, the alternative optimization strategy algorithm can be easily applied to solve Eq. (3). Following the work in [26], we initialize |$\boldsymbol{v}$| by setting |$v_{i}=1$| to randomly selected samples. Following SPL [26], the self-paced parameters are updated by absolute values of |$\mu _{1}$|⁠, |$\mu _{2}$| (⁠|$\mu _{1}$|⁠, |$\mu _{2}$||$\geq $| 1) in Step 5 of Algorithm 1 at the end of every iteration. According to the work in [24], it proposes that the algorithm seems more robust by first sorting samples in ascending order of their losses, and then setting the |$\lambda $| and |$\gamma $| according to the statistics collected from the ranked samples. In this paper, we adopt this mechanism to update the parameters.

Results and Discussion

In this section, we compare the proposed method, SPLDExtraTrees, with three different types of computational methods: the molecular dynamics methods, Rosetta and the machine learning methods. To evaluate the capability of the proposed method for predicting drug response to tyrosine kinase site-mutations, we test the effectiveness of the model on the TKI dataset [2, 19].

Experimental Setup

We conduct experiments on the TKI dataset under three scenarios. The first scenario is to train the machine learning methods on the Platinum dataset and then test it on the TKI dataset. In this scenario, tyrosine kinase is not present in the training dataset, such that we can evaluate whether the model could extrapolate to this protein target. The second scenario is to train the model on the Platinum dataset and a small portion of the TKI dataset, and test it on the rest of the TKI dataset. To do this, we randomly select two samples from each inhibitor in the TKI dataset and feed them into the training pool along with the Platinum dataset. Compared with Scenario 1, we evaluate whether the model could effectively improve its ability to predict affinity changes in Abl mutants by learning from a small amount of tyrosine kinase information.

The third scenario is to train and test the model on the TKI dataset. We apply 8-fold nested cross-validation, as set up in work [2]. In each iteration, the model selects a subset of the feature set through 7-fold cross-validation with seven inhibitors, and test on the remaining one. This process is performed eight times to obtain the prediction results for all entries on the TKI dataset. In this scenario, the model is trained on tyrosine kinase mutations only, such that we can evaluate the interpolating capability of the model.

Models in Comparison

We compare SPLDExtraTrees with three types of methods. The first is molecular dynamics (MD) simulations, using the Amber99sb*-ILDN force field with two different simulation protocols A99 and A99|$l$|⁠, respectively [7, 20, 31, 51]. For simplicity, we show only the results obtained with the best-performing Amber force fields, which are reported by [2]. The second is Rosetta, a modeling program that uses mixed physics- and knowledge-based potentials. The REF15 scoring function, among them, achieves the best published results on this task [2]. The third fold is machine learning methods. We apply ExtraTrees [2] and ExtraTrees with self-paced learning strategy (SPLExtraTrees) as the competing methods.

Evaluation Metrics

Performance are evaluated by the Root Mean Square Error (RMSE) between the experimentally measured and calculated |$\Delta \Delta $|G values, the Pearson correlation coefficient (Pears) and the area under the precision recall curve (AUPRC). The latter measures the classification ability to classify mutations as resistant or susceptible. Particularly, precision is calculated by dividing the number of the true positive resistant mutations by the total number of predicted resistant mutations, and recall is calculated by dividing the number of the true positive resistant mutations by the total number of true resistant mutations. Consistent with previous work [2, 19], resistant mutations are defined as the affinity changes for mutants by least 10-fold, i.e. |$\Delta \Delta $|G|$_{exp} > 1.36$| kcal/mol.

Results

Table 2 summarizes the performance of the |$\Delta \Delta $|G estimates across all competing methods on the TKI dataset, including computational costs (see Table S1 in the supplementary material for more details). In terms of computational costs, molecular dynamics methods are the most expensive to obtain the accurate |$\Delta \Delta $|G estimates. On the contrary, the machine learning methods can obtain |$\Delta \Delta $|G estimates from a few seconds to minutes on a single CPU core, as long as the necessary structure-based features are calculated in advance. In terms of binding affinity changes prediction and resistant mutation identification, we propose three scenarios to evaluate the extrapolating and interpolating capability of the proposed method. As we can see in Table 2, SPLDExtraTrees outperforms ExtraTrees, SPLExtraTrees and MD under all three scenarios, and it is comparable with Rosetta’s results for Scenarios 2 and 3, except Pearson in Scenario 2.

Table 2

Summary of the computational methods used, their calculation costs and performance. Mean prediction performance |$x^{upper}_{lower}$| over 20 repetitions are reported. The best results are highlighted in bold.

AbbreviationMethodForce field or scoring functionApproximate cost per|$\Delta \Delta $|G estimatePerformance
HardwareCompute hoursRMSE (kcal/mol)PearsonAUPRC
A99|$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU59|$0.91^{1.05}_{0.77}$||$0.44^{0.59}_{0.20}$||$0.56^{0.77}_{0.32}$|
A99|$l$||$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU98|$0.91^{1.09}_{0.74}$||$0.42^{0.59}_{0.20}$||$0.51^{0.75}_{0.27}$|
REF15|$^{2}$|RosettaREF151 CPU core32|$0.72^{0.83}_{0.60}$||$0.67^{0.81}_{0.45}$||$0.53^{0.74}_{0.29}$|
ExtraTrees|$^{*3}$|MLScenario 1n/a1 CPU core0.02|$0.87^{1.06}_{0.68}$||$0.12^{0.29}_{-0.04}$||$0.20^{0.39}_{0.10}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.75^{0.77}_{0.75}$||$0.50^{0.54}_{0.38}$||$0.48^{0.52}_{0.34}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.73^{0.74}_{0.72}$||$0.54^{0.56}_{0.47}$||$0.50^{0.52}_{0.43}$|
ExtraTreesMLScenario 2n/a1 CPU core0.02|$0.81^{0.89}_{0.66}$||$0.34^{0.54}_{0.22}$||$0.35^{0.47}_{0.22}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.73^{0.80}_{0.53}$||$0.53^{0.65}_{0.38}$||$0.46^{0.57}_{0.35}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.70^{0.76}_{0.57}$||$0.60^{0.68}_{0.49}$||$0.55^{0.72}_{0.42}$|
ExtraTrees|$^{*3}$|MLScenario 3n/a1 CPU core0.02|$0.68^{0.80}_{0.55}$||$0.57^{0.72}_{0. 34}$||$0.47^{0.68}_{0.25}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.59^{0.61}_{0.58}$||$0.72^{0.73}_{0.70}$||$0.56^{0.57}_{0.51}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.58^{0.59}_{0.57}$||$0.74^{0.75}_{0.72}$||$0.56^{0.60}_{0.52}$|
AbbreviationMethodForce field or scoring functionApproximate cost per|$\Delta \Delta $|G estimatePerformance
HardwareCompute hoursRMSE (kcal/mol)PearsonAUPRC
A99|$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU59|$0.91^{1.05}_{0.77}$||$0.44^{0.59}_{0.20}$||$0.56^{0.77}_{0.32}$|
A99|$l$||$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU98|$0.91^{1.09}_{0.74}$||$0.42^{0.59}_{0.20}$||$0.51^{0.75}_{0.27}$|
REF15|$^{2}$|RosettaREF151 CPU core32|$0.72^{0.83}_{0.60}$||$0.67^{0.81}_{0.45}$||$0.53^{0.74}_{0.29}$|
ExtraTrees|$^{*3}$|MLScenario 1n/a1 CPU core0.02|$0.87^{1.06}_{0.68}$||$0.12^{0.29}_{-0.04}$||$0.20^{0.39}_{0.10}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.75^{0.77}_{0.75}$||$0.50^{0.54}_{0.38}$||$0.48^{0.52}_{0.34}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.73^{0.74}_{0.72}$||$0.54^{0.56}_{0.47}$||$0.50^{0.52}_{0.43}$|
ExtraTreesMLScenario 2n/a1 CPU core0.02|$0.81^{0.89}_{0.66}$||$0.34^{0.54}_{0.22}$||$0.35^{0.47}_{0.22}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.73^{0.80}_{0.53}$||$0.53^{0.65}_{0.38}$||$0.46^{0.57}_{0.35}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.70^{0.76}_{0.57}$||$0.60^{0.68}_{0.49}$||$0.55^{0.72}_{0.42}$|
ExtraTrees|$^{*3}$|MLScenario 3n/a1 CPU core0.02|$0.68^{0.80}_{0.55}$||$0.57^{0.72}_{0. 34}$||$0.47^{0.68}_{0.25}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.59^{0.61}_{0.58}$||$0.72^{0.73}_{0.70}$||$0.56^{0.57}_{0.51}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.58^{0.59}_{0.57}$||$0.74^{0.75}_{0.72}$||$0.56^{0.60}_{0.52}$|

1 Data for the molecular dynamic simulations with the A99 and A99|$l$| force field are obtained from the work in [2]. 2 Data for the Rosetta REF15 scoring function are obtained from the work in [2]. 3 Data for the ExtraTrees|$^{*}$| are obtained from the work in [2].

Table 2

Summary of the computational methods used, their calculation costs and performance. Mean prediction performance |$x^{upper}_{lower}$| over 20 repetitions are reported. The best results are highlighted in bold.

AbbreviationMethodForce field or scoring functionApproximate cost per|$\Delta \Delta $|G estimatePerformance
HardwareCompute hoursRMSE (kcal/mol)PearsonAUPRC
A99|$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU59|$0.91^{1.05}_{0.77}$||$0.44^{0.59}_{0.20}$||$0.56^{0.77}_{0.32}$|
A99|$l$||$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU98|$0.91^{1.09}_{0.74}$||$0.42^{0.59}_{0.20}$||$0.51^{0.75}_{0.27}$|
REF15|$^{2}$|RosettaREF151 CPU core32|$0.72^{0.83}_{0.60}$||$0.67^{0.81}_{0.45}$||$0.53^{0.74}_{0.29}$|
ExtraTrees|$^{*3}$|MLScenario 1n/a1 CPU core0.02|$0.87^{1.06}_{0.68}$||$0.12^{0.29}_{-0.04}$||$0.20^{0.39}_{0.10}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.75^{0.77}_{0.75}$||$0.50^{0.54}_{0.38}$||$0.48^{0.52}_{0.34}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.73^{0.74}_{0.72}$||$0.54^{0.56}_{0.47}$||$0.50^{0.52}_{0.43}$|
ExtraTreesMLScenario 2n/a1 CPU core0.02|$0.81^{0.89}_{0.66}$||$0.34^{0.54}_{0.22}$||$0.35^{0.47}_{0.22}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.73^{0.80}_{0.53}$||$0.53^{0.65}_{0.38}$||$0.46^{0.57}_{0.35}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.70^{0.76}_{0.57}$||$0.60^{0.68}_{0.49}$||$0.55^{0.72}_{0.42}$|
ExtraTrees|$^{*3}$|MLScenario 3n/a1 CPU core0.02|$0.68^{0.80}_{0.55}$||$0.57^{0.72}_{0. 34}$||$0.47^{0.68}_{0.25}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.59^{0.61}_{0.58}$||$0.72^{0.73}_{0.70}$||$0.56^{0.57}_{0.51}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.58^{0.59}_{0.57}$||$0.74^{0.75}_{0.72}$||$0.56^{0.60}_{0.52}$|
AbbreviationMethodForce field or scoring functionApproximate cost per|$\Delta \Delta $|G estimatePerformance
HardwareCompute hoursRMSE (kcal/mol)PearsonAUPRC
A99|$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU59|$0.91^{1.05}_{0.77}$||$0.44^{0.59}_{0.20}$||$0.56^{0.77}_{0.32}$|
A99|$l$||$^{1}$|Molecular DynamicsAmber99sb*-ILDN and GAFF v2.110 CPU cores and 1 GPU98|$0.91^{1.09}_{0.74}$||$0.42^{0.59}_{0.20}$||$0.51^{0.75}_{0.27}$|
REF15|$^{2}$|RosettaREF151 CPU core32|$0.72^{0.83}_{0.60}$||$0.67^{0.81}_{0.45}$||$0.53^{0.74}_{0.29}$|
ExtraTrees|$^{*3}$|MLScenario 1n/a1 CPU core0.02|$0.87^{1.06}_{0.68}$||$0.12^{0.29}_{-0.04}$||$0.20^{0.39}_{0.10}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.75^{0.77}_{0.75}$||$0.50^{0.54}_{0.38}$||$0.48^{0.52}_{0.34}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.73^{0.74}_{0.72}$||$0.54^{0.56}_{0.47}$||$0.50^{0.52}_{0.43}$|
ExtraTreesMLScenario 2n/a1 CPU core0.02|$0.81^{0.89}_{0.66}$||$0.34^{0.54}_{0.22}$||$0.35^{0.47}_{0.22}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.73^{0.80}_{0.53}$||$0.53^{0.65}_{0.38}$||$0.46^{0.57}_{0.35}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.70^{0.76}_{0.57}$||$0.60^{0.68}_{0.49}$||$0.55^{0.72}_{0.42}$|
ExtraTrees|$^{*3}$|MLScenario 3n/a1 CPU core0.02|$0.68^{0.80}_{0.55}$||$0.57^{0.72}_{0. 34}$||$0.47^{0.68}_{0.25}$|
SPLExtraTreesMLn/a1 CPU core0.02|$0.59^{0.61}_{0.58}$||$0.72^{0.73}_{0.70}$||$0.56^{0.57}_{0.51}$|
SPLDExtraTreesMLn/a1 CPU core0.02|$0.58^{0.59}_{0.57}$||$0.74^{0.75}_{0.72}$||$0.56^{0.60}_{0.52}$|

1 Data for the molecular dynamic simulations with the A99 and A99|$l$| force field are obtained from the work in [2]. 2 Data for the Rosetta REF15 scoring function are obtained from the work in [2]. 3 Data for the ExtraTrees|$^{*}$| are obtained from the work in [2].

For the first scenario, we trained the machine learning methods on the Platinum dataset and tested them on the TKI dataset to evaluate the model’s extrapolating capability. In this scenario, the Platinum dataset was pre-filtered to excluded samples that belonged to the same protein family with Abl kinases in order to assess the generalization capacity of the models. Figure 3 plots the scatter plots of the experimental versus calculated |$\Delta \Delta $|G values. As shown in the upper panel, the SPL and SPLD strategy helped improve the robustness of the ExtraTrees in predicting |$\Delta \Delta $|G. SPL alone could enhance the performance of ExtraTrees|$^*$| (RMSE = 0.87 kcal/mol, Pearson = 0.12 and AUPRC = 0.20) by a large margin, which yielded more accurate predictions on |$\Delta \Delta $|G with lower absolute errors (RMSE = 0.75 kcal/mol), stronger correlation (Pearson = 0.50) and better classification performance (AUPRC = 0.48). The protein family information was incorporated in the SPLDExtraTrees, which further improved the prediction performances (RMSE = 0.73 kcal/mol, Pearson = 0.54 and AUPRC = 0.50). Meanwhile, the additional physical and structural features (i.e. obtained by the Rosetta REF15 scoring function) also provided useful information and contributed to prediction improvement (see Figure S8 and Table S1 in the supplementary material). We observed that the range of |$\Delta \Delta $|G predicted by SPLExtraTrees and SPLDExtraTrees is narrow, from 0.04 to 1.46, which might still be a challenge (requiring further analyses in the future). Interestingly, SPLDExtraTrees outperformed molecular dynamics methods (A99 and A99|$l$|⁠) by a considerable margin, which demonstrated its generalization ability for the kinase, as an example. However, SPLDExtraTrees underperformed Rosetta (REF15) (RMSE = 0.72, Pearson = 0.67 and AUPRC = 0.53) slightly. This might result from the absence of samples in kinase family in the training set, which are more closely related to the Abl proteins.

Scatter plots of the experimental versus calculated $\Delta \Delta $G values. $x$-axis denotes the experimental $\Delta \Delta $G value (kcal/mol). $y$-axis denotes the calculated $\Delta \Delta $G value (kcal/mol). Each $\Delta \Delta $G estimate is color-coded according to its absolute error w.r.t. the experimental $\Delta \Delta $G value; at 300 $K$, the 1.4 kcal/mol error corresponds to a 10-fold error in the $K_d$ change and the 2.8 kcal/mol error corresponds to a 100-fold error in the $K_d$ change. Extrapolating represents the model is trained on the Platinum dataset and tested on the TKI dataset, i.e. Scenario 1. Extrapolating_kinase represents the model is trained on Scenario 2. Interpolating represents the model is trained and tested on the TKI dataset, i.e. Scenario 3.
Figure 3

Scatter plots of the experimental versus calculated |$\Delta \Delta $|G values. |$x$|-axis denotes the experimental |$\Delta \Delta $|G value (kcal/mol). |$y$|-axis denotes the calculated |$\Delta \Delta $|G value (kcal/mol). Each |$\Delta \Delta $|G estimate is color-coded according to its absolute error w.r.t. the experimental |$\Delta \Delta $|G value; at 300 |$K$|⁠, the 1.4 kcal/mol error corresponds to a 10-fold error in the |$K_d$| change and the 2.8 kcal/mol error corresponds to a 100-fold error in the |$K_d$| change. Extrapolating represents the model is trained on the Platinum dataset and tested on the TKI dataset, i.e. Scenario 1. Extrapolating_kinase represents the model is trained on Scenario 2. Interpolating represents the model is trained and tested on the TKI dataset, i.e. Scenario 3.

Therefore, in the second scenario, a small part of the TKI dataset along with the Platinum dataset was used to train the models and the rest of the TKI dataset was used for testing. We assumed that the model could improve its ability to predict |$\Delta \Delta $|G by learning from a small amount of tyrosine kinase information. As shown in the middle panel of Figure 3, all three methods got a higher prediction accuracy than that in the scenario 1. Specifically, we could see a larger improvement between SPLExtraTrees and SPLDExtraTrees when samples of the same protein family were presented in the training set. Even with this small amount of kinase samples, SPLDExtraTrees obtained good estimates (RMSE = 0.70 kcal/mol and AUPRC = 0.55) that were comparable with Rosetta calculations (RMSE =0.72 kcal/mol and AUPRC = 0.53), except Pearson metric (Table 2).

For the third scenario, the machine learning methods were trained and tested on the TKI dataset such that we could evaluate the interpolative capability of the model. As shown in the bottom panel of Figure 3, SPLDExtraTrees got the best performance across the benchmark. Pearson and AUPRC achieved by SPLDExtraTrees were increased by approximately more than 0.18 and 0.15 in comparison to those for ExtraTrees|$^*$|⁠, respectively. Furthermore, the additional physical and structural features could also provide the valuable information to improve the predictive performance (see Figure S9 and Table S1 in the supplementary material for detail). However, because there was only one family in both training and test sets, SPLDExtraTrees did not get extra information from the protein family and got comparable results with SPLExtraTrees. Moreover, SPLDExtraTrees attained more accurate |$\Delta \Delta $|G estimates with lower absolute errors (RMSE = 0.58 kcal/mol), higher correlation (Pearson = 0.74) and better classification performance (AUPRC = 0.56) compared with Rosetta (RMSE = 0.72 kcal/mol, Pearson = 0.67 and AUPRC = 0.53). It implies that the proposed method can obtain state-of-the-art performance of |$\Delta \Delta $|G estimates and could discriminate well between resistant and susceptible mutations when it is trained on the relevant data (i.e. tyrosine kinase).

To further illustrate the capability of the proposed method to accurately identify resistant mutations, we plot the Receiver Operating Characteristic (ROC) curve. As shown in Figure 4, we can clearly see that the proposed method outperforms all the competing methods by a large margin under all three scenarios. In the extrapolating case, as shown in Figure 4 (a), SPLDExtraTrees obtained AUC with 0.867 and attained more than 10% improvement compared with molecular dynamics A99 and Rosetta (REF15). AUC achieved by SPLDExtraTrees is increased by approximately more than 14% and 13% gain compared with MD (A99) and Rosetta (REF15) in the extrapolating case, respectively, as shown in Figure 4 (c). However, most protein mutations are resistant rather than susceptible. Therefore, Precision Recall Curve (PRC) is more informative than the ROC curve when evaluating the models on such an imbalanced data distribution. Figure 5 plots the PRC calculated on the TKI dataset for both extrapolation and interpolation scenarios. As shown in Figure 5 (a), although the proposed method outperformed ExtraTrees|$^*$| by a large margin, it performed slightly inferior to molecular dynamics simulations (A99) and Rosetta (REF15) in the case of extrapolation. The average precision (AP) of SPLDExtraTrees holds roughly an 18% lead over ExtraTrees|$^*$|⁠. It implies that SPLDExtraTrees resistant mutations predictions are of higher precision and are composed of less false positives compared with ExtraTrees|$^*$|⁠. In the case of interpolation, AP of SPLDExtraTrees is consistent with Rosetta (REF15) and is superior to molecular dynamics simulations (A99), which implies that training the proposed method on the tyrosine kinase-related data can achieve remarkable performance. Furthermore, we found that the additional physical and structural features could provide valuable information to improve AUC and ROC performance, for the detailed information see Figures S10–S13 in the supplementary material.

All competing methods validation using Receiver Operating Characteristic (ROC) curves. The $x$-axis represents the False Positive Rate (FPR) and the $y$-axis represents the True Positive Rate (TPR). (A) The machine learning methods are trained in the extrapolation case, i.e. Scenario 1. (B) The machine learning methods are trained in the extrapolation with few kinase samples, i.e. Scenario 2. (C) The machine learning methods are trained in the interpolation case, i.e. Scenario 3.
Figure 4

All competing methods validation using Receiver Operating Characteristic (ROC) curves. The |$x$|-axis represents the False Positive Rate (FPR) and the |$y$|-axis represents the True Positive Rate (TPR). (A) The machine learning methods are trained in the extrapolation case, i.e. Scenario 1. (B) The machine learning methods are trained in the extrapolation with few kinase samples, i.e. Scenario 2. (C) The machine learning methods are trained in the interpolation case, i.e. Scenario 3.

All competing methods validation using Precision Recall Curve (PRC). The $x$-axis represents Recall and the $y$-axis represents Precision. (A) The machine learning methods are trained in the extrapolation case. (B) The machine learning methods are trained in the extrapolation with few kinase samples. (C) The machine learning methods are trained in the interpolation case.
Figure 5

All competing methods validation using Precision Recall Curve (PRC). The |$x$|-axis represents Recall and the |$y$|-axis represents Precision. (A) The machine learning methods are trained in the extrapolation case. (B) The machine learning methods are trained in the extrapolation with few kinase samples. (C) The machine learning methods are trained in the interpolation case.

Conclusions

In this paper, we propose a robust machine learning method for predicting ligand binding affinity changes upon protein mutations for cancer target Abl kinase and identifying resistant mutations. Instead of feeding all samples into the training pool at once, the proposed method starts with learning both the easy and diverse samples and gradually takes more complex samples into consideration. By combining SPL and the extremely random regression trees, the proposed method can quickly grasp easy and comprehensive knowledge and improve generalization performance. Furthermore, the diverse SPL regularization terms can be used to incorporate the domain knowledge for the model training. This hypothesis is substantiated by our experiments. We conduct experiments on the TKI dataset under three scenarios to verify the effectiveness of the proposed method for predicting drug response to tyrosine kinase resistance. Our empirical results demonstrate that the proposed method outperforms molecular dynamics simulations with the Amber force fields and the traditional extremely random regression trees by a large margin across all scenarios. Although the Pearson correlation coefficient of the proposed method performs slightly worse than Rosetta REF15 scoring function in Scenario 1 (i.e. no tyrosine kinase in the training set), the proposed method could improve its ability to predict affinity changes in Abl mutants by learning a small amount of tyrosine kinase information. In particular, when the proposed method is trained on the most relevant data (i.e. TKI dataset), it achieves the best performance among all competing computational methods.

The above results demonstrate that a good combination of domain knowledge and machine learning methods could greatly improve the prediction accuracy and generalization ability. Here, we mainly use protein family information in SPLDExtraTrees as a standard to select samples during the model training process. We also consider to use the amino acid (aa) change types (i.e. hydrophobic aa to polar aa, see Table S2 in the supplementary material for more information) as a standard in SPLDExtraTrees. The aa change types could also improve the predictive performance by a small margin in Scenario 1. However, it is much less obvious when compared with the protein family information. Interestingly, we find that for most of the aa change types we obtained good predictions, but failed to scale well only for the ‘polar to hydrophobic’ type (see Figure S14 in the supplementary material for detail). This suggests that SPLDExtraTrees can still learn from aa change types, although classifying samples by the protein family seems to convey more useful information pertaining to the binding free energy changes.

We test the proposed methods on TKI dataset, which is an Abl kinase mutation dataset. We expect SPLDExtraTrees would effectively provide routine predictions of resistance-causing mutations for kinase. In addition, we mainly discussed single-point mutations where one aa was replaced with another aa in this work, but the prediction for more challenging site-mutation types (e.g. insertions, deletions, framework changes) and multi-point mutations need to be explored in the future work. We also expect our methods could be applied to these scenarios as solutions; however, appropriate datasets and incorporation of additional domain knowledge, through feature engineering or model training objectives, might be required.

Key Points
  • A robust machine learning method, termed SPLDExtraTrees, is developed to predict ligand binding affinity changes upon protein mutation and identify resistance-causing mutations. It can be used to incorporate the domain knowledge for the model training and improve generalization performance.

  • We calculate additional physics-based structural features to provide the machine learning model with the valuable domain knowledge on proteins for these data-limited predictive tasks.

  • The experiments substantiate the capability of the proposed method for predicting kinase inhibitor resistance under three scenarios and achieve predictive accuracy comparable with that of molecular dynamics and Rosetta methods with much less computational costs.

  • The proposed method would offer a timely and useful tool for investigating resistance-causing mutations, which would shed light upon novel drug design and precision medicine.

Supporting information Available

The Supplementary Material is available on the ‘Supplementary Material.docx’ file. Data S1: Detailed information on the data set and numerical results for all calculations, provided as an Excel spreadsheet (XLSX). Data S2: Input files used for the training and testing of the machine learning methods, provided as an Excel spreadsheet (XLSX). The code is available at: https://github.com/pmobio/SPLDExtraTrees.

Author contributions statement

Z.Y.Y. and Y.J.X. designed the model and conducted the experiments, Z.Y.Y. and Z.F.Y. analyzed the results. Z.Y.Y., Z.F.Y. and C.Y.H. wrote the manuscript. Z.Y.Y., Z.F.Y., C.Y.H. and S.Y.Z. reviewed the manuscript.

Acknowledgments

The authors thank the anonymous reviewers for their valuable suggestions.

Zi-Yi Yang is currently a researcher at Tencent Quantum Laboratory since 2020. She received her PhD degree in Computer Science from the Macau University of Science and Technology, Macau, China in 2020. From 2015 to 2017, she was a bioinformatics analysis engineer at BGI and China National GenBank, Shenzhen, China. Her research interests include artificial intelligence drug discovery, Bioinformatics and machine learning.

Zhao-Feng Ye got his Ph.D in computational biology at School of Medicine, Tsinghua University. He joined Tencent Quantum Lab for ai-assisted drug discovery. The research interests include the virtual screening pipeline development, the drug effect evaluation and the drug target identification.

Yi-Jia Xiao is a fourth-year undergraduate from the Department of Computer Science and Technology of Tsinghua University. His current research interests include computational biology, natural language processing and their applications.

Chang-Yu Hsieh is currently a senior researcher at Tencent Quantum Laboratory since 2018. He received his PhD degree in Physics from the University of Ottawa in 2012 and worked as a postdoctoral researcher at the University of Toronto (2012–2013) and Massachusetts Institute of Technology (2013–2016), respectively. Before joining Tencent, he worked as a senior researcher at Singapore-MIT Alliance for Science and Technology (2017–2018). His research interests span across quantum information science, nonequilibrium statistical physics, theoretical chemistry and machine learning for chemistry.

Sheng-Yu Zhang obtained his bachelor degree in mathematics, Fudan University in 1999, master in computer science, Tsinghua University in 2002, under the supervision of Mingsheng Ying, and Ph.D. in computer science, Princeton University in 2006, under the supervision of Prof. Andrew Yao. He then worked in California Institute of Technology for a 2-year postdoc, hosted by Prof. John Preskill and Prof. Leonard Schulman. He joined The Chinese University of Hong Kong as an assistant professor in 2008 and became an associate professor in 2014. In January 2018, he joined Tencent as a Distinguished Scientist and founding director of Tencent Quantum Laboratory.

Shengyu Zhang’s research interest lies in quantum computing, algorithm designing, computational complexity and foundation of artificial intelligence. He is an editor of Theoretical Computer Science and of International Journal of Quantum Information. He published numerous papers and served as PC member in leading conferences in theoretical computer science, quantum computing and artificial intelligence.

References

1.

Aldeghi
M
,
Gapsys
V
,
de
Groot
BL
.
Accurate estimation of ligand binding affinity changes upon protein mutation
.
ACS central science
2018
;
4
(
12
):
1708
18
.

2.

Aldeghi
M
,
Gapsys
V
,
de
Groot
BL
.
Predicting kinase inhibitor resistance: physics-based and data-driven approaches
.
ACS central science
2019
;
5
(
8
):
1468
74
.

3.

Alford
RF
,
Leaver-Fay
A
,
Jeliazkov
JR
, et al.
The rosetta all-atom energy function for macromolecular modeling and design
.
Journal of chemical theory and computation
2017
;
13
(
6
):
3031
48
.

4.

Arora
A
,
Scholar
EM
.
Role of tyrosine kinase inhibitors in cancer therapy
.
Journal of Pharmacology and Experimental Therapeutics
2005
;
315
(
3
):
971
9
.

5.

Barlow
S
,
Kyle
A
,
Thompson
S
, et al.
Flex ddg: Rosetta ensemble-based estimation of changes in protein–protein binding affinity upon mutation
.
J Phys Chem B
2018
;
122
(
21
):
5389
99
.

6.

Bengio
Y
,
Louradour
J
,
Collobert
R
, et al. Curriculum learning. In:
ICML' 09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International
, Montreal Quebec Canada, June 14–18,
2009
,
41
8
. Association for Computing Machinery, New York, NY, United States.

7.

Best
RB
,
Hummer
G
.
Optimized molecular dynamics force fields applied to the helix- coil transition of polypeptides
.
J Phys Chem B
2009
;
113
(
26
):
9004
15
.

8.

Bhullar
KS
,
Lagarón
NO
,
McGowan
EM
, et al.
Kinase-targeted cancer therapies: progress, challenges and future directions
.
Mol Cancer
2018
;
17
(
1
):
1
20
.

9.

Chen
Y
,
Haoyu
L
,
Zhang
N
, et al.
Premps: Predicting the impact of missense mutations on protein stability
.
PLoS Comput Biol
2020
;
16
(
12
):e1008543.

10.

Das
S
,
Dawson
NL
,
Orengo
CA
.
Diversity in protein domain superfamilies
.
Curr Opin Genet Dev
2015
;
35
:
40
9
.

11.

Dehouck
Y
,
Grosfils
A
,
Folch
B
, et al.
Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: Popmusic-2.0
.
Bioinformatics
2009
;
25
(
19
):
2537
43
.

12.

Ding
C
,
Peng
H
.
Minimum redundancy feature selection from microarray gene expression data
.
J Bioinform Comput Biol
2005
;
3
(
02
):
185
205
.

13.

Dudoit
S
,
Fridlyand
J
,
Speed
TP
.
Comparison of discrimination methods for the classification of tumors using gene expression data
.
J Am Stat Assoc
2002
;
97
(
457
):
77
87
.

14.

Fowler
PW
,
Kevin Cole
N
,
Gordon
C
, et al.
Robust prediction of resistance to trimethoprim in staphylococcus aureus
.
Cell chemical biology
2018
;
25
(
3
):
339
49
.

15.

Gapsys
V
,
Michielssens
S
,
Seeliger
D
, et al.
pmx: Automated protein structure and topology generation for alchemical perturbations
.
J Comput Chem
2015
;348–54.

16.

Getov
I
,
Petukh
M
,
Alexov
E
.
Saafec: predicting the effect of single point mutations on protein folding free energy using a knowledge-modified mm/pbsa approach
.
Int J Mol Sci
2016
;
17
(
4
):
512
.

17.

Golub
TR
,
Slonim
DK
,
Tamayo
P
, et al.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
.
Science
286
(
5439
,
1999
):
531
7
.

18.

Guerois
R
,
Nielsen
JE
,
Serrano
L
.
Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations
.
J Mol Biol
2002
;
320
(
2
):
369
87
.

19.

Hauser
K
,
Negron
C
,
Albanese
SK
, et al.
Predicting resistance of clinical abl mutations to targeted kinase inhibitors using alchemical free-energy calculations
.
Communications biology
2018
;
1
(
1
):
1
14
.

20.

Hornak
V
,
Abel
R
,
Okur
A
, et al.
Comparison of multiple amber force fields and development of improved protein backbone parameters
.
Proteins: Structure, Function, and Bioinformatics
2006
;
65
(
3
):
712
25
.

21.

Housman
G
,
Byler
S
,
Heerboth
S
, et al.
Drug resistance in cancer: an overview
.
Cancer
2014
;
6
(
3
):
1769
92
.

22.

Ji
B
,
He
X
,
Zhai
J
, et al.
Machine learning on ligand-residue interaction profiles to significantly improve binding affinity prediction
.
Brief Bioinform
2021
;
22
(
5
):bbab054.

23.

Jiang
L
,
Meng
D
,
Mitamura
T
, et al. Easy samples first: Self-paced reranking for zero-example multimedia search. In:
Proceedings of the 22nd ACM international conference on Multimedia
, ACM Multimedia Conference Orlando Florida USA November 3–7
2014
,
547
56
. Association for Computing Machinery, New York, NY, United States.

24.

Lu
J
,
Meng
D
,
Yu
S-I
, et al. Self-paced learning with diversity. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ, (eds).
28th Annual Conference on Neural Information Processing Systems 2014
, December 8–13, 2014, Montreal, Quebec, Canada
2014
,
2078
86
. MIT Press, Cambridge, MA, United States.

25.

Juchum
M
,
Günther
M
,
Laufer
SA
.
Fighting cancer drug resistance: Opportunities and challenges for mutation-specific egfr inhibitors
.
Drug Resist Updat
2015
;
20
:
12
28
.

26.

Pawan Kumar
M
,
Packer
B
,
Koller
D
. Self-paced learning for latent variable models. In: Lafferty J, Williams C, Shawe-Taylor J, Zemel R, Culotta A, (eds).
24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6–9 December 2010
, Vancouver, British Columbia, Canada. Vol.
1
,
2010
,
2
. MIT Press, Cambridge, MA, United States.

27.

Lee
JW
,
Lee
JB
,
Park
M
, et al.
An extensive comparison of recent classification tools applied to microarray data
.
Computational Statistics & Data Analysis
2005
;
48
(
4
):
869
85
.

28.

Li
G
,
Panday
SK
,
Alexov
E
.
aafec-seq: A sequence-based method for predicting the effect of single point mutations on protein thermodynamic stability
.
Int J Mol Sci
2021
;
22
(
2
):
606
.

29.

Li
T
,
Zhang
C
,
Ogihara
M
.
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression
.
Bioinformatics
2004
;
20
(
15
):
2429
37
.

30.

Liang
Y
,
Liu
C
,
Luan
X-Z
, et al.
Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification
.
BMC bioinformatics
2013
;
14
(
1
):
1
12
.

31.

Lindorff-Larsen
K
,
Piana
S
,
Palmo
K
, et al.
Improved side-chain torsion potentials for the amber ff99sb protein force field
.
Proteins: Structure, Function, and Bioinformatics
2010
;
78
(
8
):
1950
8
.

32.

Lovly
CM
,
Shaw
AT
.
Molecular pathways: resistance to kinase inhibitors and implications for therapeutic strategies
.
Clin Cancer Res
2014
;
20
(
9
):
2249
56
.

33.

Monari
G
,
Dreyfus
G
.
Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model
.
Neurocomputing
2000
;
35
(
1-4
):
195
201
.

34.

Neel
DS
,
Bivona
TG
.
Resistance is futile: overcoming resistance to targeted therapies in lung adenocarcinoma
.
NPJ precision oncology
2017
;
1
(
1
):
1
6
.

35.

Patel
AB
,
O’Hare
T
,
Deininger
MW
.
Mechanisms of resistance to abl kinase inhibition in chronic myeloid leukemia and the development of next generation abl kinase inhibitors
.
Hematology/Oncology Clinics
2017
;
31
(
4
):
589
612
.

36.

Pires
DEV
,
Ascher
DB
,
Blundell
TL
.
mcsm: predicting the effects of mutations in proteins using graph-based signatures
.
Bioinformatics
2014
;
30
(
3
):
335
42
.

37.

Pires
DEV
,
Blundell
TL
,
Ascher
DB
.
Platinum: a database of experimentally measured effects of mutations on structurally defined protein–ligand complexes
.
Nucleic Acids Res
2015
;
43
(
D1
):
D387
91
.

38.

Pottier
C
,
Fresnais
M
,
Gilon
M
, et al.
Tyrosine kinase inhibitors in cancer: breakthrough and challenges of targeted therapy
.
Cancer
2020
;
12
(
3
):
731
.

39.

Rivals
I
,
Personnaz
L
.
Mlps (mono layer polynomials and multi layer perceptrons) for nonlinear modeling
.
The Journal of Machine Learning Research
2003
;
3
:
1383
98
.

40.

Roskoski Jr
R
.
Properties of FDA-approved small molecule protein kinase inhibitors: A 2021 update
.
Pharmacol Res
2021
;
165
:
105463
.

41.

Salentin
S
,
Sven Schreiber
V
,
Haupt
J
, et al.
Plip: fully automated protein–ligand interaction profiler
.
Nucleic Acids Res
2015
;
43
(
W1
):
W443
7
.

42.

Savojardo
C
,
Fariselli
P
,
Martelli
PL
, et al.
Inps-md: a web server to predict stability of protein variants from sequence and structure
.
Bioinformatics
2016
;
32
(
16
):
2542
4
.

43.

Schymkowitz
J
,
Borg
J
,
Stricher
F
, et al.
The foldx web server: an online force field
.
Nucleic Acids Res
2005
;
33
(
suppl_2
):
W382
8
.

44.

Shen
C
,
Ye
H
,
Wang
Z
, et al.
Can machine learning consistently improve the scoring power of classical scoring functions? insights into the role of machine learning in scoring functions
.
Brief Bioinform
2021
;
22
(
1
):
497
514
.

45.

Shu
J
,
Xie
Q
,
Yi
L
, et al. Meta-weight-net: Learning an explicit mapping for sample weighting. In: Wallach HM, Larochelle H, Beygelzimer A, d'Alché-Buc F, Fox EA, Garnett R, (eds),
Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019
, 8–14 December 2019, Vancouver, BC, Canada.
2019
.

46.

Steinbrecher
TB
,
Dahlgren
M
,
Cappel
D
, et al.
Accurate binding free energy predictions in fragment optimization
.
J Chem Inf Model
2015
;
55
(
11
):
2411
20
.

47.

Sun
T
,
Chen
Y
,
Wen
Y
, et al.
Prempli: a machine learning model for predicting the effects of missense mutations on protein-ligand interactions
.
Communications biology
2021
;
4
(
1
):
1
11
.

48.

Trott
O
,
Olson
AJ
.
Autodock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading
.
J Comput Chem
2010
;
31
(
2
):
455
61
.

49.

Wang
C
,
Zhang
Y
.
Improving scoring-docking-screening powers of protein–ligand scoring functions using random forest
.
J Comput Chem
2017
;
38
(
3
):
169
77
.

50.

Wang
DD
,
Zhu
M
,
Yan
H
.
Computationally predicting binding affinity in protein–ligand complexes: free energy-based simulations and machine learning-based scoring functions
.
Brief Bioinform
2021
;
22
(
3
):bbaa107.

51.

Wang
J
,
Wolf
RM
,
Caldwell
JW
, et al.
Development and testing of a general amber force field
.
J Comput Chem
2004
;
25
(
9
):
1157
74
.

52.

Wang
L
,
Wu
Y
,
Deng
Y
, et al.
Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field
.
J Am Chem Soc
2015
;
137
(
7
):
2695
703
.

53.

Ward
RA
,
Fawell
S
,
Floc’h
N
, et al.
Challenges and opportunities in cancer drug resistance
.
Chem Rev
2020
;
121
(
6
):
3297
351
.

54.

Weisberg
E
,
Manley
PW
,
Cowan-Jacob
SW
, et al.
Second generation inhibitors of bcr-abl for the treatment of imatinib-resistant chronic myeloid leukaemia
.
Nat Rev Cancer
2007
;
7
(
5
):
345
56
.

55.

Lu
XY
,
Cai
Q
,
Ding
K
.
Recent developments in the third generation inhibitors of bcr-abl for overriding t315i mutation
.
Curr Med Chem
2011
;
18
(
14
):
2146
57
.

56.

Yang
Z-Y
,
Liang
Y
,
Zhang
H
, et al.
Robust sparse logistic regression with the lq (0<q<1) regularization for feature selection using gene expression data
.
IEEE Access
2018
;
6
:
68586
95
.

57.

Yang
Z-Y
,
Liu
X-Y
,
Shu
J
, et al.
Multi-view based integrative analysis of gene expression data for identifying biomarkers
.
Sci Rep
2019
;
9
(
1
):
1
15
.

58.

Yang
Z
,
Wu
N
,
Liang
Y
, et al.
Smspl: Robust multimodal approach to integrative analysis of multiomics data
.
IEEE Transactions on Cybernetics
2020
;1–14. doi: .

59.

Yuan
M
,
Lin
Y
.
Model selection and estimation in regression with grouped variables
.
J R Stat Soc Series B Stat Methodology
2006
;
68
(
1
):
49
67
.

60.

Zehir
A
,
Benayed
R
,
Shah
RH
, et al.
Mutational landscape of metastatic cancer revealed from prospective clinical sequencing of 10,000 patients
.
Nat Med
2017
;
23
(
6
):
703
13
.

61.

Zilian
D
,
Sotriffer
CA
.
Sfcscore rf: a random forest-based scoring function for improved affinity prediction of protein–ligand complexes
.
J Chem Inf Model
2013
;
53
(
8
):
1923
33
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)