An overview of machine learning methods for monotherapy drug response prediction

Abstract For an increasing number of preclinical samples, both detailed molecular profiles and their responses to various drugs are becoming available. Efforts to understand, and predict, drug responses in a data-driven manner have led to a proliferation of machine learning (ML) methods, with the longer term ambition of predicting clinical drug responses. Here, we provide a uniquely wide and deep systematic review of the rapidly evolving literature on monotherapy drug response prediction, with a systematic characterization and classification that comprises more than 70 ML methods in 13 subclasses, their input and output data types, modes of evaluation, and code and software availability. ML experts are provided with a fundamental understanding of the biological problem, and how ML methods are configured for it. Biologists and biomedical researchers are introduced to the basic principles of applicable ML methods, and their application to the problem of drug response prediction. We also provide systematic overviews of commonly used data sources used for training and evaluation methods.


Introduction
The continued development of new high-throughput molecular profiling ('omics') technologies and their concomitant increased accessibility for biomedical applications make it increasingly attractive to study complex biological phenomena using datadriven and machine learning (ML) approaches [1]. ML is a specific subset of artificial intelligence that allows automatic learning from data, and has contributed to a wide range of genomics research [2]. One particular application of ML is drug response prediction (DRP), in which phenotypic responses of biological samples are predicted on the basis of their molecular profiles, with predictors often providing mechanistic insight.
In the context of biomedical research, DRP has, until now, largely focused on the response to anticancer drugs, possibly owing to the importance of cancer as a major public health issue, and the relative ease with which cancer cells can be cultured for in vitro drug screening assays. Cancer is a multifactorial and heterogeneous disease; each tumour typically has its own unique molecular characteristics, often leading to diverse responses to the same therapy for patients despite similar DRP datasets [15][16][17][18] make them the most frequently used input for ML-based DRP.
Given the omics data acquired from cell lines or tumours, along with their phenotypic responses to drugs, various ML approaches have been employed for DRP, with different inputs, models and learning processes, and different levels of complexity and interpretability of the resulting DRP models. DRP models range from established simple linear regression models to recent and more complex such as deep neural networks (DNNs). In addition, DRP methods can use biomolecular networks (between, e.g. genes and cell lines) that represent known or derived relationships.
The most recent comprehensive review of DRP methods, published in 2015 [19], naturally does not encompass important recent developments, such as recommender systems-based DRP. Other, more recent, reviews focus on specific aspects. Adam et al. [20] provide a broad view of methods linked to recent technologies, such as deep learning and single cell sequencing, and discuss important extensions, such as combination therapy and the prediction of drug resistance. Ali and Aittokallio [21] focus on the principled lessons from the NCI-DREAM7 community challenge [22]. Paltun et al. [23] discuss three specific subclasses of DRP methods based on data integration approaches. Furthermore, Chen and Zhang [24] provide a systematic performance evaluation of 17 selected DRP methods.
The present review thus intends to provide a structured, up to date, panorama of DRP methods from the past 10 years, rather than deeper technical discussions relating to specific classes of methods, which can be found in some of the above reviews and the original and ML literature. While introducing basic biological and ML concepts when needed, (i) we discuss all major databases commonly used to learn DRP ML models, together with a systematic overview of the drug response measurement that was used, numbers of drugs/samples, and URLs, (ii) we survey more than 70 DRP methods across 13 subcategories, together with a systematic overview of used inputs, outputs, evaluation approach, and code and software availability, and (iii) we discuss common approaches to evaluate DRP methods.

Databases
A number of ambitious experimental efforts have produced large pharmacogenomic datasets, i.e. collections of drug responses of clinical or preclinical samples, along with their molecular profiles. Box I provides an overview of commonly used preclinical drug response metrics that are used to represent the drug responses in these datasets.
Pharmacogenomic data have provided insight into the relationship between clinical and preclinical molecular profiles and responses to anticancer drugs, and drug response-predictive biomarkers. In this section, we survey the commonly used large publicly available pharmogenomic datasets.
The Cancer Genome Atlas (TCGA) [25] provides genomic, epigenomic, transcriptomic and proteomic features of over 11 000 patients across 33 cancer types. TCGA is a collaborative effort between the National Cancer Institute and the National Human Genome Research Institute started in 2006 aiming to better understand cancer biology and improve cancer treatment and prevention. In TCGA, patient drug responses are categorized into four classes of complete response, partial response, progressive disease and stable disease, based on tumour growth measured by RECIST, an imaging-based criterion to evaluate the response in solid tumours [26]. The TCGA dataset can be downloaded from Genomic Data Commons Data Portal.
Based on the hypothesis that in vitro drug activity on cancer cell lines can capture essential aspects of in vivo drug activity on corresponding tumours [12], molecular and drug sensitivity profiles across a large number of cell lines have been generated; Table 1 provides an overview. In the late 1980s, a project led by the US National Cancer Institute measured the ability of thousands of anticancer drugs to inhibit cell growth across 60 human cancer cell lines [6,12]. The NCI-60 panel of measurements includes gene expression (GE), copy number variation (CNV), single nucleotide polymorphism (SNP), DNA methylation (DM) and proteomics profiles of each cell line. Analysis of this data resulted in the discovery of a number of gene mutations that are highly predictive of clinical responses [27] and provided new insights into drug activity modulators and drug-target interactions.
Following the NCI-60 effort, two large collaborative datasets were established: the Cancer Cell Line Encyclopedia (CCLE) [16] and the Genomics of Drug Sensitivity in Cancer (GDSC) [15]. CCLE is a collaboration of the Broad Institute and Novartis, and GDSC is a collaboration between The Cancer Genome Project at the Wellcome Trust Sanger Institute (UK) and the Center for Molecular Therapeutics at the Massachusetts General Hospital Cancer Center (USA). Both datasets contain GE, CNV, DM, and mutational status (MS) of around 1000 human cancer cell lines from various tumour types, together with pharmacological profiles across a large fraction of these cell lines, for 24 and 518 drugs, respectively, in CCLE and GDSC. Although several studies pointed at inconsistencies in the drug response measurements between CCLE and GDSC [28,29], others argued that the inconsistencies disappear under a specific, reasonable, interpretation of the data [18,[30][31][32][33]. To shed more light on this issue, the Genentech Cell Line Screening Initiative [18] generated profiled 410 cell lines common to GDSC and CCLE using various approaches, including RNA-seq GE, MS and CNV, along with their responses to 16 drugs tested in both studies. The outcomes of this study supported the previously proposed mode of consistent interpretability across CCLE and GDSC data. PharmacoDB [34] provides large publicly available preclinical datasets that have been prepared for integrated use, with special emphasis on consistency.
The Cancer Therapeutics Response Portal (CTRP) [17] provided another pharmacogenomic dataset with a variety of different molecular features of cell lines, including MS, GE, CNV and tissue type. The last published version of this dataset includes 860 characterized cancer cell lines and their responses to 481 compounds. DepMap is a data repository that aggregates a number of different preclinical datasets (including GDSC, CCLE and CTRP).

DRP models
The high complexity of the biological response to drugs and the increasing feasibility of large-scale molecular profiling experiments make ML approaches increasingly attractive for modelling and prediction of drug responses. DRP models are most often established using a set of known phenotypic drug responses (supervised learning). Generally, the objective of DRP models is the prediction of the response of drugs for a given sample from its molecular profile ( Figure 1). Regression estimates a quantitative drug response (Box I) for each sample, while classification predicts discrete labels that stratify samples into sensitive or resistant. To obtain ground truth labels in preclinical samples, the continuous measure of drug response is then thresholded by methods such as median-based [35], maximum concentration-based [14], waterfall-based [16] and predefined [36][37][38]. From a practical perspective, discretization enables the use of classification methods, but it also loses information, and, if predictions are to be interpreted in a clinical context, the clinical relevance of the resulting classes requires special attention. Feature selection, i.e. the maximal removal of data features with a minimal loss of problem-relevant information, is an important part of DRP, as it is a major antidote to the statistical and computational problems that the high-dimensional omics input data typically entail. For focused discussions of feature selection, we refer the reader to Webb and Copsey [39], and, for their use in DRP, to the study by Koras et al. [40].
Network-based machine learning models are relatively recent, and increasingly popular for DRP. Here, networks (often equivalent to mathematical graphs) can refer to sets of interactions between cell lines, drugs, genes, proteins or their combination. Networks carry different types of information, and their use and interpretation vary greatly across the different uses for DRP.
DRP models can be classified as single-drug learning (SDL) and multi-drug learning (MDL). To predict the response to a given drug, MDL leverages data from other drugs; SDL exclusively uses data about the given drug. SDL and MDL differ in their capabilities and validation process. Table 2 provides a systematic overview of all DRP articles covered here.

SDL models
SDL models are trained to predict drug responses for a given drug, independent of other drugs. In this section, we survey the SDL approaches in the literature across seven subclasses.

Linear regression
Regression analysis is a set of statistical methods for estimating the relationship between a dependent variable and one or more independent variables. As its simplest form, linear regression uses a linear function to model this relationship, and enjoys remarkable popularity due to its simplicity and interpretability.
In DRP, the number of available observations often does not suffice to determine a unique model. Regularization techniques address this issue by limiting the possible solutions through additional assumptions [41,42], and are therefore a popular choice for DRP. Shrinkage is a form of regularization in which parameter values close to zero are preferred when they do not clearly contribute to the predictive performance of the model. Ridge regression [43] includes a form of shrinkage, and was used to assess whether models trained on preclinical samples can predict clinical drug responses [30,44]. Ridge regression typically does not yield sparse models (i.e. models with few independent variables that are easier to interpret). Lasso regularization, on the other hand, forces coefficients to zero, their number being controlled by a hyperparameter [45]. In a recent study, Huang et al. proposed Tissue-Guided Lasso to derive tissue-specific models through tissue-specific choices for their hyperparameter. They defined tissue-specific hyperparameters by training each lasso model on all cell lines, excluding those from the tissue of interest, and then calibrated the hyperparameter using data from that tissue [46]. One principled limitation of Lasso is that the number of non-zero coefficients cannot be higher than the number of data samples [42]. To derive sparse models even in cases when the number of features is larger than the sample size, the elastic net [41] combines ridge regression and lasso regularization, and has been also used for DRP [16,[47][48][49][50].
Biomolecular networks represent interactions of biomolecules in the same functional context that can be leveraged for DRP. These interactions can be derived from prior knowledge or independent experimental data. A common use of networks for DRP is to use them for regularization, by assuming that model coefficients for functionally related molecules are similar. Sokolov et al. [51] introduced the generalized elastic net model that incorporates the Pathway Commons [52] network in an elastic net approach through a network-based regularization term.
Partial least squares regression (PLSR) is another type of linear method, based on latent variable projection [53]. In PLSR, input and output data are each projected into a (usually lowdimensional) new space, and then a relationship between them is established using linear regression. This makes PLSR suitable for analysing high-dimensional data and DRP [35,54].

Logistic regression
Instead of predicting drug response as a continuous variable, samples can also be classified as sensitive or resistant. Logistic regression is a common classification approach that can be derived, in a Bayesian framework, from a simple probabilistic model that expresses drug response as a function of molecular profiles [30,44,55,56]. Classifiers derived from these models have linear decision boundaries, i.e. lines or hyperplanes that separate different classes in the input space [42,57]. The assumption of linearity can be a strength in the case of small datasets, but a limitation when larger datasets would be able to resolve more complex decision boundaries. Huang et al. [58] proposed a logistic regression model that is additionally constrained based on biomolecular networks.

Support vector machines
Support vector machines (SVMs) [42,57,59] are ML models that can be used for classification and regression. SVMs are efficiently learnable, and can employ nonlinear kernels (nonlinear transformation functions). Dong et al. used a SVM to classify samples as sensitive and resistant. SVM regression (SVR) [60] has also been used for drug prioritization [61] and response prediction [62,63]. The selection of a kernel function and the interpretation of SVMs are typically not straightforward.

Nearest neighbour method
Nearest neighbour methods are simple and intuitively understandable classification and regression methods. In the K-Nearest Neighbour (KNN) method, the output for each new sample is on the basis of the outputs of a smaller number (commonly denoted by K) of molecular profiles that are the most similar to the profile of the new sample [42]. Li et al. [64] employed KNN regression for DRP, in which the sensitivity of a new sample is modelled as the average of the observed sensitivity values of its KNNs. Nearest neighbour methods tend to be problematic in the inadvertent, and typically unknown, presence of many irrelevant data features, and they are computationally intensive.

Artificial neural networks and deep learning
Artificial neural networks (ANNs) are powerful ML models that can approximate arbitrary input-output relationships, given a     In their generic form, ANNs require more input data than simpler models. Ma et al. [65] used a transfer learning approach to address the problem of relatively little available DRP data for advanced preclinical models. In a first pre-training phase of their TCRP method, a multilayer ANN is learned using a larger number of basic preclinical training samples (cell lines). In the following learning phase, a small dataset from advanced preclinical models is used to recalibrate the model to the new context.
ANNs with many layers are called DNNs. Deep learning has recently attracted considerable attention across many fields and has become popular for DRP as well [66][67][68]. Encoders are particular types of ANNs, typically used in DNNs for feature reduction, i.e. the encoding of higher dimensional information using fewer dimensions with no or little loss of information. Stacked encoders-decoder networks can be trained simultaneously as autoencoders without the need for output data [69,70]. Ding et al. trained a four-layer autoencoder, and the output of each layer was separately used to train a logistic elastic net to predict drug responses [56]. Sharifi-Noghabi et al. [71] introduced a DNN that learns encoders of different molecular data types separately, and whose outputs are fed to a multilayer neural network for DRP. Determining the best structure of ANNs and interpreting the resulting models are non-trivial, and learning ANNs tends to be computationally expensive.
Graph neural networks (GNNs) are ANNs that operate on graphs (i.e. sets of interactions) as inputs. Recently, deep GNNs have been used to learn low-dimensional representations of biomolecular networks [72,73]. Ahmed et al. [74] employed two different GNN methods to use GE and a gene co-expression network, i.e. a network representing the correlation between the expression of gene pairs, to construct a GNN. In contrast to the use of GNNs for MDL (discussed in Section ANNs and Deep Learning below), GNNs have not been shown to improve significantly over the SDL performance of classical ML models.

Network similarity-based methods
Network similarity-based methods are based on the assumption that cell lines with similar network-level information respond similarly to the same drug. Class-specific networks, typically a sensitive network and a resistant network are then constructed to capture features characteristic for the samples from both classes. Although the above assumption underlying this relatively recent class of methods is simple, its validity needs to be established for each particular interpretation of network similarity.
Stanfield et al. [75] construct a 'base network' of genes and cell lines, in which cell lines are linked to their mutated genes, which, in turn, are connected by BioGRID protein-protein interactions [76]. To inject information about drug responses, the base network is then augmented with links between cell lines and the drug they are sensitive to, to obtain a sensitive network, and, analogously, a resistant network. The augmentation is designed to lead to tighter network connectivity between the drug, cell lines and genes involved in sensitivity or resistance, respectively. A new cell line is then classified based on the similarity of the connectivity of its base network with each of the classspecific networks. Kim et al. [77] proposed another biomolecular network-based classifier that exploits the presence of classspecific gene co-expression patterns. First, sensitive and resistant gene co-expression networks are constructed from the GE values of the corresponding cell lines. Next, for each gene in each of these networks, class-specific SVR models are trained to predict GE values on the basis of the GE of neighbouring genes in the class-specific network. For a new sample, the accuracy of the class-specific SVR prediction for any given gene is then evaluated, and interpreted as evidence for, or against, the membership of the new sample in the corresponding class.

Ensemble learning methods
By aggregating the predictions of multiple models, ensemble learning methods are often able to improve over the performance of any individual model [78]. Random forests (RFs) are established ensemble learning models that aggregate several regression or classification trees [79]. Trees are nonlinear ML models that predict the output by partitioning the input space into a hierarchical set of rectangles in the input space. The predicted output for the classification problem is the class label associated with the rectangle that corresponds to a new input. For the regression problem, the output is commonly the average value of the training outputs corresponding to each rectangle. RF trees are constructed from a random selection of the samples, and the remaining samples, called out-of-bag samples, can be used for validation. Riddick et al. [80] and Daemen et al. [81] initially used RF to predict drug sensitivity and identify molecular features associated with drug responses. Stetson et al. [82] selected differentially expressed genes between the most sensitive and resistant cell lines, and then used to train three classifiers using RF, SVM and logistic elastic net. Wan & Pal [83] proposed a RF regression-based approach using multiple omics data. To this end, they trained multiple RF regressions for different data types, and combined the individual predictions using linear regression. Moreover, Xu et al. [84] proposed a method called AutoBorutaRF, which employs an autoencoderbased feature selection algorithm and the Boruta algorithm [85] to improve the performance of RF.
To incorporate biological prior knowledge into the RF, Oskooei et al. [86] presented NetBiTE, in which a RF is biased towards using the known drug targets and their STRING PPI [87] interactors based on a network diffusion algorithm, i.e. a mathematical transformation of omics data that allows the detection of hypothetically relevant network regions [88]. Zhou & Feng [89] proposed an ensemble method called gcForest, whose performance was found to be competitive with DNNs. A modified version of gcForest, was used for drug response classification [38].
Fang et al. [90] argue that, due to the uncertainty inherent in DRP prediction, clinical applications benefit from richer output than a single predicted drug response. Their RF-based predictor is adapted to non-standard output distributions, and yields prediction intervals that, as they argue, may provide a better basis for the choice of appropriate treatments.

MDL models
MDL models use data from multiple drugs to predict the response of a given drug. This may lead to improved predictions for drugs with training data, and, potentially, the prediction of responses to drugs for which no training data are available.

Multitask linear regression
Linear regression is also employed for MDL. In a DRP challenge competition performed by NCI-DREAM [22] the best-performing method was Bayesian multi-task multiple kernel learning (MKL), in which multiple data types (including original omics datasets and computationally derived data) are first kernelized (i.e. converted into nonlinear transfer functions), and then a global similarity matrix is constructed as their weighted sum. Next, a set of drug-specific weights, determined through multitask linear regression, are used to provide the final drug response values. Ammad-ud-din et al. [91] pointed out that, although kernels may enhance the model predictions, the resulting models are typically hard to interpret, in terms of the relationship of their inputs to their outputs. To address this issue, they created the MVLR model, which aggregates groups of molecular features that are known to be biologically connected, and use them as inputs to a multitask linear regression. In another study, Yuan et al. [92] jointly trained multilinear regression models based on trace norm regularization, which can increase the dependencies between model parameters to yield a more manageable number of biologically meaningful features.

ANNs and deep learning
ANNs are widely used in MDL. Through shared model parameters, MDL models can leverage information between different drugs and reduce the problem of overfitting, i.e. the learning of parameters driven by characteristics of the training data that are not necessarily present in other data. MDL ANNs have either single or multiple outputs. Each output of a multi-output ANN predicts the response for a specific drug ( Figure 2B). Chiu et al. [93] proposed a multitask DNN model for DRP called DeepDR. Their method is composed of mutation and GE encoders that reduce data dimensionality, followed by a multilayer ANN with multiple outputs.
Single-output ANNs, on the other hand, accept omics data along with drug features as input (e.g. chemical descriptors and biological targets; Figure 2C). Unlike multi-output ANNs, they share all model parameters among multiple drugs. The potential advantage of single-output ANNs is their ability to provide predictions for new drugs whose data have not been used during the training phase. Menden et al. [94] employed a fully connected ANN to predict drug response. DeepDSC [95] first reduces the dimension of GEs using an encoder whose output is combined with the chemical drug features to be fed to a multilayer neural network. RefDNN [96] trains, for each drug, an SDL logistic elastic net regression model that predicts the probability of drug resistance. To predict the resistance of a given cell line to a new query drug, the vector of predicted resistance probabilities is then weighted by a structure similarity profile of the query drug with all the drugs as an input to a neural network to classify the response as sensitive or resistant. In an effort to leverage the observed activity of signalling pathways, Tang & Gottlieb [97] proposed a pathway-based method named PathDSP. In their approach, drug structure data along with the pathways enriched for drug targets, GE, somatic mutation and CNV are fed as inputs to a DNN. Zhu et al. [98] used transfer learning to overcome the challenge of learning with limited amounts of data by pre-training their model on a dataset from a related context, and then finalizing the training process in the context in which the model is designed to predict.
Convolutional neural networks (CNNs) are specific types of non-fully connected feedforward ANNs that can exploit natural adjacency structures in the data (such as physical or temporal adjacency) to more efficiently detect global patterns in the input data. CNNs work well for the inputs that contain spatial or temporal patterns (such as images) [70,99,100]. Joo et al. [101] proposed a CNN method, called DeepIC50, to predict three classes of high, intermediate and low responsiveness based on mutation status and drug descriptors. Moreover, in CDRscan [102], the mutation data of cell lines and the molecular fingerprints of drugs are fed to five different versions of a CNN, whose average predicted output value is reported as the final output. Liu et al. [103] introduced a model named tCNNS, in which two CNN branches are respectively used for cell line and drug features to predict drug sensitivities. Bazgir et al. [104] proposed a method called REFINED to convert GE profiles to an image that encodes relevant patterns of the molecular features in specific regions. The resulting images are then fed to a CNN for DRP.
GNNs can be used to encode drug features as points in a low-dimensional space, in which similar drugs are close to each other, by representing molecular structure of compounds as networks formed by sets of chemical bonds between atoms [105,106]. In DeepCDR [107] and GraphDRP [108], deep drug features learnt by GNNs are concatenated with deep multi-omics features, to be fed to feedforward ANNs for DRP. Manica et al. [109] proposed the MCA approach, in which genes are selected based on a network diffusion algorithm [86]. Attention-based convolutional encoders are a recent new development from the field of natural language processing that has been useful to reduce the dimensionality of the GE and drug chemical data, to be used as input to a fully connected neural network for DRP [110].

Recommender systems
Recommender systems (RS) are algorithms that yield a set of personalized item suggestions to a user. RS were developed in the 1990s to automatically recommend movies and other products to users, based on their similarity of previously observed choices by the same user, or to the choices of other users with similar preferences [111,112]. When transposed to DRP, this principle implies that a cell line is highly sensitive to a drug when similar cell lines are highly sensitive to similar drugs. Generally, RSs can be neighbourhood based or model based [111]. The main input used for the RS-based DRP models is a cell linedrug sensitivity matrix of sensitivity values of cell lines to the screened drugs. Complementary information, such as cell line molecular features and drug descriptor features, can also be used since they can enhance the similarity between different cell lines or different drugs.
Neighbourhood-based RS. Neighbourhood-based methods (also called memory-based methods) are simple types of RS, in which the cell line-drug sensitivity prediction is based on the sensitivities of cell lines with similar characteristics to similar drugs, on the basis of cell line similarity network and drug similarity networks ( Figure 3A) [113].
Zhang et al. [114] constructed cell line similarity and drug similarity networks using pairwise correlation of GEs and pairwise correlation of quantitative drug features, respectively. For a given cell line-drug pair, the predicted response is based on a weighted average of the known responses of cell lines to the given drug and all the drugs to the given cell line. These weights are normalized versions of cell line similarities and drug similarities. Le and Pham [115] tried to improve on this by basing their prediction upon all input cell line-drug responses.
One important drawback of the above approaches is their inability to predict in the case that drugs and cell lines are new (i.e. not part of the training data). To address this issue, Wei et al. [116] proposed an extension of Zhang's method called CDCN. Conversely, Sheng et al. [117] raised the possibility that distinct drug responses between two cell lines might only manifest as GE differences in a few specific genes directly related to the drug resistance. To take this into account, they defined cell line similarity scores based on drug-specific sets of genes. To identify this set of genes for any given drug, their method first identifies drugs that are similar in terms of their chemical structures. Second, for each of those similar drugs, those genes whose GEs are highly associated with drug response are determined. Third, the similarity of all other cell lines to the given cell line is defined, based on the expressions of those genes. Finally, a similarity-weighted average of the IC50 values observed in other cell lines, based on the drug and cell line similarities, is used to predict the most sensitive drugs for the given cell line.
The cell line similarity and drug similarity measures in the above studies are only based on the molecular features of cell lines and drug descriptors, respectively. Conversely, the observed cell line-drug responses can also be used to compute meaningful similarity measures [118,119].
Overall, neighbourhood-based RSs yield results based on a simple and intuitive principle. However, these methods tend to be heuristic in nature, due to their, typically ad hoc, selection of similarity measures and their subsequent heuristic downstream use. Further drawbacks of these methods are their typically high memory consumption and computational cost for testing samples, as well their difficult application on sparse input response matrices [113].
Model-based RS. In contrast to neighbourhood-based methods, model-based RSs require an explicit model and a training process that determines the model parameters. Once learned, models are fast to apply to new data, and relatively little memory is needed to store model parameters. In addition, model-based RSs typically perform better in cases where the cell line-drug sensitivity matrix is sparse [113].
Different learning algorithms can be used to train parameters of model-based methods. Typical RS-based methods for DRP are matrix factorization methods that construct latent factor models [113]. Such methods commonly represent the cell linedrug sensitivity matrix as a product of two matrices, allowing an interpretation of cell lines and drugs as vectors in a latent space ( Figure 3B).
Ammad-ud-din et al. [120] extended a kernelized Bayesian matrix factorization (KBMF) method that was originally developed for drug-protein interaction analysis [121], to combine different omics data and drug properties for DRP. Here, kernels allow capturing nonlinear relationships between drug response and both omics and drug features. To improve model interpretability, they further introduced component-wise MKL. In this method, biological pathway-based groups of features are used as model inputs, thus incorporating prior biological knowledge in the model [122].
Similarity-regularized matrix factorization by Wang et al. [123] employs regularization by GE-based cell line similarity and chemical feature-based drug similarity measures. More recently, a modified regularization term for an analogous model was proposed by Moughari and Eslahchi [124]. They showed that cell line similarities based on GE and drug similarity-based drug targets provide the best performance among a variety of cell line and drug features. To account for redundant information in similarities, the WGRMF method [125] approximates the cell line similarity and drug similarity matrix with a sparse matrix.
As the three above methods do not define any transformation matrix that projects cell lines and drug features into a latent space, they cannot predict drug sensitivity for new cell lines or drugs. Using a cell line projection matrix, Suphavilai et al. [126] developed a matrix factorization-based RS, which enables the prediction for new cell lines. In addition, Yang et al. [127] used a drug projection matrix that additionally enables the prediction for a new drug. They used a method called MACAU [128] to incorporate GE and drug target information to train parameters. Moreover, to make predictions for new cell lines, Emdadi & Eslahchi [129] employ a neighbourhood-based approach, in which the latent vector for the new cell line is calculated as the average latent vector of its neighbours. Their DSPLMF method trains a classification model based on logistic matrix factorization to predict the probability of a cell line being sensitive to a drug. Such models can be categorized as hybrid model-/neighbourhoodbased RS. To further improve the performance of their method, they developed a feature selection method called Auto-HMM-LMF [130].

Network representation learning-based models
Network representation learning methods project networks into a low-dimensional space [131] which can be used as the input features for ML. Yang et al. [132] proposed a method, in which a response network is first constructed by combining cell line genetic aberrations, drug responses and a PPI network. In the response network, cell lines with similar drug responses are close to each other. Next, a graph representation learning algorithm is used to learn a projection of the cell lines from the response network to a space that preserves their neighbourhood proximity. Finally, an SVM is trained on the representation vector of cell lines for each drug to predict the drug response of cell lines. The idea behind this method is that the cell lines with similar neighbourhood topology profiles have similar drug responses. The DREMO method by Yu et al. [133] constructs two multilayer similarity networks for cell lines and drugs based on different molecular data types, and then diffusion component analysis [134] is used to represent the cell lines and drugs into low-dimensional feature vector spaces. Finally, the sensitivity of the cell lines to each drug is predicted using logistic regression.

Network diffusion-based methods
Zhang et al. [135] proposed a DRP method on the basis of network diffusion on a heterogeneous network, consisting of cell line similarities, drug similarities, target similarities, cell line-drug sensitivities and drug-target interactions. Their target similarity network is obtained by merging PPIs and co-expression networks. In this method, a diffusion algorithm, called information flow-based method [136], is used to infer unknown cell line-drug response to be sensitive on the basis of the known responses by considering the density of the connectivities associated with the cell lines sensitive to drugs.

Ensemble learning methods
Ensemble learning (introduced for SDL models in Section Ensemble Learning Methods) can also be applied for MDL models. Matlock et al. [137] used an ensemble of RF, KNN and ANN methods for DRP. Liu et al. integrated a low-rank matrix completion and a ridge regression model for DRP [138]. Sharma & Rani [139] proposed another ensemble learning approach incorporating KBMTL [140], STREAM [141], L21 [142] and multitask regression. They first trained four different regression methods for predicting drug responses and then combined the output of each model using weighted averaging. Su et al. [143] proposed the two-level Meta-GDBP method. The first level contains four independent predictive models, based on GE, chemical properties of drugs, biological processes (based on GO terms [144]) and biological pathways (based on KEGG pathways [145]). For each of these information sources, they evaluated the performances of elastic net and SVR predictors. The second-level model is a weighted linear combination of the outputs of the best-performing models in the first level.

Evaluation of DRP models
Model evaluation, i.e. the assessment of the expected performance of a trained model on future data, needs to occur on test data that are independent from the training data. Otherwise, during training, the learned model parameters may be better adapted to the test data than could be expected for future data, thus leaving the test data evaluation optimistically biased. Data leakage, e.g. the-sometimes subtle-presence of information about the test data in the training data, may, for instance, occur when very similar data, such as drugs with similar structures, are present between training and test sets, or when the entire dataset is used for data preprocessing, e.g. for normalization across cell lines. There are different common ways of splitting a single dataset into training and test data [57]. In repeated random subsampling, the dataset is randomly split into training and test datasets repeatedly, typically, of given sizes. In each iteration, the model is trained on the training set and an evaluation metric (e.g. mean square error) is calculated on the test set. Finally, the model performance is estimated as the average evaluation metric over all iterations. In k-fold cross validation, the dataset is split into k subsets (folds, typically of identical size). In k iterations, each fold is used as a test set, and the remaining data are used for training. K-fold cross validation ensures that each data sample will appear in both train and test sets with the same frequency, and avoids stochastic fluctuations, particularly for small datasets. Moreover, in cases when k-fold cross validation would lead to training sets that are too small, typically, leaveone-out cross validation (LOOCV) is used, in which test sets of only a single sample are used in each iteration. It is worth noting that LOOCV is not recommended for large datasets since it is computationally expensive and leads to estimates of the model performance with high statistical variance [146]. Leavepair-out cross validation (LPOCV) is a special case where only two samples are held out. Finally, there is a validation method specifically for RF called out-of-bag validation (refer to section Ensemble Learning Methods). In models with hyperparameters, an additional subset of the training data (validation set) is typically split-off from the training set before training to evaluate the model under different hyperparameter values before its version with the best hyperparameter is evaluated on the test set.
As the performance of MDL models for new data may depend on whether both drugs and cell lines are present or not in the training data (i.e. known or new), four different evaluation modes exist ( Figure 4): 1. Known cell line/new drug. The prediction on new drugs typically uses known drug features. Leave-one-compound-out is a special case where only one drug is left out.
2. New cell line/known drug: This case is typical for personalized medicine applications. Leaving all cell lines derived from one tissue out as the test set is called leave-one-tissue-out.
3. New cell line/new drug: This evaluation corresponds to the most ambitious prediction task.
4. Known cell line/known drug: This evaluation corresponds to the least ambitious prediction task, which can be used to impute missing values.
Training and test sets can also be selected from different datasets. Cross-dataset evaluation thus also incorporates potential batch effects, i.e. systematically different characteristics between datasets, such as scale. In many real-world cases, heterogeneous output data present an additional challenge. For instance, regression models that predict IC50 values observed on cell lines may be difficult to evaluate against the drug effects as they are commonly characterized in clinical datasets.
In biomedical classification scenarios, the amount of available data may strongly vary across the different classes, which presents special challenges for learning and evaluation. To balance classes before learning, one can undersample the majority class, over-sample the minority class with replacement, or use synthetic data (e.g. SMOTE) [145,146]. For a discussion of subtleties and pitfalls of these approaches, we refer the reader to the review by Chawla [147].

Discussion and outlook
Machine learning is a subfield of artificial intelligence whose classification and prediction algorithms learn patterns underlying the data from examples. Omics data usable for DRP consists of different types of molecular profiles, typically from genomics, epigenomics, transcriptomics and proteomics. The different available data types capture different (potentially overlapping) facets of the biological drug response. Comparative studies found that, among different individual data types, GE enables the best predictive performance, which can then be improved only slightly by adding other data types [22,24,50,148]. Iorio et al. [14] reported that GE is an individual data type that leads to the best prediction performance across a large set of cancers. Interestingly, they also show that the prediction performance of GE is highly correlated with tissue types [14]. However, employing tissue types along with GE as model input so far did not lead to performance improvements [40,46,50,148]. In a study on myeloma, GE did not provide satisfactory results [149]. When considering a broader set of cancer types by themselves, genomic features appear to be the most informative single data type [14].
Simple linear regression methods are easily interpretable and have been used for drug-specific biomarker detection [16]; however, they cannot capture more complex associations in the data. On the other hand, the complexity of the output of many more advanced DRP ML methods-for instance, those based on many different transcripts-limits their use for the development of drug resistance biomarkers, and the extraction of new knowledge about the molecular mechanisms behind drug responses. We expect that ongoing improvements in the interpretation of ANNs [150,151] will continue to be reflected in new approaches to DRP [65,67].
DRP ML models can be categorized as single-drug and multidrug models. MDLs can leverage data from multiple drugs, thus conferring robustness and the ability to predict responses to new drugs. MDL models have also been seen empirically to show better prediction performance than SDLs [94,116,126,129], and to lead to more robust biomarkers [126]. Overall, RS and deep learning have been the most widely MDL methods used for DRP. The results reported in the literature suggest the superiority of RSs [24,114,116,123,124,126,129,130] and deep learning methods [93,96,101,104,107] over more traditional approaches, such as linear regression, RF and SVM.
At this stage, statements about the current performance of ML-based DRP methods overall are difficult to make. In the clinic, molecular patient data are already an established input for treatment decisions [152], and the potential of ML-based DRP models to predict tumoral responses to anticancer drugs has been established [30,44,46,58,67,71,101]. However, ML-based DRP seems to be rarely used in clinical practice.
One major reason for the currently limited utility of DRP in the clinic may be the difficulty of addressing random and systematic variations between datasets, which is an issue in many applications of ML. Another reason may be the lack of interpretability of many DRP models in the context of other rich, and often unstructured, relevant additional clinical information that is currently not used by DRP models.
The future of drug prioritization in the clinic will undoubtedly be influenced by ML [153]. We believe that future improvements-especially, towards the use of DRP models in the clinicwill be catalyzed in a few distinct ways. Firstly, it can be expected that, as in other ML applications, larger systematic datasets will be able to drive performance increases, especially if new datasets are aimed at the more comprehensive characterization of the relevant biology. This is particularly true if one considers the current paucity of systematic data outside of the cell line context. Future datasets with the largest positive impact on DRP performance may therefore well come from advanced preclinical models. Although more, and more diverse, data can be thought to help in any ML approach, it is in particular deep learning, as a maximally flexible class of models, that is fundamentally capable of capturing the more subtle characteristics of larger datasets. The increasing availability of computing power-and computing as a part of the biomedical research enterprise-will likely enable the continued application of new, and computationally ambitious, ML technology to DRP.
A second, complementary, route towards higher performance of DRP models is the incorporation of prior knowledge. Until now, the incorporation of prior knowledge through biomolecular networks has led to only slight overall improvements at best [44,49,60]. However, Oskooei et al. [84] has found clear improvements for specific drugs, when additional information about drug targets was provided.
Various missing aspects of drug responses in the clinic may be addressable by either large datasets, or the incorporation of knowledge (or a combination of both), as long as various current systematic limitations of data and models are addressed. A first, major, systematic limitation of most DRP models is their focus on the short-term drug response in homogeneous cell populations. Neither the available cell line input data nor the learning approaches themselves currently address the complex evolutionary process that often leads to drug resistance in heterogeneous tumours [154]. Further development of clinically relevant machine learning-based DRP models may require the use of input data, such as genomic and transcriptomic singlecell data, that resolve the subclonal tumour structure [155], and explicit modelling of tumour evolution [156].
Another aspect of biological complexity that is typically not addressed by current data-driven approaches to DRP, or data, is the immune system, which plays a major part in many diseases, including cancer [157]. The state of the immune system in relevant tissue can be highly associated with chemotherapy resistance [158], and immunomodulators have, in the last decade, become major weapons themselves against cancer and other diseases [159]. The dbGap database [160] covers immune responses to drugs with RECIST parameters, and thus provides future opportunities for the incorporation of the immune system into DRP.
The microbiome is yet another complex part of human physiology known to regulate many aspects of human health, and, increasingly, it is also recognized as a mediator of drug effects [161,162]. Evolving insight into the microbiome and its role in drug action may stimulate the adoption of the microbiome as part of DRP models [163], or lead to models in which the microbiome can then either act as an enhancer, or even become itself an active part of therapy through the molecules it secretes [164]. Related datasets, such as TCGA [25] microbiome data have, so far, not been included in DRP approaches.
Another area of potential future significance for machine learning-based DRP is the prediction of unwanted side effects or adverse drug reactions. Although methods for the prediction of more general adverse patient events currently appear to focus on electronic health records [165], increasing insight into the molecular basis of adverse drug reactions [166] and associated data may also permit the future integration with principles used today for DRP.
Eventually, DRP methods and their underlying principles may also be applicable to the study of environmental-or food-related molecules on human health. Perhaps spurred by the human genome as the exhaustive parts list of human biomolecules, environmental science has started to adopt comprehensive approaches to profile exogenous chemicals, and to study their impact on human health [167].
As the effects of environmental chemicals on human biology can, at some level, be considered as similar to drug effects, DRP principles and methods to predict the impact of drug combinations on human health may be transposable to the prediction of the effects of chemical mixtures, where, similarly to drug combinations, exhaustive testing of chemical mixtures is typically not possible. Analogously, the study of interactions between drugs and environmental and food-related chemicals [168] will have to rely on mechanistic insight and data-driven models increasingly obtained with the help of machine learning methods, such as those developed for DRP.

Box I. Common Preclinical drug response metrics
Several common drug sensitivity metrics are used in DRP input data. All are obtained from sigmoidal dose-response curves that record cell viability (the activity of a specific biological process, or the number of cells) across a range of drug concentrations. Drug sensitivity is commonly determined relative to two reference values: control (i.e. cell viability in the absence of the drug) and maximal response (the maximally observed difference of cell viability, relative to control). Based on these two concepts, pharmacogenomic databases use a variety of metrics to quantify growth inhibition [169,170], the most important being: • IC50 (inhibition concentration of 50%): the drug concentration at which cell viability is reduced to the half of control.
• GI50 (growth inhibition of 50%): the drug concentration that reduces growth by half.
• EC50 (effective concentration of 50%): the drug concentration that causes the half of the maximal response. Also called 'relative IC50'.
• AUC (area under the dose-response curve): the area under the dose-response curve.
• AA (Activity area): the area over the dose-response curve.
It is worth noting that the cell lines that are sensitive to a drug will get lower IC50, GI50, EC50 and AUC values, and higher AA value.

Key Points
• Biological knowledge can be leveraged for machine learning of drug response prediction, e.g. through biomolecular networks.
• Recent advances in machine learning, e.g. in the areas of deep learning and multitask learning, have translated into increased performance of drug response prediction.
• Large-scale pharmacogenomic datasets for advanced preclinical models are urgently needed for the future development of data-driven drug response prediction.