TUGDA: task uncertainty guided domain adaptation for robust generalization of cancer drug response prediction from in vitro to in vivo settings

Abstract Motivation Large-scale cancer omics studies have highlighted the diversity of patient molecular profiles and the importance of leveraging this information to deliver the right drug to the right patient at the right time. Key challenges in learning predictive models for this include the high-dimensionality of omics data and heterogeneity in biological and clinical factors affecting patient response. The use of multi-task learning techniques has been widely explored to address dataset limitations for in vitro drug response models, while domain adaptation (DA) has been employed to extend them to predict in vivo response. In both of these transfer learning settings, noisy data for some tasks (or domains) can substantially reduce the performance for others compared to single-task (domain) learners, i.e. lead to negative transfer (NT). Results We describe a novel multi-task unsupervised DA method (TUGDA) that addresses these limitations in a unified framework by quantifying uncertainty in predictors and weighting their influence on shared feature representations. TUGDA’s ability to rely more on predictors with low-uncertainty allowed it to notably reduce cases of NT for in vitro models (94% overall) compared to state-of-the-art methods. For DA to in vivo settings, TUGDA improved over previous methods for patient-derived xenografts (9 out of 14 drugs) as well as patient datasets (significant associations in 9 out of 22 drugs). TUGDA’s ability to avoid NT thus provides a key capability as we try to integrate diverse drug-response datasets to build consistent predictive models with in vivo utility. Availabilityand implementation https://github.com/CSB5/TUGDA. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Advances in DNA sequencing technologies have galvanized a paradigm shift in medicine from a one-size-fits-all approach to precision medicine, that is tailored to stratified populations based on molecular information (Chae et al., 2017).In oncology, an appreciation of the molecular diversity of cancers and limitations of standard-ofcare treatments have further driven this interest toward patient-specific options based on re-purposing drugs and identifying targeted drug combinations (Brown and Elenitoba-Johnson, 2020).The availability of a large number of cancer cell lines has provided ready models for collecting drug response data (Iorio et al., 2016).In combination with detailed omics profiles, these datasets present a unique opportunity to advance precision oncology based on state-of-the-art machine learning techniques (Jiang et al., 2018).
The complexity inherent in biological systems and omics data poses two main challenges in learning models that could have clinical utility.Firstly, the high-dimensionality of omics data relative to the number of data points available can impact the generalizability of the models that are learnt (Azuaje, 2016).Joint models that predict response for many drugs in a multi-task learning (MTL) setting have been widely used to alleviate this limitation (Costello et al., 2014;Suphavilai et al., 2018;Wang et al., 2017;Zhang and Yang, 2018).Secondly, while cell line datasets are typically used to learn predictive models, they are not expected to capture key aspects relevant to in vivo response including tumor heterogeneity and microenvironment, immune response and overall patient health (van Staveren et al., 2009).Previous works (Geeleher et al., 2014(Geeleher et al., , 2017;;Sakellaropoulos et al., 2019) assumed that batch effects were the main origin of differences to correct for between models, without directly addressing biological variations.Recently, some methods have sought to use domain adaptation (DA) techniques to bridge the in vitro to in vivo gap (Mourragui et al., 2019(Mourragui et al., , 2020;;Sharifi-Noghabi et al., 2020).
An underlying principle shared for MTL and DA techniques is that transfer learning, whether it is across tasks or domains, needs generalization of information through shared representations.Inability to do this effectively leads to negative transfer (NT) where predictive performance for target tasks or domains is instead hampered relative to single-task learning (STL) (Zhang et al., 2020).For MTL, this can happen when unrelated tasks are learnt together (potentially addressed by quantifying task relatedness as in GO-MTL, Kumar and Daume ´, 2012) or when poor predictors adversely impact the shared representation (potentially addressed by weighting transfer flows based on task loss as in AMTL, Lee et al., 2016 andits extension Deep-AMTFL, Lee et al., 2018).For DA, NT can occur when there is weak or no similarity between domains (Kouw and Loog, 2021) and the method PRECISE (Mourragui et al., 2019) seeks to address this for drug response prediction via a robust manifold alignment process.A refinement of this idea, TRANSACT (Mourragui et al., 2020), uses Kernel-PCA based sub-space alignment to further capture non-linear relationships between samples from in vitro and in vivo domains.However, to learn the similarity between domains, existing DA methods either do not take into account the conditional distributions (P s ðYjXÞ and P t ðYjXÞ for drug response Y given gene expression X in source s and target t), obtaining a subset of shared features that might be unrelated to drug response (Mourragui et al., 2019(Mourragui et al., , 2020)), or rely on the covariate-shift assumption (Sharifi-Noghabi et al., 2020), where marginal distributions for features (P s ðXÞ and P t ðXÞ, for tasks/domains s and t) are allowed to vary while the conditional distribution for drug response is assumed to be the same (P s ðYjXÞ ¼ P t ðYjXÞ) (Kouw and Loog, 2021;Zhao et al., 2019).This assumption can often lead to NT (Rampa ´ sek, 2020; Zhao et al., 2019) when e.g.drugs that are effective in vitro do not successfully translate to the clinical setting (Wilding and Bodmer, 2014).
We present a unified transfer learning approach (TUGDA) for MTL and DA that leverages task/domain uncertainty (rather than loss) and a relaxed covariate-shift assumption to improve robustness of drug response prediction.Specifically, TUGDA captures both aleatoric (Kendall and Gal, 2017) and epistemic (Kendall et al., 2018) uncertainties, and uses them to weight the task/domain to feature transfer.In addition, TUGDA relaxes the covariate-shift assumption across domains (P s ðYjXÞ % P t ðYjXÞ) for tasks with low confidence predictions using shared domain features.Our evaluations against state-of-the-art methods show that the use of uncertainties in guiding task-to-feature transfer reduces cases of negative transfer 94% overall and by 50% in harder cases that have limited in vitro data.For in vivo settings, TUGDA outperformed previous methods in transferring drug response predictions to both patientderived xenograft (PDX) and patient tumors.Overall, TUGDA represents a novel unified framework to leverage information from in vitro and in vivo settings, and robustly predict cancer drug responses from molecular profiles.

Definitions and preliminaries
We define a dataset C ¼ fX t ; y t g T i¼1 consisting of X t 2 R Nt Âd gene expression profiles (d genes) and y t 2 R Nt Â1 drug response values for T different drugs and N t different data points (cell lines, xenografts or patients).In an MTL setting, we jointly learn predictive models for all T tasks under the following general framework: where ' is the loss function (e.g.mean squared error, in our case) applied to each task t, with w t representing task-specific parameters as columns of W 2 R dÂT .The regularization term R is introduced to enforce priors over the task parameters and to improve generalization.This approach constrains joint learning in a naive manner (through the regularization term) and an approach to improve this is to assume that there exist shared latent bases across tasks (Argyriou et al., 2008;Kumar and Daume ´, 2012).We can represent this assumption and improve Eq. (1) as follows: where W from Eq. ( 1) is decomposed as W ¼ LS, with L 2 R dÂk representing the set of k latent bases, and S 2 R kÂT is the matrix containing vectors s t to combine those bases.The R term from eq. ( 1) is then replaced to constrain L to be ' 2 regularized while s t needs to be ' 1 sparse, with the hyperparameters l and k controlling the extent of regularization.This framework can be extended to take advantage of neural networks and use multiple layers of shared features followed by a task-specific layer.Here we assume that L and S are parameters for the first and the second (task-specific) hidden layers, respectively.The approach in Eq. ( 2) tries to reduce the risk of negative transfer by forcing unrelated tasks to use disjoint latent spaces.Nevertheless shared bases are trained without consideration of the quality of task-predictors, allowing for noisy and unreliable predictors to be the source of NT (Lee et al., 2018).
Assuming that task loss is a proxy for task reliability, the transfer from task-to-features can be guided (Lee et al., 2016(Lee et al., , 2018) by extending Eq. ( 2) as follows: (3) where Z ¼ XðX t LÞ is the output of the first neural layer L followed by a non-linear activation function X [ReLU (Nair and Hinton, 2010) in our case], Z is interpreted as the shared features space and it is used by S (task-specific parameters) to predict drug responses.A is a matrix which controls the amount of transfer from task t to k features by the row vector a o t (A's row vector).An auto-encoder regularization is then imposed aiming to reconstruct the latent features Z with the model output XðZSAÞ.This feedback loop between Z and A imposed by the autoencoder is expected to control the influence of unreliable tasks (based on task-loss) into the shared feature space.The hyperparameter a is multiplied by the training loss ' to control the sparsity of a o t , thus breaking the symmetry of transfer to features by forcing transfer from high loss tasks to be more sparse.Despite this sophisticated formulation, the assumption that task loss is a proxy for reliability may be misleading, especially in cases of overfitting from limited in vitro training data (Hawkins, 2004).

Leveraging task uncertainty for multi-task learning
We aim to estimate two types of task uncertainties and explore their use as alternative weights for task-to-feature transfer (Kendall and Gal, 2017).The first type is aleatoric uncertainty which captures uncertainty due to inherent noise in the experimental data that is being modeled.Specifically, as shown by (Kendall et al., 2018) homoscedastic aleatoric uncertainty in MTL settings captures the relative confidence between tasks.As this uncertainty does not vary with input data, we can interpret it as task uncertainty reflecting the amount of noise inherent in drug response measurements.Let f wt ðxÞ be the output function for input x and task-weight w t , we have the following relationship for aleatoric uncertainty per task (r t ) in a regression setting: where r is learnable along with model parameters.Intuitively from Eq. ( 4), r t can been interpreted as loss attenuation when the model predictions are far away from ground truth.As prior work has shown that MTL is strongly impacted by relative weighting for task losses (Kendall and Gal, 2017), the use of aleatoric uncertainty in TUGDA could reduce NT by automatically learning optimal loss weights.
A second type of task uncertainty that is accounted for in TUGDA is epistemic, representing the uncertainty in model parameters (Kendall and Gal, 2017).To do so, TUGDA uses Bayesian neural networks (BNNs, where weights W $ N ð0; IÞ) to quantify model prediction uncertainties (Goan and Fookes, 2020).We use dropout variational inference (Gal and Ghahramani, 2016) for approximate inference in our model during training and testing (Srivastava et al., 2014), thus enabling sampling from an approximate posterior distribution for weights (q Ã h ðWÞ, in a tractable family) that minimizes the Kullback-Leibler divergence to the true model posterior (Gal and Ghahramani, 2016).We therefore extend eq. ( 4) as: where Ŵt is sampled from the approximate distribution q Ã h ðWÞ.In this setting, predictions are obtained by forwarding each sample x though the model for P passes, with weights sampled according to dropout inference.
In the TUGDA framework, with BNN L and S, and a decoder layer A to regularize the task-to-feature transfer, the epistemic uncertainty for a task t given a sample x is computed in P passes as: Following this, TUGDA's novelty lies in formulating the use of task uncertainties to guide knowledge transfer from tasks t to features Z, which is accomplished by extending Eq. ( 3) as follows: where U t is employed to weight a o t , thus forcing tasks with high-uncertainty to transfer less to the shared feature space Z (by the autoencoder regularization).A model representation for MTL with TUGDA is depicted in Figure 1 (blue layers) showing how the influence of unreliable tasks is attenuated by both aleatoric (' BNN ) and epistemic (U t ) uncertainties, and how constraints for a o t are learnt in an end-to-end fashion.

Domain adaptation with task uncertainty and relaxed covariate-shift assumption
To enable domain adaptation from in vitro to in vivo settings while avoiding NT for tasks where similarity between domains is limited, we extend eq. ( 7) by adding a Discriminator module D (Fig. 1, D in gray) that is responsible for classifying an extracted feature Z from L(x) into different domains (cell line, xenograft or patient tumor).
The idea here is to use adversarial learning to match source (in vitro) and target (in vivo) marginal distributions (Ganin and Lempitsky, 2015).In this manner, we can describe the training process as a twoplayer game, where the module ðLðxÞÞ learns features that forces DðLðxÞÞÞ toward confusion, while D needs to accurately classify domains (Fig. 1, blue and gray, done in both supervised and unsupervised steps).In the end, L(x) is expected to learn features Z that are domain-invariant and so we can describe the learning process as: with n s and n t being the number of training samples from source (in vitro) and target (in vivo), respectively.To enable this adversarial training, we employed the Gradient Reversal Layer (GRL) approach (Ganin and Lempitsky, 2015) that works by flipping the sign of gradients that flow through D to the network during back-propagation.By adding a discriminator module D we end up with a framework that jointly learns a shared space between models (aligns the marginals) and uses these features to predict cancer drug response in an MTL setting.As we regularize transfer from task-to-features using taskuncertainties, we constrain our model (by the a o t sparsity in A) to transfer less from predictions with high uncertainty based on shared features from different domains and tasks.An important by-product of this formulation is the relaxation of the covariate-shift assumption for transferring information from high-uncertainty predictors with the basis that they are less likely to retain predictions across domains.With this, TUGDA is trained in an end-to-end fashion as follows: with k adv as a hyperparameter which controls the influence of adversarial training.

Results
3.1 TUGDA reduces negative transfer in multi-task learning of in vitro drug responses

Dataset and baselines
To evaluate the MTL performance of TUGDA (Fig. 1, blue), we used the Genomics of Drug Sensitivity in Cancer (GDSC) database  2016,2018), respectively.Thus, all deep neural network models share the same number of layers until the prediction step (Input layer, L and S; Fig. 1), and the differences are only in terms of the regularization used.We performed 3-fold nested cross-validation (Varma and Simon, 2006) to report MTL performance.In this process, we select hyperparameters based on validation performance in the inner loop.The best performing model of the inner loop is evaluated on an outer test fold (unseen cell lines).This process obtains a performance estimate unbiased by hyperparameter selection.We searched for the best set of hyperparameters (list and range in Supplementary Note S2) using the Tree-structured Parzen Estimator algorithm (Bergstra et al., 2011).

Results with cell line data
Models were trained to predict log IC50 values (concentration which kills 50% of cells; log-transformed) and compared in terms of mean squared error (MSE) distribution across all 200 drugs.As can be seen in Figure 2a (Supplementary Fig. S1a, full distribution), TUGDA improves over prior methods with the lowest median MSE of 1.65 and the highest Pearson correlation of 0.51 (Supplementary Fig. S1c).Higher performances were also observed in our ablation analysis, which consists of the following setup: TUGDA(-UT-E) is solely based on aleatoric uncertainty; TUGDA(-UT-A) uses epistemic uncertainty; and TUGDA(-UT) uses both uncertainty types but excludes the use of U t to weight a o t , i.e. feedback loop from A to Z will not take into account task-uncertainties.This analysis suggests that epistemic uncertainty plays an important role in TUGDA's performance when compared to aleatoric uncertainty, but the full model is key in this dataset (Fig. 2a).We also employed Wilcoxon signed-rank test to compare TUGDA's performance with the baselines and observed that TUGDA is significantly better than all baseline methods (Fig. 2a, significance bars and asterisks on top).
To quantify NT behavior, STL-based MSEs were subtracted from corresponding MTL-based MSEs s.t.positive values indicate NT (Supplementary Fig. S2, full distribution).As shown in Figure 2b, TUGDA presented the fewest number of NT cases (12 out of 200 tasks, 94% of tasks with no NT), reducing the number of tasks with NT by 50% relative to the next best method (Deep-GO-MTL).We next focused our analysis of performance on the more challenging tasks with smaller sample sizes (19 out of 200 drugs; where sample size median is 49 and maximum is 382) compared to the rest (sample size minimum is 716 and median is 745).We devised this experimental setup to reflect a more realistic scenario where drug response data can be limited.Here again, TUGDA improved over the existing methods Deep-AMTL and Deep-AMTFL in terms of median MSE (Fig. 2c), and the ablation analysis highlights the utility of the full model.As can be seen from Figure 2d, NT cases were clearly enriched in this set of 19 tasks and TUGDA reduces the number from 11 (for the next best method, 3.2 TUGDA provides a robust approach for domain adaptation from in vitro to in vivo response prediction

Datasets and baselines
We evaluated the unified TUGDA framework (Fig. 1, blue and gray modules) against existing unsupervised DA methods for transferring cancer drug responses from cell lines (in vitro) to two different in vivo settings, patient-derived xenografts (PDX) and patient tumors.PDX data was obtained from the Novartis Institutes for Biomedical Research (Gao, 2015) containing gene expression profiles (n ¼ 399) and drug responses values.Patient tumor gene expression profiles were obtained from TCGA (Network et al., 2013)  Similar to previous UDA methods (Mourragui et al., 2019(Mourragui et al., , 2020)), TUGDA is based on transductive learning (Kouw and Loog, 2021), where in the unsupervised learning step (Fig. 1) all target (ignoring labels) data is used to learn a domain-invariant space.The models are fine-tuned following the approach in Ganin and Lempitsky (2015), where the best set of hyperparameters was determined by minimizing MSE loss on source data (cell line AUC) using the domain-invariant features.This procedure was done for PDX data (list and range of hyperparameters in Supplementary Note S3, Supplementary Tables S2 and S3) and patient data (list and range of hyperparameters in Supplementary Note S3, Supplementary Tables S4 and S5).In both cases, the Tree-structured Parzen Estimator algorithm (Bergstra et al., 2011) was used for searching hyperparameters.

Results with PDX data
We evaluated the transfer of drug responses from GDSC cell-lines to PDX data based on 14 shared drugs (extended seven drugs from Mourragui et al., 2020) and computed Spearman correlations for predicted (AUC) and measured response values in the PDX setting (PDX best average response, lower values are related to sensitivity).Out of 14 drugs, TUGDA provided the highest Spearman correlation for 9 drugs (Fig. 3,Alpesilib,Buparlisib,Cetuximab,LGK974,Luminespib,Paclitaxel,Ribociclib,Tamoxifen and Trametinib), while DL, TRANSACT and Elastic Net were the best methods for three (Afatinib, Gemcitabine and Ruxolitinib), one (Erlotinib) and one (Fluorouracil) drugs, respectively.Furthermore, when examining these results for moderate or higher correlations, TUGDA presented 8 out 14 drugs above this threshold (0.3, dashed line Fig. 3), followed by TRANSACT and Elastic Net with 5 and 4 drugs, respectively.Investigating the learnt feature space, we observed that cell-lines and PDX samples from the same tissue tend to cluster together, showing that the model infers a biologically appropriate in vitro to in vivo transformation (See Supplementary Note S6, Supplementary Fig. S4).

Results with patient tumor data
For patient tumor data, we evaluated performance for transferring drug response predictions from cell-lines to patients based on 22 drugs shared in GDSC and TCGA (extended 5 drugs from Mourragui et al., 2020).As analyzed previously (Ding et al., 2016;Mourragui et al., 2020), TCGA drug responses were categorized into two groups, Responders ('Complete Response' and 'Partial Response') and Non-responders ('Stable Disease' and 'Progressive Disease').Despite several additional sources of variation in patient response data (tumor heterogeneity and environment, immune response, patient health status), TUGDA showed significant TUGDA's strength lies in the fact that it represents a novel unified transfer learning approach for multi-task learning and domain adaptation that leverages the concept of task/domain uncertainty.These attributes align it to the fundamental challenges found in building predictive models for precision oncology, including sample size limitations, lack of curated in vivo data and violations of the covariate-shift assumption when taking into account drug responses.Our experiments show that TUGDA can provide notable benefits in a multi-task setting to reduce negative transfer, particularly when training data is limited.In addition, it shows promise as a way to robustly transfer information from in vitro data to in vivo settings, based on confidence in task predictions.In particular, for domain adaptation with patient data, we observed that TUGDA performed well for drugs that were often distinct and complementary to those from previous STL DA methods, potentially due to its multi-task learning formulation finding an alternate optimum that minimizes the error for more drugs (Zhang and Yang, 2018).However, as a side effect of this, for a subset of drugs TUGDA was not able to present strong performance relative to STL DA methods (e.g Fluorouracil and Gemcitabine for PDX data and Cisplatin, Etoposide, Gemcitabine, Oxaliplatin and Trastuzumab for patient data).A possible future direction is to explore which tasks should be learned together and which tasks should be automatically downgraded to STL (Standley et al., 2020;Lozano and Swirszcz, 2012).
There a several potential pitfalls in the use of deep learning methods for computational biology, including unstable predictions (Mourragui et al., 2020), overfitting and interpretability.TUGDA's design seeks to address these by providing a stable training and prediction process (see Supplementary Note S8), and employing Bayesian neural networks, L1 and L2 regularizations for feature and task-specific layers, dropouts and task-uncertainties for regularizing task-to-feature transfer (instead of attention weights, Nguyen et al., 2020) to avoid overfitting.To address the interpretability gap (Gilpin et al., 2019), we explored and presented predictions for drugs with different mechanisms of action and trained on different domains (PDX or patient) that could be explained by the target's pathway.
TUGDA's approach to relaxing the covariate-shift assumption is a natural by-product of MTL using low-uncertainty features in a adversarial domain adaptation framework.This is distinct from prior work (Adel et al., 2017) that is based on learning the probability of label changes across source and target domains and using this to weight transfer.In a recent study, the concept of label-shift has also been highlighted as a source of NT (Tan et al., 2020).Intrinsic differences in cancer cell lines and patient tumors (e.g. the enrichment of genomic alterations and in vitro selection of subpopulations) make this scenario a likely one for domain adaptation in precision oncology.We envisage that TUGDA's framework can be extended to alleviate NT in the marginal distribution of drug responses as well, advancing the goal of realistic precision oncology models further.

Fig. 1 .
Fig.1.TUGDA framework for multi-task learning and domain adaptation in cancer drug response prediction.The layer L receives input data from cell lines (source data) in the supervised step or from other domains (PDX or patients, target data) in the unsupervised step and maps them to a latent space Z.Then, in the supervised step, the multitask layer S uses these latent features to make predictions, as well as compute task-uncertainties Ut for regularizing the amount of transfer from tasks/domains in A to the latent features in Z by employing an autoencoder regularization.Using adversarial learning, in both supervised and unsupervised steps, the discriminator D (in place to classify Z in different domains) receives the extracted features from Z and regularizes L to learn domain-invariant features.L, S, A and D consist of a single fully connected layer.Cell-line, PDX and tumor icons were Created with BioRender.com.

Fig. 2 .
Fig. 2. MTL performance evaluation using in vitro datasets.(a) Barplots (error bars represent standard deviation) showing MSE across tasks for different models including state-of-the-art methods (Deep-GO-MTL, Deep-AMTL, Deep-AMTFL), and TUGDA and its ablated variants (median MSE values are shown on the bottom along with statistical significance bars on top, where * stands for digits after the decimal p-value point i.e. '****' signifies 1e-4), (b) Strip plots comparing the degree of negative transfer and the number of such tasks (shown in parenthesis).(c) and (d) Barplots and strip plots comparing MSE and NT for tasks with smaller sample sizes (19 tasks) as well as curated response data fromDing et al. (2016).All cell line, PDX and tumor data were processed using the same pipelines, with pre-processing steps and experimental setup as proposed inMourragui et al. (2020).As baselines for both PDX and patient tumor predictions we employed (extended fromMourragui et al., 2020) the following: (i) an Elastic Net regression trained solely on cell line data.(ii) An Elastic Net regression trained solely on batch corrected cell line data (Elastic Net þ Combat) approach similar toGeeleher et al. (2017).(iii) Deep Learning model (DL)(Mourragui et al., 2020), (iv) Deep Learning þ Combat (DL þ Combat), similar toSakellaropoulos et al. (2019), as well as the unsupervised DA approaches, (v) PRECISE(Mourragui et al., 2019) and (vi) TRANSACT(Mourragui et al., 2020) (see implementation details for all baselines in Supplementary Note S4).

Fig. 3 .
Fig. 3. DA performance for predicting drug response in PDX models.Comparison of Spearman correlation between cell-line and PDX response values for 14 drugs across different models.Numbers in parenthesis represent PDX sample size.The dashed line stands for a threshold for moderate or higher correlation

Table 1 .
DA performance for predicting drug response in patient data Note: Drug names in parenthesis are corresponding matches from GDSC.We report P-values (in bold for P < 0.05) and the effect-size in brackets.Blue colored values indicate significant associations with the largest effect size for a drug.