Multivariate Bayesian structured variable selection for pharmacogenomic studies

Precision cancer medicine aims to determine the optimal treatment for each patient. In-vitro cancer drug sensitivity screens combined with multi-omics characterization of the cancer cells have become an important tool to achieve this aim. Analyzing such pharmacogenomic studies requires flexible and efficient joint statistical models for associating drug sensitivity with high-dimensional multi-omics data. We propose a multivariate Bayesian structured variable selection model for sparse identification of omics features associated with multiple correlated drug responses. Since many anti-cancer drugs are designed for specific molecular targets, our approach makes use of known structure between responses and predictors, e.g. molecular pathways and related omics features targeted by specific drugs, via a Markov random field (MRF) prior for the latent indicator variables of the coefficients in sparse seemingly unrelated regression. The structure information included in the MRF prior can improve the model performance, i.e. variable selection and response prediction, compared to other common priors. In addition, we employ random effects to capture heterogeneity between cancer types in a pan-cancer setting. The proposed approach is validated by simulation studies and applied to the Genomics of Drug Sensitivity in Cancer data, which includes pharmacological profiling and multi-omics characterization of a large set of heterogeneous cell lines.


Introduction.
A large proportion of advanced solid tumors harbor potentially treatable genomic variants (Fontes Jardim et al., 2015;Le Tourneau et al., 2015;Von Hoff et al., 2010), but very few cancer patients actually benefit from genome-informed treatments (Marquart et al., 2018).Thus, there is great potential to improve the use and benefit of therapy for individual patients by better patient stratification and by patient-tailored design of therapies.Precision cancer medicine aims at guiding cancer patient treatment based on detailed molecular characterization of each patient's disease.One strategy that is rapidly gaining traction is ex vivo cancer drug sensitivity screening, which predicts responses to a range of potential therapies in cancer cell lines and patient-derived cells and identifies molecular features that are associated with drug response.Studies where both, drug sensitivity and molecular (multi-omics), data are available are commonly referred to as pharmacogenomic studies.In this article we employ a multivariate (multi-response) regression setup with high-dimensional input matrix to analyze pharmacogenomic data, where sensitivities to several drugs are the response variables and molecular (multi-)omics variables are the input features.We analyze data from the Genomics of Drug Sensitivity in Cancer (GDSC) database (Garnett et al., 2012;Yang et al., 2013), which contains the results from drug sensitivity screens to hundreds of cancer drugs for hundreds of cell lines representing diverse cancers in a pan-cancer setup and multi-omics characterization of these cell lines.Our approach can identify important genes affiliated with target pathways of the drugs (i.e.target genes) as well as genes whose dysfunction is known to drive cancer (cancer genes), which may guide personalized cancer therapies and aid discovery of potential new application areas of anti-cancer drugs in additional cancer types based on the identification of both tissue-specific and pan-cancer processes.
Large-scale in vitro cancer drug screens produce a large amount of drug sensitivity data which are expected to be correlated for drugs that have similar mechanisms of action or common target genes or pathways.Meanwhile, multi-omics information, including for example transcriptomics (gene expression), genomics (point mutations or copy number variations) or epigenomics (e.g.CpG methylation) data, is measured for the cancer cells, which is expected to guide personalized cancer therapies through prediction of drug sensitivity (Garnett et al., 2012;Barretina et al., 2012).The omics input data are often high-dimensional and are typically sparsely associated with the response variables in a structured manner, where variables corresponding to genes in the same molecular pathway can have similar association patterns with the drugs, for example because a drug targets a molecular signaling pathway which effects the expression of several genes in the pathway.In addition, since multiple omics characterizations reflect different aspects' of information of the same system or co-functionality of multiple gene features (Kim et al., 2019), an analysis of joint associations between the correlated multiple phenotypes (e.g, multiple drugs) and high-dimensional molecular features (i.e.multi-omics data) is desired, but poses both theoretical and computational challenges.Finally, it is expected that not all of the heterogeneity between the cancer samples can be explained by the available molecular data.In particular, a pan-cancer pharmacogenomic screen will include samples from multiple cancer types, which adds heterogeneity in the drug sensitivity due to the different tissue and cell types, even if the involved molecular pathways and mechanisms are the same.This leads us to include random effects in the model to reflect heterogeneity between cancer types.
There are a number of statistical and machine learning models developed for predicting drug sensitivity by using omics data (see e.g.Ballester et al. (2022); Sharifi-Noghabi et al. (2021); Feng et al. (2021); Adam et al. (2020)).These models are often designed for making accurate predictions, either within a single cancer type (Costello et al., 2014) or using a cancer-agnostic approach (Barretina et al., 2012).Furthermore, while emphasizing accurate predictions, many of the models lack effective variable selection options, making such black-box models less practical for biological studies or clinical applications.Huang et al. (2020) developed Tissue-guided LASSO for integrating cancer tissue of origin with genomic profiles, which just repeats the analysis in each cancer type, rather than jointly modelling the pan-cancer data.Zhao and Zucknick (2020) proposed Tree-guided group lasso with Integrative Penalty Factors to jointly model drug-drug similarities and heterogeneity of multi-omics from pan-cancer data, but do not take into account correlation structure across multiple omics data sources.
Bayesian modeling provides flexibility to specify the relationships in such complex data.There have been several Bayesian methods developed to deal with structure in complex data.For example, Bai et al. (2022) and Yang and Narisetty (2020) studied Bayesian group selection of high-dimensional predictors, but for univariate response variables.Liquet et al. (2017) extended the univariate response model to a multivariate model, but lack computational efficiency because they used a standard MCMC algorithm.Richardson et al. (2011) proposed hierarchical related regression (HRR) for multivariate response variables.HRR assumes a simple independence prior for the residual covariance matrix, and it applies an efficient Evolutionary Stochastic Search (ESS) algorithm based on Evolutionary Monte Carlo (Bottolo and Richardson, 2010).More complex priors, e.g., inverse Wishart or hyperinverse Wishart prior, can be used for the residual covariance matrix to learn structures between multivariate response variables (Petretto et al., 2010;Carvalho et al., 2007;Wang, 2010;Bhadra and Mallick, 2013;Bottolo et al., 2021).
Besides imposing different structured priors on the residual covariance matrix, it is necessary to also impose structured variable selection priors for high-dimensional predictors.Although independent spike-and-slab priors for variable selection are often used in highdimensional multivariate models (Jia and Xu, 2007;Bottolo et al., 2021;Ha et al., 2021;Chakraborty et al., 2021), a structured MRF prior can also be used for the latent indicator variables to introduce prior dependence between predictors (Chekouo et al., 2015(Chekouo et al., , 2016(Chekouo et al., , 2017)), and hyperpriors of the MRF prior can be used to infer the sparsity of the dependence structure.Lee et al. (2017) utilized the residual covariance matrix for the dependence structure in an MRF prior to encourage joint selection of the same predictor across several correlated response variables.In all these articles, an MRF prior is set for the latent variables of regression coefficients only corresponding to one response variable, which therefore does not allow to learn structures across multiple response variables.
In this article, we propose a multivariate Bayesian structured variable selection approach based on Richardson et al. (2011) and its extension by Bottolo et al. (2021), which can deal with multiple response variables (e.g., the cell lines' sensitivity to multiple cancer drugs) and high-dimensional genomic predictors, and possess computational efficiency through the ESS algorithm.Our proposed approach aims to include a known complex structure between multiple response variables and high-dimensional predictors via a flexible MRF prior for the latent indicator variables of the regression coefficient matrix.That is, we include known biological associations for the dependence structure in an MRF prior rather than doing MRF inference.Our use of the MRF prior has two main advantages: • it takes into account prior knowledge on inter-relations between predictors including across groups of predictors and across response variables, to improve model performance (i.e.variable selection and prediction), and • it performs posterior inference for the model in a more computationally efficient manner than the use of data-driven structured priors (e.g.multiplicative prior for the Bernoulli probability of the latent indicator variable (i.e.hotspot prior) by Richardson et al. (2011) and hyperprior for the MRF edge potentials by Chekouo et al. (2017)) would allow.
For example, Figure 1 illustrates two groups of drugs and their corresponding two groups of target genes or pathways across multiple omics characterizations.When using omics data to predict drug responses, the associations between the multiple drugs and omics features can include prior knowledge about the groups of drugs and their target genes or target pathways.An MRF prior is able to address the joint structure by adding the edges for omics features within a group of target genes or pathways that correspond to the group of their targeting drugs.In addition, if the drug responses are measured on cell lines from different cancer types or different tissues, we use random effects to capture the sample heterogeneity arising from these sample groups.An R package BayesSUR (Zhao et al., 2021) is available on the Comprehensive R Archive Network at https://CRAN.R-project.org/package=BayesSUR.The rest of the article is organized as follows.In Section 2, we introduce the Bayesian SSUR model, propose an MRF prior for the latent indicator variables of the coefficient matrix, and introduce random effects for sample groups.Section 3 compares the performances of Bayesian SSUR models with our MRF prior to the hotspot prior by Bottolo et al. (2011) with respect to (w.r.t.) structure recovery and prediction in simulated data.In Section 4, we analyse a pharmacogenomic data set from the GDSC database.In Section 5, we conclude the article with a discussion.

Methodology.
2.1.SSUR model.We study a multivariate regression model with a response matrix Y n×m from n samples and m response variables.All response variables are regressed on the same p predictors which are measured on the n samples, so that the predictor matrix is X n×p .Associations between the responses Y and predictors X are captured by a coefficient matrix B p×m .We first assume correlated response variables, but independent samples.Section 2.3 will then extend the model to allow for correlated samples.The classic SUR model is defined as where the residuals have correlated columns with covariance Ψ and independent rows, and vec{•} is to vectorize a matrix by column.
In the Bayesian framework, to efficiently sample from the posterior distribution of the regression coefficients from (1), Zellner and Ando (2010) reparametrized the SUR model and proposed a direct Monte Carlo procedure.Bottolo et al. (2021) used the same reparametrization for the SUR model, but with an inverse Wishart prior Ψ ∼ IW(ν, τ I m ).Briefly, then model (1) can be rewritten as where u l = y l − Xβ l .The reparametrized parameters (σ 2 j , ρ jl ) have priors where v is fixed and τ ∼ Gamma(a τ , b τ ).Note that the joint distribution f (Y|X, B, Ψ) is the same regardless of the order used for the decomposition since we are simply factorising it by chain-conditioning (Bottolo et al., 2021).
The reparametrization factorizes the likelihood across multiple response variables possible, which especially benefits high-dimensional response variables.If only a few of the p predictor variables are assumed to be associated with any of the response variables, we use a latent indicator matrix Γ = {γ kj } for variable selection.If γ kj = 1, then β kj = 0 and the kth predictor is regarded as an associated predictor to the jth response variable; otherwise γ kj = 0 and β kj = 0. Independent spike-and-slab priors (George and McCulloch, 1993;Brown et al., 1998) for the regression coefficients can be used to find a small subset of predictors that explains the variability of Y, for example: (4) where w ∼ IG(a w , b w ) and δ 0 (•) is the Dirac delta function.
We may not only introduce sparsity to the high-dimensional coefficient matrix, but also sparsity to the precision matrix Ψ −1 , which implies that the residuals u l = y l − Xβ l and u j = y j − Xβ j for only a few pairs of response variables l = j have non-zero partial correlations, assuming a multivariate normal distribution for the residuals.Such a sparse precision matrix can be conceptualized as a graph G, with nodes representing the residual variables u l , and edges between them corresponding to non-zero elements of the precision matrix.Bottolo et al. (2021) used a hyper-inverse Wishart prior for Ψ instead of an inverse Wishart prior, i.e.
It assumes an underlying decomposable graph G between residuals.The HIW prior on decomposable graphs greatly enhances computational power since the parameters are updated within each clique and there is no computationally expensive normalisation constant to calculate.Since the fully Bayesian estimation procedure produces edges averaged over many different graphs, the posterior mean graph can well approximate non-decomposable graphs (Fitch et al., 2014).A sparse graph G can result in sparse Ψ −1 .So Bottolo et al. (2021) specified a Bernoulli(η) prior for each edge of the graph.Then a Binomial prior is on the cardinality edge-set where η ∼ Beta(a η , b η ) controls the sparsity of the graph.Based on ( 5) and ( 6), the parameters σ 2 and ρ are indexed across the response variables of each clique of G rather than all response variables.In addition to sparse covariance selection, Bottolo et al. (2021) also used sparse variable selection for the predictor variables via a hotspot prior (i.e. a multiplicative prior) for the hyper-parameter ω kj in γ kj ∼ Ber(ω kj ).A guideline of prior specifications for the hyper-inverse Wishart prior and spike-and-slab prior can be found in Supplementary S1.
2.2.SSUR model with MRF prior.Figure 1 illustrates known relationships between drug responses and genomic predictors.As an example, imagine a group of drugs with the same mechanism of action, where the response of a cancer cell to these drugs depends on a certain gene to be silenced.Gene silencing can either occur via a genomic alteration (deletion event), missense mutation, or another down-regulation of gene expression.It might thus be observable in one or several omics features, e.g.gene expression, copy number variation or mutation data.We may include such prior knowledge in the SUR model (1), instead of using independent or hotspot priors (Richardson et al., 2011;Lewin et al., 2016;Bottolo et al., 2021).
We propose to use an MRF prior for the latent indicator vector γ = vec{Γ} to address prior structure for the associations between response variables and predictors.The MRF prior is (7) f (γ|d, e, G) ∝ exp{d1 γ + eγ Gγ}, where the scalar d controls overall model sparsity, scalar e determines the strength of the structure relationships between responses and predictors, and G is an mp × mp (possibly weighted) adjacency matrix representing a graph to include prior structure knowledge.Term d1 γ in (7) can be generalized to d γ, where the vector d will assign different relative contributions to the prior selection probabilities of the predictors.To specify the scalar d, we refer to Lee et al. (2017) by using log-odds of a rough model sparsity (i.e.proportion of nonzero regression coefficients).To specify e, Stingo et al. (2011) suggested a separate simulation from (7) over a grid of e to detect the "phase transition" value, and then specified a Beta prior on e.However, due to much computational cost in highdimensional γ, especially in multivariate regressions when searching some large values of e resulting in very dense models, we first estimate the roughly largest e and then use a grid search for e to identify its optimal value for the model.See Supplementary S2 for more details.
For the G matrix, we assign a positive edge potential {k+j(p−1), k +j (p−1)}-element if the latent indicator variables γ kj and γ k j are correlated.To illustrate the idea, we consider a simple case with three response variables (i.e.y 1 , y 2 and y 3 ) and four predictors (i.e.x 1 , x 2 , x 3 and x 4 ).When the predictors x 1 and x 2 are assumed a priori to be associated with responses y 1 and y 2 , and x 3 and x 4 are assumed to be associated with y 3 , then G is a 12 × 12 matrix given by equation ( 8).Any nonzero element in G above can be any positive number which indicates a weight for the prior relationship between two latent indicator variables.Here for simplicity, we construct a symmetric G matrix and assume all nonzero weights to be 1.
Note that we might not know all exact relationships between response variables and predictors, but we still formulate the matrix G based on what we know.For example, if we only know relationships between response variables and relationships between predictors, we can aggregate these relationships by G y ⊗ G x − I.Here we use −I to only allow zero diagonals in G, because nonzero diagonals are already captured by the term d1 γ.For example, if we assume that y 1 and y 2 are related w.r.t. each predictor, x 1 and x 2 are related w.r.t. each response variable, and x 3 and x 4 are related w.r.t. each response variable, this translates into the following three Kronecker products.We can then aggregate them by aligning their coordinates into the full matrix G.

Gy
for y 1 and y 2 Gy for y 1 ,y 2 and y 3 Gy for y 1 ,y 2 and y 3 2.3.SSUR model with MRF prior and random effects.The SSUR model with hotspot prior in Section 2.1 and SSUR model with MRF prior in Section 2.2 both assume independent and identically distributed samples conditional on the predictors.However, samples can be heterogeneous, especially in applications with large sample size.For example, largescale drug screens may include cell line samples from different cancer tissue types.We address the heterogeneity of multiple sample groups by introducing random effects into the model similarly to Chekouo et al. (2015).
Let Z n×T be indicator variables representing n samples from T heterogeneous groups.We define an SUR model which includes spike-and-slab priors (4), hyper-inverse Wishart prior (5), MRF prior (7) and random effects, where the random effects B 0 = {β 0,tj : t = 1, • • • , T ; j = 1, • • • , m}, and all priors above are mutually independent: Let us look into details of the random effects.For any ith sample and jth response variable, we have For the ith sample, the covariance between the jth and j th response variables is ψ jj that is the jj -element of Ψ, since Although the priors for the coefficients B 0 and B in (9) do not provide any correlation between different responses for the same sample, the hyper-inverse Wishart prior on Ψ models correlations between the response variables, and so does an inverse Wishart prior on Ψ.If we look at the reparametrization (3) from the inverse Wishart prior, or similarly from the hyper-inverse Wishart prior, correlations between the response variables are contained in the reparametrized parameter ρ.
For the jth response variable, the covariance between the ith and i th samples is in which the hyper-parameter w 0 in the random effect determines the correlation between two samples from the same group.We would like a weakly informative prior for β 0,tj based on previous studies or expert knowledge in applications.In pharmacogenomic studies from multiple cancer tissues, for predict drug responses a tissue effect is usually stronger than a gene effect.Therefore it is appropriate to specify a larger hyper-parameter w 0 than w.

Posterior computation.
Posterior inference for the SSUR model with the MRF prior with or without additional random effects can be done in a similar manner to Bottolo et al. (2021).For the SUR model ( 2) with a hyper-inverse Wishart prior for the residual covariance matrix Ψ and an MRF prior for the latent indicator variables γ, the joint posterior distribution is where ρ and σ 2 are vectors of {ρ jl } and {σ 2 j }, respectively.Since y j |X, B, ρ, σ 2 is normally distributed with mean Xβ j + l<j u l ρ jl and variance σ 2 j I n , we can obtain the full conditional distributions of the regression coefficients β j , w, σ 2 j , ρ jl and τ .The posterior distribution of the latent indicator variable γ = vec{Γ} is estimated by a Metropolis-Hastings sampler.The graph G of the hyper-inverse Wishart prior is sampled from a junction tree sampler which is essentially Metropolis-Hastings sampling (Green and Thomas, 2013), see Bottolo et al. (2021) for more details.If there are random effects for sample groups as in (9), the joint posterior distribution (10) includes parameters B 0 and w 0 , i.e.
We implement Gibbs samplers to obtain posterior estimates for B, ρ and σ 2 , and update the latent indicator variable Γ via a Metropolis-Hastings sampler with parallel tempering in the same way as Bottolo et al. (2021).Thompson sampling (Russo et al., 2018) is used to derive the proposal for each latent indicator γ kj .The hyper-parameter τ is updated via a random walk Metropolis sampler as proposed by Bottolo et al. (2021).To overcome the prohibitive computational time in high-dimensional settings, the ESS algorithm (Bottolo and Richardson, 2010;Richardson et al., 2011) is used to update the posteriors.For each iteration of the MCMC sampler, after sampling the latent indicator variables Γ, we first update the hyper-parameters (τ, w, w 0 , G), then update the parameters σ 2 and ρ, and finally the regression coefficient matrices B and B 0 (see Supplementary S3).At each iteration, the ESS algorithm implements a local move to add/delete and swap the latent indicator variables within each chain, and then a global move to exchange and crossover the latent indicators between any two parallel tempered chains.The temperature across all response variables is adapted based on the acceptance rate of the global exchange operator.The ESS algorithm with parallel tempering is effective in searching a high-dimensional model space with multiple modes (Bottolo and Richardson, 2010).
2.5.Model performance evaluation.To evaluate the performance of the proposed approach, we focus on structure recovery and prediction performance.The structure recovery includes the estimation of the latent indicator variable Γ which captures the relationships between response variables and high-dimensional predictors, and the estimation of the graph G which addresses the residual relationships between response variables.Predictive accuracy of Bayesian models for new data points can be measured by the expected log pointwise predictive density (elpd), which can be assessed by leave-one-out cross-validation (elpd loo ) or by the widely applicable information criterion (elpd waic ) (Vehtari et al., 2017).We also calculate the root mean squared prediction error (RMSPE) measured on an independent test data set for the median probability model (MPM) (Barbieri and Berger, 2004;Barbieri et al., 2021) in addition to the training data root mean squared error (RMSE).Vehtari et al. (2017) proposed an efficient computation for the Bayesian LOO estimate of out-of-sample predictive fit elpd loo = m j=1 n i=1 log f (y ij |y (−i)j ), where y (−i)j is the observation vector of the jth response variable except the ith observation.The LOO is estimated by , where N is the length of an MCMC chain and ϑ (t) is the MCMC samples at the tth iteration for all related parameters.The WAIC is estimated by where the second term above is used as a measure of the model complexity.
For future prediction, a single model may be required in some cases, for practical reasons or for simplicity.Barbieri and Berger (2004) suggested the median probability model (MPM), which is defined for each coefficient to be E[β kj |γ kj = 1, data] if P{γ kj = 1|data} > 0.5, or 0 otherwise.It can be estimated through MCMC estimates: where γ kj is the estimate of the tth MCMC iteration for the latent indicator variable of where Y n×m and X n×p were used to estimate BMP M , and Y * n ×m and X * n ×p are new data.
In this section, our SSUR model with MRF prior, denoted as SSUR-MRF, is evaluated w.r.t.structure recovery for the regression coefficient matrix and prediction performance of responses.We set up two simulation scenarios: one with independent samples and the other with heterogeneous and correlated samples.In the first scenario, our approach is compared with the SSUR model with a hotspot prior, denoted as SSUR-hotspot, which was studied by Bottolo et al. (2021).In the second scenario, our approach is compared with the SSUR-MRF model without random effects.The noise matrix U is simulated based on the multivariate normal distributed Ũ and a G-Wishart distribution (Mohammadi and Wit, 2019).We first simulate the G-Wishart distribution P ∼ W G (3, M ) where diagonals of M are 1 and the off-diagonals are 0.5, and then use Cholesky decomposition chol(P −1 ) to obtain the noise matrix U = Ũ • chol(P −1 ).We control the average signal-to-noise ratio defined by Bottolo et al. (2021), and set it to 10. In-dependent data sets X * and Y * are simulated based on the same scenario as validation data.In scenario 2, X, Γ, B Γ and U are generated in the same manner as scenario 1.We also include group indicators Z with independent row vectors z i ∼ multinomial(0.1, 0.2, 0.3, 0.4) (i = 1, • • • , n and the number of groups is set to T = 4), and random effects B 0 with each group effect from N (0, 2 2 ).The response dataset is generated from a linear mixed model Y = XB Γ + ZB 0 + U. Independent validation data sets are also simulated based on scenario 2. The algorithms for the two simulation scenarios can be found in Supplementary S4.Both simulation algorithms generate validation datasets independently with the same sample size to evaluate the performance of the proposed methods.

3.2.
Comparison of the SSUR-hotspot and SSUR-MRF models.We first compare our proposed SSUR-MRF model to the SSUR-hotspot model on simulated data generated with scenario 1.Our approach uses the network in Figure 2(a) as prior information to construct edge potentials for the MRF prior as illustrated in Section 2.2.Throughout this article, we refer to a predictor as being selected or identified, if the corresponding latent indicator variable has posterior mean larger than 0.5.Figure 3 shows that SSUR-hotspot and our SSUR-MRF both have good recovery for the residual structure between response variables (i.e.G).However, SSUR-MRF has better structure recovery of the latent indicator variable Γ.Table 1 reports higher accuracy, sensitivity and specificity of the estimator for Γ by SSUR-MRF than SSUR-hotspot.The two methods have similar elpd loo and elpd waic , but our approach has smaller RMSE and RMSPE.3.3.Sensitivity analysis for SSUR-MRF.The MRF prior can be strongly informative as the edge potentials were constructed according to the true relationships Γ between the simulated response variables and predictors.Here we use different edge potentials G in the MRF prior for a sensitivity analysis.Starting from the previously constructed edge potentials G, we partially delete true edge potentials, either uniformly or non-uniformly, or add noise edges, or aggregate Kronecker products between the three response groups and six predictor groups as shown in Figure 2(a).The four cases are as follows: • Case 1: delete 1%, 10%, 50% or 90% edges uniformly from the fully informative G.
This case for every block in Γ, some corresponding edge potentials in G are kept.• Case 2: delete 1%, 10%, 50%, 90% or 100% edges non-uniformly in consecutive chunks from the edge list 1 of the fully informative G.In this case, for some blocks in Γ all corresponding edge potentials in G are deleted.• Case 3: add 0.1%, 0.5%, 1% 2 noise edges to the fully informative G.
• Case 4: aggregate Kronecker products between response groups and predictor groups (see guidance in Section 2.2).
Table 2 Case 1 shows that our SSUR-MRF model can identify well truly associated predictors w.r.t.accuracy, sensitivity and specificity of the estimated Γ, and have stable prediction performance w.r.t.RMSE and RMSPE, when deleting 1%, 10%, 50% or 90% true edges uniformly.This indicates that our approach can recover a good structure of Γ and good prediction performance of responses, even if only a little true association knowledge across all patterns of Γ is used in the MRF prior.Case 2 (Table 2) where some of the patterns/blocks in Γ are fully unknown (i.e. when the corresponding blocks of edges in G are deleted) in the MRF prior, the sensitivity of variable selection and prediction performance w.r.t.RMSE and RMSPE becomes slightly worse.Figure 4 indicates that the information of the deleted blocks cannot be recovered fully, but will instead be estimated with a sparser Γ. Supplementary S5 shows slightly worse residual structure recovery (i.e.Ĝ) when deleting more edges non-uniformly.However, even the worst-case scenario in Case 2, when all edges are deleted, i.e. when the MRF prior with G = 0 degenerates to a Bernoulli prior without any known structure information between variables, has similar performance to the SSUR model with hotspot prior in Table 1.Case 3 (Table 2), where adding noise edges, shows similar variable selection and prediction performance to using true potential edges.Finally, Case 4 (Table 2), where aggregating Kronecker products for the edge potentials in the MRF prior, the variable selection remains similar to using true potential edges.Here, elpd loo and elpd waic do not change much between different cases, but they can be used as the objective function to optimize hyperparameters. 1 The coordinates of all nonzero entries of G are put in an edge list in order.Deleting edge potentials uniformly, e.g.deleting 1%, means that the 1 + (1 − 1/|E|)/1% • {0 : (1% • |E|)}th edges of the edge list are deleted.Deleting 1% edges non-uniformly (i.e.blocks of edges) means that the last 1% edges in the edge list are deleted.The edge list includes the edges of each pattern (i.e.association block) together.Adding 1% noise edges means that 1% • mp(mp − 1)/2 wrong edges are included randomly.
2 Note that 1% already exceeds the total number of true edges that are ∼ 0.3% of all possible edges.3.4.Results and discussion of SSUR-MRF with random effects.In the simulation scenario 2, T = 4 sample group variables are simulated to assess the performance of our SSUR-MRF model with random effects.Figure 5(a), 5(c) and Table 3 show similar recovery of the latent indicator variable Γ w.r.t.accuracy, sensitivity and specificity for both SSUR-MRF with and without random effects.However, SSUR-MRF model without random effects is difficult to recover the residual graph structure G (Figure 5(d)), while the model with random effects can recover well the true structure (Figure 5(b)).See also Table 3, which reports the recovery performance of G w.r.t.accuracy, sensitivity and specificity when thresholding its posterior mean at 0.5.For the response prediction, SSUR-MRF with random effects has smaller elpd loo , elpd waic , RMSE and RMSPE than SSUR-MRF without random effects (Table 3).In addition, for the accuracy of estimated regression coefficients, 2 by SSUR-MRF without random effects has larger error (0.050) than by SSUR-MRF with random effects (0.013).In addition, the  4. Analysis of the pharmacogenomic screen.
4.1.Pharmacogenomic data.We apply our approach to the Genomics of Drug Sensitivity in Cancer (GDSC) database (Yang et al., 2013;Garnett et al., 2012) to study the relationships between multiple cancer drugs and high-dimensional genomic features characterising cancer cell lines.The pharmacological and genomic data are from the archived dataset release 5 (https://www.cancerrxgene.org)preprocessed by Garnett et al. (2012).We would like to investigate how the MRF prior can help to improve inference for groups of drugs that are known to have correlated response; we therefore select two groups of cancer drugs with similar molecular targets and the generic non-targeted chemotherapy agent Methotrexate: four MAPK inhibitors (RDEA119, PD-0325901, CI-1040 AZD6244), two Bcr-Abl tyrosine kinase inhibitors (Nilotinib, Axitinib), and one chemotherapy agent (Methotrexate).
The seven drugs were tested on 499 cell lines from 13 cancer tissue types with complete drug sensitivity values.The drug sensitivity of the cell lines was summarized by the log 10 (IC 50 ) from in vitro drug concentration response experiments.Note that smaller log 10 (IC 50 ) values indicate higher sensitivity of a cell line to the drug; therefore a negative regression coefficient indicates that a positive increment of the value of a feature is associated with an increase in drug sensitivity.In order to explore the relationships between the three groups of drugs and the genomic profiles of the cell lines, we first preselect known cancer genes and their corresponding genomic features by following Garnett et al. (2012), including 426 copy number variation features (counts) and 68 mutated features (binary).To make a trade-off between the computational efficiency and amount of information from gene expression data, we then preselect three subsets of gene expression features with the largest variances over cell lines, which explain 10%, 30% and 50% of the variation, which results in 269, 1175 and 2602 gene expression features, respectively.This creates three data sets including both gene expression, copy number variation and mutation information, to predict drug sensitivity responses, i.e. feature set I with 763 predictors, feature set II with 1669 predictors, feature set III with 3096 predictors.9) in Section 2.3, we summarize some known biological relationships between the drugs and genomic information.First, all features (gene expression, copy number variation, mutation) corresponding to the same gene are assumed to be related.Such group of features are likely to be identified together corresponding to each drug, i.e. if one feature for a certain gene is a predictor of drug sensitivity, then the other features corresponding to the same gene are more likely to be predictors as well.This is illustrated in Figure 6(a) and results in a Kronecker product for the edge potentials Gy 7 drugs Second, the two Bcr-Abl tyrosine kinase inhibitors were developed to inhibit Bcr-Abl tyrosine kinase activity and proliferation of Bcr-Abl expressing cells, so the point mutation BCR-ABL, and features associated with genes BCR and ABL are related and likely to be identified together corresponding to the two Bcr-Abl inhibitors.This is illustrated in Figure 6(b) and results in a Kronecker product for the edge potentials Third, the four MAPK inhibitors were developed to reduce the activity of the MAPK pathway, so genes representing the MAPK pathway are likely to be identified together as potential predictor variables for drug sensitivity of the four MAPK inhibitors.Based on the set of genes linked to the MAPK pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG) PATHWAY database, we can construct another Kronecker product for the edge potentials.Here an edge potential of the G matrix, i.e. an edge weight, is 2, if the corresponding two features are both from the same gene and also belong to one group of drug target genes.Finally, we aggregate the individual Kronecker products by aligning their coordinates in the final G matrix for the MRF prior.
Other prior specifications, MCMC settings and diagnostics can be found in Supplementary S7 and S8.For comparison, we also run almost the same model as SSUR-MRF but with hyperparameter e = 0 in the MRF prior, which degenerates to a Bernoulli prior, and which we name SSUR-Ber.We choose SSUR-Ber instead of SSUR-hotspot as the comparison model, since it is easier to use than SSUR-hotspot, which has many tuning hyperparameters of the hotspot prior, and because SSUR-Ber has shown similar model performance as SSUR-hotspot shown in Section 3. 4.3.Results and discussion.Figure ?? shows an estimated residual structure between the seven drugs by our SSUR-MRF model based on Feature set III with the most genomic information.It does not only estimate residual correlation between any two MAPK inhibitors and between the two Bcr-Abl inhibitors, but also separates the chemotherapy drug Methotrexate from the other drugs.Supplementary S9 shows the residual structures between the seven drugs as estimated by the SSUR-MRF and SSUR-Ber models with feature sets I-III, respectively.We find that the structure estimated by our SSUR-MRF model based on feature set III is closest to our knowledge about the relationships between the seven drugs.To look at variable selection, a gene feature is considered to be identified if the estimated marginal selection probability of its coefficient is larger than 0.5, i.e. if the corresponding latent indicator variable has posterior mean larger than 0.5.Table 4 reports the numbers of identified features over the seven drugs by the SSUR-Ber and SSUR-MRF models.SSUR-Ber results in very sparse models and identifies a similar number of genomic features for each drug.In contrast, our SSUR-MRF model identifies more genomic features and finds a different model sparsity for the three drug groups, in particular relatively denser models for the four MAPK inhibitors.This indicates that our model is able to distinguish variable selection corresponding to different response variables.For the group with the two BCR-ABL inhibitors, i.e.Nilotinib and Axitinib, both SSUR-Ber and SSUR-MRF identify the mutation BCR-ABL associated with drug Nilotinib, as expected.For the group of the four MAPK inhibitors, Figure 8 displays the numbers of identified features by SSUR-Ber and SSUR-MRF.For feature sets I, II and III, SSUR-Ber identifies quite different features (Figure 8(a)), i.e. there is not much overlap.However, our SSUR-MRF model identifies 35 common features over the three feature sets.This reflects more stable variable selection due to using prior knowledge via the MRF prior.Table 5 further shows that the SSUR-Ber model does not identify any known target features for the MAPK inhibitors.Supplementary Table S10.1 shows the identified feature names for the MAPK inhibitors by SSUR-Ber.Overall, our SSUR-MRF model is able to identify many more features than SSUR-Ber, and identifies more known target features for the MAPK inhibitors.Figure 9 shows the names of features that were identified for the MAPK inhibitors by the SSUR-MRF model.The seven copy number variation and mutation features in Figure 9(a) are also in Figure 9(b) and (c), because only more gene expression features are selected by the models using feature sets II and III, but no additional mutation or copy number variation features.As more target gene expression features are used to construct the edge potentials in the MRF prior in the models built with feature sets II and III, our approach can identify more of them.As Figure 8(b) shows, we have identified 35 common features with the SSUR-MRF model over feature sets I, II and III, but only seven of these common features belong to known target genes of the corresponding drugs as shown in Figure 9.We find that the 28 other common identified features (listed in Supplementary Table S10.2) are cancer genes, i.e. genes that are known to be deregulated in cancer.The Cancer Gene Census summarizes how dysfunction of these genes drives cancer (Sondka et al., 2018).In Table 6, prediction performances of the SSUR-Ber and SSUR-MRF models are reported based on Feature set III which has the most genomic information.Overall, prediction performance is very similar between the two models.As for elpd loo or elpd waic , our SSUR-MRF model is slightly worse than SSUR-Ber.To assess the prediction performance of the median probability model, we need an independent data set for out-of-sample prediction to obtain RMSPE.For this purpose we gathered 46 cell lines with complete pharmacogenomic data from the updated GDSC data set by Smirnov et al. (2016), that were not included in our training data.Table 6 shows that SSUR-MRF has slightly better RMSPE than SSUR-Ber on this independent data set.Our approach also estimates the tissue-specific effects of 13 cancer types, which may indicate relationships between drug responses and cancer types.Figure 10 shows the estimated random effects by SSUR-MRF using the genomic feature set III. Negative effect estimates can indicate especially high effectiveness of a drug to kill cancer cells of the corresponding cancer type.We focus on the strongest (negative) effects, since most error bars of the posterior mean for the cancer type random effects are quite large.Methotrexate has the strongest average effect in blood cancer samples; it is known to be an effective chemotherapeutic agent in leukemia (Powell et al., 2010).Supplementary S12 shows that Methotrexate has much lower log(IC 50 ) values (i.e. more effectiveness) on cell lines from blood tissue type compared with other tissue types.Three of the four MAPK inhibitors (RDEA119, PD-0325901, CI-1040) have their strongest effect in skin cancer cell lines.Supplementary S12 also shows that these drugs have lower log(IC 50 ) values on skin tissue cell lines compared with other tissue types, while AZD6244 shows more variation.Nilotinib and Axitinib are common targeted therapies for chronic myelogenous leukemia with a BCR-ABL mutation (Halbach et al., 2016).We can observe the effect of Nilotinib on blood cancer samples in Figure 10 although with large error bars, and the quite low log(IC 50 ) on the only four BCR-ABL mutated blood cancer cell lines are shown in Supplementary S12.

Conclusion.
In this work, we have developed a multivariate Bayesian structured variable selection model for analyzing data from pharmacogenomic studies.Our model exploits the relationships between multiple correlated response variables (drug sensitivity measurements) and high-dimensional structured multi-omics input data for variable selection and to improve prediction.With our approach we want to (a) be able to borrow known information between response variables and predictors, (b) learn associations between response variables and predictors, and (c) understand the residual covariance of response variables.The proposed approach allows us to make use of known network information on the relationships between responses and predictors in an MRF prior for the variable selection indicator Γ, and to further simultaneously select predictors in a sparse manner and learn the residual covariance matrix between the response variables.In addition, we can take into account sample heterogeneity through random effects which are excluded from the variable selection.Guidance for specifying (weakly) informative hyper-parameters has been provided in the Supplement.
Through the simulation studies, we have demonstrated that the proposed approach can recover the network structure (i.e.latent indicator variable Γ) between multiple response variables and predictors, and predict responses well.We have found that including only a small amount of prior knowledge for most patterns/network groups (Figure 4(d)) will improve model performance over a model that does not any include prior knowledge.Our approach is also robust to noise in the prior information (i.e.false edge potentials) in the MRF prior (Table 2).Even if there is no prior association knowledge between drugs and genes/pathways (i.e.subgraphs), our approach has similar model performance as SSURhotspot.
In the pharmacogenomic data application, our approach robustly identified molecular targets of the targeted therapies, and also validated other known cancer-related genes.The use of known information in the MRF prior improved the prediction performance in the independent validation data compared to SSUR-Ber when applied to the largest input data set (feature set III).Through the random effects in our approach, cancer tissue effects were estimated, which could indicate potential relationships between drugs and cancer types.Nevertheless, there was still remaining heterogeneity within cancer types, e.g.reflecting molecular cancer sub-types.To address this, our model could be extended to multilevel random effects or a mixture approach could be employed for the random effects, e.g. by a flexible Dirichlet process prior (Li et al., 2010;Heinzl et al., 2012).i Supplementary materials for 'Multivariate Bayesian structured variable selection for pharmacogenomic studies' S1: Prior specifications for HIW prior and spike-and-slab prior.
For the hyper-parameters ν and τ in Ψ ∼ HIW G (ν, τ I m ), we specify a fixed ν = m + 2 and τ ∼ Gamma(a τ , b τ ).ν = m + 2 is suggested by Bottolo et al. (2021), because it is the smallest integer degree resulting in a proper prior for the reparametrized parameter σ 2 j .A sensitivity analysis for these hyper-parameters of the hyper-inverse Wishart prior can be found in Supplementary S5.
The hyper-parameters a w and b w control the variance of the spike-and-slab prior for non-zero regression coefficients in which determines the posterior scale for the estimated effects.In order to provide a sufficiently large scale for the effects, we would like to allow some posterior density for values of w that will correspond to non-neglible posterior density for values of β kj larger than the upper 95% confidence bound E[β kj ] + 1.96 Var[β kj ], where E[•] and Var[•] are the mean and variance of the prior for the effect β kj .This is to ensure that prior of w provides large enough variation to be able to cover a wide range of β kj .Since w has posterior conditional 0,tj respectively, a w and b w can be chosen with similar scales as the two factors.For example, we assume a model sparsity (i.e., proportion of nonzero coefficients) r sparsity = 1 mp kj γ kj ∈ (0, 1), and E[β kj ] and Var[β kj ] can either be estimated roughly from previous studies or elucidated using expert knowledge about typical effect sizes in similar studies.Then we specify where const a and const b are chosen to ensure that the 5% quantile of the posterior conditional ( ) is larger than E[β kj ] + 1.96 Var[β kj ].

S2: Prior specification for the MRF prior.
It is known that the hyper-parameter e in a Markov random field (MRF) prior f (γ|d, e, G) ∝ exp d1 γ + eγ Gγ can display a phase transition behaviour (Li and Zhang, 2010;Stingo et al., 2011) which results in a sharp increase in the number of selected predictors (i.e., β kj = 0) with a small change of e given a value (w, d).In the MRF prior, the model sparsity is logit −1 d which can be used to specify the hyper-parameter d.Here logit is log-odds that is the natural logarithm of the odds.
To specify e, Stingo et al. (2011); Lee et al. (2017) first looked for the phase transition boundary, and then implemented a grid search for the hyper-parameter.This might still search many values for e over the phase transition boundary, which would result in very dense models in high-dimensional settings.So we first determine the expected largest value e to avoid searching in the space of too dense models which would be computationally slow, and then implement the grid search strategy.If one wants to specify model sparsity in the range (c 1 , c 2 ), one can achieve this by specifying the hyper-parameter d subject to logit −1 d = c 1 and search hyper-parameter e from 0 to −d−ln(c −1 2 −1) (see below).Because of d =logit c 1 , c 1 represents a lower bound for the model sparsity which is reached if e = 0. We aim to give a roughly estimate for e ≥ 0 based on the assumed largest sparsity c 2 .
An estimate for e in the MRF prior and {g rj } rj = G.
One might expect to take the average over γ to be the assumed largest sparsity c 2 (i.e., roughly inclusion probability of each predictor), However, it is not easy to get an explicit or approximate expression for e via the equation above.
Our strategy is to assume at least one link/neighbour so that the parameter e is not too small.Since r =j g rj γ r counts the number of links/neighbours of the rth feature, we let (c 1 , c 2 ) = (0.1, 0.3), and then d = logit c 1 = −2 and grid interval e = (0, −d−ln(c −1 2 −1)) = (0, 1.2).We determine the final e = 1 for Algorithm 1, and the final e = 0.2 for Algorithm 2.
We run a MCMC sampler with 5 chains, 400 000 iterations in total with the first 200 000 iterations as a burn-in period.
For our approach SSUR-MRF model, we tested various hyperparameters of the hyperinverse Wishart prior based on the simulation scenario 1.Large ν = 100 or a τ in the hyper-inverse Wishart prior results in a denser Ĝ than G (Figure S5.1, S5.3, S5.5 and S5.7).But all hyperparameters of the hyper-inverse Wishart prior are not much sensitive to the structure recovery of Γ and response predictions (Table S3).To specify weak informative priors for the random effects for the 13 cancer tissue types, we set the hyper-parameter w 0 ∼ IG(54.6,400) by assuming each cancer tissue effect with prior mean 0 and prior standard deviation 3 and according to the prior specification in section S5.Corresponding to the three feature sets, the model sparsity is assumed to be 6%, 2% and 1% respectively 3 , so we obtain parameter d of the MRF prior with values -2.7, -4 and -4.6.Our SSUR-MRF model with random effects, hereafter denoted as SSUR-MRF, has good MCMC diagnostics (see Section S8).For comparison, we run a similar model with parameter e = 0 of the MRF prior, which is equivalent to independent Bernoulli priors for the latent indicator variables, here denoted as SSUR-Ber.In SSUR-Ber, the model sparsity hyper-parameter d in the prior f (γ|d) ∝ exp{d1 γ} is tuned to reach the best elpd loo and elpd waic .

Fig 2 :
Fig 2: Simulation scenarios: True relationships between response variables and predictors.(a) Network structure between Y and X; (b) latent indicator variable Γ for the associations between Y and X in the SUR model; (c) additional structure G between response variables not explained by XB.Black indicates a true relation between the response variables and predictors.

Fig 4 :
Fig 4: Results for simulation scenario 1: Sensitivity analysis for case 2, i.e. when blocks of edges are deleted (i.e.delete edges non-uniformly).

Fig 5 :
Fig 5: Results for simulation scenario 2: Posterior mean of Γ and G by the SSUR-MRF with random effects based on the simulated data from scenario 2

Fig 6 :
Fig 6: GDSC data application: Illustration of the assumed relationships between drugs and related gene features, which are used for the MRF prior.The right panel is for the Bcr-Abl fusion gene, its corresponding related features and the two Bcr-Abl tyrosine kinase inhibitor drugs.The left panel illustrates gene TP35 as one example with its corresponding features representing all three data sources; all seven drugs are shown to indicate that the relationship between the three gene features is valid in relation to all drugs in the data set.The rectangles indicate drugs, solid circles indicate gene features and dashed circles indicate that the elements inside are related.The edges between drugs and dashed circled gene features indicate assumed associations between the gene features and the drug sensitivity measurements for the drugs.The names with suffix ".GEX", ".CNV" and ".MUT" indicate features of expression, copy number variation and mutation, respectively.

Fig 7 :
Fig 7: GDSC data application: Estimated residual structure between the seven drugs by the SSUR-MRF model based on features set III with Ĝ thresholded at 0.5.

Fig 8 :
Fig 8: GDSC data application: A Venn diagram for the numbers of identified features for the MAPK inhibitors by SSUR-Ber (panel (a)) and SSUR-MRF (panel (b)) models and overlaps between the models fitted with feature sets I, II, and III.

Fig 9 :
Fig 9: GDSC data application: Estimated network between the MAPK inhibitors and identified target genes based on Ĝ and Γ thresholded at 0.5 by SSUR-MRF corresponding to feature set I, II and III respectively.

Fig 10 :
Fig 10: GDSC data application: Posterior estimates for the cancer tissue random effects for all drugs based on the median probability model.Random effects are centred around zero.Error bars are ± standard deviation of the posterior mean.
proper a w and b w according to this posterior conditional with 5% quantile larger than E[β kj ]+1.96Var[β kj ].Since a w and b w are combined with the factors 1
vi S6: Estimated G for the sensitivity analysis in Section 3.3.

Figure S6 :
Figure S6: Sensitivity analysis Case 2: delete blocks of edges vii S8: MCMC diagnostics by SSUR-Ber and SSUR-MRF in the GDSC data analysis.

xS9:
Estimated residual structure between the seven drugs in the GDSC data analysis.

Figure S9 :
Figure S9: GDSC data application: Estimated residual structure between the seven drugs by SSUR-MRF and SSUR-Ber based on Ĝ thresholded at 0.5.Panels (a)-(c) are estimated by SSUR-MRF, and panels (d)-(f ) are estimated by SSUR-Ber, corresponding to Feature sets I, II and III respectively

Figure S12 :
Figure S12: GDSC data application: Drug responses across cancer tissue types for each drug.Each grey dot corresponds each cancer cell line.Red points "•" correspond to the BCR-ABL mutated blood cell lines treated by drugs Nilotinib and Axitinib.

Table 1
Results for simulation scenario 1: Accuracy of variable selection and prediction performance of models SSUR-hotspot and SSUR-MRF prior

Table 2
Results for simulation scenario 1: Sensitivity analysis of SSUR-MRF with different MRF priors

Table 3
Results for simulation scenario 2: Structure recovery and prediction by SSUR-MRF with and without random effects

Table 4
GDSC data application: Number of identified genomic features corresponding to each drug by the SSUR-Ber and SSUR-MRF models.

Table 5
GDSC data application: Numbers of genomic features selected as predictors for the MAPK inhibitors in the SSUR-Ber and SSUR-MRF models.Note that Feature sets I, II and III include 40, 55, and 81 features corresponding to known target genes of the corresponding drug, respectively.

Table 6
GDSC data application: Prediction performance of the SSUR-Ber and SSUR-MRF models based on Feature set III.

Table S5 :
Performance of variable selection and response predictions by SSUR-MRFwith different hyper-inverse Wishart prior These specifications are the same as the SSUR-MRF model in Table1.

Table S11 :
GDSC data application: Similar prediction performance between the SSUR-Ber and SSUR-MRF models.
S12: Drug responses across cancer tissue types for each drug.