Variational Bayes for high-dimensional proportional hazards models with applications within gene expression

Abstract Motivation Few Bayesian methods for analyzing high-dimensional sparse survival data provide scalable variable selection, effect estimation and uncertainty quantification. Such methods often either sacrifice uncertainty quantification by computing maximum a posteriori estimates, or quantify the uncertainty at high (unscalable) computational expense. Results We bridge this gap and develop an interpretable and scalable Bayesian proportional hazards model for prediction and variable selection, referred to as sparse variational Bayes. Our method, based on a mean-field variational approximation, overcomes the high computational cost of Markov chain Monte Carlo, whilst retaining useful features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities. The performance of our proposed method is assessed via extensive simulations and compared against other state-of-the-art Bayesian variable selection methods, demonstrating comparable or better performance. Finally, we demonstrate how the proposed method can be used for variable selection on two transcriptomic datasets with censored survival outcomes, and how the uncertainty quantification offered by our method can be used to provide an interpretable assessment of patient risk. Availability and implementation our method has been implemented as a freely available R package survival.svb (https://github.com/mkomod/survival.svb). Supplementary information Supplementary data are available at Bioinformatics online.

Since the variational probability distribution of β j conditional on z j = 1 (i.e. the slab component) is singular with respect to the Dirac measure δ 0 , in this case it is sufficient to consider only the continuous part of the prior measure in the denominator of the Radon-Nikodym derivative, that is dQ µj ,σj |zj =1 dΠ j (β j ) = dN (µ j , σ 2 j ) wdLap(λ) (β j ).
The last display thus provides an upper bound for the KL divergence and hence a surrogate objective function. Minimising this expression with respect to either µ j or σ j gives the same minimisers as minimising the objective functions (8) or (9), respectively, and hence gives our update equations.

A.2 Update equation for γ j
In a similar way, the KL divergence between Q µ,σ,γ and Π(·|D) as of function of γ j equals We now split this expression in terms of the events z j = 1 or 0. Noting that with Q µ,σ,γ -probability one, β j = 0 if and only if z j = 0, the last display can rewritten as where C does not depend on γ j . Upper bounding the expectation in the last display using Jensen's inequality, Substituting this into the second to last display, we can upper bound the KL divergence as a function of γ j by where C does not depend on γ j and we have usedw = a 0 /(a 0 + b 0 ). Setting the derivative with respect to γ j of the last equation equal to zero and rearranging gives the update equation (10) for γ j .

B Goodness of fit measures
A commonly used measure of goodness of fit in the VI literature is the evidence lower bound (ELBO), which acts as a lower bound for the Bayesian marginal likelihood and is defined as with Q ∈ Q [Bishop, 2006]. Intuitively, the ELBO trades-off how well the model fits the data (first term) with how close it is to the prior (second term). Picking the parameter that maximizes the ELBO is the variational analogue of empirical Bayes, which selects parameters by maximising the marginal likelihood. Since VI is used exactly when the marginal likelihood is intractable, it is natural to use the ELBO, which measures the quality of the variational approximation [Bishop, 2006].
Similar to our derivation of the coordinate-ascent update expression for γ j , we exploit the product structure of our variational distribution to evaluate an expression for the ELBO.
As we cannot evaluate a closed-form expression for E µ,σ,γ [l p (D; β)], we use Monte Carlo integration to estimate this quantity, i.e. we compute ∼ N (µ j , σ 2 j ) with probability γ j or taking β (i) j = 0 with probability 1 − γ j for j = 1, . . . , p. The remaining term can be evaluated explicitly as Although popular, model selection based on the ELBO is not justified theoretically and can be sensitive to the setting; therefore we consider additional goodness of fit measures. One such measure, proposed by Nott et al. (2012), involves using the variational posterior to approximate the log-predictive density score (LPDS) on an held out testing dataset, defined as where D test is the held out test set and D train is the training set used to computeΠ [Nott et al., 2012]. By Jensen's inequality, Hence, given that computing the ELBO for a validation set involves computing EΠ [log L p (D test ; β)], we can cheaply obtain an upper bound for the LPDS to use as a goodness of fit measure.
A further goodness of fit measure specific to survival models is the concordance index (c-index), defined as where η k = β ⊤ 0 x k is referred to as the prognostic index. Intuitively, for two observations (t i , δ i = 1, x i ) and (t j , δ j = 1, x j ), when η j > η i we would expect t i > t j . This remark follows from the form of the hazard function under the PHM, where we assume Therefore, when η j > η i the hazard rate h(t j ) > h(t i ) and thus we would expect t j to have failed before t i . Often the c-index is estimated by, where the prognostic index is estimated using η j = β ⊤ x j for a given point estimate β of β, and I(·) is the indicator function [Harrell et al., 1982]. Notably, the c-index is not robust to censoring and tends to overestimate k when there is a high degree of censoring [Gonen and Heller, 2005].

C.1 Markov chain Monte Carlo sampler
To construct our sampler we note that the model given by (5) can be reformulated such that the prior is given by: Algorithm 1 details our Gibbs sampler for the posterior in (5) based on the above formulation.
Notably, we introduce the notation x k:l : Algorithm 1 Spike and Slab MCMC sampler 1: Require: K, the Metropolis-Hastings proposal kernel, N: number of samples.
Ignoring the superscript for clarity, the distribution w j |D, β, z, w −j is conditionally independent of D, β, z and w −j . Therefore, w As z j is discrete, evaluating the RHS of (11) for z j = 0 and z j = 1, gives the unnormalised conditional probabilities. Summing gives the normalisation constant and thus we can sample z j from a Bernoulli distribution with parameter Finally, to sample from β we use a Metropolis-Hastings within Gibbs step, wherein a proposal is sampled from a random-walk proposition kernel K. The proposal is then accepted with probability A or rejected with probability 1 − A, in which case β . Noting A is given by, The implementation of the sampler is available as an R package that can be installed from https:

D.1 Sensitivity to starting values
To examine the sensitivity with respect to the initialization of µ, σ and γ we generate data as describe in Section 3.1 taking (n, p, s, c) = (200, 1000, 10, 0.25).
• Ridge: where µ is the MLE under the ridge penalty.
• Elastic Net: where µ is the MLE under the elastic net penalty (equal mixture of ridge and LASSO penalties).
• LASSO: where µ is the MLE under the LASSO penalty.
In the last three cases a regularization hyperparameter of 100λ min is used, where λ min is the hyperparameter wherein all estimates coefficients are equal to zero. In order to compute the MLE under the different penalties we use glmnet. To evaluate the different initialization methods we compute the ℓ 2 -error, ℓ 1 -error, TPR, FDR, AUC and runtime, presenting the median, lower (5%) and upper (95%) quantiles across 100 runs in Table 1.
Examining Table 1, we notice that within setting 1 the initialization method does not impact the performance of the method with all methods performing equally. It is worth noting the runtime for random initialization is largest, meaning it takes the algorithm longer to converge given a poor initialization. Within settings 2-4, the initialization methods can have a substantial effect on performance. For instance, within settings 3 initialization with ridge yields a median ℓ 2 -error of 1.019, whereas initialization with the LASSO penalty gives a median ℓ 2 -error of 0.394. Furthermore, poor initialization has an effect on the variable selection of the method (Table 1). Overall, initialization of µ using the LASSO gave the best models, obtaining best metrics across the different settings.
In some cases however, initialization with the LASSO can be slower than other methods e.g. the elastic net, which produced comparable models.  To examine the sensitivity we compute and present the mean ℓ 2 -error, ℓ 1 -error, TPR, FDR, AUC and runtime (in seconds) across 100 runs.
Within the simplest setting we notice the method is not sensitive to the starting values, obtaining an ℓ 2 error of 0.41 and ℓ 1 -error of 1.09 and the ideal values for the TPR, FDR and AUC across the different values of σ and γ (Figure 1). More interestingly, the method can be sensitive to the initial values of σ and γ in more complicated settings (settings 2-4). Specifically, when the γ j s are small the performance is worse than when they are larger (Figures 2 -4), for instance using a value of γ j = 0.01, j = 1, . . . , p, would give a worse performance across all metrics in comparison to a value of γ j = 0.5. Furthermore, within setting 2, we notice the performance is sensitive to both the values of σ and γ, with the optimal across all metrics except the FDR given by σ j = 0.1 and γ j = 0.5 ( Figure 2). Finally, for settings 3 and 4 the method is not particularly sensitive to the value of σ, however can be sensitive to the value of γ. Therefore choosing a starting value of γ of at least 0.5 is appropriate within these settings. It is worth pointing out, that in some cases overflow issues were encountered when the values of σ j and γ j were too large, for instance when γ j = 0.75 and σ j = 1. Overall, the ELBO seems to be the most appropriate goodness of fit measure, as it is maximized for model parameter λ that yields the lowest ℓ 2 -error ℓ 1 -error and largest TPR. Within these settings the ELL and c-index do not identify the model with the best performance metric and in turn should be used cautiously.

D.2.2 Sensitivity to a 0
To examine the sensitivity with respect to a 0 , we fix λ = 1 and examine the grid of values of A 0 = {1, 2, 5, 10, 20, 50, 100, 500}. For each a 0 ∈ A 0 we compute the VB posterior and report the respective performance metrics and goodness of fit measures (averaged across 100 replications) in Figure 6.
Generally, the proposed method is not particularly sensitive to the value of a 0 , performing equivalently for values between 1 and 100, Figures 6 (a)-(e). However, for large values of a 0 the performance decreases, arising as a large value of a 0 corresponds to us believing a priori that many coefficients are non-zero, meaning in the resulting models many coefficients will be non-zero. This behaviour is observed in our results through the increased ℓ 1 -error, ℓ 2 -error and FDR. Finally, as a 0 increases the runtime increases Figure 6 (f).
Regarding goodness of fit measures, the ELBO is maximized for models with correspondingly good performance metrics. Further, the ELL and c-index increase as a 0 increases, meaning models with many non-zero parameters are favoured. These performance metrics should therefore be used cautiously when tuning a 0 .

D.2.3 Sensitivity to λ and a 0
Finally, we consider the sensitivity for both parameters. To do so, we fit the VB posterior for are not consistently optimal for a single combination of λ and a 0 . For example, in setting 3 the optimal value of ℓ 2 -error is obtained when (λ, a 0 ) = (2, 100) whereas the optimal AUC is obtained when (λ, a 0 ) = (4, 100), meaning in practice a trade-off may need to be made between better point estimates and variable selection.
As before, the ELBO seems to the most appropriate goodness of fit measure for most settings.
However, in highly correlated settings, where there is no single "best" parameter values, the optimal ELBO corresponds the model with smallest ℓ 1 -error [Figures 7 (g) and 10 (g)]. Meaning, other measures are more appropriate if the practitioner wishes to optimise other performance metrics, e.g. the TPR. Finally, the ELL and c-index should be used cautiously as these favour models with many non-zero coefficients and should therefore be used for tuning λ when a 0 is fixed.

E.2 Ovarian cancer dataset
We examine the low and high risk groups constructed using the prognostic index for the ovarian cancer dataset. To construct the groups, we selected the model fit with λ = 0.5 for the first fold, which has the validation c-index of 0.57, the lowest value obtained across the different values of λ and folds. We chose this value of λ and fold to demonstrate how comparison of groups can highlight outliers and indicate when splitting based on the prognostic index may not be appropriate.
As before, to construct the groups, the validation set was split based on the median value of the prognostic index computed on the training set. The pairwise posterior probability that a patient within the high risk group is at greater risk than a patient within the low risk group is presented in Figure 11. Immediately we notice two patients within each group that may be outliers: patient 2 within the high-risk group and patient 20 within the low-risk group. In the case of patient 2 within the high-risk group, examining the row from left to right we notice the posterior probability is decreasing from around 0.8 to 0.4, meaning that on furthest end, the high-risk patients in the low-risk groups are at greater risk than patient 2. Regarding patient 20 from the low risk group, we notice the posterior probability across the column is lower in comparison to the other patients within the group. Furthermore the mean posterior probability across the column is 0.73, meaning this may be a borderline case between being in the high and low risk group.

E.3 Breast cancer dataset
Convergence diagnostics for the model fit to the breast cancer dataset are presented in Figure 12 for the ovarian and breast cancer datasets respectively. As with the TCGA data we notice the ELBO  Figure 11: Comparison of validation risk groups for the TCGA ovarian cancer dataset. Rows correspond to patients in the high-risk group and columns to patients in the low-risk group. Both rows and columns are sorted from lowest to highest risk.
is increasing as we iterate the algorithm, suggesting the model fit is improving. Furthermore, we notice that the KL(Q∥Π) is decreasing, suggesting that sparsity is induced. Coupled with the fact the validation expected log-likelihood is increasing, this suggests fewer spurious variables are being selected and the model is fitting better to unseen validation set.

E.4 Model fits
Model fit for different values of λ for the two datasets are presented in Table 2 and Table 3. Reported is the mean and standard deviation across 10 training and validation folds of the: ELBO, ELL, KL, c-index ( k) and the number of parameters for which γ j > 0.5. Examining the different metrics, we notice the model is not particularly sensitivity to the value of λ.  Figure 12: Breast cancer dataset convergence diagnostics for model is fit with λ = 2.5, Presented is the: ELBO, ELL and KL for the training and validation set. Each fold is presented in (solid) grey, and the mean over the 10 folds presented in (dashed) black. As with the TCGA, we notice as we iterate the model fit improves, fitting better to the unseen validation set.