BITES: balanced individual treatment effect for survival data

Abstract Motivation Estimating the effects of interventions on patient outcome is one of the key aspects of personalized medicine. Their inference is often challenged by the fact that the training data comprises only the outcome for the administered treatment, and not for alternative treatments (the so-called counterfactual outcomes). Several methods were suggested for this scenario based on observational data, i.e. data where the intervention was not applied randomly, for both continuous and binary outcome variables. However, patient outcome is often recorded in terms of time-to-event data, comprising right-censored event times if an event does not occur within the observation period. Albeit their enormous importance, time-to-event data are rarely used for treatment optimization. We suggest an approach named BITES (Balanced Individual Treatment Effect for Survival data), which combines a treatment-specific semi-parametric Cox loss with a treatment-balanced deep neural network; i.e. we regularize differences between treated and non-treated patients using Integral Probability Metrics (IPM). Results We show in simulation studies that this approach outperforms the state of the art. Furthermore, we demonstrate in an application to a cohort of breast cancer patients that hormone treatment can be optimized based on six routine parameters. We successfully validated this finding in an independent cohort. Availability and implementation We provide BITES as an easy-to-use python implementation including scheduled hyper-parameter optimization (https://github.com/sschrod/BITES). The data underlying this article are available in the CRAN repository at https://rdrr.io/cran/survival/man/gbsg.html and https://rdrr.io/cran/survival/man/rotterdam.html. Supplementary information Supplementary data are available at Bioinformatics online.

Estimating the effects of interventions on patient outcome is one of the key aspects of personalized medicine.Their inference is often challenged by the fact that the training data comprises only the outcome for the administered treatment, and not for alternative treatments (the so-called counterfactual outcomes).Several methods were suggested for this scenario based on observational data, i.e. data where the intervention was not applied randomly, for both continuous and binary outcome variables.However, patient outcome is often recorded in terms of time-to-event data, comprising right-censored event times if an event does not occur within the observation period.Albeit their enormous importance, time-to-event data is rarely used for treatment optimization.
We suggest an approach named BITES (Balanced Individual Treatment Effect for Survival data), which combines a treatment-specific semiparametric Cox loss with a treatment-balanced deep neural network; i.e. we regularize differences between treated and non-treated patients using Integral Probability Metrics (IPM).We show in simulation studies that this approach outperforms the state of the art.Further, we demonstrate in an application to a cohort of breast cancer patients that hormone treatment can be optimized based on six routine parameters.We successfully validated this finding in an independent cohort.

Introduction
Inferring the effect of interventions on outcomes is relevant in diverse domains, comprising precision medicine and epidemiology (Frieden, 2017), or marketing (Kohavi et al., 2009;Bottou et al., 2013).A fundamental issue of causal reasoning is that potential outcomes are observed only for the applied intervention but not for its alternatives (the counterfactuals).For instance, in medicine, only the factual treatment outcome is observed.The counterfactual outcomes remain hidden.
Estimates of Average Treatment Effects (ATE) do not necessarily hold on the level of individual patients, and the Individual Treatment Effect (ITE) has to be inferred from data (Holland, 1986).Solving the latter "missing data problem" was attempted repeatedly in the literature using machine learning methods in combination with counterfactual reasoning.There are two naive approaches to this issue: the treatment can be included as a covariate or it can be used to stratify the model development, i.e. individual treatmentspecific models are learned (also called T-learner).Potential outcomes can then be estimated by changing the respective treatment covariate or model.These naive approaches are occasionally discussed in performance comparisons, e.g., in (Chapfuwa et al., 2020;Curth et al., 2021).An alternative approach is to match similar patients between treated and non-treated populations using, e.g., propensity scores (Rosenbaum and Rubin, 1983).This directly provides estimates of counterfactual outcomes.However, a central issue in this context is to define appropriate similarity measures, which should ideally also be valid in a high-dimensional variable space (King and Nielsen, 2019).Further alternatives are Causal Forests (Athey et al., 2016;Wager and Athey, 2017;Athey and Wager, 2019) or deep architectures such as the Treatment-Agnostic Representation Network (TARNet) (Johansson et al., 2016;Shalit et al., 2016).Both methods do not account for treatment selection biases and thus will be biased towards treatment-specific distributions.This issue was recently approached by several groups which balanced the treated and non-treated distributions using model regularization via representations of Integral Probability Metrics (IPM) (Müller, 1991).Suggested methods are, e.g., balanced propensity score matching (Diamond and Sekhon, 2013;Li and Fu, 2017), deep implementations such as the Counterfactual regression Network (CFRNet) (Johansson et al., 2016;Shalit et al., 2016) or the autoencoder based Deep-Treat (Atan et al., 2018).Recently, balancing was incorporated in a Generative Adversarial Net for inference of Individualized Treatment Effects (GANITE) (Yoon et al., 2018).Note, learning balanced representations involves a trade-off between predictive power and bias since biased information can be also highly predictive.
All aforementioned approaches deal with continuous or binary response variables.In medicine, however, patient outcome is often recorded as time-to-event data, i.e. the time until an event occurs.The patient is (right-)censored at the last known follow-up if the event was not observed within the observation period.A plethora of statistical approaches deal with the analysis of time-to-event data (Martinussen and Scheike, 2006), of which one of the most popular methods is Cox's Proportional Hazards (PH) model (Cox, 1972).The Cox PH model is a semi-parametric approach for timeto-event data, which models the influence of variables on the baseline hazard.Here, the PH assumption implies an equal baseline hazard for all observations.In fact, the influence of variables can be estimated without any consideration of the baseline hazard function (Cox, 1972;Breslow, 1972).The Cox PH model is also highly relevant in the context of machine learning.It was adapted to the highdimensional setting using l 1 and l 2 regularization terms by Tibshirani, (1997), with applications ranging from the prediction of adverse events by patients with chronic kidney disease (Zacharias et al., 2021) to the risk prediction in cancer entities (Jachimowicz et al., 2021;Staiger et al., 2020).The Cox PH model can be also adapted to deep learning architectures, as proposed by Katzman et al., (2018).Alternative machine-learning approaches to model time-to-event data include discrete-time Cox models built on multi-outcome feedforward architectures (Lee et al., 2018;Gensheimer and Narasimhan, 2019;Kvamme and Borgan, 2019) and random survival forests (RSF) (Ishwaran et al., 2008;Athey and Wager, 2019).
The prediction of ITEs from time-to-event data has received little attention in the machine learning community, which is surprising considering the enormous practical relevance of the topic.Seminal works are Chapfuwa et al., (2020) and Curth et al., (2021).Most recently, Curth et al., (2021) suggested to learn discrete-time treatment-specific conditional hazard functions, which were estimated using a deep learning approach.Treatment and control distributions were balanced analogously to (Shalit et al., 2016) using the p-Wasserstein distance (Kantorovitch, 1958;Ramdas et al., 2017).This approach, named SurvITE, was shown to out-perform the current state of the art in simulation studies.
We propose to combine the loss of the Cox PH model with an IPM regularized deep neural network architecture to balance generating distributions of treated and non-treated patients.We named this approach "Balanced Individual Treatment Effect for Survival data" (BITES).We show that this approach -albeit its apparent simplicity -outcompetes SurvITE as well as alternative state-of-the-art methods.First, we demonstrate the superior performance of BITES in simulation studies where we focus on biased treatment assignments and small sample sizes.Second, we used training data from the Rotterdam Tumour Bank (Foekens et al., 2000) to show that BITES can optimize hormone treatment in patients with breast cancer.We validated the latter model using data from a controlled randomized trial of the German Breast Cancer Study Group (GBSG) (Schumacher et al., 1994) and analyzed feature importance using SHAP (SHapley Additive exPlanations) values (Lundberg and Lee, 2017).We further provide an easy-to-use python implementation of BITES including scheduled hyper-parameter optimization1 .

Methods
Patient outcome can be recorded as (right-)censored timeto-event data.First, we will introduce models for such data, i.e. the Cox proportional hazards model and recent nonlinear adaptations.Second, we will discuss the potential outcome model and how it can be used to model survival.Third, we introduce regularization techniques to account for unbalanced distributions and, finally, we will combine these methods in a deep neural network approach termed BITES to learn treatment recommender systems based on patient survival.

Survival Data
Let X be the space of covariates and T the space of available treatments.Further, let y ∈ Y be the observed survival times and E ∈ E = {0, 1} the corresponding event indicator.Denote sample data of patient i by the triplet (x i , y i , E i ) ∈ X × Y × E. If the patient experiences the event within the observation period, y E=1 i is the time until the event of interest occurs, otherwise y E=0 i is the censoring time.Let the survival times y be distributed according to f (y) with the corresponding cumulated event distribution F (y) = y 0 f (y ) dy .The survival probability at time y is then given by S(y) = 1 − F (y).The hazard function is and corresponds to the risk of dying at time y (Cox, 1972), i.e. a greater hazard corresponds to greater risk of failure.
Here, the model parameters are given by β and the baseline hazard function is λ 0 (y) = λ(y; x = 0).Note that λ 0 (y) = f (y) 1−F (y) = − d dy log(S(y)).According to Cox's proportional hazards (PH) assumption, all patients share the same baseline hazard function and, importantly, the baseline hazard cancels in maximum likelihood estimates of β.Thus, time dependence can be eliminated from the individual hazard prediction and rather than learning the exact time to event, Cox regression learns an ordering of hazard rates.At every event time y E=1 i , the set of patients at risk is given by ).The partial log-likelihood of the Cox model (Cox, 1972;Breslow, 1972) is given by: Faraggi and Simon (1995) suggested to replace the ordinary linear predictor function, β T x, by a feedforward neural network with a single outcome node h θ (x) and network parameters θ.Following this idea, Katzman et al., (2018) introduced DeepSurv, which showed improved performance compared to the linear case, particularly if non-linear covariate dependencies are present.

The counterfactual problem
The outcome space for multiple treatment options k is given by Y = Y 0 × . . .× Y (k−1) .For simplicity.we will restrict the discussion to the binary case, k = 2, with a treated group, T = 1, and a control group, T = 0.
We consider the problem where only a single factual outcome is observed per patient, i.e. the outcomes for all other interventions, also known as the counterfactuals, are missing.Hence, the individual treatment effect (ITE), defined as can only be inferred based on potential outcome estimates (Rubin, 1974).We will build a recommendation model that assigns treatments to patients with predictions τ (x i ) > 0.
Following recent work (Johansson et al., 2016;2020;Shalit et al., 2016;Alaa and van der Schaar, 2017;Athey and Wager, 2019;Wager and Athey, 2017;Yoon et al., 2018;Yao et al., 2018), we make the standard strong ignorability assumption, which has been shown to be a sufficient condition to make the ITE identifiable (Shalit et al., 2016;Pearl, 2017), i.e. it guarantees proper causal dependencies on the interventions.The strong ignorability assumption contains the unconfoundedness and overlap assumptions: Assumption 1 (Unconfoundedness) Covariates X do not simultaneously influence the treatment T and potential out- This assumption ensures that the causal effect is not influenced by non-observable causal substructures such as confounding (Pearl, 2009).Correcting for confounding bias requires structural causal models, which are ambiguous in general and need to be justified based on domain knowledge (Pearl, 2008).
Assumption 2 (Overlap) There is a non-zero probability for each patient i to receive each of the treatments T ∈ T :

Balancing distributions
Strong ignorability only removes confounding artifacts.Imbalances of the generating distributions due to biased treatment administration might still be present.Balancing the generating distributions of treated and control group has been shown to be effective both on the covariate space (Imai and Ratkovic, 2014) and on latent representations (Johansson et al., 2016;Shalit et al., 2016;Li and Fu, 2017;Yao et al., 2018;Huang et al., 2016;Johansson et al., 2020;Lu et al., 2020;D'Amour et al., 2017).This is either achieved by multi-task models or IPMs.The latter quantify the difference of probability measures P and Q defined on a measurable space S by finding a function f ∈ F that maximizes (Müller, 1991) Most commonly used are the Maximum Mean Discrepancy (MMD), restricting the function space to reproducing kernel-Hilbert spaces (Gretton et al., 2012), or the p-Wasserstein distance (Ramdas et al., 2017).Both have appealing properties and can be empirically estimated (Sriperumbudur et al., 2012).MMD has low sample complexity with a fast rate of convergence, which comes with low computational costs.A potential issue is that gradients vanish for overlapping means (Feydy et al., 2018).The p-Wasserstein distance, on the other hand, offers more stable gradients even for overlapping means, which comes with higher computational costs, i.e. by solving a linear program.The computational burden can be reduced by entropically smoothing the latter and by using the Sinkhorn divergence, where W p (P, Q) is the smoothed Optimal Transport (OT) loss defined in the following (Ramdas et al., 2017;Feydy et al., 2018) Definition 2.1 (Smoothed Optimal Transport loss) For p ∈ [1, ∞) and Borel probability measures P, Q on R d the entropically smoothed OT loss is with Γ(P, Q) the set of all joint probability measures whose marginals are P, Q on R d , i.e. for all subsets Here, mediates the strength of the Kullback-Leibler divergence.
The Sinkhorn divergence can be efficiently calculated for > 0 (Cuturi, 2013).For p = 2 and = 0 we can retrieve the quadratic Wasserstein distance and in the limit → +∞ it becomes the MMD (Genevay et al., 2017).BITES tunes to take advantage of the more stable OT gradients to improve the overlap while remaining computationally efficient.In the following, we denote it by IPM p (•, •) to highlight the possibility to use any representation of the IPM.A thorough discussion of Sinkhorn divergences with 1-and 2-dimensional examples can be found in (Feydy et al., 2018).

Treatment recommender systems
For comparison, we evaluated several strategies to build treatment recommender systems.

Cox regression model
We implemented the Cox regression as T-learner with treatment-specific survival models using lifelines (Davidson-Pilon et al., 2021).Note, an ordinary Cox regression model which uses both the covariates X and the treatment variable T as predictor variables generally recommends the treatment with the better ATE; a treatmentspecific term adds to β T x and thus the treatment which reduces the hazard most will be recommended.Therefore, we did not include the latter approach and focus on the Cox T-learner in our analysis.
DeepSurv Katzman et al., (2018) suggested to provide individual recommendations based on single model predictions using T and X as covariates based on τ DS (T, Hence, it uses a treatment independent baseline hazard which could compromise the performance (Bellera et al., 2010;Xue et al., 2013).
Treatment-specific DeepSurv models To account for treatment-specific differences of baseline hazard functions, we also estimated DeepSurv as a T-learner (T-DeepSurv), i.e. we learned models stratified for treatments.We then evaluated the time-dependent individual treatment effect based Treatment-specific Random Survival Forests Analogously to the previous approach, we learned treatmentspecific Random Survival Forests (RSF) (Ishwaran et al., 2008;Athey et al., 2016) using the implementation of scikitsurvival (Pölsterl, 2020) to estimate the time-dependent ITE.
SurvITE Curth et al., (2021) suggested to learn discretetime treatment-specific conditional hazard functions, which were estimated using an individual outcome head for each time interval2 .We evaluated the time-dependent ITE to assign treatments, as for the latter two methods.

BITES
BITES model architecture BITES combines survival modeling with counterfactual reasoning, i.e. it facilitates the development of treatment recommender systems using time-to-event data.BITES uses the network architecture shown in Figure 1 with loss function where q is the fraction of patients in the control cohort (patients with T = 0) and L T Cox is given by the negative Cox partial log-likelihood of Equation 1, where we parametrize the hazard function h T (Φ(x)) according to the network architecture illustrated in Figure 1.The latent representation Φ is regularized by an IPM term to reduce differences between treatment and control distributions of nonconfounding variables.Throughout the article, we used the Sinkhorn divergence of the smoothed OT loss with p = 2 as IPM term.Hence, the parameter in Equation 2 calibrates between the quadratic-Wasserstein distance ( = 0) and MMD ( = ∞).The total strength of the IPM regularization is adjusted by α.Models with α = 0 do not balance treatment effects and therefore we denote this method as "Individual Treatment Effects for Survival" (ITES).Models with α > 0 will be denoted as "Balanced Individual Treatment Effects for Survival" (BITES).(B)ITES uses the time-dependent ITE for treatment decisions.For the studies shown in this article, we assigned treatments based on the ITE evaluated for a survival probability of 50%, i.e. τ (x i ) = (S(x)λ 1 (y)) −1 (0.5) − (S(x)λ 0 (y)) −1 (0.5).
Implementation BITES uses a deep architecture of dense-connected layers which are each followed by a dropout (Srivastava et al., 2014) and a batch normalization layer (Ioffe and Szegedy, 2015).It uses ReLU activation functions (Nair and Hinton, 2010) and is trained using the Adam optimizer (Kingma and Ba, 2014).Further, early stopping based on non-decreasing validation loss and weight decay regularizations (Krogh and Hertz, 1992) are used to improve generalization.Our implementation is based on the PyTorch machine learning library (Paszke et al., 2019) and the pycox package (Kvamme and Borgan, 2019).The Sinkhorn divergence is implemented using the GeomLoss package (Feydy et al., 2018).We provide an easy-to-use python implementation which includes a hyperparameter optimization using the ray[tune] package (Liaw et al., 2018) to efficiently distribute model training.

Performance measures
We used different measures to assess the performance of treatment recommendation systems.This comprises both measures for the quantification of prediction performance and of treatment assignment.Discriminative performance was assessed using a time-dependent extension of Harrell's C-index (Harrell, 1982) to account for differing baseline hazards, which evaluates for all samples i and j at all event times y E=1 i (Antolini et al., 2005).This reduces to Harrell's C-index for strictly ordered survival curves.To quantify the performance of treatment recommendations, we used the Precision in Estimation of Heterogenous Effect (PEHE) score (Hill, 2011), which is defined as the difference in residuals between factual and counterfactual outcome: Note, the PEHE score can only be calculated if both the factual and counterfactual outcomes are known, which is usually only the case in simulation studies.Therefore, we restricted its application to the latter.There, we further quantified the proportion of correctly assigned "best treatments".

Simulation studies
We performed three exemplary simulation studies.First, we simulated a scenario where covariates affect survival only linearly.Second, we simulated data with additional non-linear dependencies, and, finally, we performed a simulation where the treatment assignments were biased by the covariates.
Linear simulation study In analogy to Alaa et al., (2017) and Lee et al., (2018), we simulated a 20-dimensional covariate vector x = (x 1 , x 2 ) ∼ N (0, I) consisting of two 10-dimensional vectors x 1 and x 2 , with corresponding survival times given by We set the parameters γ 1 = (0.1, . . ., 0.1) T and γ 2 = (15,35,55,75,95,115,135,155,175,195) T • 10 −2 .The first term in the exponent is treatment dependent while the second term affects survival under both treatments identically.This simulation gives an overall positive average treatment effect in ∼ 64% of the patients.Survival times exceeding 10 years were censored to resemble common censoring at the end of a study.Of the remaining samples, 50% were censored at a randomly drawn fraction f c ∼ U(0, 1) of the true unobserved survival time.Samples were assigned randomly to the treated, T = 1, and control group, T = 0, without treatment administration bias.Finally, we added an error ∼ N (0, 0.1 • I) to all covariates.Detailed information about hyper-parameter selection is given in the Supplementary materials.Non-linear simulation study Next, we simulated nonlinear treatment-outcome dependencies using the model where we set the parameters γ 1 = (2, . . ., 2) T and γ 2 = (0.5, 0.9, 1.3, 1.7, 2.1, 2.5, 2.9, 3.3, 3.7, 4.1) T .Note, the first term imposes sizable non-linear effects which differ between both treatments.We further scaled the polynomials by c = 0.01 to yield realistic survival times up to 10 years.This setting gives an overall positive average treatment effect in ∼ 64% of the patients.
Figure 2(b) gives the performance of the evaluated methods in terms of Harrell's C-index.We observed that the ordinary Cox regression with linear predictor variables performs worst across all sample sizes, followed by RSF, and SurvITE.Approximately equal performance was observed for the DeepSurv approaches, ITES, and BITES.Among these methods, the treatment-specific DeepSurv models (T-DeepSurv) showed a higher variance across the simulation runs, in particular for the low sample sizes.Next, we studied the corresponding PEHE scores (Supplementary Figure S1) and the proportion of correctly assigned treatments (Figure 2(e)).We observed, although DeepSurv performed well in terms of C-Indices, that the performance was highly compromised in the latter two measures.In fact, it was not able to outperform the recommendation based on the ATE, i.e. always assigning T = 1, which corresponds to the dashed horizontal line.We further observed that SurvITE performed worst in this scenario with both substantially lower proportions of correctly assigned treatments and higher PEHE scores compared to the other methods.Here, T-DeepSurv, ITES, and BITES performed best, however, the results of the former are inferior compared to ITES and BITES for sample sizes of n = 600 and n = 1200.
Non-linear simulation study with treatment bias Finally, we repeated the non-linear simulation study but now took into account a treatment assignment bias, i.e. the value of one or more covariates is indicative of the applied treatment.To simulate this effect, we assigned the treatment with a 90% probability if the fifths entry of x 1 or x 2 was larger than zero.To ensure that the unconfoundedness assumption holds, we set the corresponding entries γ 1 and γ 2 to zero.This simulation study yields a positive treatment effect in ∼ 71% of the patients (dashed horizontal line in Figure 2(f)).and PEHE scores, respectively.Similar to the previous studies, the best performing methods with respect to C-Indices were the two DeepSurv models, ITES and BITES.With respect to correctly assigned treatments and PEHE scores, however, BITES consistently outperformed the other methods for reasonable sample sizes starting from n = 1200.
For n = 600, none of the methods was able to outperform a model where the treatment is always recommended (dashed line in Figure 2(f)).

BITES optimizes hormone treatment in patients with breast cancer
We retrieved pre-processed data of 1,545 breast cancer patients as used by Katzman et al., (2018), which were originally extracted from the Rotterdam Tumour Bank (Foekens et al., 2000).The available patient characteristics are age, menopausal status (pre/post), number of cancerous lymph nodes, tumor grade, and progesterone and estrogen receptor status.Of these patients, 339 were treated by a combination of chemotherapy and hormone therapy.The remaining patients were treated by chemotherapy only.Note, in this study the application of hormone treatment was not randomized.
In total ∼ 37% of the patients were censored.
We used these data to learn treatment recommender systems in order to predict the individual treatment effect of adding hormone therapy to chemotherapy.We performed hyperparameter tuning as outlined in the Supplementary Material, and selected the models with the lowest validation loss, respectively.
Next, we evaluated the performance using test data from   values are shown in red and low values in blue.We observed that the number of positive lymph nodes has the strongest impact on survival with SHAP values ranging from ∼ −0.5 to ∼ 1 in the group with and without hormone treatment, where more positive lymph nodes (shown in red) indicate a worse survival.Considering the menopausal status, we observed that patients post-menopause showed increased risk of death in the group without hormone treatment (T = 0).Interestingly, this effect was substantially mitigated in the hormone-treated group.It was particularly interesting to observe that high tumor grade (grade 3, shown in red), yields high SHAP values up to 0.5 in the hormone treated group.This effect was substantially mitigated in the group without hormone treatment.In summary, we observed strong hints that hormone treatment alleviates the negative effect of menopause, and increases the negative effect of high tumor grade on patient survival.

Conclusion
We presented BITES, which is a machine learning framework to optimize individual treatment decisions based on time-to-event data.It combines Deep Neural Network counterfactual reasoning with Cox's proportional hazards model.It further enables balancing of treated and non-treated patients using integral probability metrics on a latent layer data representation.We demonstrated in simulation studies that BITES outcompetes state-of-the-art methods with respect to prediction performance (Harrell's C-index), cor-rectly assigned treatments, and PEHE scores.We observed that BITES can effectively capture both linear and nonlinear covariate outcome dependencies on both small and large scale observational studies.Moreover, we showed that BITES can be used to optimize hormone treatment in breast cancer patients.Using independent data from the GBSG Trial 2, we observed that BITES treatment recommendations might improve patient survival.In this context, SHAP values were demonstrated to enhance the interpretability and transparency of treatment recommendations.
Like most recently developed counterfactual tools, BITES depends on the strong ignorability assumption.Hence, caution is necessary when analyzing heavily confounded observational data.Future work needs to address more specialized time-to-event models, such as competing event models, and the generalization to multiple treatments and combinations thereof.Both could substantially broaden the scope of applications for BITES.
In summary, BITES facilitates treatment optimization from time-to-event data.In combination with SHAP values, BITES models can be easily interpreted on the level of individual patients, making them a versatile backbone for treatment recommender systems.

Figure 2 Figure 2 .
Figure2(a) shows the distributions of Harrell's C-index evaluated on 1000 test samples for 50 consecutive simulation runs.We observed that across all investigated sample sizes (x-axis) the T-learner Cox regression showed superior performance, closely followed by ITES and BITES.These three methods performed equally well for the larger sample sizes n = 1800 and n = 2400.We further investigated the proportion of correctly assigned treatments, Figure2(a), and PEHE scores, Supplementary FigureS1, where we obtained qualitatively similar trends.RSF, DeepSurv and T-DeepSurv showed inferior performance with respect to C-Indices, correctly assigned treatments, and PEHE scores.

Figures 2 Figure 3 .
Figures 2(c), 2(f), and Supplementary FigureS1, show the results in terms of C-index, correctly assigned treatments,

Finally, we explored
feature importance of the BITES model using SHAP values(Lundberg and Lee, 2017) with results shown in Figure4which correspond to treatment option T = 0 (no hormone treatment) and T = 1 (hormone treatment), respectively.Here, points correspond to patients and positive (negative) SHAP values on the x-axis indicate an increased (decreased) risk of failure.Further, the feature value is illustrated in colors ranging from red to blue, where high

Figure 4 .
Figure 4. SHAP (SHapley Additive exPlanations) values for the best selected BITES model on the controlled randomized test samples of the RGBSG data.Red points correspond to high and blue points to low feature values.A positive SHAP value indicates an increased hazard and hence decreased survival chances and vice versa.

Table 1 .
Predictive outcomes on the controlled randomized test set of the RGBSG data obtained by each of the discussed models with minimum validation loss found in a hyper-parameter grid search.