Distinguishing prognostic and predictive biomarkers: an information theoretic approach

Abstract Motivation The identification of biomarkers to support decision-making is central to personalized medicine, in both clinical and research scenarios. The challenge can be seen in two halves: identifying predictive markers, which guide the development/use of tailored therapies; and identifying prognostic markers, which guide other aspects of care and clinical trial planning, i.e. prognostic markers can be considered as covariates for stratification. Mistakenly assuming a biomarker to be predictive, when it is in fact largely prognostic (and vice-versa) is highly undesirable, and can result in financial, ethical and personal consequences. We present a framework for data-driven ranking of biomarkers on their prognostic/predictive strength, using a novel information theoretic method. This approach provides a natural algebra to discuss and quantify the individual predictive and prognostic strength, in a self-consistent mathematical framework. Results Our contribution is a novel procedure, INFO+, which naturally distinguishes the prognostic versus predictive role of each biomarker and handles higher order interactions. In a comprehensive empirical evaluation INFO+ outperforms more complex methods, most notably when noise factors dominate, and biomarkers are likely to be falsely identified as predictive, when in fact they are just strongly prognostic. Furthermore, we show that our methods can be 1–3 orders of magnitude faster than competitors, making it useful for biomarker discovery in ‘big data’ scenarios. Finally, we apply our methods to identify predictive biomarkers on two real clinical trials, and introduce a new graphical representation that provides greater insight into the prognostic and predictive strength of each biomarker. Availability and implementation R implementations of the suggested methods are available at https://github.com/sechidis. Supplementary information Supplementary data are available at Bioinformatics online.

• Global outcome modelling methods, which model the outcome function f (X, T ).
The simplest way to model this function is by using generalised linear regression models (Freidlin and Simon, 2005;Tian et al., 2014). Different types of extensions have been suggested in the literature, such as FindIt (Imai and Ratkovic, 2013), which uses a support vector machine (SVM) classifier with differing penalisation for the predictive vs the prognostic covariates. Another category of methods utilises the potential outcome (or counterfactual) modelling (Rubin, 1974). A popular method in this class is Virtual Twins (VT) (Foster et al., 2011), which follows a three-step strategy. Firstly, it non-parametrically estimates the outcome function f (X, T ) using random forests (RF) (Breiman, 2001). Then, it estimates the treatment effect (i.e. treatment difference) for each patient i: z(x i ) = f (x i , 1) − f (x i , 0). Finally, it identifies interesting subgroups by building a regression tree, using X as an input, and the treatment difference Z as target variable. • Global treatment effect modelling methods, which directly model the treatment effect z(X).
One of the earliest subgroup identification method is Interaction Trees (IT) (Su et al., 2008), an extension of the CART decision tree algorithm (Breiman et al., 1984) that utilizes the treatment variable. The splitting criterion in IT is an interaction test between treatment and covariate, and as a result, IT splits the data in such a way that the heterogeneity with respect of the treatment contrast z(X) is increased. Loh et al. (2015) proposed a different method, the Generalized Unbiased Interaction Detection and Estimation (GUIDE) tree, which overcomes one of the inherent biases of CART: favouring variables with a large number of categories. While all the above methods are non-parametric, the treatment effect can be also modelled with parametric methods. For example, Tian et al. (2014) present a method where they fit a multivariate regression to predict the outcome, by using as input the modified covariates z(X)T /2 (where the coding for the treatment T variable is ±1), and without the need of modelling the main effects. • Local modelling methods, which directly searches for subgroups with improved treatment effect z(X). Kehl and Ulm (2006) presented a method to solve the problem of responder identification in censored data by extending the bump hunting procedure (Friedman and Fisher, 1999). Another popular algorithm is the Subgroup Identification based on Differential Search (SIDES) (Lipkovich et al., 2011), which directly searches for specific areas in the covariate space with an enhanced treatment-by-biomarker interactions. SIDES is a recursive partitioning method inspired by bump hunting.
For the experiments in our paper, we selected representative methods from each category: Virtual Twins (global outcome modelling), Interaction Trees and Modified Covariate Regression (global treatment modelling), and SIDES (local modelling). The primary objective of all these is subgroup discovery. We remind the reader that our primary objective is distinguishing prognostic/predictive markers by ranking them according to their effect. Some of the methods above, such as SIDES and IT, do quantify predictive strength via variable importance scores, and these scores can be used to derive predictive biomarker rankings. Some other methods, such as VT/MCR, can be straightforwardly adapted to generate predictive biomarker rankings (more details in Section 5 os the Supplementary material). These methods can also be discussed/characterised from the perspective of the statistical tools each is using: penalised linear regression methods (such as MCR), counterfactual modelling methods (such as VT) and recursive partitioning methods (such as SIDES/IT).
In our article we use the tool of information theory to introduce a new approach that explicitly disentangles the prognostic and predictive strength of each biomarker. By using our new formalization of the problem we suggest two novel methods, INFO/INFO+, for predictive biomarker ranking. The methods also fit neatly into the second category of Lipkovich et al. (2017), as a global treatment effect modelling method and, in summary, the following advantages hold: 1. Unlike penalised linear regression methods, our methods do not assume any functional form for the interaction and the main effects. Main paper's Section 3.1.5 explores empirically this result, and shows that INFO+ captures high-order biomarker interactions, without assuming any specific predictive model. 2. Unlike potential outcome modelling methods, our methods do not build a prediction model to derive the treatment effect. To derive this effect, the potential outcome methods build a prediction model that infers the outcome of each patient under opposite treatment assignment and as a result, they make further modelling assumptions (Gelman and Hill, 2006, Chapter 9). There are scenarios where this counterfactual modelling may fail. For example, in the first step of VT, the classifiers that we use to estimate f (x, t) may ignore completely the treatment variable, and just focus on the strongly prognostic biomarkers. Main paper's Section 3.1.4 explores empirically these scenarios, and shows that counterfactual modelling through VT is biased towards strongly prognostic biomarkers, while on the other hand, information theoretic modelling through INFO+ achieves a better trade-off between true positive and false positive decisions. 3. Unlike recursive partitioning methods, our methods use all the available data in every step for estimating the predictive strength and for capturing possible interaction between the biomarkers. To do this in a sample efficient way, we derive low dimensional approaches in combination with shrinkage estimators that are suitable for 'small-sample size, high-dimensionality' scenarios (Hausser and Strimmer, 2009). Main paper's sections 3.1.6 and 3.1.7 explore empirically these scenarios, by comparing the performance of the different methods for different sample-sizes and different dimensionalities. Furthermore, Main paper's Section 3.1.8 explores the scenario of having subgroups of patients with an enhanced treatment effect. 4. Unlike all the previous methods that derive full rankings of all biomarkers, our approach exploits a step-wise procedure that can be used to derive partial rankings (i.e. top-K predictive biomarkers). This brings huge improvement in terms of the computational cost. Main paper's Section 3.1.10 compares the different methods in terms of their computational complexity.

Background on information theory
In information theory, a fundamental concept is the mutual information 1 between two random variables X and Y (Shannon, 1948;Cover and Thomas, 2006;Steuer et al., 2002): Mutual information quantifies the amount of information two variables share. The choice of the base of the logarithm merely alters the scale: popular choices are 2, giving a value in bits, and e (natural logarithm) giving a value in nats. The above is the generic phrasing, where x, y could be continuous or discrete valued. In the discrete data scenario, the integrals become sums over the possible values: for example the set {0, 1} if Y is a binary outcome variable. In a medical scenario the random variable X might represent some patient factors, and Y a binary health outcome. The mutual information quantifies the dependence between random variables -larger if they are dependent, or zero if statistically independent. In a practical situation, we imagine that our dataset {x i , y i } n i=1 is a sample from the joint distribution p(x, y) of these random variables, and we estimate the relevant probabilities. When x, y are discrete valued, it is common 1 Note that the mutual information is symmetric, I(X; Y ) = I(Y ; X).
to use maximum likelihood estimates (simple frequency counts), and the estimator is in general denoted: A second important concept for this paper is the conditional mutual information, which is: This quantifies the dependence between X and Y , once we already know the value of a third variable Z. It can also more formally be stated as the expectation of I(X; Y |Z = z) with respect to the distribution of Z. Again, the conditional information is zero when X, Y are conditionally independent given Z, increasing with stronger dependence. In practice, when we have categorical data, we can estimate the conditional mutual information by using maximum likelihood estimates for the probabilities. This approach is known to have high error, especially in small-sample sizes (Steuer et al., 2002). In Section 4 we will present a shrinkage method (Efron, 2012;Hausser and Strimmer, 2009) to estimate conditional mutual information, which achieves smaller error than the maximum likelihood approach, especially in small-sample scenarios.
Information theory provides an algebra with a well-defined set of semantics. For example if we have two features x 1 , x 2 , and denote their joint random variable as the concatenation X 1 X 2 , the following identity holds: This is known as the chain rule (Cover and Thomas, 2006), showing that the information about the outcome Y contained in the joint X 1 X 2 can be decomposed into the individual information in X 1 , plus a conditional term accounting for its interaction with X 2 .

Mutual information and log-linear models
In this section we will connect mutual information with log-linear models.
The main outcome of this is to show that I(T ; Y |X) captures only the predictive strength and does not contain any prognostic information.

Log-linear models and independence in two-way tables
We will present that connection by using the notation of (Agresti, 2013, p. 339). Firstly, let us assume that we have two variables X (biomarker) and Y (target). For all probabilities, {π ij } the expected counts are: For multinomial sampling, under statistical independence, the {µ ij } have the structure: Since the independence formula is multiplicative, log µ ij has additive form: where the superscripts X and Y represent the variables and are not exponents. This is the loglinear model of independence (M 0 ). The ML fitted values are {µ ij = n i+ n +j n } the estimated expected frequencies for the X 2 test of independence.
Statistically dependent variables satisfy a more complex log-linear model: This is the saturated model (M 1 ). The {λ Y X ij } are associated terms that reflect deviations from independence. The model M 1 is hierarchical, which means that the model includes all lower-order terms composed from variables contained in a higher order term. A reason for including lower-order terms is that, otherwise, the statistical significance and the interpretation of a higher order term depends on how variables are coded. This is undesirable and with hierarchical models the same results occur no matter how variables are coded.
The goodness-of-fit of the model of independence is measured by the deviance, which is the likelihood-ratio statistic for testing the null hypothesis that the model holds against the general alternative (i.e. the saturated model). In the log-linear model of independence the deviance can be written in terms of the mutual information as: So the mutual information, that quantifies the dependence between X and Y , it can be seen as a measure that quantifies how well the model of independence M 0 , which is the model with the interaction terms, fits the data compared to the surrogate model. This is the exact definition of mutual information, the KL divergence between the joint p(x, y) (surrogate model) and the product of the marginals p(x)p(y) (independence model): Prognostic biomarker rankings by using I(X; Y ): We can see that I(X; Y ) quantifies how close the two models M 0 and M 1 are. Since the only difference between these two models is the interaction term λ Y X ij -the "prognostic" term-the mutual information quantifies if there is some interaction between X and Y . For that reason, we can derive prognostic biomarker rankings by ranking the biomarkers on the information that they share with the target.

Log-linear models and independence in three-way tables
Let us assume now that apart from biomarker X and target Y we also have a third variable, i.e. the treatment T . This is more complicated case, because there are several potential types of independence. In this section we will explore the conditional independence of T and Y given X, and we will explain why using I(T ; Y |X) to derive predictive biomarker rankings. The saturated log-linear model for three-way tables is: The model of conditional independence of T ⊥ ⊥ Y |X can be derived by setting to zero the interaction co-efficents λ T Y ik and λ T Y X ijk (Agresti, 2013, Table 9.1) The deviance of the log-linear model of conditional independence can be written in terms of the conditional mutual information as: Predictive biomarker rankings by using I(T ; Y |X): From all the above it is shown that I(T ; Y |X) quantifies how close the two models M 0 and M 1 are. Since the only difference between these two models is the interaction terms λ T Y ik and λ T Y X ijk the conditional mutual information quantifies if there is some interaction between T and Y. This information does not take into account any prognostic information, because the prognostic term λ Y X ij takes part in both models M 0 and M 1 and as a result it cancels out its effect. An interesting point to mention here is that this conditional mutual information quantifies also the effect of the term of the overall treatment effect λ T Y ik but since this term is independent of X will not affect the ranking. It could be argued that the predictive strength should be normalised by subtracting the overall treatment effect, i.e. I(T ; Y |X) − I(T ; Y ). Since the normalising factor, I(T ; Y ), does not depend on the biomarker X, both normalised and un-normalised versions will rank markers in exactly the same order.
Main paper's Section 3.1.3 shows experimentally that the conditional mutual information between the treatment and the outcome given a biomarker captures only the predictive strength of the biomarker.

Estimating Conditional Mutual Information
INFO+ derives predictive biomarker rankings by estimating conditional mutual information quantities. Here we will describe the methods we implemented in our software for estimating mutual information for different data types (e.g. categorical, continuous, time-to-event outputs).

Categorical data
Because in clinical trials most of the times we have small-samples, it is desirable to use estimators that are tailored to these situations. For the purpose of this work, we will focus on categorical data, and now we will present a shrinkage estimator, which is suitable for 'small n, large p' scenarios (Hausser and Strimmer, 2009), and it outperforms the conventional maximum likelihood approach, especially when we have small-sample sizes. The conditional mutual information can be estimated as: To estimate the above expression, we need a way to estimate the joint probabilityp(t, y, x). The simplest approach is to use the maximum likelihood estimator (i. e. frequencies of observed counts),p ML (t, y, x) = n t,y,x n , where nt,y,x is the shorthand for the counts of the event: T = t, Y = y, X = x, and n is the sample size. By plugging in these probabilities in eq. (3), we derive the maximum likelihood estimator The overall size of sample space is d = |T | × |Y| × |X|, and it increases as we increase the dimensionality of X. In scenarios where n >> d the maximum likelihood estimator is known to be optimal (Hausser and Strimmer, 2009). But otherwise, a scenario that often can happen in clinical trials, it is challenging to estimate reliably the probabilities and as a result the conditional mutual information. This happens because of the small number of observed counts in each event, or even worse, events that do not have any example (zero-counts).
To overcome this limitation, Hausser and Strimmer (2009) derived a shrinkage estimator, similar to James and Stein (1961), for the multinomial probabilities. The main idea is to average over two different models: one high-dimensional, which is a low-bias/high-variance estimator, and one low dimensional, which a high-bias/low-variance. In our work we will follow the convention of choosing as high-dimensional estimator the maximum likelihoodp ML (t, y, x) = n t,y,x n , while as a low-dimensional estimator uniform probabilities p Uni (t, y, x) = 1 d . The shrinkage estimator is then the convex combination: p Shrink (t,y,x)=λp Uni (t,y,x)+(1−λ)p ML (t,y,x),∀t∈T ,y∈Y,x∈X. (4) Schäfer and Strimmer (2005) derived a closed form expression for the optimal shrinkage intensity λ * , optimal in terms of mean squared error, which can be estimated through the following expression (Hausser and Strimmer, 2009): From eq. (4) and (5) we can estimate the shrinkage probabilities. It can be shown that there is a one to one correspondence between the shrinkage and the Bayes estimates, which means that shrinkage is an empirical Bayes estimator (Hausser and Strimmer, 2009). Hausser and Strimmer (2009) used the shrinkage probabilities for estimating entropy in order to learn gene association networks, while Scutari and Brogini (2012) used them to derive a test of independence, in order to learn the structure of Bayesian networks. In our work by plugging the shrinkage probabilities in eq. (3) we derive a shrinkage estimator for the conditional mutual informa-tionÎ Shrink (T ; Y |X), which we will use to derive predictive biomarker rankings through INFO and INFO+ (Main paper's Algorithm 1, Line 4). Figure 1 compares the performance of the maximum likelihood and shrinkage estimator in terms of Mean Squared Error (MSE). To perform these experiments, we generated synthetic data with known conditional mutual information population values I(T ; Y |X). We used three different levels: 0 nats (or in other words T and Y conditionally independent given X), 0.01 nats, and 0.10 nats. Furthermore, we assumed that the treatment and the outcome variables are binary (i.e. |T | = |Y| = 2), while |X| = 25. We can see X as the joint variable of two features that can take five distinct values, which is for example the conditional mutual information we need to estimate in INFO+, Main paper's Algorithm 1 Line 4. As we observe, the shrinkage estimator performs better (i.e. lower MSE) and converges much faster than the maximum likelihood estimator.

Continuous covariates
While in the experiments of the main paper we focused on categorical datasets, INFO+ can be used with different types of data using different types of estimators. Now we will provide a short review of some popular Machine Learning approaches for estimating mutual information.
When we have continuous variables (covariates and/or output) one straightforward way for estimating mutual information is by making some parametric assumption. For example, Brillinger (2004) derives an estimator under the popular Gaussian modelling. A non-parametric way to estimate mutual information is by partitioning the continuous space into discrete bins, a method that it is also known as binning (Steuer et al., 2002). This can be done by using some pre-specified fixed number of bins, or with adaptive partitioning methods, such as the Fraser -Swinney algorithm (Fraser and Swinney, 1986). A popular adaptive partitioning method is the Maximal Information Coefficient (MIC) (Reshef et al., 2011), which is a measure of divergence that uses a normalised version of the mutual information, and selects the number of bins by picking the one that maximizes the mutual information. There are other types of non-parametric estimators of the mutual information, such as kernel based approaches (Gretton et al., 2005) or nearest-neighbor estimator (Pál et al., 2010).
In our software we implemented an estimator for the conditional mutual information, which is based on non-parametric density estimation procedure. The core idea is to transform the continuous variables to categorical using a method for histogram generation, such as the Scott or the Freedman -Diaconis rule (Keen, 2010, Chapter 6), and then to use the shrinkage estimator presented earlier.

Survival outcome
Another interesting scenario is having survival data (also known as time-to-event data, or censored data). Beirlant et al. (1997) present a histogram-based method for estimating entropy from censored data. Extending their idea, we can derive a histogram-based method to estimate mutual information in censored data. Let s = 1, ..., S be the distinct times of observed events in either group. Then, for each timestep we have the outcome variable Y (s). For each patient, the value of this random variable is 0 if there is no progression information at this time point for this patient, or the value of the outcome value, if we have the progression information. Finally, we estimate the conditional mutual information, by averaginĝ I(T ; Y (s)|X) across each time step s.

Normalised Conditional Mutual Information
When we have different types of variables (continuous, categorical variables with different levels) it is challenging to compare directly the values of the conditional mutual information. To overcome this issue, we should use normalised versions of the conditional mutual information, which take into account the diverse characteristics of each covariate (Vinh et al., 2010). The literature on deriving normalised information theoretic measures is vast (Kvålseth, 2017). In our implementation, for each feature X, we nor-

Competing methods
In this section we will provide all necessary implementation details.
• MCR: One way to rank the biomarkers on their predictive strength is by fitting a generalised linear regression model and explore the importance of the interaction terms. Recently, Tian et al. (2014) introduced the Modified Covariates Regression (MCR) method, where they fit the model with modified covariates z(x)T /2, without the need of modelling the main effects. The coding of the treatment T variable is ±1, and in the initial publication, the authors used L 1 regularization (Tibshirani, 1996) to perform variable selection. In our case we are interested in deriving rankings, for that reason it is more suitable to use L 2 regularization (i.e. a less sparse solution comparing to L 1 regularization) (Hoerl and Kennard, 1970). Furthermore, following Tian et al. (2014), to mimic the realistic situation where we do not have any knowledge about the true functional forms for interaction effects, we will assume a linear model z(x) = x. Lastly, to rank the strength of the different interaction terms we can use any generic methods for deriving variable importance from generalised linear regression models (Kuhn and Johnson, 2013). In our experiments to fit the penalised logistic regression models we used the R function glmnet (R package glmnet (Friedman et al., 2010)), and we optimised the regularisation parameter using 10-fold cross validation and the one-standard-error rule -picking up the most parsimonious model within one standard error of the minimum (Hastie et al., 2009). Then, we used the R function varImp (R package caret (Kuhn, 2008)) to derive a variable importance score for each biomarker, and using these scores we produced the predictive biomarker ranking. • VT: Virtual Twins (Foster et al., 2011) is a method that builds upon the counterfactual modelling idea of deriving the treatment difference for each example i in the dataset: To derive this variable we used R package aVirtualTwins (Vieille, 2016) and train two Random Forest (RF) models in the two treatment groups. Following Foster et al. (2011), we fit the two models using all default parameters of the R function randomForest, except for the number of trees per forest which we set at 1000, and we also included the covariate-treatment interactions in the model. Finally, in the initial publication, the authors used the dataset {x i , z i } n i=1 for subgroup identification by building a regression tree. In our work, since we are interested on deriving predictive rankings, we used this dataset to derive a variable importance score for each biomarker. A powerful method to derive variable importance scores (Hastie et al., 2009) is RF, so we used again the R function randomForest to derive the scores. We produced the predictive biomarker ranking by using these scores. • SIDES: As we presented in Section 1, a popular recursive partitioning method for subgroup discovery is SIDES (Lipkovich et al., 2011). SIDEScreen (Lipkovich and Dmitrienko, 2014b) is an extension of SIDES, which is more suitable for high dimensional trials, and it can be used to derive a variable importance measure that captures the predictive power of each biomarker (Lipkovich and Dmitrienko, 2014a). Using these variable importance scores, which capture the relative predictive power of biomarkers, we can rank the biomarkers according to their predictive strength. In our experiments we used the R package RSIDES 2 to implement SIDEScreen, with the following parameters inspired by (Lipkovich et al., 2017): splitting criterion differential, maximum number of promising child subgroups retained for each parent group (width) M = 5, maximum number of levels (depth) L = 2, smallest sample size in the any of the two child subgroups formed by a split 30, while we used the default values for the remaining parameters. Furthermore, since we are interested on the variable importance ranking we turned off the complexity control (Lipkovich and Dmitrienko, 2014a). • IT: Another recursive partitioning technique is Interaction Trees (IT) (Su et al., 2009). In our experiments 3 we used the following parameters to generate IT: 20 the minimum number of observations for claiming a terminal node, 5 the minimum number of observations in each treatment arm within either of the two child groups when splitting a parent node, and 10 the maximum tree depth. To derive variable importance scores we generated random forests of IT (Su et al., 2008, Section 2.5) with 500 trees, and using these scores we derived the predictive biomarker ranking. In the experiments of the main paper we focus on categorical datasets, so we can use our shrinkage estimator presented in Section 4, which is parameter-free. Thus, to use INFO and INFO+ in this scenario we do not need to set-up any parameter. The software that implements our proposed methods will be publicly available as R package. Table 1 presents three 'easy' models. In M-1 the Xs are generated as independent variables X j ∼ N (0, 1), j = 1 . . . p, and the model has a simple linear form with five prognostic X Prog. = {X 1 , X 2 , X 3 , X 4 , X 5 } and five predictive variables X Pred. = {X 4 , X 5 , X 6 , X 7 , X 8 }, while the rest {X 9 , . . . , Xp} are irrelevant (i.e. noisy variables). Note that X Prog. ∩ X Pred. = {X 4 , X 5 }, which means that these two biomarkers carry both predictive and prognostic information. This is considered 'easy' as a method could identify a predictive marker by capitalising on its prognostic nature. M-2 is considered slightly more challenging for deriving predictive rankings, since we have no overlap between predictive and prognostic markers. M-3 is yet more challenging, as we have correlated covariates. To simulate challenging scenarios we generate correlations both within and between covariates in the predictive, prognostic and irrelevant sets. In practice we generate the odd-numbered covariates { X 1 , X 3 , X 5 ... } to have an internal correlation of ρ = 0.7, and evennumbered, { X 2 , X 4 , X 5 ... } , to have the same internal correlation -i.e. given any two odd valued indices i, j ∈ {1 . . . p} , the correlation between X i , X j is ρ = 0.7, and the same holds for two even valued indices. Table 2 presents two models of 'medium' difficulty. We still have five prognostic and five predictive variables, but this time we have non-linear interactions. In M-4 we have second order, while in M-5 we even have fourth order interaction terms. In both models the covariates are correlated in the same way as in M-3.

Description of simulation models
Finally, Table 3 presents some 'hard' problems. M-6 has a predictive subgroup with an enhanced treatment effect, which is defined by thresholding two predictive biomarkers, X 5 and X 6 . Threshold values are chosen such as ∼ 50% of the data belong to the predictive subgroup. M-7 has a predictive subgroup with an enhanced treatment effect, but this time the subgroup is defined by three predictive biomarkers X 6 , X 7 and X 8 . Also, the threshold values are chosen such as ∼ 25% of the data belong to the predictive subgroup. Table 4 presents another interesting scenario, where we have a successful trial, since there is a significant effect of the treatment alone. In this case it exists an overall treatment effect in the population. We simulated two different models. M-8 is an extension of the medium difficulty model M-4 but this time it exists a significant effect if the treatment alone. M-9 Table 1. 'Easy' models: outputs Y are generated by logit[P (Y = 1|T, X)] = f (X, T ).

Experiments with continuous covariates
In this section we will repeat the experiments of Sections 3.1.4-3.1.8 of the main paper using continuous covariates. The implementations of SIDES and VT can handle continuous covariates, while for INFO+, we estimated the conditional mutual information by using the non-parametric histogrambased estimator presented in Section 4, using the Scott's rule for nonparametric density estimation.   Figure 7 of the main paper, but this time we are using continuous covariates.

Description of IPASS trial
IPASS study (Mok et al., 2009;Fukuoka et al., 2011) was a Phase III, multicenter, randomized, open-label, parallel-group study comparing gefitinib (Iressa, AstraZeneca) with carboplatin (Paraplatin, Bristol-Myers Squibb) plus paclitaxel (Taxol, Bristol-Myers Squibb) as first-line treatment in clinically selected patients in East Asia who had advanced non small-cell lung cancer (NSCLC). 1217 patients were randomized 1:1 between the treatment arms, and the primary end point was progression-free survival; for full details of the trial see (Mok et al., 2009). It is known that gefitinib inhibits the epidermal growth factor receptor (EGFR), and is now indicated for the first-line treatment of patients with NSCLC whose tumours have specific EGFR mutations 4 . We therefore expect baseline EGFR mutation status to appear as a strongly predictive biomarker. The covariates used in the IPASS study are shown in Table 5; for convenience we used categorical covariates as defined in the study. Our clinical endpoint is progression-free survival, and statistically we model this as a Bernoulli endpoint, neglecting its time-to-event nature. We analysed the data at 78% maturity, when 950 subjects have had progression events. The following covariates have missing observations (as shown in parentheses): X 5 (0.4%), X 12 (0.2%), X 13 (0.7%), X 14 (0.7%), X 16 (2%), X 17 (0.3%), X 18 (1%), X 19 (1%), X 20 (0.3%), X 21 (0.3%), X 22 (0.3%), X 23 (0.3%). Following Lipkovich et al. (2017), for the patients with missing values in biomarker X, we create an additional category, a procedure known as the missing indicator method (Allison, 2001). Table 5. Covariates used in the IPASS clinical trial.

Biomarker
Description Values 9 Description of AURORA trial AURORA was a randomized, double-blind, placebo-controlled, multicenter trial in which 2776 patients with end-stage renal disease were randomly assigned 1:1 to double-blind treatment with rosuvastatin at a dose of 10 mg or placebo. The primary endpoint was the time to a major cardiovascular event defined as a nonfatal myocardial infarction, nonfatal stroke, or death from cardiovascular causes. All myocardial infarctions, strokes, and deaths were reviewed and adjudicated by a clinical end-point committee whose members were unaware of the randomized treatment assignments, in order to ensure consistency of the event diagnosis. For full details of the trial see (Fellström et al., 2009). The 44 covariates used in the AURORA study are shown in Table 6. The covariates were mixed, i.e. both categorical and continuous. Following Lipkovich et al. (2017), for the the patients with missing values in a categorical covariate we created an additional category (i.e. missing indicator method), while for missing values in continuous covariates we replaced all missing data by the mean of that variable (i.e. mean imputation) (Allison, 2001). Furthermore, because some of the competing methods did not support mixed type (i.e. SIDES), to keep a fair comparison, the continuous variables were discretised in 5 levels following an equal-width strategy. Baseline SBP