-
PDF
- Split View
-
Views
-
Cite
Cite
Mark A. van de Wiel, Johannes Berkhof, Wessel N. van Wieringen, Testing the prediction error difference between 2 predictors, Biostatistics, Volume 10, Issue 3, July 2009, Pages 550–560, https://doi.org/10.1093/biostatistics/kxp011
Close -
Share
Abstract
We develop an inference framework for the difference in errors between 2 prediction procedures. The 2 procedures may differ in any aspect and possibly utilize different sets of covariates. We apply training and testing on the same data set, which is accommodated by sample splitting. For each split, both procedures predict the response of the same samples, which results in paired residuals to which a signed-rank test is applied. Multiple splits result in multiple p-values. The median p-value and the mean inverse normal transformed p-value are proposed as summary (test) statistics, for which bounds on the overall type I error rate under a variety of assumptions are proven. A simulation study is performed to check type I error control of the least conservative bound. Moreover, it confirms superior power of our method with respect to a one-split approach. Our inference framework is applied to genomic survival data sets to study 2 issues: compare lasso and ridge regression and decide upon use of both methylation and gene expression markers or the latter only. The framework easily accommodates any prediction paradigm and allows comparing any 2, possibly nonmodel-based, prediction procedures.
1. INTRODUCTION
Several performance criteria for comparing competing prediction procedures exist. In the most general setting, prediction procedures are compared on the basis of their estimated prediction errors, usually obtained by applying cross-validation (CV) and/or bootstrapping principles. Alternatively, when the prediction procedures are model based, the model with an optimal Akaike or Bayesian information criterion (AIC/BIC) may be selected, thereby balancing goodness-of-fit and model complexity. Here, we discuss a new, inference-based approach.
The standard strategies for selecting a prediction procedure are designed for the situation of so-called “equal preference procedures”: a situation in which no practical arguments exist why one procedure is preferred over the other. An alternative setting we consider here is one of an asymmetric preference: from the outset of the comparison there are practical reasons to favor a particular procedure. Several of such practical reasons are discussed below.
First, consider the situation where the procedures to be compared use the same set of variables. However, procedure 1 may be preferred over procedure 2 because procedure 1 is the standard procedure that has proven itself in many applications (e.g. a Cox model in survival analysis). Alternatively, procedure 1 returns results which are easier to interpret or have less costly implications in practice than the results of procedure 2. The latter is illustrated in the first example of Section 4 for a regression setting using L1 and L2 (respectively) penalty on the regression coefficients.
The second situation is a data setting in which more than one type of (a large number of) covariates is available. When prediction procedure 1 uses less types of covariates than procedure 2, it is less costly to apply in practice. In Section 4, we illustrate this setting using a genomics data set with 2 types of covariates: gene expression and methylation markers.
Application of traditional model selection techniques to the situations above may be unsatisfactory. First, criteria like the Bayes’ factor, AIC, and BIC are likelihood based and do not facilitate the comparison of nonlikelihood- or nonmodel-based predictors. Second, what if the prediction accuracy of the alternative model is only marginally better than that of the standard one? Should the standard model be dropped? We are only willing to do so if the evidence in favor of the alternative model is beyond reasonable doubt.
In this paper, we restate the problem of selecting the best prediction procedure among 2 alternatives as a problem of hypothesis testing. We propose a formal inference strategy which allows one to assess whether one procedure predicts significantly better than the other. We consider the situation where no independent validation set for the comparison of the procedures is available, so training and testing have to be performed on the same data. Naturally, one may set a (random) part of the data apart for validation. However, the results may depend heavily on which samples are set apart, as shown in our examples. This is of course undesirable. Moreover, our simulation shows that one may lose power by using the one-split approach. Hence, also in an inference setting, we advocate multiple splits of the data set into training and test sets, using the same principles as CV. Combination of the results from multiple splits brings about additional difficulties for our inference, solutions of which we discuss.
While there exists a vast literature on model comparison, little attention is paid to inference for comparing prediction accuracies of 2 models or, more generally, prediction procedures. Perhaps, the most related work is by Diebold and Mariano (1995), who also use a paired test to compare prediction accuracies. However, their setting of time series models differs from ours, in that the time ordering leads to a natural choice of the training and test sets. Technically, our work is also related to the Bayesian notion of posterior predictive p-values Meng (1994), in the sense that possibly dependent p-values are combined. However, our method does not rely on priors and is focused on dealing with the nuisance from multiple splits.
2. METHOD
Suppose we have one data set containing N independent samples. A sample j consists of covariate information Xj and (possibly censored) response tj. In the censored case, δj indicates whether an event has taken place for sample j. This data set is split into a training set of size n = qN and a test set of size m = N − n. The choice of q will be discussed later. Given q, one has to choose how to split the original data set and how many splits are to be considered. We use conditional subsampling. Subsampling refers to the test sets not being a partition of the entire data set, but chosen randomly. In the case of survival data, censored and uncensored data bear a different amount of information, so we condition the training set on containing the same proportion of censored data as the original data set. The number of splits I needs to be fairly large since the variability from split to split can be substantial. We mostly use I = 50 and verify whether any increase in I is necessary to obtain more precise results.
2.1 Hypothesis

The fact that the training and test sets are not fixed in our setting implies that we need to test (2.1) simultaneously for all (many) possible training sets. Let 𝒮 denote the class of all training sets. Then, we test

In practice, we cannot include all elements of 𝒮 for computational reasons. Instead, a large enough subset is generated and the consistency of the result is checked. In our setting, a one-sided alternative is usually more relevant than a 2-sided one because of the asymmetric preference between the 2 procedures. We do not aim to detect local alternatives (for a particular training–test split) but focus on average behavior. However, we first discuss how to test (2.1) for a given training set and then how to combine the resulting p-values for the final testing of (2.2).
2.2 Residuals and test statistic
Definition of a residual depends on the type of response. In an uncensored continuous (e.g. linear regression) setting or binary (e.g. classification) setting, one may simply consider the usual “predicted minus observed” residuals or standardized measures thereof. For censored data, the concept of residuals is less obvious and still a matter of debate Hothorn and others (2004). Integrated Brier scores may be used, thereby acknowledging that “time-to-event itself cannot adequately be predicted” in survival analysis settings Graf and others (1999). These residuals are detailed in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).
Once the residuals are available for a given test set, these may be used to compute p-values under null hypothesis (2.1), which we then combine to test (2.2). We use a signed-rank test, for example, the Wilcoxon signed-rank test, to reduce the influence of a few large residuals. Fast computation of the p-values for a given training set is desirable because it has to be repeated over all splits. When m is large, normal or saddlepoint approximations can be used due to independence of the test samples when considering a given split. When m is small, the fast split-up algorithm Van de Wiel (2001) can be used for exact computations, as implemented in the R package “coin” Hothorn and others (2006).
2.3 Combining p-values: type I error control
In the training–testing setting, one obtains I p-values, one from every split. How should we combine these p-values? The widely used approach of Fisher (1932) is inappropriate here due to the strong dependence between the p-values, which is caused by the overlap of training and test samples in one split with those of another split. Instead we take the following strategy.




≤ α is equivalent to
Φ=med(pΦ, 1, …, pΦ, I) ≤ Φ−1(α). Therefore, P(
≤ α) = P(
Φ ≤ Φ−1(α)). Any explicit computations on P(
≤ α) and P(
Φ ≤ Φ−1(α)) require specification of the dependence structure between pΦ, 1, …, pΦ,I. The marginal distributions are standard normal, a necessary, but not sufficient, condition for the joint distribution to be multivariate normal (MVN) with variances equal to 1. Assuming multivariate normality (as in Hartung, 1999), the following theorems assure validity of (2.6).See supplementary material available at Biostatistics online (http://www.biostatistics. oxfordjournals.org). □
As mentioned before, the overlap between several training and test sets, along with the complex relationship between the training data and the prognostic function, complicates verification of the multivariate null distribution. For example, say that samples j and k belong to the training and test set, respectively, for split 1, while the reverse is true for split 2. A positive difference of residuals refers to a better prediction of procedure 1 with respect to procedure 2. Suppose that split 1 results in a positive difference for sample j. This event has a marginal probability of 1/2 under H0. When we are interested in the joint null distribution over the 2 splits, we need to be able to compute the conditional probability of a positive difference in the second split given the positive difference in the first one. This is likely to be somewhat larger than 1/2 due to the cross-dependence between the samples, but the complex training does not allow explicit computation of this conditional probability. In addition, residuals of test samples overlapping between splits are also dependent.
We believe that (2.7) and (2.8) will be conservative in many situations and support this by simulation results. Nevertheless, it is useful to have “worst-case” bounds for the type I error which hold under very general conditions.
.The inequality follows directly from the Chebyshev-type inequalities by Camp (1922) and Vysochanskij and Petunin (1980), using the fact that E(
Φ) = 0. □
Note that the most pessimistic case,
= 1, implies σ2 = 1 and hence an upper bound for α = 0.05 equal to 2/9×1/1.652 = 0.082. A mean covariance of say
= 0.5 implies an upper bound approximately equal to 0.082×0.5 = 0.040. Inequality (2.9) also holds for the median,
Φ, but computation of σ2 is not explicit in this case. Finally, dropping all assumptions on P, we obtain the following bound.
See supplementary material available at Biostatistics online (http://www.biostatistics.oxford- journals.org). □
We found that for α ≤ 0.1, the right-hand side of (2.10) is approximately equal to 2.5α. In particular, it equals 0.126 for α = 0.05. The upper bounds under various assumptions on the joint distribution of (pΦ,1, …, pΦ,I) are listed in Table 1 for several values of α.
Type I error upper bounds under several assumptions
| α | MVN | Unimodal | None |
| Theorems 1 and 2 | Theorem 3, σ2 = 1 | Theorem 4 | |
| 0.001 | 0.001 | 0.0233 | 0.0026 |
| 0.002 | 0.002 | 0.0268 | 0.0052 |
| 0.005 | 0.005 | 0.0335 | 0.013 |
| 0.01 | 0.01 | 0.0411 | 0.0258 |
| 0.02 | 0.02 | 0.0527 | 0.051 |
| 0.05 | 0.05 | 0.0821 | 0.1255 |
| 0.1 | 0.1 | 0.1353* | 0.2456 |
| α | MVN | Unimodal | None |
| Theorems 1 and 2 | Theorem 3, σ2 = 1 | Theorem 4 | |
| 0.001 | 0.001 | 0.0233 | 0.0026 |
| 0.002 | 0.002 | 0.0268 | 0.0052 |
| 0.005 | 0.005 | 0.0335 | 0.013 |
| 0.01 | 0.01 | 0.0411 | 0.0258 |
| 0.02 | 0.02 | 0.0527 | 0.051 |
| 0.05 | 0.05 | 0.0821 | 0.1255 |
| 0.1 | 0.1 | 0.1353* | 0.2456 |
Assuming symmetry as well.
Type I error upper bounds under several assumptions
| α | MVN | Unimodal | None |
| Theorems 1 and 2 | Theorem 3, σ2 = 1 | Theorem 4 | |
| 0.001 | 0.001 | 0.0233 | 0.0026 |
| 0.002 | 0.002 | 0.0268 | 0.0052 |
| 0.005 | 0.005 | 0.0335 | 0.013 |
| 0.01 | 0.01 | 0.0411 | 0.0258 |
| 0.02 | 0.02 | 0.0527 | 0.051 |
| 0.05 | 0.05 | 0.0821 | 0.1255 |
| 0.1 | 0.1 | 0.1353* | 0.2456 |
| α | MVN | Unimodal | None |
| Theorems 1 and 2 | Theorem 3, σ2 = 1 | Theorem 4 | |
| 0.001 | 0.001 | 0.0233 | 0.0026 |
| 0.002 | 0.002 | 0.0268 | 0.0052 |
| 0.005 | 0.005 | 0.0335 | 0.013 |
| 0.01 | 0.01 | 0.0411 | 0.0258 |
| 0.02 | 0.02 | 0.0527 | 0.051 |
| 0.05 | 0.05 | 0.0821 | 0.1255 |
| 0.1 | 0.1 | 0.1353* | 0.2456 |
Assuming symmetry as well.
2.4 Algorithm and software
Our testing procedure is summarized by the following algorithm. We assume that the median p-value
is used as the summary p-value.
Initiate i = 1. 2.
Randomly split entire group into training and test group, using conditional subsampling. 3.
Apply both prediction procedures on the training set. 4.
For the corresponding test set, predict the outcomes and compute residuals Rk,j for both procedures, k = 1,2. 5.
Compute the pairwise difference between residuals for procedure 1 and procedure 2, Dj = R1,j − R2,j. 6.
Repeat 4. and 5. for all test samples. 7.
Compute the signed-rank Sj for all j = 1, …, m. 8.
The signed-rank test statistic: S = ∑j = 1ma(Sj). (Using a(x) = x corresponds to the Wilcoxon signed-rank statistic.) 9.
Under H0, each Sj is equally likely positive and negative. Therefore, compute the p-value pi by sign-permutations. 10.
Repeat 2. to 9. for i = 2, …, I. Finally, reject the global null-hypothesis H0 when
≤ α.
This algorithm is implemented in R R Development Core Team (2006). The code is available in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).
3. SIMULATION
We performed a simulation study (1) to verify our conjecture that the most liberal bounds, (2.7) for
and (2.8) for
Φ, remain conservative and (2) to compare power of our procedure to a one-split procedure (for the latter maintains the type I error exactly, up to the discreteness of the test statistic).
We used a setting with 2 sparse linear models. The number of available covariates is of the same order as the number of samples. When applying the one-split approach, one would like to know the consistency of the decision with respect to repeated splitting. The one-split approach is useful only if the probability to arrive at the same decision for another split is reasonably high, say larger than 80%. In the simulation, we have estimated this consistency as well. Here, we summarize the results of the simulation, details of which are given in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).
We observed that the multisplit approach using c0.05 = Φ − 1(0.05) (Theorem 1) controls the type I error well below 0.05. Second, we observed that the one-split approach can be more powerful on average than the multisplit approaches but only in the very low power region when its results are inconsistent. This is mainly due to splits for which the alternative is more pronounced. However, in the area where a one-split approach could be of practical use (reasonable power and consistency > 0.8), its power is dominated by the multisplit procedure with c0.05, and generally also by the most conservative approach, which guarantees type I error control under the conditions of Theorem 4.
4. APPLICATIONS
We consider 2 applications of our inference procedure. We focus on survival as a response variable. All summary p-values and graphs are initially based on I = 50 random splits with q = 4/5 using conditional subsampling. All testing is one-sided. Since extrapolation of the results to n = N may be a concern, we performed a graphical control by computing the average prediction error of both procedures for several values of q:q = 2/3,3/4,4/5,17/20,9/10. Multiple splits per fraction were used but such that the test sets are nested for decreasing value of n′ = Nq to allow better comparability between several split fractions. If the 2 resulting curves do not converge after n = 4/5N toward n = N, it is probably safe to extrapolate the result, since the difference in estimated prediction errors does not decrease; if the curves converge, one should be very careful in extrapolating the results.
4.1 Microarray data set: lasso versus ridge
The first data set is the microarray data set of Bullinger and others (2004) which can be downloaded from the GEO data base (accession number GSE425). It consists of 119 gene expression profiles from patients with acute myeloid leukemia and contains expression values of 6283 genes. Thirteen samples with more than 1800 missing values are removed. Also genes with over 20% missing values are removed. The resulting data set contains 103 expression profiles of 4673 genes. Remaining missing values are imputed using K-nearest neighbors. Overall survival is used as the end point in the analysis. This is a high-dimensional setting for which the number of covariates, p, is larger than N. Recently, comparison of methods for survival prediction from such data has received a lot of attention Bøvelstad and others (2007); Schumacher and others (2007); Van Wieringen and others (2009). These comparisons all evaluate several methods by several prediction error metrics. Here, we first focus on the comparison between lasso regression and ridge regression. We used implementations of lasso and ridge as described in Park and Hastie (2007) and Van Houwelingen and others (2006), respectively. The nice variable selection property of lasso motivates our asymmetric preference for the 2 prediction procedures. However, for 3 other microarray data sets it was shown Bøvelstad and others (2007) that ridge regression predicted slightly better than lasso regression. If ridge regression predicts significantly better than lasso regression for our data, it may be worthwhile to consider using this more expensive procedure.
Figure 1(a) displays the histogram of p-values as determined from the 50 splits. It is obvious that there is no significant difference, also from the summary p-values:
= 0.613 and p′ = 0.595. Figure 1(b) shows the potential effect of the split fraction on the prediction error. An interesting feature of this figure is that the lasso and ridge curves cross, just after the commonly used split fraction q = 2/3. This implies that one should be careful with deciding between 2 procedures based on cross-validated prediction errors for one given split fraction. In conclusion, there is no statistical evidence to prefer ridge over lasso in this application, so the lasso is recommended because of its additional feature selection property.
Lasso versus ridge: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).
Lasso versus ridge: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).
4.2 Methylation and mRNA data set: 2 sets of markers versus 1
A second application of our test is illustrated on a second data set of 99 Dutch patients with acute myeloid leukemia. At diagnosis, DNA expression was measured for 33 genes and its relation with survival was examined Hess and others (2007). In addition, the degree of methylation was assessed for 25 different regions. Here, we compare a lasso regression that contains both predictor sets versus one containing only the gene expression markers. If the procedure with both sets predicts significantly better than the procedure with DNA expression alone, it may be considered for prognosis of patients. Otherwise, the less costly alternative without methylation assessments is preferred.
We increased I to 250 because we were not satisfied with the precision of the estimates of the summary p-values from 50 splits. Figure 2(a) displays the histogram of p-values. The median p-value equals
= 0.0458, while p′ = 0.0569. The resampling-based 90% upper confidence bounds for
and p′ equal 0.0608 and 0.0659, respectively. Moreover, we observe from Figure 2(b) that the prediction error curves diverge after q = 4/5. Hence, there is substantial evidence that the methylation markers are crucial in addition to the mRNA markers for predicting survival in this data set.
Methylation versus no methylation: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).
Methylation versus no methylation: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).
5. DISCUSSION
5.1 Why not bootstrap?
Our testing procedure is based on simple CV. We have also considered 3 utilizations of the bootstrap for the purpose of inference, but none was found to be appropriate. Below, we discuss these bootstrapping schemes and why these fail for our aim.
First, in this splitting setting, bootstrapping the residuals in a multivariate fashion is not feasible due to the complex dependencies between residuals of test samples which are in each other's training set for 2 different splits. Hence, exchangeability of the samples is not guaranteed.
Second, it has been shown before that estimation of the prediction error may be improved by combining bootstrap CV with the apparent error rate Schumacher and others (2007). Bootstrap CV resamples a training set of size N from the original sample and uses samples not present in the bootstrap as test samples. This results in a somewhat pessimistic estimate since the bootstrap sample does not contain all the information contained in the original sample. Therefore, a weighted average with the apparent error (error within the training sample) is computed. Unfortunately, this elegant approach seems inappropriate in an inference procedure. First, the residuals contributing to the apparent error cannot be used for testing our null hypothesis. Second, the bootstrapping procedure results in test samples of different size, hence containing different amounts of independent information. Combining these into one inference is not straightforward.
Finally, one could try to generate a confidence interval, for example, for the median residual, by resampling the entire data set several times. This confidence interval would then be used as an alternative to our testing procedure. Such a procedure would be computationally intensive. More importantly, we are interested in comparing the predictive performance of the 2 procedures trained on a large proportion (q) of the data. The resampling approach would imply a substantial loss of “independent” information in the training phase, which may not have equal consequences for the 2 procedures. Hence, the confidence interval would not reflect the uncertainty under our null of interest, which is concerned about predictors using independent data in the training phase.
5.2 Alternative permutation approaches
We have illustrated the use of our method for comparing 2 prediction procedures in several settings. Naturally, our method can also be used to test whether a particular data–procedure combination has any predictive value at all, simply by comparing it with a “no-covariates” predictor. In this setting, an alternative testing approach would be to use multiple random permutation of the response variables and to reapply the prognostic method to each permutation. If the covariates contain any predictive value, then the observed prediction error should be in the left tail of the resulting permutation distribution of prediction error. However, in high-dimensional settings this approach is computationally cumbersome since one needs many permutations to obtain a reasonable estimate of the p-value and each permutation requires a new training loop. Permutation could also be used in a setting with 2 sets of covariates: fix the response and the corresponding covariate values of set 1, while permuting the labels of the covariates of set 2. However, this approach may be invalid. The null hypothesis implies that the second covariate set is not associated with the response but does not imply it to be nonassociated with the covariates in the first set, which is a necessary condition (implying every permutation to be equally likely under H0) for the resulting p-value to be valid.
5.3 Choice of the split fraction
The optimal choice of the ratio of the training and test set sizes in the proposed procedure depends on the data set under consideration. In particular, it is not immediate that the commonly used ratio 2:1 (q = 2/3) for the estimation of prediction error is optimal in this testing framework. Ideally one wants to approach N as much as possible by n, so that the inference on a prediction based on n samples can be extrapolated to N. However, power of the test collapses when test sample size m = (1 − q)N is too small. If the data sets are very large in terms of the number of samples, it is likely that for both procedures the prediction error levels off after some value q = q′. This is a well-known phenomenon: the prediction error is bounded by the amount of information the covariates contain with respect to the response Dobbin and Simon (2007). If such a phenomenon is observed for both procedures, then we would consider using q = q′. However, this is a computationally intensive task since it requires evaluation of both procedures under multiple split fractions. In Section 4, we illustrated a more practical alternative using default split fraction q = 4/5 and assessing the validity of extrapolation of the results by studying prediction errors for several larger split fractions.
5.4 Extensions
While the multivariate normal assumption in Theorems 1 and 2 is difficult to verify, it may be useful to have an asymptotic justification, when the number of samples is large. The asymptotic results may heavily depend on the type of residuals as well as the prediction procedures considered. Hence, such a study requires a focus on particular prediction paradigms and procedures. Another extension of our method is to the comparison of more than 2 predictors. This would imply using a nonparametric K-sample test that accounts for blocking, such as the Friedman test, instead of the Wilcoxon signed-rank test. The theory on combining p-values from several splits may then be applied without any change.
5.5 Conclusion
A unique characteristic of our method is the testing framework targeting on prediction. We connect 2 well-accepted concepts in applied statistics: prediction error rates and p-values as a measure of evidence against the null hypothesis. The procedure is entirely nonparametric, hence it is robust against outliers and may also be used for nonlikelihood prediction methods. In fact, it may be used in any prediction or classification setting in which it is possible to define a residual of the predicted value with respect to the observation. The method owes much of its power to the pairing principle in the training–test splitting. Moreover, combining p-values avoids arbitrariness related to the study of a single split. In conclusion, we suggest that our method may play an important role in the evaluation of competitive prediction procedures and (sets of) prognostic factors in many studies.
FUNDING
Center for Medical Systems Biology established by the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.
SUPPLEMENTARY MATERIAL
Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.
We thank Jelle Goeman for providing the implementation of ridge regression to us. Moreover, we are grateful to Etienne Roquain, José Ferreira, and Aad van der Vaart for discussions on this work. Conflict of Interest: None declared.






