Abstract

We develop an inference framework for the difference in errors between 2 prediction procedures. The 2 procedures may differ in any aspect and possibly utilize different sets of covariates. We apply training and testing on the same data set, which is accommodated by sample splitting. For each split, both procedures predict the response of the same samples, which results in paired residuals to which a signed-rank test is applied. Multiple splits result in multiple p-values. The median p-value and the mean inverse normal transformed p-value are proposed as summary (test) statistics, for which bounds on the overall type I error rate under a variety of assumptions are proven. A simulation study is performed to check type I error control of the least conservative bound. Moreover, it confirms superior power of our method with respect to a one-split approach. Our inference framework is applied to genomic survival data sets to study 2 issues: compare lasso and ridge regression and decide upon use of both methylation and gene expression markers or the latter only. The framework easily accommodates any prediction paradigm and allows comparing any 2, possibly nonmodel-based, prediction procedures.

1. INTRODUCTION

Several performance criteria for comparing competing prediction procedures exist. In the most general setting, prediction procedures are compared on the basis of their estimated prediction errors, usually obtained by applying cross-validation (CV) and/or bootstrapping principles. Alternatively, when the prediction procedures are model based, the model with an optimal Akaike or Bayesian information criterion (AIC/BIC) may be selected, thereby balancing goodness-of-fit and model complexity. Here, we discuss a new, inference-based approach.

The standard strategies for selecting a prediction procedure are designed for the situation of so-called “equal preference procedures”: a situation in which no practical arguments exist why one procedure is preferred over the other. An alternative setting we consider here is one of an asymmetric preference: from the outset of the comparison there are practical reasons to favor a particular procedure. Several of such practical reasons are discussed below.

First, consider the situation where the procedures to be compared use the same set of variables. However, procedure 1 may be preferred over procedure 2 because procedure 1 is the standard procedure that has proven itself in many applications (e.g. a Cox model in survival analysis). Alternatively, procedure 1 returns results which are easier to interpret or have less costly implications in practice than the results of procedure 2. The latter is illustrated in the first example of Section 4 for a regression setting using L1 and L2 (respectively) penalty on the regression coefficients.

The second situation is a data setting in which more than one type of (a large number of) covariates is available. When prediction procedure 1 uses less types of covariates than procedure 2, it is less costly to apply in practice. In Section 4, we illustrate this setting using a genomics data set with 2 types of covariates: gene expression and methylation markers.

Application of traditional model selection techniques to the situations above may be unsatisfactory. First, criteria like the Bayes’ factor, AIC, and BIC are likelihood based and do not facilitate the comparison of nonlikelihood- or nonmodel-based predictors. Second, what if the prediction accuracy of the alternative model is only marginally better than that of the standard one? Should the standard model be dropped? We are only willing to do so if the evidence in favor of the alternative model is beyond reasonable doubt.

In this paper, we restate the problem of selecting the best prediction procedure among 2 alternatives as a problem of hypothesis testing. We propose a formal inference strategy which allows one to assess whether one procedure predicts significantly better than the other. We consider the situation where no independent validation set for the comparison of the procedures is available, so training and testing have to be performed on the same data. Naturally, one may set a (random) part of the data apart for validation. However, the results may depend heavily on which samples are set apart, as shown in our examples. This is of course undesirable. Moreover, our simulation shows that one may lose power by using the one-split approach. Hence, also in an inference setting, we advocate multiple splits of the data set into training and test sets, using the same principles as CV. Combination of the results from multiple splits brings about additional difficulties for our inference, solutions of which we discuss.

While there exists a vast literature on model comparison, little attention is paid to inference for comparing prediction accuracies of 2 models or, more generally, prediction procedures. Perhaps, the most related work is by Diebold and Mariano (1995), who also use a paired test to compare prediction accuracies. However, their setting of time series models differs from ours, in that the time ordering leads to a natural choice of the training and test sets. Technically, our work is also related to the Bayesian notion of posterior predictive p-values Meng (1994), in the sense that possibly dependent p-values are combined. However, our method does not rely on priors and is focused on dealing with the nuisance from multiple splits.

2. METHOD

Suppose we have one data set containing N independent samples. A sample j consists of covariate information Xj and (possibly censored) response tj. In the censored case, δj indicates whether an event has taken place for sample j. This data set is split into a training set of size n = qN and a test set of size m = Nn. The choice of q will be discussed later. Given q, one has to choose how to split the original data set and how many splits are to be considered. We use conditional subsampling. Subsampling refers to the test sets not being a partition of the entire data set, but chosen randomly. In the case of survival data, censored and uncensored data bear a different amount of information, so we condition the training set on containing the same proportion of censored data as the original data set. The number of splits I needs to be fairly large since the variability from split to split can be substantial. We mostly use I = 50 and verify whether any increase in I is necessary to obtain more precise results.

2.1 Hypothesis

The purpose is to test whether, for a given size n of the training set, the prediction error on new samples differs between 2 prediction procedures. Prediction error is measured by residuals, which will be discussed later. The residuals are available pairwise because the same splits are used for both procedures. We assume that pairwise differences of the residuals follow a continuous distribution function F(x). We propose to test a one-sample “symmetry around zero” hypothesis Lehmann (1975): 
graphic
(2.1)
This null hypothesis allows for rank-based tests. It reduces to a hypothesis of median residual equal to zero when assuming symmetry, which may be checked through visual inspection using the data at hand.

The fact that the training and test sets are not fixed in our setting implies that we need to test (2.1) simultaneously for all (many) possible training sets. Let 𝒮 denote the class of all training sets. Then, we test

 
graphic
(2.2)

In practice, we cannot include all elements of 𝒮 for computational reasons. Instead, a large enough subset is generated and the consistency of the result is checked. In our setting, a one-sided alternative is usually more relevant than a 2-sided one because of the asymmetric preference between the 2 procedures. We do not aim to detect local alternatives (for a particular training–test split) but focus on average behavior. However, we first discuss how to test (2.1) for a given training set and then how to combine the resulting p-values for the final testing of (2.2).

2.2 Residuals and test statistic

Definition of a residual depends on the type of response. In an uncensored continuous (e.g. linear regression) setting or binary (e.g. classification) setting, one may simply consider the usual “predicted minus observed” residuals or standardized measures thereof. For censored data, the concept of residuals is less obvious and still a matter of debate Hothorn and others (2004). Integrated Brier scores may be used, thereby acknowledging that “time-to-event itself cannot adequately be predicted” in survival analysis settings Graf and others (1999). These residuals are detailed in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

Once the residuals are available for a given test set, these may be used to compute p-values under null hypothesis (2.1), which we then combine to test (2.2). We use a signed-rank test, for example, the Wilcoxon signed-rank test, to reduce the influence of a few large residuals. Fast computation of the p-values for a given training set is desirable because it has to be repeated over all splits. When m is large, normal or saddlepoint approximations can be used due to independence of the test samples when considering a given split. When m is small, the fast split-up algorithm Van de Wiel (2001) can be used for exact computations, as implemented in the R package “coin” Hothorn and others (2006).

2.3 Combining p-values: type I error control

In the training–testing setting, one obtains I p-values, one from every split. How should we combine these p-values? The widely used approach of Fisher (1932) is inappropriate here due to the strong dependence between the p-values, which is caused by the overlap of training and test samples in one split with those of another split. Instead we take the following strategy.

Reject (2.2) if 
graphic
(2.3)
We also consider an alternative rejection criterion based on a different “summary statistic”: 
graphic
(2.4)
equivalent to 
graphic
(2.5)
where pΦ,i = Φ − 1(pi), with Φ the standard normal cumulative distribution function (cdf). Use of (2.3) has a relatively simple interpretation, but properties of the average (2.4) are more tractable and the variance of the estimator (from a finite number of splits I) is smaller. In both cases, one would like to guarantee overall type I error control. Both (2.3) and (2.4) are proper rejection rules when, under H0, 
graphic
(2.6)
in the sense that the type I error is controlled. We motivate the use of (2.3) and (2.4) from several bounds for the type I error under various conditions. First note that for m large piU[0,1], hence pΦ,iN(0,1). Moreover, forumla ≤ α is equivalent to forumlaΦ=med(pΦ, 1, …, pΦ, I) ≤ Φ−1(α). Therefore, P(forumlaα) = P(forumlaΦ ≤ Φ−1(α)). Any explicit computations on P(forumlaα) and P(forumlaΦ ≤ Φ−1(α)) require specification of the dependence structure between pΦ, 1, …, pΦ,I. The marginal distributions are standard normal, a necessary, but not sufficient, condition for the joint distribution to be multivariate normal (MVN) with variances equal to 1. Assuming multivariate normality (as in Hartung, 1999), the following theorems assure validity of (2.6).

 

THEOREM 2.1
Let (pΦ,1, …, pΦ, I) = (X1, …, XI) ∼ MVN(0, Σ), with Σii = 1. Moreover, let forumla = med(p1, … ,pI), then for all α ≤ 0.5, 
graphic
(2.7)

 

THEOREM 2.2
Let (pΦ,1, …, pΦ,I) = (X1, …, XI) ∼ MVN(0, Σ), with Σii = 1. Moreover, let forumla, then for all α ≤ 0.5, 
graphic
(2.8)

 

Proof.

See supplementary material available at Biostatistics online (http://www.biostatistics. oxfordjournals.org). □

As mentioned before, the overlap between several training and test sets, along with the complex relationship between the training data and the prognostic function, complicates verification of the multivariate null distribution. For example, say that samples j and k belong to the training and test set, respectively, for split 1, while the reverse is true for split 2. A positive difference of residuals refers to a better prediction of procedure 1 with respect to procedure 2. Suppose that split 1 results in a positive difference for sample j. This event has a marginal probability of 1/2 under H0. When we are interested in the joint null distribution over the 2 splits, we need to be able to compute the conditional probability of a positive difference in the second split given the positive difference in the first one. This is likely to be somewhat larger than 1/2 due to the cross-dependence between the samples, but the complex training does not allow explicit computation of this conditional probability. In addition, residuals of test samples overlapping between splits are also dependent.

We believe that (2.7) and (2.8) will be conservative in many situations and support this by simulation results. Nevertheless, it is useful to have “worst-case” bounds for the type I error which hold under very general conditions.

 

THEOREM 2.3
Let P be the null law of forumlaΦ, σ2 be the variance of forumlaΦ and forumla be the mean covariance between pΦ,i and pΦ,j. Then, for α ≤ 0.051 and unimodal P or α ≤ 0.124 and symmetric, unimodal P, we have 
graphic
(2.9)
where graphic.

 

Proof.

The inequality follows directly from the Chebyshev-type inequalities by Camp (1922) and Vysochanskij and Petunin (1980), using the fact that E(forumlaΦ) = 0. □

Note that the most pessimistic case, forumla = 1, implies σ2 = 1 and hence an upper bound for α = 0.05 equal to 2/9×1/1.652 = 0.082. A mean covariance of say forumla = 0.5 implies an upper bound approximately equal to 0.082×0.5 = 0.040. Inequality (2.9) also holds for the median, forumlaΦ, but computation of σ2 is not explicit in this case. Finally, dropping all assumptions on P, we obtain the following bound. 

THEOREM 2.4
Let P be the null law of forumlaΦ. Then, 
graphic
(2.10)
where Φ and ϕ denote the standard normal cdf and probability density function, respectively, and s is the solution of ϕ(s) + Φ − 1(α)Φ(s) = 0.

 

Proof.

See supplementary material available at Biostatistics online (http://www.biostatistics.oxford- journals.org). □

We found that for α ≤ 0.1, the right-hand side of (2.10) is approximately equal to 2.5α. In particular, it equals 0.126 for α = 0.05. The upper bounds under various assumptions on the joint distribution of (pΦ,1, …, pΦ,I) are listed in Table 1 for several values of α.

Table 1.

Type I error upper bounds under several assumptions

α MVN Unimodal None 
 Theorems 1 and 2 Theorem 3, σ2  = 1 Theorem 4 
0.001 0.001 0.0233 0.0026 
0.002 0.002 0.0268 0.0052 
0.005 0.005 0.0335 0.013 
0.01 0.01 0.0411 0.0258 
0.02 0.02 0.0527 0.051 
0.05 0.05 0.0821 0.1255 
0.1 0.1 0.1353* 0.2456 
α MVN Unimodal None 
 Theorems 1 and 2 Theorem 3, σ2  = 1 Theorem 4 
0.001 0.001 0.0233 0.0026 
0.002 0.002 0.0268 0.0052 
0.005 0.005 0.0335 0.013 
0.01 0.01 0.0411 0.0258 
0.02 0.02 0.0527 0.051 
0.05 0.05 0.0821 0.1255 
0.1 0.1 0.1353* 0.2456 
*

Assuming symmetry as well.

Table 1.

Type I error upper bounds under several assumptions

α MVN Unimodal None 
 Theorems 1 and 2 Theorem 3, σ2  = 1 Theorem 4 
0.001 0.001 0.0233 0.0026 
0.002 0.002 0.0268 0.0052 
0.005 0.005 0.0335 0.013 
0.01 0.01 0.0411 0.0258 
0.02 0.02 0.0527 0.051 
0.05 0.05 0.0821 0.1255 
0.1 0.1 0.1353* 0.2456 
α MVN Unimodal None 
 Theorems 1 and 2 Theorem 3, σ2  = 1 Theorem 4 
0.001 0.001 0.0233 0.0026 
0.002 0.002 0.0268 0.0052 
0.005 0.005 0.0335 0.013 
0.01 0.01 0.0411 0.0258 
0.02 0.02 0.0527 0.051 
0.05 0.05 0.0821 0.1255 
0.1 0.1 0.1353* 0.2456 
*

Assuming symmetry as well.

2.4 Algorithm and software

Our testing procedure is summarized by the following algorithm. We assume that the median p-value forumla is used as the summary p-value.

  1. Initiate i = 1. 2.

  2. Randomly split entire group into training and test group, using conditional subsampling. 3.

  3. Apply both prediction procedures on the training set. 4.

  4. For the corresponding test set, predict the outcomes and compute residuals Rk,j for both procedures, k = 1,2. 5.

  5. Compute the pairwise difference between residuals for procedure 1 and procedure 2, Dj = R1,jR2,j. 6.

  6. Repeat 4. and 5. for all test samples. 7.

  7. Compute the signed-rank Sj for all j = 1, …, m. 8.

  8. The signed-rank test statistic: S = ∑j = 1ma(Sj). (Using a(x) = x corresponds to the Wilcoxon signed-rank statistic.) 9.

  9. Under H0, each Sj is equally likely positive and negative. Therefore, compute the p-value pi by sign-permutations. 10.

  10. Repeat 2. to 9. for i = 2, …, I. Finally, reject the global null-hypothesis H0 when forumlaα.

This algorithm is implemented in R R Development Core Team (2006). The code is available in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

3. SIMULATION

We performed a simulation study (1) to verify our conjecture that the most liberal bounds, (2.7) for forumla and (2.8) for forumlaΦ, remain conservative and (2) to compare power of our procedure to a one-split procedure (for the latter maintains the type I error exactly, up to the discreteness of the test statistic).

We used a setting with 2 sparse linear models. The number of available covariates is of the same order as the number of samples. When applying the one-split approach, one would like to know the consistency of the decision with respect to repeated splitting. The one-split approach is useful only if the probability to arrive at the same decision for another split is reasonably high, say larger than 80%. In the simulation, we have estimated this consistency as well. Here, we summarize the results of the simulation, details of which are given in the supplementary material available at Biostatistics online (http://www.biostatistics.oxfordjournals.org).

We observed that the multisplit approach using c0.05 = Φ − 1(0.05) (Theorem 1) controls the type I error well below 0.05. Second, we observed that the one-split approach can be more powerful on average than the multisplit approaches but only in the very low power region when its results are inconsistent. This is mainly due to splits for which the alternative is more pronounced. However, in the area where a one-split approach could be of practical use (reasonable power and consistency  > 0.8), its power is dominated by the multisplit procedure with c0.05, and generally also by the most conservative approach, which guarantees type I error control under the conditions of Theorem 4.

4. APPLICATIONS

We consider 2 applications of our inference procedure. We focus on survival as a response variable. All summary p-values and graphs are initially based on I = 50 random splits with q = 4/5 using conditional subsampling. All testing is one-sided. Since extrapolation of the results to n = N may be a concern, we performed a graphical control by computing the average prediction error of both procedures for several values of q:q = 2/3,3/4,4/5,17/20,9/10. Multiple splits per fraction were used but such that the test sets are nested for decreasing value of n′ = Nq to allow better comparability between several split fractions. If the 2 resulting curves do not converge after n = 4/5N toward n = N, it is probably safe to extrapolate the result, since the difference in estimated prediction errors does not decrease; if the curves converge, one should be very careful in extrapolating the results.

4.1 Microarray data set: lasso versus ridge

The first data set is the microarray data set of Bullinger and others (2004) which can be downloaded from the GEO data base (accession number GSE425). It consists of 119 gene expression profiles from patients with acute myeloid leukemia and contains expression values of 6283 genes. Thirteen samples with more than 1800 missing values are removed. Also genes with over 20% missing values are removed. The resulting data set contains 103 expression profiles of 4673 genes. Remaining missing values are imputed using K-nearest neighbors. Overall survival is used as the end point in the analysis. This is a high-dimensional setting for which the number of covariates, p, is larger than N. Recently, comparison of methods for survival prediction from such data has received a lot of attention Bøvelstad and others (2007); Schumacher and others (2007); Van Wieringen and others (2009). These comparisons all evaluate several methods by several prediction error metrics. Here, we first focus on the comparison between lasso regression and ridge regression. We used implementations of lasso and ridge as described in Park and Hastie (2007) and Van Houwelingen and others (2006), respectively. The nice variable selection property of lasso motivates our asymmetric preference for the 2 prediction procedures. However, for 3 other microarray data sets it was shown Bøvelstad and others (2007) that ridge regression predicted slightly better than lasso regression. If ridge regression predicts significantly better than lasso regression for our data, it may be worthwhile to consider using this more expensive procedure.

Figure 1(a) displays the histogram of p-values as determined from the 50 splits. It is obvious that there is no significant difference, also from the summary p-values: forumla = 0.613 and p′ = 0.595. Figure 1(b) shows the potential effect of the split fraction on the prediction error. An interesting feature of this figure is that the lasso and ridge curves cross, just after the commonly used split fraction q = 2/3. This implies that one should be careful with deciding between 2 procedures based on cross-validated prediction errors for one given split fraction. In conclusion, there is no statistical evidence to prefer ridge over lasso in this application, so the lasso is recommended because of its additional feature selection property.

Fig. 1.

Lasso versus ridge: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).

Fig. 1.

Lasso versus ridge: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).

4.2 Methylation and mRNA data set: 2 sets of markers versus 1

A second application of our test is illustrated on a second data set of 99 Dutch patients with acute myeloid leukemia. At diagnosis, DNA expression was measured for 33 genes and its relation with survival was examined Hess and others (2007). In addition, the degree of methylation was assessed for 25 different regions. Here, we compare a lasso regression that contains both predictor sets versus one containing only the gene expression markers. If the procedure with both sets predicts significantly better than the procedure with DNA expression alone, it may be considered for prognosis of patients. Otherwise, the less costly alternative without methylation assessments is preferred.

We increased I to 250 because we were not satisfied with the precision of the estimates of the summary p-values from 50 splits. Figure 2(a) displays the histogram of p-values. The median p-value equals forumla = 0.0458, while p′ = 0.0569. The resampling-based 90% upper confidence bounds for forumla and p′ equal 0.0608 and 0.0659, respectively. Moreover, we observe from Figure 2(b) that the prediction error curves diverge after q = 4/5. Hence, there is substantial evidence that the methylation markers are crucial in addition to the mRNA markers for predicting survival in this data set.

Fig. 2.

Methylation versus no methylation: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).

Fig. 2.

Methylation versus no methylation: (a) histogram of p-values and (b) prediction errors for several split fractions; median (solid) and quartiles (dashed).

5. DISCUSSION

5.1 Why not bootstrap?

Our testing procedure is based on simple CV. We have also considered 3 utilizations of the bootstrap for the purpose of inference, but none was found to be appropriate. Below, we discuss these bootstrapping schemes and why these fail for our aim.

First, in this splitting setting, bootstrapping the residuals in a multivariate fashion is not feasible due to the complex dependencies between residuals of test samples which are in each other's training set for 2 different splits. Hence, exchangeability of the samples is not guaranteed.

Second, it has been shown before that estimation of the prediction error may be improved by combining bootstrap CV with the apparent error rate Schumacher and others (2007). Bootstrap CV resamples a training set of size N from the original sample and uses samples not present in the bootstrap as test samples. This results in a somewhat pessimistic estimate since the bootstrap sample does not contain all the information contained in the original sample. Therefore, a weighted average with the apparent error (error within the training sample) is computed. Unfortunately, this elegant approach seems inappropriate in an inference procedure. First, the residuals contributing to the apparent error cannot be used for testing our null hypothesis. Second, the bootstrapping procedure results in test samples of different size, hence containing different amounts of independent information. Combining these into one inference is not straightforward.

Finally, one could try to generate a confidence interval, for example, for the median residual, by resampling the entire data set several times. This confidence interval would then be used as an alternative to our testing procedure. Such a procedure would be computationally intensive. More importantly, we are interested in comparing the predictive performance of the 2 procedures trained on a large proportion (q) of the data. The resampling approach would imply a substantial loss of “independent” information in the training phase, which may not have equal consequences for the 2 procedures. Hence, the confidence interval would not reflect the uncertainty under our null of interest, which is concerned about predictors using independent data in the training phase.

5.2 Alternative permutation approaches

We have illustrated the use of our method for comparing 2 prediction procedures in several settings. Naturally, our method can also be used to test whether a particular data–procedure combination has any predictive value at all, simply by comparing it with a “no-covariates” predictor. In this setting, an alternative testing approach would be to use multiple random permutation of the response variables and to reapply the prognostic method to each permutation. If the covariates contain any predictive value, then the observed prediction error should be in the left tail of the resulting permutation distribution of prediction error. However, in high-dimensional settings this approach is computationally cumbersome since one needs many permutations to obtain a reasonable estimate of the p-value and each permutation requires a new training loop. Permutation could also be used in a setting with 2 sets of covariates: fix the response and the corresponding covariate values of set 1, while permuting the labels of the covariates of set 2. However, this approach may be invalid. The null hypothesis implies that the second covariate set is not associated with the response but does not imply it to be nonassociated with the covariates in the first set, which is a necessary condition (implying every permutation to be equally likely under H0) for the resulting p-value to be valid.

5.3 Choice of the split fraction

The optimal choice of the ratio of the training and test set sizes in the proposed procedure depends on the data set under consideration. In particular, it is not immediate that the commonly used ratio 2:1 (q = 2/3) for the estimation of prediction error is optimal in this testing framework. Ideally one wants to approach N as much as possible by n, so that the inference on a prediction based on n samples can be extrapolated to N. However, power of the test collapses when test sample size m = (1 − q)N is too small. If the data sets are very large in terms of the number of samples, it is likely that for both procedures the prediction error levels off after some value q = q′. This is a well-known phenomenon: the prediction error is bounded by the amount of information the covariates contain with respect to the response Dobbin and Simon (2007). If such a phenomenon is observed for both procedures, then we would consider using q = q′. However, this is a computationally intensive task since it requires evaluation of both procedures under multiple split fractions. In Section 4, we illustrated a more practical alternative using default split fraction q = 4/5 and assessing the validity of extrapolation of the results by studying prediction errors for several larger split fractions.

5.4 Extensions

While the multivariate normal assumption in Theorems 1 and 2 is difficult to verify, it may be useful to have an asymptotic justification, when the number of samples is large. The asymptotic results may heavily depend on the type of residuals as well as the prediction procedures considered. Hence, such a study requires a focus on particular prediction paradigms and procedures. Another extension of our method is to the comparison of more than 2 predictors. This would imply using a nonparametric K-sample test that accounts for blocking, such as the Friedman test, instead of the Wilcoxon signed-rank test. The theory on combining p-values from several splits may then be applied without any change.

5.5 Conclusion

A unique characteristic of our method is the testing framework targeting on prediction. We connect 2 well-accepted concepts in applied statistics: prediction error rates and p-values as a measure of evidence against the null hypothesis. The procedure is entirely nonparametric, hence it is robust against outliers and may also be used for nonlikelihood prediction methods. In fact, it may be used in any prediction or classification setting in which it is possible to define a residual of the predicted value with respect to the observation. The method owes much of its power to the pairing principle in the training–test splitting. Moreover, combining p-values avoids arbitrariness related to the study of a single split. In conclusion, we suggest that our method may play an important role in the evaluation of competitive prediction procedures and (sets of) prognostic factors in many studies.

FUNDING

Center for Medical Systems Biology established by the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.

SUPPLEMENTARY MATERIAL

Supplementary Material is available at http://www.biostatistics.oxfordjournals.org.

We thank Jelle Goeman for providing the implementation of ridge regression to us. Moreover, we are grateful to Etienne Roquain, José Ferreira, and Aad van der Vaart for discussions on this work. Conflict of Interest: None declared.

References

Bøvelstad
HM
Nygaard
S
Størvold
HL
Aldrin
M
Borgan
Ø
Frigessi
A
Lingjaerde
OC
Predicting survival from microarray data—a comparative study
Bioinformatics
2007
, vol. 
23
 (pg. 
2080
-
2087
)
Bullinger
L
Döhner
K
Bair
E
Fröhling
S
Schlenk
RF
Tibshirani
R
Döhner
H
Pollack
JR
Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia
The New England Journal of Medicine
2004
, vol. 
350
 (pg. 
1605
-
1616
)
Camp
BH
A new generalization of Tchebysheff's statistical inequality
Bulletin of the American Mathematical Society
1922
, vol. 
28
 (pg. 
427
-
432
)
Diebold
FX
Mariano
RS
Comparing predictive accuracy
Journal of Business & Economic Statistics
1995
, vol. 
13
 (pg. 
253
-
263
)
Dobbin
KK
Simon
RM
Sample size planning for developing classifiers using high-dimensional DNA microarray data
Biostatistics
2007
, vol. 
8
 (pg. 
101
-
117
)
Fisher
RA
Statistical Methods for Research Workers
1932
4th edition
London
Oliver and Boyd
Graf
E
Schmoor
C
Sauerbrei
W
Schumacher
M
Assessment and comparison of prognostic classification schemes for survival data
Statistics in Medicine
1999
, vol. 
18
 (pg. 
2529
-
2545
)
Hartung
J
A note on combining dependent tests of significance
Biometrical Journal
1999
, vol. 
41
 (pg. 
849
-
855
)
Hess
CJ
Berkhof
J
Denkers
F
Ossenkoppele
GJ
Schouten
JP
Oudejans
JJ
Waisfisz
Q
Schuurhuis
GJ
Activated intrinsic apoptosis pathway is a key related prognostic parameter in acute myeloid leukemia
Journal of Clinical Oncology
2007
, vol. 
25
 (pg. 
1209
-
1215
)
Hothorn
T
Hornik
K
van de Wiel
MA
Zeileis
A
A lego system for conditional inference
The American Statistician
2006
, vol. 
60
 (pg. 
257
-
263
)
Hothorn
T
Lausen
B
Benner
A
Radespiel-Tröger
M
Bagging survival trees
Statistics in Medicine
2004
, vol. 
23
 (pg. 
77
-
91
)
Lehmann
EL
Nonparametrics: Statistical Methods based on Ranks
1975
San Francisco, CA
Holden-Day
Meng
X-L
Posterior predictive p-values
Annals of Statistics
1994
, vol. 
22
 (pg. 
1142
-
1160
)
Park
MY
Hastie
T
L1-regularization path algorithm for generalized linear models
Journal of the Royal Statistical Society Series B
2007
, vol. 
69
 (pg. 
659
-
677
)
R Development Core Team
R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing
2006
 
Schumacher
M
Binder
H
Gerds
T
Assessment of survival prediction models based on microarray data
Bioinformatics
2007
, vol. 
23
 (pg. 
1768
-
1774
)
Van de Wiel
MA
The split-up algorithm: a fast symbolic method for computing p-values of rank statistics
Computational Statistics
2001
, vol. 
16
 (pg. 
519
-
538
)
Van Houwelingen
HC
Bruinsma
T
Hart
AA
Van't Veer
LJ
Wessels
LF
Cross-validated Cox regression on microarray gene expression data
Statistics in Medicine
2006
, vol. 
25
 (pg. 
3201
-
3216
)
Van Wieringen
WN
Kun
D
Hampel
R
Boulesteix
A-L
Survival prediction using gene expression data: a review and comparison
Computational Statistics & Data Analysis
2009
, vol. 
53
 (pg. 
1590
-
1603
)
Vysochanskij
DF
Petunin
YI
Justification of the 3σ rule for unimodal distributions
Theory of Probability and Mathematical Statistics
1980
, vol. 
21
 (pg. 
25
-
36
)